0% found this document useful (0 votes)
100 views496 pages

BCS-29 Advanced Computer Architecture

The document discusses instruction set architectures (ISAs), including their components and classifications. It provides examples of how an expression could be evaluated using different ISA models, such as stack-based, accumulator-based, and 1/2/3-address instructions. RISC processors typically use register-register instructions with 0 or 3 operands, while CISC may use register-memory instructions with 1 or 2 operands. Overall the document provides an overview of ISA concepts from the programmer and hardware designer perspectives.

Uploaded by

Aakriti Dubey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views496 pages

BCS-29 Advanced Computer Architecture

The document discusses instruction set architectures (ISAs), including their components and classifications. It provides examples of how an expression could be evaluated using different ISA models, such as stack-based, accumulator-based, and 1/2/3-address instructions. RISC processors typically use register-register instructions with 0 or 3 operands, while CISC may use register-memory instructions with 1 or 2 operands. Overall the document provides an overview of ISA concepts from the programmer and hardware designer perspectives.

Uploaded by

Aakriti Dubey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 496

BCS-29

Advanced Computer Architecture


Instruction Set Architectures
RISC Processors
RISC vs CISC
Computer Architecture

• The term architecture is used here to describe the attribute of


a system as seen by the programmer.
• It is the conceptual structure and functional behavior as
distinct from the organization of the data-flow and control-
flow, the logic design, and the physical implementation.
• Instruction set architecture: program-visible instruction set
• Instruction format, memory addressing modes, architectural registers,
endian type, alignment, …
• EX: RISC, CISC, VLIW, EPIC

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-2


Instruction Set Architecture
• Elements of ISA
• Programming Registers
• Type and Size of Operands
• Addressing Modes
• Types of Operations
• Instruction Encoding

Dr. P K Singh MMMUT, Gorakhpur 3


Instruction Set Architecture

• Instruction set architecture is the structure of a


computer that a machine language programmer must
understand to write a correct (timing independent)
program for that machine.
• The instruction set architecture is also the machine
description that a hardware designer must understand
to design a correct implementation of the computer.

Dr P K Singh TCS-802 Advance Computer Architecture Slide-2.4


Components of an ISA
• Sometimes known as The Programmer’s Model of the machine
• Storage cells
• General and special purpose registers in the CPU
• Many general-purpose cells of same size in memory
• Storage associated with I/O devices
• The machine instruction set
• The instruction set is the entire repertoire of machine operations
• Makes use of storage cells, formats, and results of the fetch/execute cycle
• i.e., register transfers
• The instruction format
• Size and meaning of fields within the instruction
• The nature of the fetch-execute cycle
• Things that are done before the operation code is known
Dr P K Singh TCS-802 Advance Computer Architecture Slide-2.5
A Basic Model of the machine

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-6


An Instruction
• Instruction add r0, r1, r2
• Operation to be perform add r0, r1, r2
• Ans: Op code: add, load, branch, etc.
• Where to find the operands: add r0, r1, r2
• In CPU registers, memory cells, I/O locations, or part of
instruction
• Place to store result add r0, r1, r2
• Again CPU register or memory cell
• Location of next instruction add r0, r1, r3
br endloop
• Almost always memory cell pointed to by program counter(PC)
or Instruction pointer (IP)
Dr P K Singh TCS-802 Advance Computer Architecture Slide-2.7
Instructions May be Divided into 3 Classes
• Data movement instructions
• Move data from a memory location or register to another
memory location or register without changing its form
• Load—source is memory and destination is register
• Store—source is register and destination is memory
• Arithmetic and logic (ALU) instructions
• Change the form of one or more operands to produce a result
stored in another location
• Add, Sub, Shift, etc.
• Branch instructions (control flow instructions)
• Alter the normal flow of control from executing the next
instruction in sequence
• Br Loc, Brz Loc2,—unconditional or conditional branches

Dr P K Singh TCS-802 Advance Computer Architecture Slide-2.8


Classifying ISA
Register-Memory Register-Register
Stack Accumulator
Processor Processor Processor Processor
TOS

... ... ... ...

... ... ... ...

Memory Memory Memory Memory

Dr P K Singh TCS-802 Advance Computer Architecture Slide-2.9


Example: X = (A+B)*(C+D)
• Zero Address Instructions:
• A stack based computer do not use address field in instruction. To
evaluate an expression first it is converted Post fix Notation.

PUSH A TOP = A
PUSH B TOP = B
ADD TOP = A+B
PUSH C TOP = C
PUSH D TOP = D
ADD TOP = C+D
MUL TOP = (C+D)*(A+B)
POP X M[X] = TOP

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-10


Example: X = (A+B)*(C+D)
• One Address Instructions
• There is an implied ACCUMULATOR register for data manipulation.
One operand is in accumulator and other is in register or memory
location.
• The ALU always consider one operand from the ACCUMULATOR
register and route the result into the ACCUMULATOR register.
Instruction Format:
opcode Operand/address LOAD A AC = M[A]
of operand ADD B AC = AC + M[B]
STORE T M[T] = AC
LOAD C AC = M[C]
ADD D AC = AC + M[D]
MUL T AC = AC * M[T]
STORE X M[X] = AC

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-11


Example: X = (A+B)*(C+D)
• Two Address Instructions:
• This is common in commercial computers. Here two address can be
specified in the instruction. Unlike earlier in one address instruction the
result was stored in accumulator here result can be stored at different
location rather than just accumulator but require more number of bit to
represent address.
opcode Destination Source
Instruction Format: Address Address

MOV R1, A R1 = M[A]


ADD R1, B R1 = R1 + M[B]
MOV R2, C R2 = C
ADD R2, D R2 = R2 + D
MUL R1, R2 R1 = R1 * R2
MOV X, R1 M[X] = R1

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-12


Example: X = (A+B)*(C+D)
• Three Address Instruction:
• This has three address field to specify a register or a memory location.
Program created are much short in size but number of bits per
instruction increase. These instructions make creation of program much
easier but it does not mean that program will run much faster because
now instruction only contain more information, but each micro-
operation will be performed in one cycle only.
opcode Destination Source Source
• Instruction Format: Address Address Address

ADD R1, A, B R1 = M[A] + M[B]


ADD R2, C, D R2 = M[C] + M[D]
MUL X, R1, R2 M[X] = R1 * R2

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-13


Classifying ISA (Instructions)
• Stack Architectures -
operands are implicitly on the top of the stack
0 address add tos <- tos + next

• Accumulator Architectures -
one operand is implicitly accumulator
1 address add A acc <- acc + mem[A]

• General-Purpose Register Architectures -


only explicit operands, either registers or memory locations
• Memory-Memory :
access memory Locations as part of any instruction
2 address add A, B mem[A] <- mem[A] + mem[B]
3 address add A, B, C mem[A] <- mem[B] + mem[C]

Dr P K Singh TCS-802 Advance Computer Architecture Slide-2.14


Classifying ISA (Instructions)
• General-Purpose Register Architectures -
only explicit operands, either registers or memory
locations
• register-memory:
access memory as part of any instruction
2 address add R1, A R1 <- R1 + mem[A]
load R1, A R1 <_ mem[A]

• register-register:
access memory only with load and store instructions
3 address add R1, R2, R3 R1 <- R2 + R3
load R1, R2 R1 <- mem[R2]
store R1, R2 mem[R1] <- R2

Dr P K Singh TCS-802 Advance Computer Architecture Slide-2.15


Operand Access

• Register-Register (0,3)
(m, n) means m memory operands and n total operands
in an ALU instruction
• Pure RISC, register to register operations
• Advantages
• Simple, fixed length instruction encoding.
• Simple code generation.
• Instructions take similar number of clocks to execute.
Uniform CPI
• Disadvantages
• Higher inst. count.
• Some instructions are short and bit encoding may be
wasteful.

Dr P K Singh TCS-802 Advance Computer Architecture Slide-2.16


Operand Access

• Register-Memory (1,2)
• Register – Memory ALU Architecture
• In later evolutions of RISC and CISC
• Advantages
• Data can be accessed without loading first.
• Instruction format easy to encode
• Good instruction density
• Disadvantages
• Source operand also destination, data overwritten
• Need for memory address field may limit # registers
• CPI varies by operand location

Dr P K Singh TCS-802 Advance Computer Architecture Slide-2.17


Operand Access

• Memory-Memory (3,3)
• True memory-memory ALU model, e.g. full orthogonal
CISC architecture
• Advantages
• Most compact instruction density, no temporary registers needed
• Disadvantages
• Memory access create bottleneck
• Variable CPI
• Large variation in instruction size
• Expensive to implement
• Not used in today’s architectures

Dr P K Singh TCS-802 Advance Computer Architecture Slide-2.18


Memory Addressing
• What is accessed - byte, word, multiple words?
• today’s machine are byte addressable, due to legacy issues
• But main memory is organized in 32 - 64 byte lines
• matches cache model
• Retrieve data in, say, 4 byte chunks
• Alignment Problem
• accessing data that is not aligned on one of these boundaries will require
multiple references
• E.g. fetching 16 bit integer at byte offset 3 requires two four byte chunks to be
read in (line 0, line 1)
• Can make it tricky to accurately predict execution time with mis-aligned data
• Compiler should try to align! Some instructions auto-align too

Dr P K Singh TCS-802 Advance Computer Architecture Slide-2.19


Addressing Modes
• The addressing mode specifies the address of an operand we
want to access
• Register or Location in Memory
• The actual memory address we access is called the effective address
• Effective address may go to memory or a register array
• typically dependent on location in the instruction field
• multiple fields may combine to form a memory address
• register addresses are usually simple - needs to be fast
• Effective address generation is important and should be fast!
• Falls into the common case of frequently executed instructions

Dr P K Singh TCS-802 Advance Computer Architecture Slide-2.20


Memory Addressing
Mode Example Meaning When used
Register Add R4, R3 Regs[R4]Regs[R4] + Value is in a
Regs[R3] register

Immediate Add R4, #3 Regs[R4]  Regs[R4] + 3 For constants

Displacement Add R4, 100(R1) Regs[R4]  Regs[R4] + Access local


Mem[100+Regs[R1]] variables

Indirect Add R4, (R1) Regs[R4]Regs[R4] + Pointers


Mem[Regs[R1]]

Indexed Add R3, Regs[R3]Mem[Regs Traverse an array


(R1+R2) [R1] + Regs[R2]]

Direct Add R1, $1001 Regs[R1]  Regs[R1] + Static data,


Mem[1001] address constant
may be large

Dr P K Singh Slide-2.21
TCS-802 Advance Computer Architecture
Memory Addressing
Mode Example Meaning When used
Memory Add R1, @(R3) Regs[R1]Regs[R1] + *p if R3=p
Indirect Mem[Mem[Regs[R3]]]
Autoinc Add R1, (R2)+ Regs[R1]Regs[R1]+ Stepping through
Mem[Regs[R2]], arrays in a loop
Regs[R2]Regs[R2]+1

Autodec Add R1, (R2)- Regs[R1]Regs[R1]+ Same as above. Can


Mem[Regs[R2]], push/pop for a
Regs[R2]Regs[R2]-1 stack

Scaled Add R1, Regs[R1] Regs[R1]+ Index arrays by a


100(R2)[R3] Mem[100+Regs[R2] + scaling factor, e.g.
Regs[R3] * d] word offsets

Dr P K Singh Slide-2.22
TCS-802 Advance Computer Architecture
Instruction Set Encoding Options
Variable (e.g. VAX)

OpCode and # of ops Operand 1 Operand 2 … Operand N

Fixed (e.g. DLX, SPARC, PowerPC)

OpCode Operand 1 Operand 2 Operand 3

Hybrid (e.g. x86, IBM 360)

OpCode Operand 1 Operand 2 Operand 3

OpCode Operand 1 Operand 2

OpCode Instruction Size? Complexity?


Dr P K Singh TCS-802 Advance Computer Architecture Slide-2.23
Reduced Instruction Set Computer (RISC)
• RISC architectures represent an important innovation in the area of computer
organization.
• The RISC architecture is an attempt to produce more CPU power by simplifying
the instruction set of the CPU.
• The opposed trend to RISC is that of complex instruction set computers (CISC).
• Both RISCs and CISCs try to solve the same problem. CISCs are going the
traditional way of implementing more and more complex instructions. RISCs try
to simplify the instruction set.
• Innovations in RISC architectures are based on a close analysis of a large set of
widely used programs.
• One of the main concerns of RISC designers was to maximize the efficiency of
pipelining.
• Present architectures often include both RISC and CISC features.
• Both RISC and CISC architectures have been developed as an attempt to
cover the semantic gap.
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-24
Typical RlSC Processor Architecture

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-25


Main Characteristics of RISC Architectures:

• The instruction set is limited and includes only simple


instructions.
• Only LOAD and STORE instructions reference data in memory.
• Instructions use only few addressing modes.
• Instructions are of fixed length and uniform format.
• A large number of registers is available.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-26


Main Characteristics of RISC
A Small number of Simple Instructions:
• Simple and small decode and execution hardware is required.
• A hard-wired controller is needed, rather than using microprogramming.
• CPU takes less silicon area to implement, and run faster also.
• Execution of one machine instruction per clock cycle.
• Register-to-register operation.
• Simple addressing mode.
• Simple instruction format.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-27


Main Characteristics of RISC
Try to achieve one instruction per clock
• Machine cycle is defined to be the time it takes two operands
to fetch from registers, perform an ALU operation and store
the result in a register.
• The instruction pipeline performs more efficiently due to
simple instructions and simpler execution patterns.
• Complex operations are executed as a sequence of simple
instructions. In the case of CISC, these operations are
executed as one single or few complex instructions.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-28


Example

An illustrative example with the following assumption:


• A program with 80% of executed instructions being simple and 20%
complex.
• CISC: simple instructions takes 4 cycles, complex instructions take 8 cycles;
cycle time is 100 ns.
• RISC: simple instructions are executed in one cycle; complex instructions
are implemented as a sequence of instruction(let 14 instructions on
average); cycle time is 75 ns.
How much time takes a program of 1000000 instructions:
CISC: (106 X 0.80 X 4 + 106 X 0.20 X 8) X 10-7 = 0.48 ns
CISC: (106 X 0.80 X 1 + 106 X 0.20 X 14) X .75 X 10-7 = 0.48 ns

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-29


Main features of CISC
• A large number of instructions (> 200) and complex instructions and data
types.
• Many and complex addressing modes.
• Direct hardware implementations of high-level language statements.
• Microprogramming techniques are used so that complicated instructions
can be implemented.
• Memory bottleneck is a major problem, due to complex addressing modes
and multiple memory accesses per instruction.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-30


Typical ClSC Processor Architecture

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-31


An overview

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-32


BCS-29
Advanced Computer Architecture
Parallel Computer Models
Elements of Modem Computers

• The hardware, software, and programming elements


of modern computer systems can be characterized
by looking at a variety of factors, including:
• Computing problems
• Algorithms and data structures
• Hardware resources
• Operating systems
• System software support
• Compiler support

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-2


Elements of Modem Computers

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-3


Computing Problems
• Numerical computing
• complex mathematical formulations
• tedious integer or floating-point computation
• Transaction processing
• accurate transactions
• large database management
• information retrieval
• Logical Reasoning
• logic inferences
• symbolic manipulations

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-4


Algorithms and Data Structures
• Traditional algorithms and data structures are designed
for sequential machines.
• New, specialized algorithms and data structures are
needed to exploit the capabilities of parallel
architectures.
• These often require interdisciplinary interactions among
theoreticians, experimentalists, and programmers.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-5


Hardware Resources
• The architecture of a system is shaped only partly by the
hardware resources.
• The operating system and applications also significantly influence
the overall architecture.
• The modem computer system demonstrates its power through
coordinated efforts by hardware resources, an operating system,
and application software.
• Not only the processor and memory architectures be considered,
but also the architecture of the device interfaces (which often
include their advanced processors).

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-6


Operating System
• Operating systems manage the allocation and deallocation of
resources during user program execution.
• UNIX, Mach, and OSF/1 provide support for
• multiprocessors and multicomputers
• multithreaded kernel functions
• virtual memory management
• file subsystems
• network communication services
• An OS plays a significant role in mapping hardware resources to
algorithmic and data structures.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-7


Operating System
• An effective operating systems manage the allocation and
deallocation of resources during user program execution.
• An OS plays a significant role in mapping hardware resources to
algorithmic and data structures.
• Efficient mapping will benefit the programmer and produce better
source codes.
• The mapping of algorithmic and data structures onto the machine
architecture includes processor scheduling, memory maps, Inter
processor communications, etc. These activities are usually
architecture-dependent.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-8


System Software Support
• Compilers, assemblers, and loaders are traditional tools for developing
programs in high-level languages. With the operating system, these tools
determine the bind of resources to applications, and the effectiveness of this
determines the efficiency of hardware utilization and the system’s
programmability.
• Most programmers still employ a sequential mind set, abetted by a lack of
popular parallel software support.
• Parallel software can be developed using entirely new languages designed
specifically with parallel support as its goal, or by using extensions to existing
sequential languages.
• New languages have obvious advantages (like new constructs specifically for
parallelism), but require additional programmer education and system
software.
• The most common approach is to extend an existing language.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-9


Compiler Support
• Preprocessors
• use existing sequential compilers and specialized libraries to implement
parallel constructs
• Precompilers
• perform some program flow analysis, dependence checking, and limited
parallel optimzations
• Parallelizing Compilers
• requires full detection of parallelism in source code, and transformation of
sequential code into parallel constructs
• Compiler directives are often inserted into source code to aid
compiler parallelizing efforts

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-10


The history of Intel’s processors
Intel 4004 (1971)
• 16-pin DIP package
• 4-bits processed at a time
• 12-bit addresses
• Clock: 740KHz
• Address Space: 4 KB
• Instruction Set: 46
• Registers: 16
Intel 4040 (1974)
• Instruction Set expanded to 60 instructions
• Program memory expanded to 8 KB (13-bit address
space)
• Registers expanded to 24
• Subroutine stack expanded to 7 levels deep
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-11
8-bit Microprocessors
8008
• Processed 8-bits at a time
• 14-bit addresses
• First to include an interrupt line
Features
• Seven 8-bit "scratchpad" registers: The main accumulator (A) and six
other registers (B, C, D, E, H, and L).
• 14-bit program counter (PC).
• Seven-level push-down address call stack. Eight registers are actually
used, with the top-most register being the PC.
• Four condition code status flags: carry (C), even parity (P), zero (Z), and
sign (S).
• Indirect memory access using the H and L registers (HL) as a 14-bit data
pointer (the upper two bits are ignored).

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-12


Elementary Discussion on Modem Computers
8085 architectural block diagram

13
16-bit and 8-bit Microprocessors
• 8086 (16-bit) & 8088 (8-bit)
• 20-bit addresses
• Two processors which consisted of: Bus Interface
Unit and Execution Unit
• Segmented Memory

14
Elementary Discussion on Modem Computers
8088 architectural block diagram

15
80286 & 80386
• Two modes
• 8086 Real Address Mode
• Protected Virtual Address Mode
• Protected Virtual Address Mode
• More address space
• Multi-user protection
• Dedicated 286; task
• Virtual memory
• Interrupt handler

16
Introduction to 80486
• Increased speed (2 x 80386)
• 80386 with an internal math coprocessor
• Upgradeable (Compatibility)
• Utilize new technology (RISC, cache)

17
80486 Processor

18
Intel Pentium
• Introduced in '93 but was expected in 1992
• 60/66 MHz Bus Speed
• Improved cache structure
• Re-organized to form two caches that are each 8K bytes
in size
• One for caching data
• Another for caching instructions
• Wider data bus width
• Increased from 32 bit to 64 bits

19
Intel Pentium
• Faster numeric processor
• Operates about 5 times faster than the
486 numeric processor
• A dual integer processor
• Often allows two instructions per clock
• Branch prediction logic
• MMX instructions --- a later introduction to
Pentium

20
Intel Pentium
• Pentium has two execution lines
• U-line --- can execute any Pentium instruction
• V-line --- only executes simple instructions.
• Each line has 5 stages
i. Pre-fetch
ii. Instruction Decode
iii. Address Generation
iv. Execute, Cache, and ALU Access
v. Write back

21
Elementary Discussion on Modem Computers
Pentium architectural block diagram

22
Elementary Discussion on Modem Computers
P6 Microarchitecture

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-23


Classification of Parallel Architectures

• Classification based on Architectural schemes


• Flynn’s classification
• It is based upon the number of concurrent instruction (or control)
and data streams available in the architecture:

• Feng’s classification
• This classification is mainly based on degree of parallelism to
classify parallel computer architecture.

• Handle’s classification
• Handler’s proposed an elaborate notation for expressing the
pipelining and parallelism of computers. He divided the computer
at three levels such as Processor Control Unit(PCU), Arithmetic
Logic Unit(ALU), Bit Level Circuit(BLC)

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-24


Flynn’s classification
• Single instruction, single data stream (SISD)

• Single instruction, multiple data streams (SIMD)


• vector computers with scalar and vector hardware

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-25


Flynn’s classification
• Multiple instructions, single data stream (MISD): systolic arrays

• Multiple instructions, multiple data streams (MIMD): parallel


computers

• Among parallel machines, MIMD is most popular, followed by SIMD, and finally MISD.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-26


Feng’s classification

• Feng suggested the use of degree of parallelism to


classify various computer architectures.
• The maximum number of binary digits that can be
processed within a unit time by a computer system is
called the maximum parallelism degree P.
• A bit slice is a string of bits one from each of the words at
the same vertical position.
• Under above classification
• Word Serial and Bit Serial (WSBS)
• Word Parallel and Bit Serial (WPBS)
• Word Serial and Bit Parallel(WSBP)
• Word Parallel and Bit Parallel (WPBP)

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-27


Feng’s classification

• WSBS has been called bit parallel processing because


one bit is processed at a time.
• WPBS has been called bit slice processing because m-
bit slice is processes at a time.
• WSBP is found in most existing computers and has been
called as Word Slice processing because one word of n-
bit processed at a time.
• WPBP is known as fully parallel processing in which an
array on nxm-bits is processes at one time.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-28


Handle’s classification

• Handler’s proposed an elaborate notation for expressing


the pipelining and parallelism of computers. He divided
the computer at three levels.
• Processor Control Unit(PCU)
• Arithmetic Logic Unit(ALU)
• Bit Level Circuit(BLC)
PCU corresponds to CPU, ALU corresponds to a functional unit or PE’s in
an array processor. BLC corresponds to the logic needed for performing
operation in ALU.
He uses three pairs of integers to describe computer:
Computer = (k*k’,d*d , w*w’)
Where,
k= no. of PCUs k’=no. of PCUs which are pipelined;
d=no. of ALUs control by each PCU d’=no. of ALUs that can be pipelined
w=no. of bits or processing elements in ALU w’=no. of pipeline segments
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-29
Shor’s Classification:
• In this classification computers are classified on the basis of
organization of the constituent elements in the computer. He
proposed 6 machines which are recognized and distinguished
by numerical designators.
• Machine1:

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-30


Shor’s Classification:
Machine2:

Machine3:

Machine4:

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-31


Modern classification(Sima, Fountain, Kacsuk)
• Classify based on how parallelism is achieved
• by operating on multiple data: data parallelism
• by performing many functions in parallel: function
parallelism
• Control parallelism, task parallelism depending on the level of the
functional parallelism.

Parallel architectures

Data-parallel Function-parallel
architectures architectures

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-32


Data parallel architectures

• Vector processors, SIMD (array processors), systolic


arrays.
IP

MAR

MEMORY

OP ADDR MDR
A1 B1 C1 A2 B2 C2 A N B N CN
DECODER

ALU ALU ALU

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-33


Function Parallel Architectures

Function-parallel
architectures

Instruction level Thread level Process level


Parallel Arch Parallel Arch Parallel Arch
(ILPs) (MIMDs)

Pipelined VLIWs Superscalar Shared


Distributed
processors processors Memory
Memory MIMD
MIMD

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-34


Categories of Parallel Computers(MIMD)

• Considering their architecture only, there are two


main categories of parallel computers:
• systems with shared common memories, and
• systems with unshared distributed memories.
Shared-Memory Multiprocessors
• Shared-memory multiprocessor models:
• Uniform-memory-access (UMA)
• Nonuniform-memory-access (NUMA)
• Cache-only memory architecture (COMA)
• These systems differ in how the memory and
peripheral resources are shared or distributed.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-35


Categories of Parallel Computers(MIMD)

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-36


The UMA Model
• Physical memory uniformly shared by all processors, with equal access
time to all words.
• Processors may have local cache memories.
• Peripherals also shared in some fashion.
• Tightly coupled systems use a common bus, crossbar, or multistage
network to connect processors, peripherals, and memories.
• Many manufacturers have multiprocessor (MP) extensions of uniprocessor
(UP) product lines.
• Synchronization and communication among processors achieved through shared
variables in common memory.
• Symmetric MP systems – all processors have access to all peripherals, and any
processor can run the OS and I/O device drivers.
• Asymmetric MP systems – not all peripherals accessible by all processors; kernel
runs only on selected processors (master); others are called attached processors
(AP).

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-37


The UMA Multiprocessor Model

P1 P2 … Pn

System Interconnect
(Bus, Crossbar, Multistage network)

I/O SM1 … SMm

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-38


Performance Calculation
• Consider two loops. The first loop adds corresponding elements of two N-
element vectors to yield a third vector. The second loop sums elements of
the third vector. Assume each add/assign operation takes 1 cycle, and
ignore time spent on other actions (e.g. loop counter
incrementing/testing, instruction fetch, etc.). Assume interprocessor
communication requires k cycles.
• On a sequential system, each loop will require N cycles, for a total of 2N
cycles of processor time.
• On an M-processor system, we can partition each loop into M parts, each
having L = N / M add/assigns requiring L cycles. The total time required is
thus 2L. This leaves us with M partial sums that must be totaled.
• Computing the final sum from the M partial sums requires l = log2(M)
additions, each requiring k cycles (to access a non-local term) and 1 cycle
(for the add/assign), for a total of l  (k+1) cycles.
• The parallel computation thus requires
2N / M + (k + 1) log2(M) cycles.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-39


Performance Calculation
• Assume N = 220.
• Sequential execution requires 2N = 221 cycles.
• If processor synchronization requires k = 200 cycles,
and we have M = 256 processors, parallel execution
requires
2N / M + (k + 1) log2(M)
= 221 / 28 + 201  8
= 213 + 1608 = 9800 cycles
• Comparing results, the parallel solution is 214 times
faster than the sequential, with the best theoretical
speedup being 256 (since there are 256 processors).
Thus the efficiency of the parallel solution is 214 /
256 = 83.6 %.
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-40
The NUMA Model
• Shared memories, but access time depends on the location of the data
item.
• The shared memory is distributed among the processors as local
memories, but each of these is still accessible by all processors (with
varying access times).
• Memory access is fastest from the locally-connected processor, with the
interconnection network adding delays for other processor accesses.
• Additionally, there may be global memory in a multiprocessor system,
with two separate interconnection networks, one for clusters of
processors and their cluster memories, and another for the global shared
memories.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-41


Shared Local Memories

• The shared memory is physically distributed to all processors, called local


memories. The collection of all local Memories forms a global address
space accessible by all processors.
• It is faster to access a local memory with a local processor. The access of
remote memory attached to other processors takes longer time due to the
added delay through the interconnection network.

LM1 P1

LM2 P2
. Inter-
.
. connection .
. Network .
LMn Pn

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-42


Hierarchical Cluster Model
• In the hierarchically structured multiprocessor the processors are divided into
several cluster. Each cluster is itself an UMA or a NUMA multiprocessor.
• The clusters are connected to global shared-memory modules. The entire system is
considered a NUMA multiprocessor.
• All processors belonging to the same cluster are allowed to uniformly access the
cluster shared-memory modules. GSM GSM … GSM

Global Interconnect Network

P CSM P CSM

P C CSM P CSM
I … C
. N . . I .
. . . N .
. . . .
P CSM P CSM

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-43


The COMA Model
• In the COMA(Cache-Only Memory Architecture) model, processors only
have cache memories; the caches, taken together, form a global address
space.
• Each cache has an associated directory that aids remote machines in their
lookups; hierarchical directories may exist in machines based on this
model.
• Initial data placement is not critical, as cache blocks will eventually
migrate to where they are needed.
Interconnection Network

D D D

C C … C

P P P

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-44


Distributed-Memory Multicomputers
• This system consists of multiple computers, often called nodes, inter-
connected by a message-passing network. Each node is an autonomous
computer consisting of a processor, local memory and sometimes
attached disks or I/O peripherals.
• The message-passing network provides point-to-point static connections
among the nodes.
• All local memories are private and are accessible only by local processors.
For this reason, traditional multicomputers have also been called no-
remote-memory-access(NORMA) machines.
• Internode communication is carried out by passing messages through the
static connection network. With advances in interconnection and network
technologies, this model of computing has gained importance.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-45


Distributed-Memory Multicomputers

P P

M M

M P Message-passing P M
interconnection
network
M P P M

P P

M M
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-46
Programming Environments
• Programmability depends on the programming environment provided to
the users.
• Conventional computers are used in a sequential programming
environment with tools developed for a uniprocessor computer.
• Parallel computers need parallel tools that allow specification or easy
detection of parallelism and operating systems that can perform parallel
scheduling of concurrent events, shared memory allocation, and shared
peripheral and communication links.
• Implicit Parallelism:
• Explicit Parallelism

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-47


Programming Environments
• Implicit Parallelism:
• Use a conventional language (like C, Fortran, Lisp, or Pascal) to write the
program.
• Use a parallelizing compiler to translate the source code into parallel code.
• The compiler must detect parallelism and assign target machine resources.
• Success relies heavily on the quality of the compiler.

• Explicit Parallelism
• Programmer write explicit parallel code using parallel dialects of common
languages.
• Compiler has reduced need to detect parallelism, but must still preserve
existing parallelism and assign target machine resources.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-48


System Attributes to Performance

• Performance depends on
• hardware technology
• architectural features
• efficient resource management
• algorithm design
• data structures
• language efficiency
• programmer skill
• compiler technology

TCS-802 Advance Computer


Dr P K Singh Slide-1.49
Architecture
Performance Indicators

• Turnaround time depends on:


• disk and memory accesses
• input and output
• compilation time
• operating system overhead
• CPU time
• Since I/O and system overhead frequently overlaps
processing by other programs, it is fair to consider only
the CPU time used by a program, and the user CPU time
is the most important factor.

TCS-802 Advance Computer


Dr P K Singh Slide-1.50
Architecture
Clock Rate and CPI

• CPU is driven by a clock with a constant cycle time  (usually measured in


nanoseconds).

• The inverse of the cycle time is the clock rate (f = 1/, measured in
megahertz).

• The size of a program is determined by its instruction count, Ic, the number of
machine instructions to be executed by the program.

• Different machine instructions require different numbers of clock cycles to


execute. CPI (cycles per instruction) is thus an important parameter.

TCS-802 Advance Computer


Dr P K Singh Slide-1.51
Architecture
Average CPI
• It is easy to determine the average number of cycles
per instruction for a particular processor if we know
the frequency of occurrence of each instruction type.

• Of course, any estimate is valid only for a specific set


of programs (which defines the instruction mix), and
then only if there are sufficiently large number of
instructions.

• In general, the term CPI is used with respect to a


particular instruction set and a given program mix.
TCS-802 Advance Computer
Dr P K Singh Slide-1.52
Architecture
Performance Factors (1)
• The time required to execute a program containing Ic
instructions is just T = Ic  CPI  .

• Each instruction must be fetched from memory,


decoded, then operands fetched from memory, the
instruction executed, and the results stored.

• The time required to access memory is called the


memory cycle time, which is usually k times the
processor cycle time . The value of k depends on
the memory technology and the processor-memory
interconnection scheme.
TCS-802 Advance Computer
Dr P K Singh Slide-1.53
Architecture
Performance Factors (2)
• The processor cycles required for each instruction
(CPI) can be attributed to
• cycles needed for instruction decode and execution (p),
and
• cycles needed for memory references (m  k).

• The total time needed to execute a program can then


be rewritten as T = Ic  (p + m  k) .

TCS-802 Advance Computer


Dr P K Singh Slide-1.54
Architecture
System Attributes
• The five performance factors (Ic , p, m, k, ) are
influenced by four system attributes:
• instruction-set architecture (affects Ic and p)
• compiler technology (affects Ic and p and m)
• CPU implementation and control (affects p  )
• cache and memory hierarchy (affects memory access
latency, k  )

• Total CPU time can be used as a basis in estimating


the execution rate of a processor.

TCS-802 Advance Computer


Dr P K Singh Slide-1.55
Architecture
MIPS Rate

• If C is the total number of clock cycles needed to execute


a given program, then total CPU time can be estimated
as T = C   = C / f.

• Other relationships are easily observed:


• CPI = C / Ic
• T =Ic  CPI  
• T =Ic  CPI / f

• Processor speed is often measured in terms of millions


of instructions per second, frequently called the MIPS
rate of the processor.
TCS-802 Advance Computer
Dr P K Singh Slide-1.56
Architecture
MIPS Rate

Ic f f  Ic
MIPS rate = = =
T 10 6
CPI 10 6
C 10

• The MIPS rate is directly proportional to the clock rate


and inversely proportion to the CPI.

• All four system attributes (instruction set, compiler,


processor, and memory technologies) affect the MIPS
rate, which varies also from program to program.
TCS-802 Advance Computer
Dr P K Singh Slide-1.57
Architecture
Throughput Rate

• The number of programs a system can execute per unit time, Ws , in


programs per second.
• CPU throughput, Wp, is defined as

f
Wp =
I c  CPI
In a multiprogrammed system, the system throughput is
often less than the CPU throughput.

TCS-802 Advance Computer


Dr P K Singh Slide-1.58
Architecture
Consider the execution of an object code with 200,000 instructions on a 40
MHz processor. The program instruction mix is as follows:
Instruction Type CPI Instruction mix

Arithmetic and logic 1 60%


Load/Store with cache hit 2 18%

Branch 4 12%

Memory reference with cache miss 8 10%

1. Calculate the average CPI when program is executed on uniprocessor


with the above trace result.
2. Calculate the corresponding MIPS rate based in the CPI obtained.

f 40*106
MIPSrate = = = 17.86
CPI *106 2.24*106
Ic = 200000 and f = 40 MHz

CPI avg =
 I .CPI
c

I c

 I .CPI = 120,000 + 72,000 + 96,000 + 160,000 =448,0000


c

 I = 120,000 + 36,000 + 24,000 + 20,000 =220,0000


c

CPI avg = 2.24

6
f 40*10
MIPSrate = 6
= 6
= 17.86
CPI *10 2.24*10
TCS-802 Advance Computer
Dr P K Singh Slide-1.60
Architecture
BCS-29
Advanced Computer Architecture
Parallel Computing
Programming Environments
Issues in Parallel Computing
• Design of parallel computers
• Design of efficient parallel algorithms
• Parallel programming models
• Parallel computer language
• Methods for evaluating parallel algorithms
• Parallel programming tools
• Portable parallel programs

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-2


Programming Environments
• Programmability depends on the programming environment provided to
the users.
• Conventional computers are used in a sequential programming
environment with tools developed for a uniprocessor computer.
• Parallel computers need parallel tools that allow specification or easy
detection of parallelism and operating systems that can perform parallel
scheduling of concurrent events, shared memory allocation, and shared
peripheral and communication links.
• Implicit Parallelism:
• Explicit Parallelism

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-3


Programming Environments
• Implicit Parallelism:
• Use a conventional language (like C, Fortran, Lisp, or Pascal) to write the
program.
• Use a parallelizing compiler to translate the source code into parallel code.
• The compiler must detect parallelism and assign target machine resources.
• Success relies heavily on the quality of the compiler.

• Explicit Parallelism
• Programmer write explicit parallel code using parallel dialects of common
languages.
• Compiler has reduced need to detect parallelism, but must still preserve
existing parallelism and assign target machine resources.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-4


Important Issues in Parallel Programming
Important Issues:
• Partitioning of data
• Mapping of data onto the processors
• Reproducibility of results
• Synchronization
• Scalability and Predictability of performance
• Success depends on the combination of
• Architecture, Compiler, Choice of Right Algorithm, Programming
Language
• Design of software, Principles of Design of algorithm, Portability,
Maintainability, Performance analysis measures, and Efficient
implementation

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-5


Exploitation of PARALLELISM
Attributes of parallelism
• Computational granularity,
• Time and space complexities,
• Communication latencies,
• Scheduling policies,
• Load balancing, etc.
Types of Parallelism
• Data parallelism
• Task parallelism
• Combination of Data and Task parallelism
• Stream parallelism

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-6


Data Parallelism
• Identical operations being applied concurrently on
different data items is called data parallelism.
• It applies the SAME OPERATION in parallel on
different elements of a data set.
• It uses a simpler model and reduce the
programmer’s work.
• Responsibility of programmer is to specify the
distribution of data for various processing elements.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-7


Task Parallelism
• Many tasks are executed concurrently is called task
parallelism.
• This can be done (visualized) by a task graph. In this graph, the
node represent a task to be executed. Edges represent the
dependencies between the tasks.
• Sometimes, a task in the task graph can be executed as long
as all preceding tasks have been completed.
• Let the programmer define different types of processes.
These processes communicate and synchronize with each
other through MPI or other mechanisms.
• Programmer’s responsibility is to deal explicitly with process
creation, communication and synchronization.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-8


Data and Task Parallelism
Integration of Task and Data Parallelism
• Two Approaches
• Add task parallel constructs to data parallel constructs.
• Add data parallel constructs to task parallel construct
• Approach to Integration
• Language based approaches.
• Library based approaches.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-9


Stream Parallelism
• Stream parallelism refers to the simultaneous execution of different
programs on a data stream. It is also referred to as pipelining.
• The computation is parallelized by executing a different program at
each processor and sending intermediate results to the next
processor.
• The result is a pipeline of data flow between processors.
• Many problems exhibit a combination of data, task and stream
parallelism.
• The amount of stream parallelism available in a problem is usually
independent of the size of the problem.
• The amount of data and task parallelism in a problem usually
increases with the size of the problem.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-10


Conditions of Parallelism
• The exploitation of parallelism in computing requires
understanding the basic theory associated with it.
Progress is needed in several areas:
• computation models for parallel computing
• Inter-processor communication in parallel architectures
• integration of parallel systems into general environments

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-11


Data and Resource Dependencies
• Program segments cannot be executed in parallel
unless they are independent.
• Independence comes in several forms:
• Data dependence: data modified by one segement must
not be modified by another parallel segment.
• Control dependence: if the control flow of segments
cannot be identified before run time, then the data
dependence between the segments is variable.
• Resource dependence: even if several segments are
independent in other ways, they cannot be executed in
parallel if there aren’t sufficient processing resources (e.g.
functional units).
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-12
Data Dependence
• Flow dependence: S1 precedes S2, and at least one output of S1 is input
to S2.
• Anti-dependence: S1 precedes S2, and the output of S2 overlaps the input
to S1.
• Output dependence: S1 and S2 write to the same output variable.
• I/O dependence: two I/O statements (read/write) reference the same
variable, and/or the same file.
• Unknown dependence: Dependence relationships cannot be
determined in the following situations:
• Indirect addressing
• The subscript of a variable is itself subscripted.
• The subscript does not contain the loop index variable.
• A variable appears more than once with subscripts having different
coefficients of the loop variable (that is, different functions of the loop
variable).
• The subscript is nonlinear in the loop index variable.
• Parallel execution of program segments which do not have total
data independence can produce non-deterministic results.
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-13
Example
S1: Load R1, A /R1  Memory(A)/
S2: Add R2, R1 /R2  (R1) + (R2)/
S3: Move R1,R3 /R1  (R3)/
S4: Store B, R1 /Memory(B)  (R1)/
S2 is flow dependent on S1 because the variable R1
S3 is anti-dependent on S1 because of register R1.
S3 is output-dependent on S1 because of register R1and more …..

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-14


Program Transformation and Code scheduling

S1: A = 1 S1

S2: B = A + 1
S1 S1
S3: C = B + 1
S4: D = A + 1
S1 S1
S5: E = D + B
S1: A = 1
cobegin
S2: B=A+1
post (e)
S3: C=B+1
II
S4: D=A+1
wait (e)
S5: E=D+B
coend
Control Dependence
• It is the situation, when the order of the execution cannot be determined
before run time.
• Different paths taken after a conditional branch may introduce or eliminate
data dependence among instructions.
• Dependence may also exist between operations performed in successive
iterations of a looping procedure.

• Control-independent example:
for (i=0;i<n;i++) {
a[i] = c[i];
if (a[i] < 0) a[i] = 1;
}
• Control-dependent example:
for (i=1;i<n;i++) {
if (a[i-1] < 0) a[i] = 1;
}
• Compiler techniques are needed to get around control dependence limitations.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-16


Control Dependences
S : if A ≠ 0 then S: b = [A ≠ 0]
T: C=C+1
T: C = C+ 1 when b
U: D = C/A
U: D = C/A when b
else
V: D = C when not b
V: D=C
W: X=C+D
end if
W: X=C+D
Resource Dependence
• Data and control dependencies are based on the
independence of the work to be done.
• Resource independence is concerned with conflicts
in using shared resources, such as registers, integer
and floating point ALUs, etc.
• ALU conflicts are called ALU dependence.
• Memory (storage) conflicts are called storage
dependence.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-18


Bernstein’s Conditions
• Bernstein’s conditions are a set of conditions which must exist
if two processes can execute in parallel.
• Notation
• Ii is the set of all input variables for a process Pi .
• Oi is the set of all output variables for a process Pi .
• If P1 and P2 can execute in parallel (which is written as P1 ||
P2), then:
I1  O2 = 
I 2  O1 = 
O1  O2 = 

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-19


Bernstein’s Conditions
• In terms of data dependencies, Bernstein’s conditions imply
that two processes can execute in parallel if they are flow-
independent, anti-independent, and output-independent.
• The parallelism relation || is commutative (Pi || Pj implies Pj
|| Pi ), but not transitive (Pi || Pj and Pj || Pk does not imply Pi
|| Pk ) . Therefore, || is not an equivalence relation.
• Intersection of the input sets is allowed.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-20


Detection of Parallelism
• Example
P1: C = D x E P1
x
P2: M = G + C
P3: A = B + C P5
+3 /
P4: C = L + M P2 +1
P4
P5: F = G / E

+2
P3

Dependence Graph

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-21


Execution (Data-flow)
D

E D
E X P1
C G X P1 G E
B
G + P2
C
P2 + P3 + / P5
B + M
P3 L
A
M + P4
L +
P4
G C C A F
E / P5

F
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-22
Hardware Parallelism & Software Parallelism
Hardware parallelism
• Hardware parallelism is defined by machine architecture and hardware
multiplicity.
• It can be characterized by the number of instructions that can be issued per
machine cycle. If a processor issues k instructions per machine cycle, it is
called a k-issue processor. Conventional processors are one-issue machines.
• Examples. Intel i960CA is a three-issue processor (arithmetic, memory access,
branch). IBM RS-6000 is a four-issue processor (arithmetic, floating-point,
memory access, branch).
• A machine with n k-issue processors should be able to handle a maximum of
nk threads simultaneously.
Software Parallelism
• Software parallelism is defined by the control and data dependence of
programs, and is revealed in the program’s flow graph.
• It is a function of algorithm, programming style, and compiler optimization.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-23


Mismatch between software and hardware
parallelism
Example:
A = (P X Q) + (R X S)
B = (P X Q) - (R X S)
L1 L2 L3 L4
Code Sequence Cycle 1

L1 Load P
L2 Load Q Cycle 2 X1 X2
L3 Load R
L4 Load S Cycle 3 + -
X1 Mul P, Q
X2 Mul R, S A B
+ Add X1, X2 Maximum software parallelism: No limitation
- Sub X1, X2 of functional units (L=load, X/+/- = arithmetic).
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-24
Mismatch between software and hardware
parallelism
Example: Execution Using Single Functional Unit for
Load, Mul and Add/Sub
A = (P X Q) + (R X S)
L1 Cycle 1
B = (P X Q) - (R X S)
L2 Cycle 2
Code Sequence
X1 L3
L1 Load P Cycle 3

L2 Load Q L4 Cycle 4
L3 Load R X2
Cycle 5
L4 Load S
X1 Mul P, Q + Cycle 6
X2 Mul R, S - Cycle 7
+ Add X1, X2 A
- Sub X1, X2 B
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-25
Mismatch between software and hardware
parallelism Execution Using Two Functional Units for
Example: each of Load, Mul and Add/Sub operations
A = (P X Q) + (R X S) L1 L3 Cycle 1
B = (P X Q) - (R X S)
L2 L4
Cycle 2
Code Sequence
L1 Load P X1 X2 Cycle 3
L2 Load Q
= inserted for
L3 Load R synchronization S1 S2 Cycle 4
L4 Load S
L5 L6
X1 Mul P, Q Cycle 5

X2 Mul R, S
+ - Cycle 6
+ Add X1, X2
- Sub X1, X2
A B
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-26
Program Partitioning & Scheduling
• Program Partitioning
• The transformation of sequentially coded program into a
parallel executable form can be done manually by the
programmer using explicit parallelism or by a compiler
detecting implicit parallelism automatically.
• Program partitioning determines whether the given
program can be partitioned or split into pieces that can be
executed in parallel or follow a certain pre-specified order
of execution.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-27


Program Partitioning & Scheduling
• Grain size or Granularity
• It is the size of the parts or pieces of a program that can be considered for
parallel execution.
• Grain size is the simplest measure to count the number of instructions in
a program segment chosen for parallel Execution.
• Grain sizes are usually described as fine, medium or coarse, depending on
the level of parallelism involved
• Latency
Latency is the time required for communication between different subsystems in
a computer.
• Memory latency, for example, is the time required by a processor to
access memory.
• Synchronization latency is the time required for two processes to
synchronize their execution.
• Computational granularity and communication latency are closely related.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-28


Levels of Parallelism

Increasing
communication
demand and
Jobs or programs

Subprograms, job steps or } Coarse grain

scheduling
overhead

Higher degree of
parallelism
related parts of a program

Procedures, subroutines,
tasks, or coroutines
}Medium grain

Non-recursive loops
or unfolded iterations

Instructions
or statements
}Fine grain

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-29


BCS-29
Advanced Computer Architecture
Pipelined Processing
Linear & Nonlinear pipelines
Instruction Pipelines & Arithmetic Operations
Principles of Pipelining
• A pipeline may be compared directly with an
assembly line in a manufacturing plant. Thus
• Input task or process is subdivided into a sequence of subtasks;
• Each subtask is executed by a specialized hardware stage;
• Many such hardware stages operate concurrently;
• When successive tasks are streamed into the pipeline they are
executed in an overlapped fashion at the subtask level.
• The creation of the correct sequence of subtasks is crucial to
the performance of the pipeline.
• Slowest subtask is the bottleneck in the pipeline.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-2


Linear Pipeline
• A linear pipeline processor is a cascade of Processing Stages
which are linearly connected to perform fixed function over a
stream of data flowing from one end to the other.
• Linear pipeline are static pipeline because they are used to
perform fixed functions.
• On the basis of the control of data flow along the pipeline. we
model linear pipelines in two categories:
• Synchronous Pipeline
• Clocked latches between Stage i and Stage i+1
• Equal delays in all stages
• Asynchronous Pipeline (Handshaking)

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-3


Asynchronous Pipeline

Input Output
Ready Ready Ready Ready
S1 S2 Sk
Ack Ack Ack Ack

• Data flow between adjacent stages in an asynchronous


pipeline is controlled by a handshaking protocol.
• Asynchronous pipelines are useful in designing
communication channels in massage passing multicomputer
• Asynchronous pipelines may have variable throughput rate.
Different amount of delay may be experienced in different
stages.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-4


Synchronous Pipeline
L0 L1 L2 Lk-1 Lk

S1 S1 S1

Clock
t tM d

• Clocked latches are used to interface between stages. On the arrival of a


clock pulse, all latches transfer data to the next stage simultaneously.
• The pipeline stages are combinational logic circuits. It is desired to have
approximately equal delays in all stages.
• These delays determine the clock period and thus the speed of the
pipeline.
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-5
Reservation Table
• The utilization pattern of successive stages in a pipeline is
specified by a Reservation Table
Time Time
X
S1 T1 T2 T3 T4 T5 T6 T7 T8

X S1 X X X X X
S2
S2 X X X X X
X
S3
S3 X X X X X
X
S4 S4 X X X X X

• Reservation Table of a 5 tasks on 4 stages


four stage linear pipeline

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-6


Clock period and frequency
Consider ti to be time delay due to logic circuitry in
stage Si , d to be time delay of each interface latch.
• Then
• The clock period, t , of a linear pipeline is given by
t = max {t i } + d
= tM + d

• The frequency, f = 1 / t
= 1 / [t M + d ] (i.e., reciprocal of clock period)

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-7


Speedup, Efficiency & Throughput
• Speedup
• Ideally, a linear pipeline of k stages can process n task in k
+ (n-1) clock so total time required is
TK = { k + (n-1)} t
• Time required for nonpipelined processing
T1 = n k t
Speedup factor SK = T1 / Tk
SK = n k / k + (n-1)
• The maximum speedup is SK → k as n → .
• This maximum speedup is very difficult to achieve because of
the dependence structure of the program.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-8


Speedup, Efficiency & Throughput
• Pipeline efficiency:
 = Sk / k
= n / k + (n-1)
So efficiency = 1 when n→ and
lower bound of  is 1 / k when n=1
• Throughput i.e. Number of task performed per unit
time
Hk = n / { k + (n-1)} t = n f / k +(n-1)
Maximum throughput is f, when efficiency = 1 as n→

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-9


Linear Instruction Pipelines
Assume the following instruction execution phases:
Fetch (F)
Decode (D)
Operand Fetch (O)
Execute (E)
Write results (W)

Execution F I1 I2 I3
D I1 I2 I3
O I1 I2 I3
E I1 I2 I3
W I1 I2 I3
Dr P K Singh Slide-2.10
Dependencies
• Data Dependency
(Operand is not ready yet)
• Instruction Dependency
(Branching)
Will that Cause a Problem?

Data Dependency 1 2 3 4 5 6
I1 -- Add R1, R2, R3
I2 -- Sub R4, R1, R5
F I1 I2
Solutions
D I1 I 2
STALL
O I 1 I2
Forwarding E I1 I2
Write and Read in one cycle
….
W I1 I2
Dr P K Singh Slide-2.11
Instruction Dependency
I1 – Branch o
I2 – 1 2 3 4 5 6

Solutions
F I1 I2
STALL
D I1 I2
Predict Branch taken O I1 I2
Predict Branch not taken
….
E I1 I2
W I1 I 2

Dr P K Singh Slide-2.12
Non Linear Pipelines
• Non-Linear pipeline are dynamic pipeline because they
can be reconfigured to perform variable functions at
different times.
• Non-Linear pipeline allows feed-forward and feedback
connections in addition to the streamline connection.

X Y

S1 S2 S3

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-13


Difference Between Linear and Non-Linear
pipeline
Linear Pipeline Non-Linear Pipeline
Linear pipeline are static pipeline Non-Linear pipeline are dynamic pipeline because
because they are used to perform fixed they can be reconfigured to perform variable
functions. functions at different times.
Linear pipeline allows only streamline Non-Linear pipeline allows feed-forward and feedback
connections. connections in addition to the streamline connection.
It is relatively easy to partition a given Function partitioning is relatively difficult because the
function into a sequence of linearly pipeline stages are interconnected with loops in
ordered sub functions. addition to streamline connections.
The Output of the pipeline is produced The Output of the pipeline is not necessarily produced
from the last stage. from the last stage.
The reservation table is trivial in the The reservation table is non-trivial in the sense that
sense that data flows in linear there is no linear streamline for data flows.
streamline.
Static pipelining is specified by single Dynamic pipelining is specified by more than one
Reservation table. Reservation table.
All initiations to a static pipeline use the A dynamic pipeline may allow different initiations to
same reservation table. follow a mix of reservation tables.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-14


Reservation Table
X Y

S1 S2 S3

• There are two reservation tables corresponding to a function X and a


function Y, respectively. Each function evaluation is specified by one
reservation table.
• A dynamic pipeline may be specified by more than one reservation table.
• Each reservation table displays the time-space flow of data through the
pipeline for one function evaluation.
• Different functions follow different paths through the pipeline.
• The number of columns in a reservation table is called the Evaluation time
of a given function.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-15


Reservation Table
Reservation table for function X

S1 X X X

S2 X X
S3 X X X

Reservation table for function Y


S1 Y Y
S2 Y
S3 Y Y Y
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-16
Reservation Table
• The check marks in each row of the reservation table
correspond to the time instants (cycles) that a particular stage
will be used.
• There may be multiple check marks in a row, which means
repeated usage of the same stage in different cycles.
• Contiguous check marks in a row simply imply the extended
usage of a stage over more than one cycle.
• Multiple check marks in a column mean that multiple stages
need to be used in parallel during a particular clock cycle.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-17


Latency Analysis
• Latency
• The number of time units [clock cycles] between two initiations of a
pipeline is the Latency between them.
• A latency of K means that two initiations are separated by K clock
cycles.
• Collision
• Any attempt by two or more initiations to use the same pipeline stage
at the same time will cause a collision.
• A collision implies resource conflicts between two initiations in the
pipeline. Therefore, all collisions must be avoided in scheduling a
sequence of pipeline initiations.
Forbidden Latency: Latencies that cause collisions.
Permissible Latency: Latencies that will not cause collisions.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-18


Forbidden Latencies
• X after X
2

S1 X1 X2 X1 X2 X1

S2 X1 X2 X1 X2

S3 X1 X2 X1 X2 X1

S1 X1 X2 X1 X1

S2 X1 X1 X2

S3 X1 X1 X1 X2

Dr P K Singh Slide-2.19
Forbidden Latencies
• X after X
4
S1 X1 X2 X1 X1

S2 X1 X1 X2 X2

S3 X1 X1 X2 X1

7
S1 X1 X1 X2 X1

S2 X1 X1

S3 X1 X1 X1

Dr P K Singh Slide-2.20
Permissible Latencies
• X after X
1

S1 X1 X2 X1 X2 X1

S2 X1 X2 X1 X2

S3 X1 X2 X1 X2 X1 X2

S1 X1 X2 X1 X1

S2 X1 X1 X2 X2

S3 X1 X1 X2 X1 X2

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-21


Nonlinear Pipeline Design
• Latency Sequence
• A sequence of permissible latencies between successive task initiations
• Latency Sequence → 1, 8
• Latency Cycle
• A Latency Cycle is a latency sequence which repeats the same
subsequence (cycle) indefinitely.
• Latencies Cycle → (1,8) → 1, 8, 1, 8, 1, 8 …
• Average latency
• The average latency of a latency cycle is obtained by dividing the sum of all
latencies by the number of latencies along the cycle.
• Average Latency (of a latency cycle) → sum of all latencies / number of
latencies along the cycle {(1+8)/2=4.5}
• Constant Cycle → One latency value (3*)
Objective → Obtain the shortest average latency between
initiations without causing collisions.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-22


Latency Cycle (1,8)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

X1 X2 X1 X2 X1 X2 X3 X4 X3 X4 X3 X4 X5 X6

X1 X2 X1 X2 X3 X4 X3 X4 X5 X6

X1 X2 X1 X2 X1 X2 X3 X4 X3 X4 X3 X4 X5

Average Latency = (1+8)/2 = 4.5

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-23


Scheduling events
• Collision-Free Scheduling:
• When scheduling events in a nonlinear pipeline, the main objective is
to obtain the shortest average latency between initiations without
causing collisions.
• Collision vector
• By examining the reservation table, one can distinguish the set of
permissible latencies and set of forbidden latencies.
• The combined Set of permissible and forbidden latencies can be easily
displayed by a collision vector

• C = (Cm, Cm-1, …, C2, C1), m <= n-1


• n = number of column in reservation table
• Ci = 1 if latency i causes collision, 0 otherwise

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-24


Collision Vector
Forbidden Latencies: 2, 4, 5, 7
Collision Vector =
1011010
The next state is obtained by bitwise ORing the
initial collision vector with the shifted register
• C.V. = 1 0 1 1 0 1 0 (first state)

0 1 0 1 1 0 1 C.V. 1-bit right shifted


1 0 1 1 0 1 0 initial C.V.
---------------- OR
1111111
State Transition
State Transition using an n-bit Right shift register (n is
maximum forbidden latency)

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-26


Latency Analysis
X after X
C.V.
Grant X
Gate

OR

0
Grant X if 0

8+

1011010
Cycles: (1, 8), (I, 8, 6, 8), (1,
8, 3, 8), (3), (6), [3, 8), (3, 6,
8+
3
6 8+
3) and many more are the
1*

1011011
legitimate cycles may be
1111111

3* 6
traced.
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-27
Latency Analysis
• Latency Sequence:
• A sequence of permissible latencies between successive initiations
• Latency Cycle:
• A latency sequence that repeats the same subsequence (cycle)
indefinitely
• Simple cycles:
• A simple cycle is a latency cycle in which each state appears only
once.
(3), (6), (8), (1, 8), (3, 8), and (6,8)
• Greedy Cycles:
• Simple cycles whose edges are all made with minimum latencies from
their respective starting states.
• Greedy cycles must first be simple, and their average latencies must
be lower than those of other simple cycles.
(1,8), (3) → one of them is MAL(Minimum Average latency)
Minimum Average latency(MAL)
• The minimum-latency edges in the state diagrams are
marked with asterisks.
• At least one of the greedy cycles will lead to the MAL.

• Bounds on the MAL


• MAL is lower bounded by the maximum number of checkmarks
in any row of the reservation table.
• MAL is lower than or equal to the average latency of any greedy
cycle in the state diagram.
• The average latency of any greedy cycle is upper-bounded by the
number of 1’s in the initial collision vector plus 1. This is also an
upper bund on the MAL.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-29


Optimization of MAL
• To optimize the MAL, one needs to find the lower bound by modifying the
reservation table.
• The approach is to reduce the maximum number of check marks in any
row.
• The modified reservation table must preserve the original function being
evaluated.
• use of non-compute delay stages to increase pipeline performance with a
shorter MAL.
Delay Insertion:
• The purpose of delay insertion is to modify the reservation table, yielding
a new collision vector.
• This leads to a modified state diagram, which may produce greedy cycles
meeting the lower hound on the MAI...

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-30


Delay Insertion

output

S1 S2 S3

MAL = 3
Reservation Tables
1 2 3 4 5
S1 X X
5+ 1011 3*
S2 X X
S3 X X

Forbidden Latencies: 1, 2, 4 State Diagram


C.V. → 1 0 1 1
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-31
Delay Insertion
output
D1

S1 S2 S3

D2

1 2 3 4 5 6 7
S1 X X
Forbidden: 2, 6 S2 X X
C.V. → 1 0 0 0 1 0 S3 X X
D1 D
D2 D

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-32


Delay Insertion
4,7+
100010
1* 4,7+
7+ 3 4,7+ 5
4 5
110011 100011
1* 5
3* 3*
100110

Greedy cycle (1, 3), resulting in a reduced MAL= (1 + 3)/2 = 2.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-33


BCS-29
Advanced Computer Architecture
Pipelined Processing
INSTRUCTION PIPELINE DESIGN
Pipeline Hazards
• Pipeline Hazards
• The situations that prevent the next instruction in the instruction
stream from executing in its designated clock cycle.
• Hazards reduce the performance from the ideal speedup gained by
pipelining
• Three types of hazards
• Structural hazards
• Arise from resource conflicts when the hardware can’t support all
possible combinations of overlapping instructions
• Data hazards
• Arise when an instruction depends on the results of a previous
instruction in a way that is exposed by overlapping of instruction in
pipeline
• Control hazards
• Arise from the pipelining of branches and other instructions that
change the PC (Program Counter)

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-2


Structural Hazards
• If certain combination of instructions can’t be accommodated
because of resource conflicts, the machine is said to have a
structural hazard
• It can be generated by:
• Some functional unit is not fully pipelined
• Some resources has not been duplicated enough to allow all the
combinations in the pipeline to execute
• For example: a machine may have only one register file write port, but
under certain conditions, the pipeline might want to perform two writes
in one clock cycle – this will generate structural hazard
• When a sequence of instructions encounter this hazard, the pipeline will
stall one of the instructions until the required unit is available
• Such stalls will increase the Clock cycle Per Instruction from its ideal 1 for
pipelined machines

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-3


Structural Hazards
Instruction
Clock number
Number
1 2 3 4 5 6 7 8 9 10
load IF ID EX MEM WB
Instruction i+1 IF ID EX MEM WB
Instruction i+2 IF ID EX MEM WB
Instruction i+3 stall IF ID EX MEM WB
Instruction i+4 IF ID EX MEM WB
Instruction i+5 IF ID EX MEM

• A machine with structural hazard will have lower CPI


• Why a designer allows structural hazard?
• To reduce cost
• Pipelining all the functional units or duplicating them may be too costly
• To reduce latency
• Introducing too many pipeline stages may cause latency issues
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-4
INSTRUCTION PIPELINE DESIGN
Pipelined Instruction Processing

X = Y + Z and A = B X C

In-order Instruction issuing

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-5


INSTRUCTION PIPELINE DESIGN

Reordered Instruction issuing

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-6


Mechanisms for Instruction Pipelining
Prefetch Buffers:
Three types of buffers can be used to match the instruction fetch rate to the pipeline
consumption rate.
ln one memory-access time, a block of consecutive instructions are fetched into a prefetch
buffer.
Sequential Buffer: Sequential instructions are loaded into this Buffer for in-sequence
pipelining.
Target Buffer: Instructions from a branch target are loaded into this buffer for out-of-
sequence pipelining.
Loop Buffer: This buffer holds sequential instructions contained in a small loop. The
loop buffers are maintained by the fetch stage of the pipeline.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-7


Mechanisms for Instruction Pipelining
Multiple Functional Unit:
Sometimes a certain pipeline stage becomes the bottleneck. This stage
corresponds to the row with the maximum number of checkmarks in the
reservation table. This bottleneck problem can be alleviated by using multiple
copies of the same stage simultaneously. This leads to the use of multiple
execution units in a pipelined processor design.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-8


P6 Microarchitecture

9
Mechanisms for Instruction Pipelining
Internal Dara Forwarding:
The throughput of a pipelined processor can be further improved with
internal data forwarding among multiple functional units. In some eases,
some memory-access operations can be replaced by register transfer
operations.
Store M, R1 Store M, R1
Load R2, M Move R2, R1

Store-Load forwarding
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-10
Mechanisms for Instruction Pipelining
Internal Dara Forwarding
Load R1, M Load R1, M
Load R2, M Move R2, R1

Load-Load forwarding

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-11


Mechanisms for Instruction Pipelining
Example of Internal Dara Forwarding

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-12


Pipeline Hazards
• The read and write of shared variables by different instructions in a
pipeline may lead to Hazard.
• Hazards should be prevented before these instructions enter the
pipeline.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-13


Dynamic Instruction Scheduling
Static Scheduling:
• This scheme is supported by an optimizing compiler.
• Data dependences in a sequence of instructions create
interlocked relationships among them.
• Interlocking can he resolved through a compiler-based static
scheduling approach.
• A compiler can be used to increase the separation between
interlocked instructions by re sequencing.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-14


Dynamic Instruction Scheduling
Tomasulo’s Algorithm
Register tagging
Hardware based Dependence resolution
Scoreboarding Technique
Scoreboard: The centralized control Unit
A type of Data Driven Mechanism
Enables out-of-order execution and allows out-of-order
completion.
Split the ID pipe stage of pipeline into 2 stages:
Decode instructions, check for structural hazards
Wait until no data hazards, then read operands

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-15


Dynamic Instruction Scheduling
Tomasulo’s Algorithm
Control & buffers distributed with Function Units (FU)
FU buffers called “reservation stations”; have pending operands

Registers in instructions replaced by values or pointers to reservation stations


(RS);
form of register renaming ;
avoids WAR, WAW hazards
More reservation stations than registers, so can do optimizations compilers can’t
Results to FU from RS, not through registers, over Common Data Bus that
broadcasts results to all FUs
Load and Stores treated as FUs with RSs as well

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-16


Dynamic Instruction Scheduling
Tomasulo’s Algorithm
• An issued instruction whose operands are
not available is forwarded to an RS associated
with the functional unit it will use.
• It waits until its data dependences have been
resolved and its operands become available.
• The dependence is resolved by monitoring
the result bus, when all operands for an
instruction are available, it is dispatched to
the functional unit for execution
• All working registers are tagged. lf a source
register is busy when an instruction reaches
the issue stage, the tag for the source register
is forwarded to an RS.
• When the register data becomes available, it
also reaches the RS which has the same tag.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-17


Dynamic Instruction Scheduling
Scoreboarding –
• Technique for allowing instructions to execute out of order
when there are sufficient resources and no data dependencies.
• Scoreboard keeps track of dependencies, state or operations
• Scoreboard replaces ID, EX, WB with 4 stages
• ID1: Issue — decode instructions & check for structural hazards
• ID2: Read operands — wait until no data hazards, then read operands
• EX: Execute — operate on operands; when the result is ready, it notifies
the scoreboard that it has completed execution
• WB: Write results — finish execution; the scoreboard checks for WAR
hazards. If none, it writes results. If WAR, then it stalls the instruction

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-18


The Scoreboard
Three Parts of the Scoreboard
• Instruction status—which of 4 steps the instruction is in
• Functional unit status—Indicates the state of the functional unit (FU).
• Register result status—Indicates which functional unit will write each
register, if one exists. Blank when no pending instructions will write
that register

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-19


Branch Handling Techniques
assembly code segment after if-
Source code segment Assembly code segment
conversion
mov r1,0
for ( i = 0; i < 100; i++) mov r1,0
mov r2,0
if ( a[i] <= 50)
ld_i r3, A, 0 mov r2,0
j = j + 1;
L1: ld i r3,A,0
else
Ld_i r4,r3,r2 L1:
k = k +1;
Bgt r4, 50, L2 ld i r4,r3,r2
Add r5, r5, 1 pgt p1(U),p2(U),r4,50
Jmp l3 add r5,r5,1 (p2)
L2: add r6,r6,1 (p1)
add r6,r6,1 add r1,r1,1
L3: add r2,r2,4
add r1,r1,1 blt r1,100,L1
add r2,r2,4
blt r1,100,L1

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-20


Branch Handling Techniques
Instruction Fetch

Instruction
Decode

T F Predi
cate
Regi Instruction Regi
ster Execute ster
File File

Memory Access

Write Back

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-21


Branch Handling Techniques
Delayed branches:
• A delayed branch of d cycles allows at least d-1 useful instructions to be
executed after the branch instruction.
• Execution of these instructions must be independent of branch instruction
to achieve the zero branch penalty.
• Code motion across branches can be used to achieve a delayed branch
and on non availability of useful instructions, NOPs can be used as fillers.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-22


Branch Handling Techniques
Branch Prediction:
Branch can be predicted either based on branch code types statically or
dynamically, based on branch history during program execution.
Static Branch Prediction:
• The static prediction direction (token or not taken) can even be wired into
the processor. According to past experience, the best performance is given
by predicting taken.
• The wired-in static prediction cannot be changed once committed to the
hardware.
• However, the scheme can be modified to allow the compiler to select the
direction of each branch on a semi-static prediction basis.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-23


Branch Handling Techniques
Dynamic Branch Prediction:
• The dynamic branch prediction strategy works better because it uses recent
branch history to predict whether or not the branch will be taken next time
when it occurs.
• Dynamic prediction demands additional hardware to keep track of the past
behavior of the branch instructions at run time.
• Following state transition diagram may used for tracking the last two
outcomes at each branch instruction in a given program.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-24


BCS-29
Advanced Computer Architecture
Pipelined Processing
ARITHMETIC PIPELINE DESIGN
Other ILP Architectures
Pipeline Multiplication
Multiplication
• Consider as an example the multiplication of two 8-bit integers, A X B = P,
where P is the 16-bit product.
• This fixed-point multiplication can be written as the summation of eight
partial products as shown below: P = A X B = P0 + P1 + P2 + …+ P7.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-2


Pipeline Multiplication
Carry Propagate Adder

Carry Save Adder

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-3


Pipeline Multiplication

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-4


Floating Point Addition
• Linear pipeline with four functional p a q b
stages. Inputs are two normalised
floating-point numbers
a*2^p and b*2^q Other Fraction
fraction
selector
• Output is a normalised floating- Exponent
subtractor
Fraction with min(p,q)

point number d* 2^s which is the Right


sum of the two inputs. |p-q| shifter

• The hardware units other than the


latches can all be implemented r = max(p,q) Fraction adder
using combinational logic. c

• If time delay of interface latches is leading left shifter


10ns and if the time delays of the zero counter d
four stages are 60, 50, 90 and 80ns,
respectively, then cycle time of
Exponent adder
pipeline can be chosen to be
100ns. s d

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-5


Combined Adder and Multiplier

Partial
B
Products

A F C G H
Exponents Partial Add Find Partial
Subtract Shift Mantissa Leading 1 Shift
/ ADD

Re
Round
normalize

E D

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-6


Reservation Table for Multiplication
1 2 3 4 5 6 7
A X
B X X
C X X
D X X
E X
F
G
H

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-7


Reservation Table for Addition

1 2 3 4 5 6 7 8 9
A Y
B
C Y
D Y
E Y
F Y Y
G Y
H Y Y

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-8


Superscalar Architectures
• Superscalar processors attempt to issue multiple instructions
per cycle
• However, essential dependencies are specified by sequential ordering so
operations must be processed in sequential order
• This proves to be a performance bottleneck that is very expensive to
overcome
• Program contains no explicit information regarding
dependencies that exist between instructions
• Dependencies between instructions must be determined by
the hardware
• It is only necessary to determine dependencies with sequentially
preceding instructions that have been issued but not yet completed
• Compiler may re-order instructions to facilitate the hardware’s
task of extracting parallelism

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-9


Superscalar Architectures

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-10


Superscalar Architectures
IF ID EX WB
IF ID EX WB
IF ID EX WB
IF ID EX WB
IF ID EX WB
IF ID EX WB
IF ID EX WB
IF ID EX WB
IF ID EX WB

• Superscalar:
– Issue parallelism = IP = n inst / cycle
– Operation latency = OP = 1 cycle
– Peak IPC = n instr / cycle (n x speedup?)

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-11


Superscalar Performance
Estimate the ideal execution time of N independent instructions
through the pipeline.
• The time required by the scalar base machine(Single Pipeline) is
T(1,1) = k + N - 1 (base cycles)
• The ideal execution time required by an m-issue superscalar machine is
T(m, 1) = k + (N – m)/m (base cycles)
• The ideal speedup of the superscalar machine over the base machine is
S(m,1) = T(1,1)/T(m,1)
= (k + N – 1)/(k + (N – m)/m )
= m(k + N – 1)/(N + m(K-1))

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-12


Other ILP Architectures
Superpipelined Architecture:
• cycle time = 1/m of baseline
• Issue parallelism = IP = 1 inst / minor cycle
• Operation latency = OP = m minor cycles
• Peak IPC = m instr / major cycle (m x speedup?)

1
2
3
4
5
6
IF DE EX WB

1 2 3 4 5 6
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-13
Other ILP Architectures
• VLIW: Very Long Instruction Word
– Issue parallelism = IP = n inst / cycle
– Operation latency = OP = 1 cycle
– Peak IPC = n instr / cycle = 1 VLIW / cycle

IF DE WB

EX

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-14


Other ILP Architectures
• Superpipelined-Superscalar
• Issue parallelism = IP = n inst / minor cycle
• Operation latency = OP = m minor cycles
• Peak IPC = n x m instr / major cycle
1
2
3
4
5
6
7
8
9
IF DE EX WB

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-15


BCS-29
Advanced Computer Architecture
Memory Subsystem
Memory Hierarchy
Memory Subsystem

Auxiliary memory

Magnetic
tapes I/O Main
processor memory
Magnetic
disks

CPU Cache
memory

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-2


Memory Subsystem
Smaller, L0:
faster, registers CPU registers hold words retrieved
and from L1 cache.
costlier L1: on-chip L1
(per byte) cache (SRAM) L1 cache holds cache lines retrieved
storage from the L2 cache memory.
devices L2: off-chip L2
cache (SRAM) L2 cache holds cache lines
retrieved from main memory.

L3: main memory


Larger, (DRAM)
Main memory holds disk
slower, blocks retrieved from local
and disks.
cheaper local secondary storage
L4:
(per byte) (local disks)
storage Local disks hold files
devices retrieved from disks on
remote network servers.

L5: remote secondary storage


(tapes, distributed file systems, Web servers)

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-3


Memory Hierarchy
• Registers
• In CPU

• Internal or Main memory


• May include one or more
levels of cache
• “RAM”

• External memory
• Backing store

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-4


Internal Memory Types

Write
Memory Type Category Erasure Volatility
Mechanism
Random-access Read-write Electrically, byte-
Electrically Volatile
memory (RAM) memory level
Read-only
Masks
memory (ROM) Read-only
Not possible
Programmable memory
ROM (PROM)
Erasable PROM UV light, chip-
(EPROM) level Nonvolatile
Electrically Electrically
Read-mostly Electrically, byte-
Erasable PROM
memory level
(EEPROM)
Electrically,
Flash memory
block-level

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-5


External Memory Types
• HDD
• Magnetic Disk(s)
• SDD (Solid State Disk(s))

• Optical
• CD-ROM
• CD-Recordable (CD-R)
• CD-R/W
• DVD

• Magnetic Tape

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-6


Hierarchical Memory Technology
• Memory in system is usually characterized as
appearing at various levels (0, 1, …) in a hierarchy,
with level 0 being CPU registers and level 1 being the
cache closest to the CPU.
• Each level is characterized by five parameters:
• access time ti (round-trip time from CPU to ith level)
• memory size si (number of bytes or words in the level)
• cost per byte ci
• transfer bandwidth bi (rate of transfer between levels)
• unit of transfer xi (grain size for transfers)

BCS-29 Advanced Computer


Dr P K Singh Slide-4.7
Architecture
Memory Generalities
• It is almost always the case that memories at lower-
numbered levels, when compare to those at higher-
numbered levels
• are faster to access,
• are smaller in capacity,
• are more expensive per byte,
• have a higher bandwidth, and
• have a smaller unit of transfer.
• In general, then, ti-1 < ti, si-1 < si, ci-1 > ci, bi-1 > bi, and
xi-1 < xi.
BCS-29 Advanced Computer
Dr P K Singh Slide-4.8
Architecture
The Inclusion Property
• The inclusion property is stated as:
M1  M2  ...  Mn
The implication of the inclusion property is that all
items of information in the “innermost” memory
level (cache) also appear in the outer memory levels.
• The inverse, however, is not necessarily true. That is,
the presence of a data item in level Mi+1 does not
imply its presence in level Mi. We call a reference to
a missing item a “miss.”

BCS-29 Advanced Computer


Dr P K Singh Slide-4.9
Architecture
The Coherence Property
• The inclusion property is, of course, never
completely true, but it does represent a desired
state. That is, as information is modified by the
processor, copies of that information should be
placed in the appropriate locations in outer memory
levels.
• The requirement that copies of data items at
successive memory levels be consistent is called the
“coherence property.”

BCS-29 Advanced Computer


Dr P K Singh Slide-4.10
Architecture
Coherence Strategies
• Write-through
• As soon as a data item in Mi is modified, immediate
update of the corresponding data item(s) in Mi+1, Mi+2, …
Mn is required. This is the most aggressive (and
expensive) strategy.
• Write-back
• The update of the data item in Mi+1 corresponding to a
modified item in Mi is not updated unit it (or the
block/page/etc. in Mi that contains it) is replaced or
removed. This is the most efficient approach, but cannot
be used (without modification) when multiple processors
share Mi+1, …, Mn.

BCS-29 Advanced Computer


Dr P K Singh Slide-4.11
Architecture
Locality of References
• In most programs, memory references are assumed to
occur in patterns that are strongly related (statistically) to
each of the following:
• Temporal locality – if location M is referenced at time t, then it
(location M) will be referenced again at some time t+t.
• Spatial locality – if location M is referenced at time t, then
another location Mm will be referenced at time t+t.
• Sequential locality – if location M is referenced at time t, then
locations M+1, M+2, … will be referenced at time t+t, t+t’,
etc.
• In each of these patterns, both m and t are “small.”
• H&P suggest that 90 percent of the execution time in
most programs is spent executing only 10 percent of the
code.

BCS-29 Advanced Computer


Dr P K Singh Slide-4.12
Architecture
Working Sets
• The set of addresses (bytes, pages, etc.) referenced
by a program during the interval from t to t+,
where  is called the working set parameter,
changes slowly.
• This set of addresses, called the working set, should
be present in the higher levels of M if a program is to
execute efficiently (that is, without requiring
numerous movements of data items from lower
levels of M). This is called the working set principle.

BCS-29 Advanced Computer


Dr P K Singh Slide-4.13
Architecture
Hit Ratios
• When a needed item (instruction or data) is found in
the level of the memory hierarchy being examined, it
is called a hit. Otherwise (when it is not found), it is
called a miss (and the item must be obtained from a
lower level in the hierarchy).
• The hit ratio, h, for Mi is the probability (between 0
and 1) that a needed data item is found when sought
in level memory Mi.
• The miss ratio is obviously just 1-hi.
• We assume h0 = 0 and hn = 1.

BCS-29 Advanced Computer


Dr P K Singh Slide-4.14
Architecture
Access Frequencies
• The access frequency fi to level Mi is
(1-h1)  (1-h2)  …  hi.

• Note that f1 = h1, and


n

f
i =1
i =1

BCS-29 Advanced Computer


Dr P K Singh Slide-4.15
Architecture
Effective Access Times
• There are different penalties associated with misses at
different levels in the memory hierarcy.
• A cache miss is typically 2 to 4 times as expensive as a cache hit
(assuming success at the next level).
• A page fault (miss) is 3 to 4 magnitudes as costly as a page hit.
• The effective access time of a memory hierarchy can be
expressed as
n
Teff =  fi  ti
i =1

= h1t1 + (1 − h1 )h2t2 + + (1 − h1 )(1 − h2 ) (1 − hn −1 )hntn


The first few terms in this expression dominate, but the
effective access time is still dependent on program behavior
and memory design choices.
BCS-29 Advanced Computer
Dr P K Singh Slide-4.16
Architecture
Memory Technologies

BCS-29 Advanced Computer


Dr P K Singh Slide-4.17
Architecture
Memory Components
• Static RAM Static RAM (SRAM) is a memory technology based on flip-flops. SRAM has
an access time of 2 – 10 nanoseconds. From a logical view, all of main memory can be
viewed as fabricated from SRAM, although such a memory would be unrealistically
expensive.
• Dynamic RAM Dynamic RAM (DRAM) is a memory technology based on capacitors –
electrical elements that store electronic charge. Dynamic RAM is cheaper than static
RAM and can be packed more densely on a computer chip, thus allowing larger capacity
memories. DRAM has an access time (to be defined later) in the order of 60
nanoseconds, slower than SRAM.
• DDR SDRAM (double data rate synchronous dynamic random access memory)
• DDR2 SDRAM (double data rate two synchronous dynamic random access memory)

BCS-29 Advanced Computer


Dr P K Singh Slide-4.18
Architecture
How does cache memory work?

• Main memory consists of up to 2n addressable words, with each word having a


unique n-bit address. For mapping purposes, this memory is considered to
consist of a number of fixed-length blocks of K words each. Thus, there are M =
2n/ K blocks.
• The cache is split into C lines of K words each, Figure with number of lines
considerably smaller than the number of main memory blocks (C << M).
• At any time, only a subset of the blocks of main memory resides in the lines in
the cache.
• If a word in a block of memory is read, that block is transferred to one of the
lines of the cache.

BCS-29 Advanced Computer


Dr P K Singh Slide-4.19
Architecture
Random Access Memory (RAM)

• Misnamed as all semiconductor memory is random


access
• Read/Write
• Volatile
• Temporary storage
• Static or dynamic
Types of RAM
• Dynamic RAM (DRAM) – are like leaky capacitors; initially
data is stored in the DRAM chip, charging its memory cells
to maximum values. The charge slowly leaks out and
eventually would go to low to represent valid data; before
this happens, a refresh circuitry reads the contents of the
DRAM and rewrites the data to its original locations, thus
restoring the memory cells to their maximum charges
• Static RAM (SRAM) – is more like a register; once the data
has been written, it will stay valid, it doesn’t have to be
refreshed. Static RAM is faster than DRAM, also more
expensive. Cache memory in PCs is constructed from SRAM
memory.
Dynamic RAM
• Bits stored as charge in capacitors
• Charges leak
• Need refreshing even when powered
• Simpler construction
• Smaller per bit than SRAM
• Less expensive
• Need refresh circuits
• Slower
• Used for main memory in computing systems
• Essentially analogue
• Level of charge determines value
DRAM Structure & Operation
• Address line active when bit read or written
• Transistor switch closed (current flows)
• Write
• Voltage to bit line
• High for 1 low for 0
• Then signal address line
• Transfers charge to capacitor

• Read
• Address line selected
• transistor turns on
• Charge from capacitor fed via bit line to sense amplifier
• Compares with reference value to determine 0 or 1
• Capacitor charge must be restored
DRAM Refreshing
• Refresh circuit included on chip
• Disable memory array chip
• Count through rows and select each in turn
• Read contents & write it back (restore)

• Takes time

• Slows down apparent performance


Static RAM
• Bits stored as on/off switches
• No charges to leak
• No refreshing needed when powered
• More complex construction
• Larger per bit
• More expensive
• Does not need refresh circuits
• Faster
• Cache
• Digital
• Uses flip-flops
Static RAM Structure & Operation
• Transistor arrangement gives
stable logic state
• State 1
• C1 high, C2 low
• T1 T4 off, T2 T3 on
• State 0
• C2 high, C1 low
• T2 T3 off, T1 T4 on
• Address line transistors T5 T6 is
switch
• Write – apply value to B &
compliment to B
• Read – value is on line B
Static RAM cell

• Cache is much faster than main memory because it is


implemented using SRAM (Static Random Access
Memory).

TCS-802 Advance Computer


Dr P K Singh Slide-4.27
Architecture
SRAM v DRAM
• Both volatile
• Power needed to preserve data

• Dynamic cell
• Simpler to build, smaller
• More dense
• Less expensive
• Needs refresh
• Larger memory units

• Static
• Faster
• Cache
Read Only Memory (ROM)
• Provides permanent storage (nonvolatile)

• Used for: microprogramming, library subroutines (code) and constant


data, systems programs (BIOS for PC or entire application + OS for certain
embedded systems)

• Types
• Written during manufacture (very expensive for small runs)
• Programmable (once) PROM (needs special equipment to program)
• Read “mostly”
• Erasable Programmable (EPROM) - Erased by UV
• Electrically Erasable (EEPROM) - Takes much longer to write than read
• Flash memory - Erase whole memory electrically
Internal linear organization

• 8X2 ROM chip


• As the number of locations
increases, the size of the
address decoder needed,
becomes very large
• Multiple dimensions of
decoding can be used to
overcome this problem
Internal two-dimensional organization

• High order address bits (A2A1) select one of the rows


• The low order address bit selects one of the two locations in
the row
Memory Subsystems Organization (1)

• Two or more memory chips can be combined to create


memory with more bits per location (two 8X2 chips can
create a 8X4 memory)
Memory Subsystems Organization (2)

• Two or more memory chips can be combined to create more


locations (two 8X2 chips can create 16X2 memory)
Processor and Memory

TCS-802 Advance Computer


Dr P K Singh Slide-4.35
Architecture
Memory control signals
• To control the memory system in the minimum
mode, requires: ALE, /BHE, M/IO, DT/R, /RD, /WR,
and /DEN
• ALE – address latch enable, signals external circuitry
that a valid address is on the bus (0->1) so the
address can be stored in the latch (or buffer)
• M/IO – identify whether it is a memory or IO
(Input/Output) operation (high – memory, low – I/O)
• DT/R – transmit or receive (1 – transmit)
• DEN – to enable the data bus

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-36


Read Cycle
• Consists of 4 time states
• T1 – memory address is on the address bus, /BHE is also
output, ALE is enable
• Address is latch to external device at the trailing edge of ALE
• T2 – M/IO and DT/R are set to 1 and 0 respectively. These
signals remain their status during the cycle
• Late in T2 - /RD is switched to 0 and /DEN also set to 0
• T3 and T4 – status bits S3, S4 are output
• Data are read during T3
• /RD and /DEN return to 1 at T4

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-37


Read Cycle

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-38


Write Cycle
• T1 – address and /BHE are output and latched with
ALE pulse
• M/IO is set to 1, DT/R is also set to 1
• T2 - /WR set to 0 and data put on data bus
• Data remain in the data bus until /WR returns to 1
• When /WR returns to 1 at T4, data is written into
memory

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-39


Write Cycle

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-40


Double Data Rate (DDR) DRAM

• An SDRAM type of memory where data are transferred on


both the rising and the falling clock edge, effectively doubling
the transfer rate without increasing the clock frequency
• DDR-200 means a transfer rate of 200 million transfers per
second, at a clock rate of 100 MHz
• DDR1 upto 400 MHz
• DDR2 standard allows higher clock frequencies

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-41


Memory Expansion on Motherboards

Memory Expansion Memory Expansion


Using 4 SIMMs on using 4 Memory
the Motherboard
Chips on a SIMM
Motherboard

Slot 3 Slot 4
SIMM

Slot 1 Slot 2
SIMM SIMM

Processor

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-42


RAM Example
Design an 8KX8 RAM module using 2KX8 RAM chips. The module should be
connected on an 8-bit processor with a 16-bit address bus and occupy the address
range starting from the address A000. Show the circuit and the memory map.
Starting Address = A000 = 1010-0000-0000-0000
• Number of memory devices needed = 8K/2K ==> A15 = 1, A14 = 0 and A13 = 1
=4 A13
Address Selection Circuit A14
• Decoder needed = 2X4 A15

• Number of address lines on each 2KX8 A15 A14 A13 A12 A11 A10 A0 Mem. Map
memory chip = 11 0 0 0 0 0 0 0 0000 Not
1 0 0 1 1 1 1 9FFF Used
2m = 2K = 21 x 210 = 211  (A0..A10) 1 0 1 0 0 0 0 A000
RAM1
• Decoder needed = 2X4 1 0 1 0 0 1 1 A7FF

 2 address lines are needed for the decoder. 1 0 1 0 1 0 0 A800


RAM2
1 0 1 0 1 1 1 AFFF
 (A11..A12)
1 0 1 1 0 0 0 B000
RAM3
• Number of address lines needed for the 1 0 1 1 0 1 1 B7FF
address selection circuit 1 0 1 1 1 0 0 B800
RAM4
= 16 - 11 - 2 = 3  (A13, A14 A15) 1 0 1 1 1 1 1 BFFF
1 1 0 0 0 0 0 C000 Not
1 1 1 1 1 1 1 FFFF Used

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-43


Circuit Diagram

D7
D0 D7 D0 D7 D0 D7 D0 D7

2Kx8 RAM 2Kx8 RAM 2Kx8 RAM 2Kx8 RAM

D0 A0 A0 A0 A0

A10 A10 A10 A10

RD WR CS RD WR CS RD WR CS RD WR CS

RD
WR

A0
2X4 DEC.
A11
A Y0
A12
B Y1
A13
Y2
A14
CS Y3
A15 A15

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-44


BCS-29
Advanced Computer Architecture
External Memory
Types of External Memory
Magnetic Disk
Floppy Disk
Hard Disk
Removable
Optical
CD-ROM
CD-Recordable (CD-R)
CD-R/W
DVD
Magnetic Tape
PD/SSD

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-2


Floppy Disk
• The first floppy disks, invented and made by IBM.
• First Disk diameter of 8 inches (203 mm) Subsequently 5 1⁄4-
inch (133 mm) and then ​3 1⁄2 inch (90 mm)
• Capacity
• 360 KB
• 1.2 MB
• 1.44 MB

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-3


Magnetic Disk
• Disk substrate coated with magnetizable material
(iron oxide…rust)
• Substrate used to be aluminium
• Now glass
• Improved surface uniformity
• Increases reliability
• Reduction in surface defects
• Reduced read/write errors
• Lower flight heights (See later)
• Better stiffness
• Better shock/damage resistance

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-4


Disk Medium Materials
• Aluminum with a deposit of magnetic material
• Some disks also use glass platters
• Eg. Newer IBM/Hitachi products
• Better surface uniformity and stiffness but harder to
deposit magnetic material
• Anti-Ferromagnetically Coupled media
• Uses two magnetic layers of opposite polarity to reinforce
the orientation.
• Can provide higher densities but at higher manufacturing
complexity

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-5


Read and Write Mechanisms
• Recording & retrieval via conductive coil called a head
• May be single read/write head or separate ones
• During read/write, head is stationary, platter rotates
• Write
• Current through coil produces magnetic field
• Pulses sent to head
• Magnetic pattern recorded on surface below
• Read (traditional)
• Magnetic field moving relative to coil produces current
• Coil is the same for read and write
• Read (contemporary)
• Separate read head, close to write head
• Partially shielded magneto resistive (MR) sensor
• Electrical resistance depends on direction of magnetic field
• High frequency operation
• Higher storage density and speed

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-6


Inductive Write MR Read

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-7


New Recording Technologies
• Longitudinal Recording now expected to extend
above 100 Gb/sq-in.

• Perpendicular Recording expected to extend to 1


Tb/sq-in

• Beyond that:
• Heat-assisted recording (HAMR)

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-8


Data Organization and Formatting
• Concentric rings or tracks
• Gaps between tracks
• Reduce gap to increase capacity
• Same number of bits per track (variable packing density)
• Constant angular velocity
• Tracks divided into sectors
• Minimum block size is one sector
• May have more than one sector per block

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-9


Storage Density
• Determines both capacity and performance
• Density Metrics
• Linear density (Bits/inch or BPI) BPI
• Track density (Tracks/inch or TPI)
• Areal Density = BPIxTPI
TPI

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-10


Disk Data Layout

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-11


Disk Rotation
• Bit near centre of rotating disk passes fixed point slower than
bit on outside of disk
• Increase spacing between bits in different tracks
• Rotate disk at constant angular velocity (CAV)
• Gives pie shaped sectors and concentric tracks
• Individual tracks and sectors addressable
• Move head to given track and wait for given sector
• Waste of space on outer tracks
• Lower data density
• Can use zones to increase capacity
• Each zone has fixed bits per track
• More complex circuitry

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-12


Disk Layout Methods Diagram

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-13


Winchester Disk Format (Seagate ST506)

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-14


HDD Organization

Arm
Assembly Spindle Cylinder
Arm Head

Platter
Track

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-15


HDD Organization
• Typical configurations seen in disks today
• Platter diameters: 3.7”, 3.3”, 2.6”
• RPMs: 5400, 7200, 10000, 15000
• 0.5-1% variation in the RPM during operation
• Number of platters: 1-5
• Mobile disks can be as small as 0.75”
• Power proportional to: (# Platters)*(RPM)2.8(Diameter)4.6
• Tradeoff in the drive-design
• Read/write head
• Reading – Faraday’s Law
• Writing – Magnetic Induction
• Data-channel
• Encoding/decoding of data to/from magnetic phase changes

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-16


Tracks and Sectors
• Bits are grouped into sectors
• Typical sector-size = 512 B of data
• Sector also has overhead information
• Error Correcting Codes (ECC)
• Servo fields to properly position the head
Internal Data Rate (IDR)
• Rate at which data can be read from or written to the
physical media
• Expressed in MB/s
• IDR is determined by
• BPI
• Platter-diameter
• RPM

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-17


Formatting
Low Level Formatting (Physical or true formatting)
1. It is done at the factory level. (In low level
formatting all the data stored on the disk is lost as
the disk is physically formatted)
2. It magnetically divides the disk into tracks and
sector.
3. Basic addressing information is written to each
sector of each cylinder.
4. It checks for bad sectors and maps them out.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-18


Formatting
High Level Formatting
1. It is done with the help of OS.
2. High level Format program scans the disk for tracks and sectors marked
bad during low level formatting. The scanning program performs five
retries to read the tracks or sectors. If the tracks are still unreadable, the
area is noted as bad cluster in FAT.
3. After scanning the entire disk, the drive heads return to the first sector of
the partition and write MBR. Immediately in the next sector 1st copy of
FAT is written and after that 2nd copy of FAT is written. Initially FATS are
blank except for the bad cluster marks found in the initial scan.
4. After the 2nd copy of FAT blank root directory is created.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-19


FAT (File Allocation Table) File System
1. Developed by Microsoft for MS-DOS, MS-Windows 95,98,Me
2. FAT located in MBR sector of bootable disk
3. 2 Important Functions of FAT;
4. contains allocation information (in the form of linked list)
5. Indicate which allocation units are free.
6. It is simple and reliable. Two identical copies of FAT are used.

Structure of FAT

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-20


New Technology File System (NTFS)
Structure of NTFS

1. Used by Windows NT, XP, 2000 , Server 2003, Server 2008, Windows Vista
2. NTFS provides better performance, security compatibility and
extendibility than FAT
3. Read, Search, Write, Recovery are done fast.
4. Master File Table (MFT) contain information about all files and folders.
First file on NTFS volume.
5. Partition Boot Sector Start at Sector 0 to 16. First Info on an NTFS volume.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-21


Features of NTFS

1. It allows you to encrypt files and automatically decrypt them as


they are read.
2. Supports long file names upto 255 characters
3. Supports File Size upto 2 TB
4. For keeping track of clusters it uses a B- tree directory
5. Reliable File System as compared to FAT
6. Allows Large partition sizes i.e more than 4 GB
7. Built-in file compression facility
8. Improved Security And access control deciding who can
perform what sorts of operations on various data within the file
system

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-22


Addressing and Finding Sectors
• Must be able to identify start of track and sector
• Format disk
• Additional information not available to user
• Marks tracks and sectors
• Seek time
• Moving head to correct track
• (Rotational) latency
• Waiting for data to rotate under head
• Access time = Seek + Latency
• Transfer rate

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-23


Timing of Disk I/O Transfer

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-24


RAID
• Redundant Array of Inexpensive Disks
• Basic idea is to connect multiple disks together to
provide
• large storage capacity
• faster access to reading data
• redundant data
• Many different levels of RAID systems
• differing levels of redundancy, error checking, capacity,
and cost
Striping

• Take file data and map it to different disks


• Allows for reading data in parallel

file data block 0 block 1 block 2 block 3

Disk 0 Disk 1 Disk 2 Disk 3


Parity
• Way to do error checking and correction
• Add up all the bits that are 1
• if even number, set parity bit to 0
• if odd number, set parity bit to 1
• To actually implement this, do an exclusive OR of
all the bits being considered
• Consider the following 2 bytes
byte parity
10110011 1
01101010 0
• If a single bit is bad, it is possible to correct it
Mirroring
• Keep to copies of data on two separate disks
• Gives good error recovery
• if some data is lost, get it from the other source
• Expensive
• requires twice as many disks
• Write performance can be slow
• have to write data to two different spots
• Read performance is enhanced
• can read data from file in parallel
RAID Level-0
• Often called striping
• Break a file into blocks of data
• Stripe the blocks across disks in the system
• Simple to implement
• disk = file block % number of disks
• sector = file block / number of disks
• provides no redundancy or error detection
• important to consider because lots of disks means low
Mean Time To Failure (MTTF)
RAID Level-0

file data block 0 block 1 block 2 block 3 block 4

0 block 0 0 block 1
1 block 2 1 block 3
sectors 2 block 4 2
sectors
3 3
4 4
5 5

Disk 0 Disk 1
RAID Level-1

• A complete file is stored on a single disk


• A second disk contains an exact copy of the file
• Provides complete redundancy of data
• Read performance can be improved
• file data can be read in parallel
• Write performance suffers
• must write the data out twice
• Most expensive RAID implementation
• requires twice as much storage space
RAID Level-1

file data block 0 block 1 block 2 block 3 block 4

0 block 0 0 block 0
1 block 1 1 block 1
sectors 2 block 2 2 block 2
sectors
3 block 3 3 block 3
4 block 4 4 block 4
5 5

Disk 0 Disk 1
RAID Level-2
• Stripes data across disks similar to Level-0
• difference is data is bit interleaved instead of block
interleaved
• Uses ECC to monitor correctness of information on
disk
• Multiple disks record the ECC information to
determine which disk is in fault
• A parity disk is then used to reconstruct corrupted
or lost data
RAID Level-2

file data block 0 block 1 block 2 block 3 block 4

Data Disk Data Disk ECC Disk ECC Disk Parity Disk
RAID Level-2
• Reconstructing data
• assume data striped across eight disks
• correct data: 10011010
• parity: 0
• data read: 10011110
• if we can determine that disk 2 is in error
• just use read data and parity to know which bit to flip
RAID Level-2
• Requires fewer disks than Level-1 to provide
redundancy
• Still needs quite a few more disks
• for 10 data disks need 4 check disks plus parity disk
• Big problem is performance
• must read data plus ECC code from other disks
• for a write, have to modify data, ECC, and parity disks
• Another big problem is only one read at a time
• while a read of a single block can be done in parallel
• multiple blocks from multiple files can’t be read
because of the bit-interleaved placement of data
RAID Level-3
• One big problem with Level-2 are the disks needed to
detect which disk had an error
• Modern disks can already determine if there is an
error
• using ECC codes with each sector
• So just need to include a parity disk
• if a sector is bad, the disk itself tells us, and use the parity
disk to correct it
RAID Level-4
• Big problem with Level-2 and Level-3 is the bit
interleavening
• to access a single file block of data, must access all the
disks
• allows good parallelism for a single access but doesn’t
allow multiple I/O’s
• Level-4 interleaves file blocks
• allows multiple small I/O’s to be done at once
RAID Level-4
• Still use a single disk for parity
• Now the parity is calculated over data from multiple
blocks
• Level-2,3 calculate it over a single block
• If an error detected, need to read other blocks on
other disks to reconstruct data
Level-4 vs. Level-2,3
0 1 2 3
a

b 4 different disks
Transfer Units
c

a0 b0 c0 d0 a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3
L3
a b c d
L3 Parity

a0 a1 a2 a3 b0 b1 b2 b3 c0 c1 c2 c3 d0 d1 d2 d3
L4

0 1 2 3
L4 Parity
RAID Level-4
• Reads are simple to understand
• want to read block A, read it from disk 0
• if there is an error, read in blocks B,C, D, and parity
block and calculate correct data
• What about writes?
• it looks like a write still requires access to 4 data disks
to recalculate the parity data
• not true, can use the following formula
• new parity = (old data xor new data) xor old parity
• a write requires 2 reads and 2 writes
RAID Level-4
• Doing multiple small reads is now faster than before
• However, writes are still very slow
• this is because of calculating and writing the parity blocks
• Also, only one write is allowed at a time
• all writes must access the check disk so other writes have
to wait
RAID Level-5
• Level-5 stripes file data and check data over all the
disks
• no longer a single check disk
• no more write bottleneck
• Drastically improves the performance of multiple
writes
• they can now be done in parallel
• Slightly improves reads
• one more disk to use for reading
RAID Level-5
Level-4 Level-5
check
data disks data and check disks
disk

1 2 3 4 5 1 2 3 4 5

S0 S0

S1 S1

S2 S2

S3 S3

S4 S4

S5 S5
RAID Level-5

• Notice that for Level-4 a write to sector 0 on disk 2


and sector 1 on disk 3 both require a write to disk
five for check information
• In Level-5, a write to sector 0 on disk 2 and sector
1 on disk 3 require writes to different disks for
check information (disks 5 and 4, respectively)
• Best of all worlds
• read and write performance close to that of RAID Level-
1
• requires as much disk space as Levels-3,4
RAID Level-10
• Combine Level-0 and Level-1
• Stripe a files data across multiple disks
• gives great read/write performance
• Mirror each strip onto a second disk
• gives the best redundancy
• The most high performance system
• The most expensive system
Optical Storage CD-ROM
• Originally for audio
• 650Mbytes giving over 70 minutes audio
• Polycarbonate coated with highly reflective coat,
usually aluminium
• Data stored as pits
• Read by reflecting laser
• Constant packing density
• Constant linear velocity

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-47


CD Operation

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-48


CD-ROM Drive Speeds
• Audio is single speed
• Constant linier velocity
• 1.2 ms-1
• Track (spiral) is 5.27km long
• Gives 4391 seconds = 73.2 minutes
• Other speeds are quoted as multiples
• e.g. 24x
• Quoted figure is maximum drive can achieve

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-49


CD-ROM Format

• Mode 0=blank data field


• Mode 1=2048 byte data+error correction
• Mode 2=2336 byte data

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-50


Virtual Memory
• To facilitate the use of memory hierarchies, the memory
addresses normally generated by modern processors
executing application programs are not physical addresses,
but are rather virtual addresses of data items and
instructions.
• Physical addresses, of course, are used to reference the
available locations in the real physical memory of a system.
• Virtual addresses must be mapped to physical addresses
before they can be used.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-51


Virtual to Physical Mapping

• The mapping from virtual to physical addresses can be


formally defined as follows:
 if m  M has been allocated to store
m,
ft v =  the data identified by virtual address m

 if data v is missing in M

The mapping returns a physical address if a memory hit


occurs. If there is a memory miss, the referenced item
has not yet been brought into primary memory.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-52


Mapping Efficiency
• The efficiency with which the virtual to physical mapping can
be accomplished significantly affects the performance of the
system.
• Efficient implementations are more difficult in multiprocessor
systems where additional problems such as coherence,
protection, and consistency must be addressed.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-53


Virtual Memory Models
• Private Virtual Memory
• In this scheme, each processor has a separate virtual
address space, but all processors share the same physical
address space.
• Advantages:
• Small processor address space
• Protection on a per-page or per-process basis
• Private memory maps, which require no locking
• Disadvantages
• The synonym problem – different virtual addresses in different/same
virtual spaces point to the same physical page
• The same virtual address in different virtual spaces may point to
different pages in physical memory

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-54


Virtual Memory Models
• Shared Virtual Memory
• All processors share a single shared virtual address space, with each
processor being given a portion of it.
• Some of the virtual addresses can be shared by multiple processors.
• Advantages:
• All addresses are unique
• Synonyms are not allowed
• Disadvantages
• Processors must be capable of generating large virtual addresses (usually > 32
bits)
• Since the page table is shared, mutual exclusion must be used to guarantee
atomic updates
• Segmentation must be used to confine each process to its own address space
• The address translation process is slower than with private (per processor) virtual
memory

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-55


Memory Allocation
• Both the virtual address space and the physical
address space are divided into fixed-length pieces.
• In the virtual address space these pieces are called pages.
• In the physical address space they are called page frames.
• The purpose of memory allocation is to allocate
pages of virtual memory using the page frames of
physical memory.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-56


Address Translation Mechanisms
• [Virtual to physical] address translation requires use
of a translation map.
• The virtual address can be used with a hash function to
locate the translation map (which is stored in the cache,
an associative memory, or in main memory).
• The translation map is comprised of a translation
lookaside buffer, or TLB (usually in associative memory)
and a page table (or tables). The virtual address is first
sought in the TLB, and if that search succeeds, not further
translation is necessary. Otherwise, the page table(s)
must be referenced to obtain the translation result.
• If the virtual address cannot be translated to a physical
address because the required page is not present in
primary memory, a page fault is reported.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-57


Centralized vs. Distributed Shared Memory

• Centralized shared-memory multiprocessor or


Symmetric shared-memory multiprocessor (SMP)
• Multiple processors connected to a single
centralized memory – since all processors see the
same memory organization → uniform memory
access (UMA)
• Shared-memory because all processors can access
the entire memory address space
• Can centralized memory emerge as a bandwidth
bottleneck? – not if you have large caches and
employ fewer than a dozen processors

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-58


SMPs or Centralized Shared-Memory

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-59


Distributed Memory Multiprocessors
• For higher scalability, memory is distributed among
processors → distributed memory multiprocessors
• If one processor can directly address the memory local to
another processor, the address space is shared → distributed
shared-memory (DSM) multiprocessor
• If memories are strictly local, we need messages to
communicate data → cluster of computers or multicomputers
• Non-uniform memory architecture (NUMA) since local
memory has lower latency than remote memory

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-60


Distributed Memory Multiprocessors

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-61


Categories of Parallel Computers(MIMD)

• Considering their architecture only, there are two


main categories of parallel computers:
• systems with shared common memories, and
• systems with unshared distributed memories.
Shared-Memory Multiprocessors
• Shared-memory multiprocessor models:
• Uniform-memory-access (UMA)
• Nonuniform-memory-access (NUMA)
• Cache-only memory architecture (COMA)
• These systems differ in how the memory and
peripheral resources are shared or distributed.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-62


Categories of Parallel Computers(MIMD)

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-63


The UMA Model
• Physical memory uniformly shared by all processors, with equal access
time to all words.
• Processors may have local cache memories.
• Peripherals also shared in some fashion.
• Tightly coupled systems use a common bus, crossbar, or multistage
network to connect processors, peripherals, and memories.
• Many manufacturers have multiprocessor (MP) extensions of uniprocessor
(UP) product lines.
• Synchronization and communication among processors achieved through shared
variables in common memory.
• Symmetric MP systems – all processors have access to all peripherals, and any
processor can run the OS and I/O device drivers.
• Asymmetric MP systems – not all peripherals accessible by all processors; kernel
runs only on selected processors (master); others are called attached processors
(AP).

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-64


The UMA Multiprocessor Model

P1 P2 … Pn

System Interconnect
(Bus, Crossbar, Multistage network)

I/O SM1 … SMm

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-65


Performance Calculation

• Consider two loops. The first loop adds corresponding elements of two N-
element vectors to yield a third vector. The second loop sums elements of
the third vector. Assume each add/assign operation takes 1 cycle, and
ignore time spent on other actions (e.g. loop counter
incrementing/testing, instruction fetch, etc.). Assume interprocessor
communication requires k cycles.
• On a sequential system, each loop will require N cycles, for a total of 2N
cycles of processor time.
• On an M-processor system, we can partition each loop into M parts, each
having L = N / M add/assigns requiring L cycles. The total time required is
thus 2L. This leaves us with M partial sums that must be totaled.
• Computing the final sum from the M partial sums requires l = log2(M)
additions, each requiring k cycles (to access a non-local term) and 1 cycle
(for the add/assign), for a total of l  (k+1) cycles.
• The parallel computation thus requires
2N / M + (k + 1) log2(M) cycles.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-66


Performance Calculation
• Assume N = 220.
• Sequential execution requires 2N = 221 cycles.
• If processor synchronization requires k = 200 cycles,
and we have M = 256 processors, parallel execution
requires
2N / M + (k + 1) log2(M)
= 221 / 28 + 201  8
= 213 + 1608 = 9800 cycles
• Comparing results, the parallel solution is 214 times
faster than the sequential, with the best theoretical
speedup being 256 (since there are 256 processors).
Thus the efficiency of the parallel solution is 214 /
256 = 83.6 %.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-67


The NUMA Model
• Shared memories, but access time depends on the location of the data
item.
• The shared memory is distributed among the processors as local
memories, but each of these is still accessible by all processors (with
varying access times).
• Memory access is fastest from the locally-connected processor, with the
interconnection network adding delays for other processor accesses.
• Additionally, there may be global memory in a multiprocessor system,
with two separate interconnection networks, one for clusters of
processors and their cluster memories, and another for the global shared
memories.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-68


Shared Local Memories

• The shared memory is physically distributed to all processors, called local


memories. The collection of all local Memories forms a global address
space accessible by all processors.
• It is faster to access a local memory with a local processor. The access of
remote memory attached to other processors takes longer time due to the
added delay through the interconnection network.

LM1 P1

LM2 P2
. Inter-
.
. connection .
. Network .
LMn Pn

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-69


Hierarchical Cluster Model
• In the hierarchically structured multiprocessor the processors are divided into
several cluster. Each cluster is itself an UMA or a NUMA multiprocessor.
• The clusters are connected to global shared-memory modules. The entire system is
considered a NUMA multiprocessor.
• All processors belonging to the same cluster are allowed to uniformly access the
cluster shared-memory modules. GSM GSM … GSM

Global Interconnect Network

P CSM P CSM

P C CSM P CSM
I … C
. N . . I .
. . . N .
. . . .
P CSM P CSM

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-70


The COMA Model
• In the COMA(Cache-Only Memory Architecture) model, processors only
have cache memories; the caches, taken together, form a global address
space.
• Each cache has an associated directory that aids remote machines in their
lookups; hierarchical directories may exist in machines based on this
model.
• Initial data placement is not critical, as cache blocks will eventually
migrate to where they are needed.
Interconnection Network

D D D

C C … C

P P P

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-71


Distributed-Memory Multicomputers
• This system consists of multiple computers, often called nodes, inter-
connected by a message-passing network. Each node is an autonomous
computer consisting of a processor, local memory and sometimes
attached disks or I/O peripherals.
• The message-passing network provides point-to-point static connections
among the nodes.
• All local memories are private and are accessible only by local processors.
For this reason, traditional multicomputers have also been called no-
remote-memory-access(NORMA) machines.
• Internode communication is carried out by passing messages through the
static connection network. With advances in interconnection and network
technologies, this model of computing has gained importance.

Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-72


Distributed-Memory Multicomputers

P P

M M

M P P M
Message-passing
interconnection
network
M P P M

P P

M M
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-73
BCS-29
Advanced Computer Architecture
Interconnection Topologies and
Synchronization
Connecting Processors and Memories
• Buses
• Interconnection Networks
– Static Networks
– Dynamic Networks
M M M M M M M M
P P P P P P P P
Interconnection Network Interconnection Network

M M M M M M
Global Interconnection Network
M M M

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-2


Buses – Common Characteristics
• Multiple devices communicating over a single set of
wires
• Only one device can talk at a time or the message is
garbled
• Each line or wire of a bus can at any one time contain
a single binary digit. Over time, however, a sequence
of binary digits may be transferred
• These lines may and often do send information in
parallel
• A computer system may contain a number of
different buses

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-3


Buses – Structure
• Serial versus parallel
• Around 50-100 lines although it's possible to have as few as 3 or 4
• Lines can be classified into one of four groups
• Data lines
• Address Lines
• Control Lines
• Power
• Bus lines (parallel)
• Data
• Address
• Control
• Power
• Bus lines (serial)
• Data, address, and control are sequentially sent down single wire
• There may be additional control lines
• Power

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-4


Parallel Data Transmission
Receiver • Simultaneous
transmission
• Requires separate data
lines
• Bits must stay
synchronized
• Fast
• Expensive
One Word • Example: Printer
Transmitter connections

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-5


Serial Data Transmission
Receiver • Transfers one bit at a
time
• Requires only one data
line Bits must stay
synchronized
• Slow compared to
parallel transmission

One Word • Less Expensive


• Example: USB
Transmitter

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-6


Serial Data Communication
• Two basic types of Serial Data Communication:
• Synchronous Communication
• Asynchronous Communication

• Synchronous Communication
• Transmitter and receiver have their clocks synchronized
• Data rates are dependent on clock rates
• Continuously transmitting characters to remain in sync.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-7


Asynchronous Communication
• NO synchronization
• No need to send idle characters
• Transmitter and receiver operate independently
• Transmitter can send data at any time
• Receiver is always ready to accept data
• Requires a start and stop bit to identify each byte of
data
• How does receiver know that data is arriving?
• If the line is idle, it is sending a constant ‚1‘ (mark state)
• The receiver is able to recognize a jump from ‚1‘ to ‚0‘
with the start bit and is alerted that data is about to be
sent.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-8


Bus Operation
• Sending Data
• Obtain the use of the bus
• Transfer the data via the bus
• Possible acknowledgement
• Requesting Data
• Obtain the use of the bus
• Transfer the data request via the bus
• Wait for other module to send data
• Possible acknowledgement

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-9


Bus System
• A bus system is a hierarchy of buses connection
various system and subsystem components.
• Each bus has a complement of control, signal, and
power lines.
• There is usually a variety of buses in a system:
• Local bus – (usually integral to a system board) connects
various major system components (chips)
• Memory bus – used within a memory board to connect
the interface, the controller, and the memory cells
• Data bus – might be used on an I/O board or VLSI chip to
connect various components
• Backplane – like a local bus, but with connectors to which
other boards can be attached

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-10


System Bus (An Abstract View)

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-11


System Bus (A Technical View)
Most processors use a two-layer approach to
system bus communication.
FSB (Front Side Bus) for traditional system bus tasks
BSB (Back Side Bus) for CPU-memory cache tasks

The FSB is normally


implemented using two
sets of chips:
1. Northbridges and
2. Southbridge as
shown

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-12


System Bus (A Modern Example)

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-13


Bus Types
• Dedicated
• Separate data & address lines
• Time multiplexed
• Shared lines
• Address valid or data valid control line
• Advantage - fewer lines
• Disadvantages
• More complex control
• Degradation of performance
• Physical Dedication
• Physically separating buses and controlling them with a
"channel changer“
• Advantages – faster
• Disadvantages – physically larger
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-14
Bus Arbitration
• Listening to the bus is not usually a problem
• Talking on the bus is a problem – need arbitration to
allow more than one module to control the bus at
one time
• Centralised Arbitration
• Single hardware device controlling bus access – Bus
Controller/Arbiter
• May be part of CPU or separate
• Distributed Arbitration
• Each module may claim the bus
• Access control logic is on all modules
• Modules work together to control bus

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-15


Bus Arbitration and Control
• Bus arbitration is a process by which contention due to the
use of bus by more than one device is prevented. For
instance, a DMA transfer uses the same address, data, and
control lines used by the processor to communicate with the
memory and the I/O device. Therefore we need to have a
means to coordinate the use of these lines, to prevent 2
devices from initiating transfers at the same time.
• A device that is allowed to initiate data transfers on the bus is
called the master. Clearly, only one master can exists at a
time.
• When the master relinquishes the control, the other devices
can acquire this status. The bus arbiter performs the required
scheduling function.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-16


DMA transfer arbitration
\BBSY
C
P
U \BR

DMA Ctrl1 DMA Ctrl2


BG1
BG2

• The above figure indicates a basic arrangement when the CPU contains
the bus arbiter circuitry.
• In this case, the processor is normally the bus master, unless it grants the
ownership to any of the devices.
• A DMA controller indicates that it needs to become a bus master by
activating the bus request line \BR (‘BR bar’). The signal on the BR line is a
logical OR of the requests from all the devices connected to it. When BR
line is activated.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-17


Bus Timing
• Co-ordination of events on bus
• Synchronous – controlled by a clock
• Asynchronous – timing is handled by well-defined
specifications, i.e., a response is delivered within a
specified time after a request

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-18


Synchronous Bus Timing
• Events determined by clock signals
• Control Bus includes clock line
• A single 1-0 cycle is a bus cycle
• All devices can read clock line
• Usually sync on leading/rising edge
• Usually a single cycle for an event
• Analogy – Orchestra conductor with baton
• Usually stricter in terms of its timing requirements

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-19


Synchronous Bus Timing

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-20


Asynchronous Timing
• Devices must have certain tolerances to provide responses to
signal stimuli
• More flexible allowing slower devices to communicate on
same bus with faster devices.
• Performance of faster devices, however, is limited to speed of
bus

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-21


Asynchronous Timing – Read

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-22


Asynchronous Timing – Write

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-23


ISA BUS

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-24


Switched Networks
BUS
• Shared media
• Lower Cost
• Lower throughput
• Scalability poor

Switched Network
• Switched paths
• Higher cost
• Higher throughput
• Scalability better

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-25


Interconnection Networks
• Topology: who is connected to whom
• Direct / Indirect: where is switching done
• Static / Dynamic: when is switching done
• Circuit switching / packet switching: how are
connections established
• Store & forward / worm hole routing: how is the path
determined
• Centralized / distributed: how is switching controlled
• Synchronous/asynchronous: mode of operation

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-26


Direct and Indirect Networks

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-27


Static and Dynamic Networks
• Static Networks
• fixed point to point connections
• usually direct
• each node pair may not have a direct connection
• routing through nodes
• Dynamic Networks
• connections established as per need
• usually indirect
• path can be established between any pair of nodes
• routing through switches

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-28


Static Network Topologies
Non-uniform connectivity

Linear 2D-Mesh

Tree

Star

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-29


Static Networks Topologie
• Uniform connectivity

Ring

Torus
Fully Connected

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-30


Dynamic Networks

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-31


Crossbar Switch and Multiport Memory
• Single stage networks are sometimes called
re_circulating networks because data items may
have to pass through the single stage many times.
• The crossbar switch and the multiported memory
organization (seen later) are both single-stage
networks.
• This is because even if two processors attempted to
access the same memory module (or I/O device) at
the same time, only one of the requests is serviced
at a time.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-32


Multistage Networks
• Multistage networks consist of multiple sages of
switch boxes, and should be able to connect any
input to any output.
• A multistage network is called blocking if the
simultaneous connections of some multiple input-
output pairs may result in conflicts in the use of
switches or communication links.
• A nonblocking multistage network can perform all
possible connections between inputs and outputs by
rearranging its connections.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-33


Crossbar Networks
• Crossbar networks connect every input to every
output through a crosspoint switch.
• A crossbar network is a single stage, non-blocking
permutation network.
• In an n-processor, m-memory system, n  m
crosspoint switches will be required. Each crosspoint
is a unary switch which can be open or closed,
providing a point-to-point connection path between
the processor and a memory module.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-34


Crosspoint Switch Design
• Out of n crosspoint switches in each column of an n
 m crossbar mesh, only one can be connected at a
time.
• Crosspoint switches must be designed to handle the
potential contention for each memory module.
• Each processor provides a request line, a read/write
line, a set of address lines, and a set of data lines to a
crosspoint switch for a single column.
• The crosspoint switch eventually responds with an
acknowledgement when the access has been
completed.
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-35
Schematic of a Crosspoint Switch

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-36


Multiport Memory
• Since crossbar switches are expensive, and not suitable for
systems with many processors or memory modules, multiport
memory modules may be used instead.
• A multiport memory module has multiple connections points
for processors (or I/O devices), and the memory controller in
the module handles the arbitration and switching that might
otherwise have been accomplished by a crosspoint switch.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-37


Baseline Network

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-38


Benes Network

non-blocking

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-39


Switching Mechanism
• Circuit Switching (connection-oriented
communication)
• A circuit is established between the source and the
destination
• Packet Switching (connectionless communication)
• Information is divided into packets and each packet is sent
independently from node to node

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-40


Omega Networks
• N-input Omega networks, in general, have log2n stages, with the input stage
labeled 0.
• The inter-stage connection (ISC) pattern is a perfect shuffle.
• Routing is controlled by inspecting the destination address. When the i-th
highest order bit is 0, the 22 switch in stage i connects the input to the upper
output. Otherwise it connects the input to the lower output.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-41


Blocking Effects
• Blocking exists in an Omega network when the requested permutation would
require that a single switch be set in two positions simultaneously.
• Obviously this is impossible, and requires that one of the permutation requests
be blocked and tried in a later pass.
• In general, with 22 switches, an Omega network can implement n n/2
permutations in a single pass. For n = 8, this is about 10% of all possible
permutations.
• In general, a maximum of log2n passes are needed for an n-input Omega
network.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-42


Omega Broadcast
• An Omega network can be used to broadcast data to multiple destinations.
• The switch to which the input is connected is set to the broadcast position (input
connected to both outputs).
• Each additional switch (in later stages) to which an output is directed is also set
to the broadcast position.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-43


Larger Switches
• Larger switches (more inputs and outputs, and more switching patterns) can be
used to build an Omega network, resulting in fewer stages.
• For example, with 44 switches, only log416 stages are required for a 16-input
switch.
• A k-way perfect shuffle is used as the ISC for an Omega network using k  k
switches.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-44


BCS-29
Advanced Computer Architecture
The Cache Coherence Problem
The Cache Coherence Problem
• Since there are multiple levels in a memory
hierarchy, with some of these levels private to one or
more processors, some levels may contain copies of
data objects that are inconsistent with others.
• This problem is manifested most obviously when
individual processors maintain cached copies of a
unique shared-memory location, and then modify
that copy. The inconsistent view of that object
obtained from other processor’s caches and main
memory is called the cache coherence problem.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-2


Causes of Cache Inconsistency
• Cache inconsistency only occurs when there are
multiple caches capable of storing (potentially
modified) copies of the same objects.
• There are three frequent sources of this problem:
• Sharing of writable data
• Process migration
• I/O activity

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-3


Inconsistency in Data Sharing

• Suppose two processors each use (read) a data item X from a shared memory.
Then each processor’s cache will have a copy of X that is consistent with the
shared memory copy.
• Now suppose one processor modifies X (to X’). Now that processor’s cache is
inconsistent with the other processor’s cache and the shared memory.
• With a write-through cache, the shared memory copy will be made consistent, but
the other processor still has an inconsistent value (X).
• With a write-back cache, the shared memory copy will be updated eventually,
when the block containing X (actually X’) is replaced or invalidated.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-4


Inconsistency After Process Migration

• If a process accesses variable X (resulting in it being placed in the


processor cache), and is then moved to a different processor and
modifies X (to X’), then the caches on the two processors are
inconsistent.
• This problem exists regardless of whether write-through caches or
write-back caches are used.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-5


Inconsistency Caused by I/O
• Data movement from an I/O device to a shared primary
memory usually does not cause cached copies of data to be
updated.
• As a result, an input operation that writes X causes it to
become inconsistent with a cached value of X.
• Likewise, writing data to an I/O device usually use the data in
the shared primary memory, ignoring any potential cached
data with different values.
• A potential solution to this problem is to require the I/O
processors to maintain consistency with at least one of the
processor’s private caches, thus “passing the buck” to the
processor cache coherence solution (which will we see).

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-6


I/O Operations Bypassing the Cache

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-7


A Possible Solution

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-8


Cache Coherence Protocols
• When a bus is used to connect processors and memories in a
multiprocessor system, each cache controller can “snoop” on
all bus transactions, whether they involve the current
processor or not. If a bus transaction affects the consistency
of a locally-cached object, then the local copy can be
invalidated.
• If a bus is not used (e.g. a crossbar switch or network is used),
then there is no convenient way to “snoop” on memory
transactions. In these systems, some variant of a directory
scheme is used to insure cache coherence.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-9


Snoopy Bus Protocols
• Two basic approaches
• write-invalidate – invalidate all other cached copies of a
data object when the local cached copy is modified
(invalidated items are sometimes called “dirty”)
• write-update – broadcast a modified value of a data object
to all other caches at the time of modification
• Snoopy bus protocols achieve consistency among
caches and shared primary memory by requiring the
bus interfaces of processors to watch the bus for
indications that require updating or invalidating
locally cached objects.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-10


Initial State – Consistent Caches

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-11


After Write-Invalidate by P1

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-12


After Write-Update by P1

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-13


Operations on Cached Objects
• Read – as long as an object has not been invalidated,
read operations are permitted, and obviously do not
change the object’s state
• Write – as long as an object has not been invalidated,
write operations on the local object are permitted,
but trigger the appropriate protocol action(s).
• Replace –the cache block containing an object is
replaced (by a different block)

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-14


Write-Through Cache
• In the transition diagram (next slide), the two possible object
states in the “local” cache (valid and invalid) are shown.
• The operations that may be performed are read, write, and
replace by the local processor or a remote processor.
• Transitions from locally valid to locally invalid occur as a result
of a remote processor write or a local processor replacing the
cache block.
• Transitions from locally invalid to locally valid occur as a result
of the local processor reading or writing the object
(necessitating, of course, the fetch of a consistent copy from
shared memory).

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-15


Write-Through Cache State Transitions

R = Read, W = Write, Z = Replace


i = local processor, j = other processor

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-16


Write-Back Cache
• The state diagram for the write-back protocol divides the valid
state into RW and RO states.
• The protocol essentially gives “ownership” of the cache block
containing the object to a processor when it does a write
operation.
• Before an object can be modified, ownership for exclusive
access must first be obtained by a read-only bus transaction
which is broadcast to all caches and memory.
• If a modified block copy exists in a remote cache, memory
must first be updated, the copy invalidated, and ownership
transferred to the requesting cache

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-17


Write-Back Cache

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-18


Goodman’s Write-Once Protocol State
Diagram

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-19


Goodman’s Cache Coherence Protocol

• Combines advantages of write-back and write-through


protocols.
• First write of a cache block uses write-through.
• Cache states (see previous slide):
• Valid: block is consistent with memory, has been read, but not
modified.
• Invalid: block not in cache, or is inconsistent with memory.
• Reserved: block written once after being read and is consistent with
memory copy (which is the only other copy).
• Dirty: block modified more than once, inconsistent with all other
copies.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-20


Commands and State Transitions
• Local processor accesses:
• Read-hit or read-miss (P-Read) – transition to valid state.
• Write-hit (P-Write)
• First one results in transition to reserved state.
• Additional writes go to (or stay in) dirty state.
• Write-miss – transition to dirty state.
• Remote processor invalidation commands (issued over
snoopy bus):
• Read-invalidate – read a block and invalidate all other copies.
• Write-invalidate – invalidate all other copies of a block.
• Bus-read (Read-blk) – normal read; transition to valid state.
• (Note textbook correction.)

Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.21


Snoopy Bus Protocol Performance
• Depends heavily on the workload.
• In uniprocessors:
• bus traffic and memory-access time heavily influenced by
cache misses.
• Miss ratio increases as block size increases, up to a data
pollution point (that is, as blocks become larger, the
probability of finding a desired data item in the cache
increases).
• Data pollution point increases with larger cache sizes.

Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.22


Snoopy Bus Protocol Performance
• In multiprocessor systems
• Write-invalidate protocol
• Better handles process migrations and synchronization than other protocols.
• Cache misses can result from invalidations sent by other processors before a cache
access, which significantly increases bus traffic.
• Bus traffic may increase as block sizes increase.
• Write-invalidate facilities writing synchronization primitives.
• Average number of invalidated cache copies is small in a small multiprocessor.
• Write-update procotol
• Requires bus broadcast facility
• May update remote cached data that is never accessed again
• Can avoid the back and forth effect of the write-invalidate protocol for data shared
among multiple caches
• Can’t be used with long write bursts
• Requires extensive tracing to identify actual behavior

Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.23


Directory-based Protocols
• The snoopy bus-based protocols may be adequate
for relatively small multiprocessor systems, but are
wholly inadequate for large multiprocessor systems.
• Commands (in the form of messages) to control the
consistency of remote caches must be sent only to
those processors with caches containing a copy of
the affected block (since broadcast is very expensive
in a multistage network – like Omega).
• This gives rise to directory-based protocols.

Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.24


Directory Structures
• Cache directories store information on where (in which
processors) copies of cache blocks reside.
• Central directory approaches (with copies of all cache
directories) is very large, and requires an associative search
(like the individual cache directories).
• Memory modules might keep track of which processor caches
have copies of their data, thus allowing the memory module
to redirect cache miss requests to the cache that contains the
“dirty” data (causing the associated writing of the block to
memory).

Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.25


Types of Directory Protocols
• Directory entries are pairs identifying cache blocks
and processor caches holding those blocks.
• Three different types of directory protocols:
• Full-map directories – each directory entry can identify all
processors with cached copies of data; with N processors,
each directory entry must have N processor ID identifiers.
• Limited directories – each entry has a fixed number of
processor identifiers, regardless of the system size.
• Chained directories – emulate full-map directories by
distributing entries among the caches.

Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.26


Full-map Protocols
• Directory entries have one bit per processor in the system,
and another bit to indicate if the data has been modified
(“dirty”).
• If the dirty bit is set, then only one processor must be
identified in the bit map; only that processor is allowed to
write the block into memory.
• Cache maintains two bits of state information per block:
• Is the cached block valid?
• Can a valid cached block be written to memory?
• The purpose of the cache coherence protocol is to keep the
cache’s state bits and those in the memory directory
consistent.

Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.27


Three States of a Full-Map Directory

Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.28


Full Map State Changes
• In the first state (upper left in previous slide), X is
missing from all caches.
• In the second state, three caches are requesting
copies of X. The bits of the three processors are set,
and the dirty bit is still ‘C’ (clean), since no processor
has requested to write X.
• In the third state, the dirty bit is set (‘D’), since a
processor requested to write X. Only the
corresponding processor has it’s bit set in the map.

Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.29


Write Actions
• Cache C3 detects the block is valid, but the processor doesn’t have
write permission.
• Write request issued to memory, stalling the processor.
• Other caches receive invalidate requests and send
acknowledgements to memory.
• Memory receives acknowledgements, sets dirty bit, clears pointers
to other processors, sends write permission to C3.
• By waiting for acknowledgements, the memory ensures sequential
consistency.
• C3 gets write permission, updates cache state, and reactivates the
processor.

Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.30


Full-Map Protocol Benefits
• The full-map protocol provides an upper bound on
the performance of centralized directory-based
cache coherence.
• It is not scalable, however, because of the excessive
memory overhead it incurs.

Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.31


Limited Directories
• Designed to solve the directory size problem.
• Restricts the number of cached copies of a datum, thus
limiting the growth of the directory.
• Agrawal notation: Diri X
• i indicates number of pointers in directory
• X is NB for no broadcast, B for broadcast
• E.g. full map with N processors is DirN NB
• In the example (next slide), the left figure shows C1 and C2
holding copies of X. When C3 requests a copy, the C1 or C2
copy must be invalidated using a process called “eviction,” as
shown by the right figure.

Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.32


Eviction in a Limited Directory

Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.33


Limited Directory Memory Size
• In the full-map protocol, it is sufficient to use a single
bit to identify if each of the N processors has a copy
of the datum.
• In a limited directory scheme, processor numbers
must be maintained, requiring log2 N bits each.
• If the code being executed on a multiprocessor
system exhibits “processor locality,” then a limited
directory is sufficient to capture the identity of the
processors.

Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.34


Limited Directory Scalability
• Limited directory schemes for cache coherency in
non-bus systems are scalable, in that the number of
resources required for their implementation grows
linearly as the number of processors grows.
• Diri B protocols exist that allow more than i copies of
a block to exist in caches, but must use broadcast to
invalidate more than i copies of a block (because of a
write request). Without a broadcast capability in the
connection network, ensuring sequential consistency
is difficult.

Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.35


Chained Directories
• Chained directories are scalable (like limited directories).
• They keep track of shared copies of data using a chain of
directory pointers.
• Each cache must include a pointer (which can be the chain
termination pointer) to the next cache that contains a datum.
• When a processor requests a read, it is sent the datum along
with a pointer to the previous head of the list (or a chain
termination pointer if it is the only processor requesting the
datum).

Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.36


A Chained Directory Example

Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.37


Invalidation in Chained Directories

• When a processor requests to write a datum, the


processor at the head of the list is sent an invalidate
request.
• Processors pass the invalidate request along until it
reaches the processor at the end of the list.
• That processor sends an acknowledgement to the
memory, which then grants write access to the
processor requesting such.
• Author suggests this be called the “gossip” protocol.

Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.38


Complications with Chained Dirs
• Suppose processor i requests Y, and the (direct-mapped)
cache already contains an entry X which maps to the same
location as Y. It must evict X from its cache, thus requiring the
list of X’s users to be altered.
• Two schemes for the list alteration:
• Send a message “down the list” to cache i-1 with a pointer to cache
i+1, removing i from the list.
• Invalidate X in caches i+1 through N.
• Alternately, a doubly-linked list could be used, with the
expected implications for size, speed, and protocol
complexity.
• Chained directories are scalable, and cache sizes (not number
of processors) control the number of pointers.
Dr P K Singh TCS-802 Advance Computer Architecture Slide-7.39
BCS-29
Advanced Computer Architecture
Input Output Techniques
Input/Output Problems
• Wide variety of peripherals
• Delivering different amounts of data
• At different speeds
• In different formats
• All slower than CPU and RAM
• Need I/O modules w/ some “intelligence”
Accessing I/O Devices

• Computer system components communicate through


an interconnection network
• How to address the I/O devices? Memory-mapped I/O
allows I/O registers to be accessed as memory
locations. As a result, these registers can be accessed
using only Load and Store instructions
I/O Device Interface
• An I/O device interface is a circuit between
a device and the interconnection network
• Provides the means for data transfer and
exchange of status and control information
• Includes data, status, and control registers
accessible with Load and Store instructions
• Memory-mapped I/O enables software to view these
registers as locations in memory
Generic Model of I/O Module
Input/Output Module
• Interface to CPU and Memory
• Interface to one or more peripherals
External Devices
• Human readable (human interface)
• Monitor, printer, keyboard, mouse
• Machine readable
• Disk, tape, sensors
• Communication
• Modem
• Network Interface Card (NIC)
External Device Block Diagram
I/O Module Function
• Control & Timing
• CPU (Processor) Communication
• Device Communication
• Data Buffering
• Error Detection (e.g., extra parity bit)
I/O Steps
• CPU checks (interrogates) I/O module device status
• I/O module returns status
• If ready, CPU requests data transfer by sending a
command to the I/O module
• I/O module gets a unit of data (byte, word, etc.) from
device
• I/O module transfers data to CPU
• Variations of these steps for output, DMA, etc.
Processor/device Communications
• Command decoding
• Data
• Status reporting
• Address recognition
I/O Module Diagram

Systems Bus Interface External Device Interface


External Data
Data Data Register Device
Lines Status
Interface
Status/Control Register
Logic Control

Address
Lines Input External Data
Output Device
Data Logic Status
Interface
Lines
Logic Control
14
I/O Module Decisions
• Hide or reveal device properties to CPU
• Support multiple or single device
• Control device functions or leave for CPU
• Also O/S decisions
• e.g. Unix treats everything it can as a file
Input Output Techniques
• Programmed
• Interrupt driven
• Direct Memory Access (DMA)
Programmed I/O
• CPU has direct control over I/O
• Sensing status
• Read/write commands
• Transferring data
• CPU waits for I/O module to complete operation
• Wastes CPU time
Programmed I/O - detail
• CPU requests I/O operation
• I/O module performs operation
• I/O module sets status bits
• CPU checks status bits periodically
• I/O module does not inform CPU directly
• I/O module does not interrupt CPU
• CPU may wait or come back later
Program-Controlled I/O
• Discuss I/O issues using keyboard & display
• Read keyboard characters, store in memory, and
display on screen
• Implement this task with a program that performs all
of the relevant functions
• This approach called program-controlled I/O
• How can we ensure correct timing of actions
and synchronized transfers between devices?
Signaling Protocol for I/O Devices
• Assume that the I/O devices have a way to send a
‘READY’ signal to the processor
• For keyboard, it indicates that the character is ready
to be read.
• For display, it indicates the display is ready to receive
the character.
• The ‘READY’ signal in each case is a status flag
in status register that is polled by processor
I/O Commands
• CPU issues address
• Identifies module (& device if >1 per module)
• CPU issues command
• Control - telling module what to do
• e.g. spin up disk
• Test - check status
• e.g. power? Error?
• Read/Write
• Module transfers data via buffer from/to device
Addressing I/O Devices

• Under programmed I/O data transfer is very like memory


access (CPU viewpoint)
• Each device given unique identifier
• CPU commands contain identifier (address)
I/O Mapping
• Memory mapped I/O
• Devices and memory share an address space
• I/O looks just like memory read/write
• No special commands for I/O
• Large selection of memory access commands available
• Isolated I/O
• Separate address spaces
• Need I/O or memory select lines
• Special commands for I/O
• Limited set
Interrupt Driven I/O
• Overcomes CPU waiting
• No repeated CPU checking of device
• I/O module interrupts when ready
Interrupts

• Drawback of a programmed I/O: BUSY-WAIT LOOP


• Due to the time needed to poll if I/O device is ready, the
processor cannot often perform useful computation
• Instead of using a BUSY-WAIT LOOP, let I/O device alert
the processor when it is ready
• Hardware sends an interrupt-request signal
to the processor at the appropriate time, much like a
phone call.
• Meanwhile, processor performs useful tasks
Example of Using Interrupts
• Consider a task with extensive computation and
periodic display of current results
• Timer circuit can be used for desired interval, with
interrupt-request signal to processor
• Two software routines: COMPUTE & DISPLAY
• Processor suspends COMPUTE execution to execute
DISPLAY on interrupt, then returns
• DISPLAY is short; time is mostly in COMPUTE
Interrupt Driven I/O
Basic Operation
• CPU issues read command
• I/O module gets data from peripheral whilst CPU
does other work
• I/O module interrupts CPU
• CPU requests data
• I/O module transfers data
CPU Viewpoint
• Issue read command
• Do other work
• Check for interrupt at end of each instruction cycle
• If interrupted:-
• Save context (registers)
• Process interrupt
• Fetch data & store
• See Operating Systems notes
Design Issues
• How do you identify the module issuing the
interrupt?
• How do you deal with multiple interrupts?
• i.e. an interrupt handler being interrupted
Identifying Interrupting Module (1)
• Different line for each module
• PC
• Limits number of devices
• Software poll
• CPU asks each module in turn
• Slow
Identifying Interrupting Module (2)
• Daisy Chain or Hardware poll
• Interrupt Acknowledge sent down a chain
• Module responsible places vector on bus
• CPU uses vector to identify handler routine
• Bus Master
• Module must claim the bus before it can raise interrupt
• e.g. PCI & SCSI
Multiple Interrupts
• Each interrupt line has a priority
• Higher priority lines can interrupt lower priority lines
• If bus mastering only current master can interrupt
Example - PC Bus
• 80x86 has one interrupt line
• 8086 based systems use one 8259A interrupt
controller
• 8259A has 8 interrupt lines
Sequence of Events
• 8259A accepts interrupts
• 8259A determines priority
• 8259A signals 8086 (raises INTR line)
• CPU Acknowledges
• 8259A puts correct vector on data bus
• CPU processes interrupt
ISA Bus Interrupt System
• ISA bus chains two 8259As together
• Link is via interrupt 2
• Gives 15 lines
• 16 lines less one for link
• IRQ 9 is used to re-route anything trying to use IRQ 2
• Backwards compatibility
• Incorporated in chip set
82C59A Interrupt
Controller
Intel 82C55A
Programmable Peripheral Interface
Using 82C55A To Control
Keyboard/Display
8086 External Interrupt Connections
NMI - Non-Maskable Interrupt INTR - Interrupt Request

Programmable
NMI Requesting Interrupt Controller
Device (part of chipset)

NMI
8086 CPU
Intel
INTR
Interrupt Logic 8259A
PIC
Divide Single
int into
Error Step

Software Traps
Interrupt Vector Table – IVT (in memory)

• x86 has 256 interrupts, specified by Type Number or Vector

• 1 byte of data must accompany each interrupt; specifies Type

• Vector is a pointer (address) into Interrupt Vector Table, IVT


• IVT is stored in memory from 0000:0000 to 0000:03ffh

• IVT contains 256 far pointer values (addresses)


• Far pointer is CS:IP values

• Each far pointer is address of Interrupt Service Routine, ISR


• Also referred to as Interrupt Handler
When interrupt is requested, which IVT entry? Type!
IVT Format
0000:0000
0000:0001
Offset
0000:0002
Interrupt 0 IP LSB

0000:0003
Segment IP MSB

0000:0004 CS LSB

0000:0005
Offset CS MSB
Interrupt 1
0000:0006
0000:0007
Segment

Given a Vector, where is the


ISR address stored in memory ?

Offset = Type  4
0000:03fc
0000:03fd
Offset Example: int 36h
0000:03fe
Interrupt 255
0000:03ff
Segment Offset = (544) = 216
= 00d8h
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-41
What Happens During an
Interrupt ?
Complete push
Current Flags
Instruction

Internal
YES Set
TEMP=TF
TF=TEMP
Interrupt

NO

YES IF=0 pop


NMI TF=0 IP and CS
NO

YES 1
push
INTR IF popf
CS and IP
0
NO

1 acknowledge call Resume


TF Interrupted
interrupt ISR Procedure
0

Fetch Read Type


YES
Next Code # NMI
Instruction
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-42
NO
Similarity to Subroutine
Procedure
• call  int

• ret  iret
• call pushes CS, IP and loads CS:IP with address of subroutine

• int does what call does and more

• ret pops IP, CS

• iret pops FLAGS, IP, and CS

• This is why ALL programs MUST have a stack segment, so that interrupts
can be handled

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-43


Interrupt Vector Assignments
Type Function Comment
0 Divide Error Processor - zero or overflow
1 Single Step (DEBUG) Processor - TF=1
2 Nonmaskable Interrupt Pin Processor - NMI Signal
3 Breakpoint Processor - Similar to Sing Step
4 Arithmetic Overflow Processor - into
5 Print Screen Key BIOS - Key Depressed
6 Invalid Opcode Processor - Invalid Opcode
7 Coprocessor Not Present Processor - no FPU
8 Time Signal BIOS - From RT Chip (AT - IRQ0)
9 Keyboard Service BIOS - Gen Service (AT - IRQ1)
A-F Originally Bus Ops (IBM PC) BIOS - (AT - IRQ2-7)
10 Video Service Request BIOS - Accesses Video Driver
11 Equipment Check BIOS - Diagnostic
12 Memory Size BIOS - DOS Memory
13 Disk Service Request BIOS - Accesses Disk Driver
14 Serial Port Service Request BIOS - Accesses Serial Port Drvr
15 Miscellaneous BIOS - Cassette, etc.
16 Keyboard Service Request BIOS - Accesses KB Driver

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-44


Interrupt Vector Assignments
Type Function Comment
17 Parallel Port LPT Service BIOS - Printer Driver
18 ROM BASIC BIOS - BASIC Interpreter in ROM
19 Reboot BIOS - Bootstrap
1A Clock Service BIOS - Time of Day from BIOS
1B Control-Break Handler BIOS - Keyboard Break
1C User Timer Service BIOS - Timer Tick
1D Pointer to Video Parm Table BIOS - Video Initialization
1E Pointer to Disk Parm Table BIOS - Disk Subsystem Init.
1F Pointer to Graphics Fonts BIOS - CGA Graphics Fonts
20 Program Terminate DOS - Clear Memory, etc.
21 Function Call DOS - Transfer Control
22 Terminate Address DOS - program Terminate handler
23 Control-C Handler DOS - For OS Use
24 Fatal Error Handler DOS - Critical Error
25 Absolute Disk Read DOS - Disk Read
26 Absolute Disk Write DOS - Disk Write
27 Terminate DOS - TSR Usage
28 Idle Signal DOS - Idle
2F Print Spool DOS - Cassette, etc.
70-77 Hardware Interrupts in AT Bios DOS - (AT - IRQs 8-15)

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-45


AT – IRQ Definitions
IBM-AT (Advanced Technology) - Intel 80286

Name Interrupt Vector Priority Description


NMI 02 1 Memory Parity Error
IRQ0 08 2 Timer (Intel 8253 Chip 55 ms intervals)
IRQ1 09 3 Keyboard
IRQ2 0A 4 8259 PIC Slave or EGA/VGA Vert. Retrace
IRQ3 0B 13 Serial Port (COM2 or COM4)
IRQ4 0C 14 Serial Port (COM1 or COM3)
IRQ5 0D 15 Fixed Disk or LPT2 Request
IRQ6 0E 16 Floppy Disk Driver
IRQ7 0F 17 LPT1 Request
IRQ8 70 5 CMOS Real-Time Clock (RT Chip)
IRQ9 71 6 Re-directed to IRQ2
IRQ10 72 7 RESERVED
IRQ11 73 8 RESERVED
IRQ12 74 9 Mouse or other
IRQ13 75 10 Math Coprocessor (NPX)
IRQ14 76 11 Hard Disk
IRQ15 77 12 RESERVED

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-46


Interrupt Service Routines
• ALL Interrupts
• Interrupts for Input
• Interrupts for Output
Interrupts for Input
• INT 16h / AH = 00h - get keystroke from keyboard
(no echo)
• INT 16h / AH = 01h - check for keystroke in the
keyboard buffer
• INT 21h / AH=1 - read character from standard input,
with echo
• INT 21h / AH=6 - direct console input or output
• INT 21h / AH=7 - character input without echo to AL
• INT 21h / AH=0Ah - input of a string
• INT 21h / AH=0Bh - get input status
Interrupts for Output
• INT 21h / AH=2 - write character to standard output
• INT 21h / AH=9 - output of a string
Direct Memory Access
• Interrupt driven and programmed I/O require active
CPU intervention
• Transfer rate is limited
• CPU is tied up
• DMA is the answer
DMA Application
• DMA is for high-speed data transfer from/to mass storage peripherals,
e.g. harddisk drive, magnetic tape, CD-ROM, and sometimes video
controllers.
• For example, a hard disk may boasts a transfer rate of 5 M bytes per
second, i.e. 1 byte transmission every 200 ns. To make such data transfer
via the CPU is both undesirable and unnecessary.
• The basic idea of DMA is to transfer blocks of data directly between
memory and peripherals. The data don’t go through the microprocessor
but the data bus is occupied.
• “Normal” transfer of one data byte takes up to 29 clock cycles. The DMA
transfer requires only 5 clock cycles.
• Nowadays, DMA can transfer data as fast as 60 M byte per second. The
transfer rate is limited by the speed of memory and peripheral devices.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-51


Basic DMA concept
• Direct memory access (DMA) is a feature of modern
computer systems that allows certain hardware subsystems to
read/write data to/from memory without microprocessor
intervention, allowing the processor to do other work.
• Used in disk controllers, video/sound cards etc, or between
memory locations.
• Typically, the CPU initiates DMA transfer, does other
operations while the transfer is in progress, and receives an
interrupt from the DMA controller once the operation is
complete.
• Can create cache coherency problems (the data in the cache
may be different from the data in the external memory after
DMA)
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-52
The 8237 DMA controller
• Supplies memory and I/O with control signals and addresses during DMA
transfer
• 4-channels (expandable)
• 0: DRAM refresh
• 1: Free
• 2: Floppy disk controller
• 3: Free
• 1.6MByte/sec transfer rate
• 64 KByte section of memory address capability with single programming
• “fly-by” controller (data does not pass through the DMA-only memory to I/O
transfer capability)
• Initialization involves writing into each channel:
• i) The address of the first byte of the block of data that must be transferred (called the base
address).
• ii) The number of bytes to be transferred (called the word count).

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-53


DMA Operation
• CPU tells DMA controller:-
• Read/Write
• Device address
• Starting address of memory block for data
• Amount of data to be transferred
• CPU carries on with other work
• DMA controller deals with transfer
• DMA controller sends interrupt when finished
DMA Transfer & Cycle Stealing
• DMA controller takes over bus for a cycle
• Transfer of one word of data
• Not an interrupt
• CPU does not switch context
• CPU suspended just before it accesses bus
• i.e. before an operand or data fetch or a data write
• Slows down CPU but not as much as CPU doing
transfer
Basic process of DMA
In maximum mode:
1. Handshaking Pins: RQ/GT1 and RQ/GT0 → DMA request and acknowledge
Sequences:
2. Peripheral asserts one of the request pins, e.g. RQ/GT1 or RQ/GT0 (RQ/GT0 has
higher priority)
3. CPU completes its current bus cycle and enters into a HOLD state
4. CPU grants the right of bus control by asserting a grant signal via the same pin as
the request signal.
5. DMA operation starts
6. Upon completion of the DMA operation, the peripheral asserts the request/grant
pin again to relinquish bus control.

In minimum mode:
The HOLD and HLDA pins are used for handshaking
DMA Controller
• A DMA controller interfaces with several peripherals that may request
DMA.
• The controller decides the priority of simultaneous DMA requests and
communicates with the peripheral and the CPU, and provides memory
addresses for data transfer.
• DMA controller commonly used with 8088 is the 8237 programmable
device.
• The 8237 is in fact a special-purpose processor.
• Normally it appears as part of the system controller chip-sets.
• The 8237 is a 4-channel device. Each channel is dedicated to a specific
peripheral device and capable of addressing 64 K bytes section of memory.
DMA on the 8088 CPU
• The I/O device asserts the appropriate DRQ signal for the channel.
• The DMA controller will enable appropriate channel, and ask the CPU to
release the bus so that the DMA may use the bus. The DMA requests the
bus by asserting the HOLD signal which goes to the CPU.
• The CPU detects the HOLD signal, and will complete executing the current
instruction. Now all of the signals normally generated by the CPU are
placed in a tri-stated condition (neither high or low) and then the CPU
asserts the HLDA signal which tells the DMA controller that it is now in
charge of the bus.
• The CPU may have to wait (hold cycles).
• DMA activates its -MEMR, -MEMW, -IOR, -IOW output signals, and the
address outputs from the DMA are set to the target address, which will be
used to direct the byte that is about to transferred to a specific memory
location.
• The DMA will then let the device that requested the DMA transfer know
that the transfer is commencing by asserting the -DACK signal.
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-59
DMA on the 8088 CPU
• The peripheral places the byte to be transferred on the bus Data lines.
• Once the data has been transferred, The DMA will de-assert the -DACK2
signal, so that the FDC knows it must stop placing data on the bus.
• The DMA will now check to see if any of the other DMA channels have any
work to do. If none of the channels have their DRQ lines asserted, the
DMA controller has completed its work and will now tri-state the -MEMR,
-MEMW, -IOR, -IOW and address signals.
• Finally, the DMA will de-assert the HOLD signal. The CPU sees this, and de-
asserts the HOLDA signal. Now the CPU resumes control of the buses and
address lines, and it resumes executing instructions and accessing main
memory and the peripherals.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-60


A 8237 DMA application

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-61


DMA Module Diagram
DMA Configurations (1)

• Single Bus, Detached DMA controller


• Each transfer uses bus twice
• I/O to DMA then DMA to memory
• CPU is suspended twice
DMA Configurations (2)

• Single Bus, Integrated DMA controller


• Controller may support >1 device
• Each transfer uses bus once
• DMA to memory
• CPU is suspended once
DMA Configurations (3)

• Separate I/O Bus


• Bus supports all DMA enabled devices
• Each transfer uses bus once
• DMA to memory
• CPU is suspended once
BCS-29
Advanced Computer Architecture
Memory Hierarchy Design
Memory Subsystem
Hierarchy List
• Registers
• L1 Cache
• L2 Cache
• Main memory
• Disk cache
• Disk
• Optical
• Tape

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-2


Cache and Main Memory

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-3


Cache/Main Memory Structure

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-4


Cache operation – overview
• CPU requests contents of memory location
• Check cache for this data
• If present, get from cache (fast)
• If not present, read required block from main
memory to cache
• Then deliver from cache to CPU
• Cache includes tags to identify which block of main
memory is in each cache slot

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-5


Cache Read Operation - Flowchart

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-6


How does cache memory work?

• Main memory consists of up to 2n addressable words, with


each word having a unique n-bit address. For mapping
purposes, this memory is considered to consist of a number of
fixed-length blocks of K words each. Thus, there are M = 2n/ K
blocks.
• The cache is split into C lines of K words each, Figure with
number of lines considerably smaller than the number of
main memory blocks (C << M).
• At any time, only a subset of the blocks of main memory
resides in the lines in the cache.
• If a word in a block of memory is read, that block is
transferred to one of the lines of the cache.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-7


Cache Design
• Addressing
• Size
• Mapping Function
• Replacement Algorithm
• Write Policy
• Block Size
• Number of Caches

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-8


Mapping Function
• how does cache contents map to main memory
contents?
cache main memory
tag data block address contents
000
line
blocki

...
blockj
use tag (and maybe
line address) to
identify block address xxx
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-9
Cache Basics

• cache line vs. main memory location


• same concept – avoid confusion (?)
• line has address and contents
• contents of cache line divided into tag and data fields
• fixed width
• fields used differently !
• data field holds contents of a block of main memory
• tag field helps identify the start address of the block of memory
that is in the data field

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-10


Mapping Function Example

holds up to 64 Kbytes
• cache of 64 KByte of main memory contents
• 16 K (214) lines – each line is 5 bytes wide = 40 bits
tag field: 1 byte
4 byte blocks
data field: 4 bytes
of main memory
• 16 MBytes main memory
• 24 bit address
• 224 = 16 M
• will consider DIRECT and ASSOCIATIVE mappings

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-11


Direct Mapping

• each block of main memory maps to only one cache line


• i.e. if a block is in cache, it must be in one specific place – based
on address! address
• split address into two parts s w

• least significant w bits identify unique word in block


• most significant s bits specify one memory block
• split s bits into:
• cache line address field r bits
s
• tag field of s-r most significant bits tag line
line field s–r r
identifies line
containing
block !
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-12
Direct Mapping: Address Structure Example

24 bit address

tag line address word


s-r r w

8 14 2

2 bit word identifier


s = 22 bit block (4 byte block)
identifier
• two blocks may have the same r value, but then
always have different tag value !
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-13
Direct Mapping Cache Line Table
each block = 4 bytes
cache line main memory blocks held
0 0, m, 2m, 3m, … 2s-m
1 1, m+1, 2m+1, … 2s-m+1
s=22
...

...
m-1 m-1, 2m-1,3m-1, … 2s-1

m=214 But…a line can contain only


one of these at a time!

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-14


Direct Mapping Cache Organization

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-15


Direct Mapping Summary
• Address length = (s + w) bits

• Number of addressable units = 2s+w words or bytes

• Block size = line size – tag size = 2w words or bytes

• Number of blocks in main memory = 2s+ w/2w = 2s

• Number of lines in cache = m = 2r

• Size of tag = (s – r) bits

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-16


Direct Mapping pros & cons
• Simple
• Inexpensive
• Fixed location for given block
• If a program accesses 2 blocks that map to the
same line repeatedly, cache misses are very high

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-17


Associative Mapping
• main memory block can load into any line of cache
• memory address is interpreted as tag and word
select in block s = tag → does not use line address !
• tag uniquely identifies block of memory !
• every line’s tag is examined for a match
• cache searching gets expensive

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-18


Fully Associative Cache Organization
no line
field !

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-19


Associative Mapping Example

tag = most signif.


22 bits of address

Typo-
leading F
missing!
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-20
Associative Mapping (Address Structure)

Word
Tag 22 bit 2 bit

• 22 bit tag stored with each 32 bit block of data


• Compare tag field with tag entry in cache to check for hit
• Least significant 2 bits of address identify which 8 bit word is required
from 32 bit data block
• e.g.
• Address Tag Data Cache line
• FFFFFC 3FFFFF 24682468 any, e.g. 3FFF

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-21


Associative Mapping Summary
• Address length = (s + w) bits
• Number of addressable units = 2s+w words or bytes
• Block size = line size – tag size = 2w words or bytes
• Number of blocks in main memory = 2s+ w/2w = 2s
• Number of lines in cache = undetermined
• Size of tag = s bits

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-22


Set Associative Mapping
• Cache is divided into a number of sets
• Each set contains k lines → k – way associative
• A given block maps to any line in a given set
• e.g. Block B can be in any line of set i
• e.g. 2 lines per set
• 2 – way associative mapping
• A given block can be in one of 2 lines in only one set

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-23


Set Associative Mapping Address Structure

Word
Tag 9 bit Set 13 bit
2 bit
E.g. Given our 64Kb cache, with a line size of 4 bytes, we have 16384 lines. Say
that we decide to create 8192 sets, where each set contains 2 lines. Then we
need 13 bits to identify a set (213=8192)
Use set field to determine cache set to look in
Compare tag field of all slots in the set to see if we have a hit, e.g.:
Address = 16339C = 0001 0110 0011 0011 1001 1100
Tag = 0 0010 1100 = 02C
Set = 0 1100 1110 0111 = 0CE7
Word = 00 = 0
Address = 008004 = 0000 0000 1000 0000 0000 0100
Tag = 0 0000 0001 = 001
Set = 0 0000 0000 0001 = 0001
Word = 00 = 0

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-24


Set Associative Mapping

• To compute cache set number:


• SetNum = j mod v
• j = main memory block number Main Memory
• v = number of sets in cache
Block 0

Block 1

Set 0 Line 0 Block 2

Line 1 Block 3
Set 1 Line 2 Block 4

Line 3 Block 5
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-25
K-Way Set Associative Cache Organization
set select Direct + Associative Mapping
(direct)

tag
(associative)
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-26
Breaking into Tag, Set, Word
• Given Tag=9 bits, Set=13 bits, Word=2 bits
• Given address FFFFFD16
• What are values of Tag, Set, Word?
• First 9 bits are Tag, next 13 are Set, next 2 are Word
• Rewrite address in base 2: 1111 1111 1111 1111 1111
1101
• Group each field in groups of 4 bits starting at right
• Add zero bits as necessary to leftmost group of bits
• 0001 1111 1111 0001 1111 1111 1111 0001
• → 1FF 1FFF 1 (Tag, Set, Word)

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-27


Replacement Algorithms Direct Mapping
• what if bringing in a new block, but no line available
in cache?
• must replace (overwrite) a line – which one?

• direct → no choice
• each block only maps to one line
• replace that line

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-28


Replacement Algorithms (Associative & Set Associative)
• hardware implemented algorithm (speed)
• Least Recently Used (LRU)
• e.g. in 2-way set associative
• which of the 2 blocks is LRU?
• First In first Out (FIFO)
• replace block that has been in cache longest
• Least Frequently Used (LFU)
• replace block which has had fewest hits
• Random

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-29


Write Policy
• must not overwrite a cache block unless main
memory is up to date
• Complication: Multiple CPUs may have individual
caches!!
• Complication: I/O may address main memory too
(read and write)!!
• N.B. 15% of memory references are writes

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-30


Write Through Method
• all writes go to main memory as well as cache
• Each of multiple CPUs can monitor main memory
traffic to keep its own local cache up to date
• lots of traffic → slows down writes

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-31


Write Back Method
• updates initially made in cache only
• update (dirty) bit for cache slot is set when update
occurs
• if block is to be replaced, write to main memory only
if update bit is set
• Other caches get out of sync
• I/O must access main memory through cache

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-32


Example-1
• M1 : 16K word 50 ns access time
• M2: 1M words 400 ns Access time
• If 8 word cache line and set size is 256 words with set
associative mapping
• Give Mapping between M2 and M1
• Calculate Effective memory access time (Cache hit
ratio =0.95)

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-33


Cache Performance

• Two measures that characterize the performance of a cache are


the hit ratio and the effective access time

(Num times referenced words are in cache)


Hit Ratio = -----------------------------------------------------
(Total number of memory accesses)

(# hits)(TimePerHit)+(# misses) (TimePerMiss)


Eff. Access Time = --------------------------------------------------------
(Total number of memory accesses)

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-34


Cache Performance Example

• Direct-Mapped Cache Memory


Block 0 0-15
Slot 0 Block 1 16-31

Slot 1 Block 2 32-47

Slot 2 Block 3 48-63

Slot 3 Block 4 64-79

Cache Memory Block 5 80-95


Cache access time = 80ns Block 6 …
Main Memory time = 2500 ns
Block 7
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-35
Cache Performance Example
• Sample program executes from memory location 48-95
once. Then it executes from 15-31 in a loop ten times
before exiting.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-36


Cache Performance Example

• Hit Ratio: 213 / 218 = 97.7%


• Effective Access Time: ((213)*(80ns)+(5)(2500ns)) /
218 = 136 ns

• Although the hit ratio is high, the effective access


time in this example is 75% longer than the cache
access time due to the large amount of time spent
during a cache miss
• What sequence of main memory block accesses
would result in much worse performance?

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-37


Cache Performance Example
• Consider Cache and Main Memory hierarchy
• Cache targeted to maintain a hit ratio of 0.9.
• A cache access on read-hit takes 20 ns; that on a write-hit
takes 60 ns with a write-back scheme, and with a write-
through scheme it needs 400 ns.
• The probability of a cache block is to be replaced i.e. dirty is
estimated as 0.1.
• An average block transfer time between the cache and shared
memory via the bus is 400 ns.
• Consider the read and write accesses are assumed equally
probable.
• Derive the effective memory-access times per instruction for
the write-through and write-back caches separately.

Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-38

You might also like