BCS-29 Advanced Computer Architecture
BCS-29 Advanced Computer Architecture
PUSH A TOP = A
PUSH B TOP = B
ADD TOP = A+B
PUSH C TOP = C
PUSH D TOP = D
ADD TOP = C+D
MUL TOP = (C+D)*(A+B)
POP X M[X] = TOP
• Accumulator Architectures -
one operand is implicitly accumulator
1 address add A acc <- acc + mem[A]
• register-register:
access memory only with load and store instructions
3 address add R1, R2, R3 R1 <- R2 + R3
load R1, R2 R1 <- mem[R2]
store R1, R2 mem[R1] <- R2
• Register-Register (0,3)
(m, n) means m memory operands and n total operands
in an ALU instruction
• Pure RISC, register to register operations
• Advantages
• Simple, fixed length instruction encoding.
• Simple code generation.
• Instructions take similar number of clocks to execute.
Uniform CPI
• Disadvantages
• Higher inst. count.
• Some instructions are short and bit encoding may be
wasteful.
• Register-Memory (1,2)
• Register – Memory ALU Architecture
• In later evolutions of RISC and CISC
• Advantages
• Data can be accessed without loading first.
• Instruction format easy to encode
• Good instruction density
• Disadvantages
• Source operand also destination, data overwritten
• Need for memory address field may limit # registers
• CPI varies by operand location
• Memory-Memory (3,3)
• True memory-memory ALU model, e.g. full orthogonal
CISC architecture
• Advantages
• Most compact instruction density, no temporary registers needed
• Disadvantages
• Memory access create bottleneck
• Variable CPI
• Large variation in instruction size
• Expensive to implement
• Not used in today’s architectures
Dr P K Singh Slide-2.21
TCS-802 Advance Computer Architecture
Memory Addressing
Mode Example Meaning When used
Memory Add R1, @(R3) Regs[R1]Regs[R1] + *p if R3=p
Indirect Mem[Mem[Regs[R3]]]
Autoinc Add R1, (R2)+ Regs[R1]Regs[R1]+ Stepping through
Mem[Regs[R2]], arrays in a loop
Regs[R2]Regs[R2]+1
Dr P K Singh Slide-2.22
TCS-802 Advance Computer Architecture
Instruction Set Encoding Options
Variable (e.g. VAX)
13
16-bit and 8-bit Microprocessors
• 8086 (16-bit) & 8088 (8-bit)
• 20-bit addresses
• Two processors which consisted of: Bus Interface
Unit and Execution Unit
• Segmented Memory
14
Elementary Discussion on Modem Computers
8088 architectural block diagram
15
80286 & 80386
• Two modes
• 8086 Real Address Mode
• Protected Virtual Address Mode
• Protected Virtual Address Mode
• More address space
• Multi-user protection
• Dedicated 286; task
• Virtual memory
• Interrupt handler
16
Introduction to 80486
• Increased speed (2 x 80386)
• 80386 with an internal math coprocessor
• Upgradeable (Compatibility)
• Utilize new technology (RISC, cache)
17
80486 Processor
18
Intel Pentium
• Introduced in '93 but was expected in 1992
• 60/66 MHz Bus Speed
• Improved cache structure
• Re-organized to form two caches that are each 8K bytes
in size
• One for caching data
• Another for caching instructions
• Wider data bus width
• Increased from 32 bit to 64 bits
19
Intel Pentium
• Faster numeric processor
• Operates about 5 times faster than the
486 numeric processor
• A dual integer processor
• Often allows two instructions per clock
• Branch prediction logic
• MMX instructions --- a later introduction to
Pentium
20
Intel Pentium
• Pentium has two execution lines
• U-line --- can execute any Pentium instruction
• V-line --- only executes simple instructions.
• Each line has 5 stages
i. Pre-fetch
ii. Instruction Decode
iii. Address Generation
iv. Execute, Cache, and ALU Access
v. Write back
21
Elementary Discussion on Modem Computers
Pentium architectural block diagram
22
Elementary Discussion on Modem Computers
P6 Microarchitecture
• Feng’s classification
• This classification is mainly based on degree of parallelism to
classify parallel computer architecture.
• Handle’s classification
• Handler’s proposed an elaborate notation for expressing the
pipelining and parallelism of computers. He divided the computer
at three levels such as Processor Control Unit(PCU), Arithmetic
Logic Unit(ALU), Bit Level Circuit(BLC)
• Among parallel machines, MIMD is most popular, followed by SIMD, and finally MISD.
Machine3:
Machine4:
Parallel architectures
Data-parallel Function-parallel
architectures architectures
MAR
MEMORY
OP ADDR MDR
A1 B1 C1 A2 B2 C2 A N B N CN
DECODER
Function-parallel
architectures
P1 P2 … Pn
System Interconnect
(Bus, Crossbar, Multistage network)
LM1 P1
LM2 P2
. Inter-
.
. connection .
. Network .
LMn Pn
P CSM P CSM
P C CSM P CSM
I … C
. N . . I .
. . . N .
. . . .
P CSM P CSM
D D D
C C … C
P P P
P P
…
M M
M P Message-passing P M
interconnection
network
M P P M
P P
…
M M
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-46
Programming Environments
• Programmability depends on the programming environment provided to
the users.
• Conventional computers are used in a sequential programming
environment with tools developed for a uniprocessor computer.
• Parallel computers need parallel tools that allow specification or easy
detection of parallelism and operating systems that can perform parallel
scheduling of concurrent events, shared memory allocation, and shared
peripheral and communication links.
• Implicit Parallelism:
• Explicit Parallelism
• Explicit Parallelism
• Programmer write explicit parallel code using parallel dialects of common
languages.
• Compiler has reduced need to detect parallelism, but must still preserve
existing parallelism and assign target machine resources.
• Performance depends on
• hardware technology
• architectural features
• efficient resource management
• algorithm design
• data structures
• language efficiency
• programmer skill
• compiler technology
• The inverse of the cycle time is the clock rate (f = 1/, measured in
megahertz).
• The size of a program is determined by its instruction count, Ic, the number of
machine instructions to be executed by the program.
Ic f f Ic
MIPS rate = = =
T 10 6
CPI 10 6
C 10
f
Wp =
I c CPI
In a multiprogrammed system, the system throughput is
often less than the CPU throughput.
Branch 4 12%
f 40*106
MIPSrate = = = 17.86
CPI *106 2.24*106
Ic = 200000 and f = 40 MHz
CPI avg =
I .CPI
c
I c
6
f 40*10
MIPSrate = 6
= 6
= 17.86
CPI *10 2.24*10
TCS-802 Advance Computer
Dr P K Singh Slide-1.60
Architecture
BCS-29
Advanced Computer Architecture
Parallel Computing
Programming Environments
Issues in Parallel Computing
• Design of parallel computers
• Design of efficient parallel algorithms
• Parallel programming models
• Parallel computer language
• Methods for evaluating parallel algorithms
• Parallel programming tools
• Portable parallel programs
• Explicit Parallelism
• Programmer write explicit parallel code using parallel dialects of common
languages.
• Compiler has reduced need to detect parallelism, but must still preserve
existing parallelism and assign target machine resources.
S1: A = 1 S1
S2: B = A + 1
S1 S1
S3: C = B + 1
S4: D = A + 1
S1 S1
S5: E = D + B
S1: A = 1
cobegin
S2: B=A+1
post (e)
S3: C=B+1
II
S4: D=A+1
wait (e)
S5: E=D+B
coend
Control Dependence
• It is the situation, when the order of the execution cannot be determined
before run time.
• Different paths taken after a conditional branch may introduce or eliminate
data dependence among instructions.
• Dependence may also exist between operations performed in successive
iterations of a looping procedure.
• Control-independent example:
for (i=0;i<n;i++) {
a[i] = c[i];
if (a[i] < 0) a[i] = 1;
}
• Control-dependent example:
for (i=1;i<n;i++) {
if (a[i-1] < 0) a[i] = 1;
}
• Compiler techniques are needed to get around control dependence limitations.
+2
P3
Dependence Graph
E D
E X P1
C G X P1 G E
B
G + P2
C
P2 + P3 + / P5
B + M
P3 L
A
M + P4
L +
P4
G C C A F
E / P5
F
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-22
Hardware Parallelism & Software Parallelism
Hardware parallelism
• Hardware parallelism is defined by machine architecture and hardware
multiplicity.
• It can be characterized by the number of instructions that can be issued per
machine cycle. If a processor issues k instructions per machine cycle, it is
called a k-issue processor. Conventional processors are one-issue machines.
• Examples. Intel i960CA is a three-issue processor (arithmetic, memory access,
branch). IBM RS-6000 is a four-issue processor (arithmetic, floating-point,
memory access, branch).
• A machine with n k-issue processors should be able to handle a maximum of
nk threads simultaneously.
Software Parallelism
• Software parallelism is defined by the control and data dependence of
programs, and is revealed in the program’s flow graph.
• It is a function of algorithm, programming style, and compiler optimization.
L1 Load P
L2 Load Q Cycle 2 X1 X2
L3 Load R
L4 Load S Cycle 3 + -
X1 Mul P, Q
X2 Mul R, S A B
+ Add X1, X2 Maximum software parallelism: No limitation
- Sub X1, X2 of functional units (L=load, X/+/- = arithmetic).
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-24
Mismatch between software and hardware
parallelism
Example: Execution Using Single Functional Unit for
Load, Mul and Add/Sub
A = (P X Q) + (R X S)
L1 Cycle 1
B = (P X Q) - (R X S)
L2 Cycle 2
Code Sequence
X1 L3
L1 Load P Cycle 3
L2 Load Q L4 Cycle 4
L3 Load R X2
Cycle 5
L4 Load S
X1 Mul P, Q + Cycle 6
X2 Mul R, S - Cycle 7
+ Add X1, X2 A
- Sub X1, X2 B
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-25
Mismatch between software and hardware
parallelism Execution Using Two Functional Units for
Example: each of Load, Mul and Add/Sub operations
A = (P X Q) + (R X S) L1 L3 Cycle 1
B = (P X Q) - (R X S)
L2 L4
Cycle 2
Code Sequence
L1 Load P X1 X2 Cycle 3
L2 Load Q
= inserted for
L3 Load R synchronization S1 S2 Cycle 4
L4 Load S
L5 L6
X1 Mul P, Q Cycle 5
X2 Mul R, S
+ - Cycle 6
+ Add X1, X2
- Sub X1, X2
A B
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-26
Program Partitioning & Scheduling
• Program Partitioning
• The transformation of sequentially coded program into a
parallel executable form can be done manually by the
programmer using explicit parallelism or by a compiler
detecting implicit parallelism automatically.
• Program partitioning determines whether the given
program can be partitioned or split into pieces that can be
executed in parallel or follow a certain pre-specified order
of execution.
Increasing
communication
demand and
Jobs or programs
scheduling
overhead
Higher degree of
parallelism
related parts of a program
Procedures, subroutines,
tasks, or coroutines
}Medium grain
Non-recursive loops
or unfolded iterations
Instructions
or statements
}Fine grain
Input Output
Ready Ready Ready Ready
S1 S2 Sk
Ack Ack Ack Ack
S1 S1 S1
Clock
t tM d
X S1 X X X X X
S2
S2 X X X X X
X
S3
S3 X X X X X
X
S4 S4 X X X X X
• The frequency, f = 1 / t
= 1 / [t M + d ] (i.e., reciprocal of clock period)
Execution F I1 I2 I3
D I1 I2 I3
O I1 I2 I3
E I1 I2 I3
W I1 I2 I3
Dr P K Singh Slide-2.10
Dependencies
• Data Dependency
(Operand is not ready yet)
• Instruction Dependency
(Branching)
Will that Cause a Problem?
Data Dependency 1 2 3 4 5 6
I1 -- Add R1, R2, R3
I2 -- Sub R4, R1, R5
F I1 I2
Solutions
D I1 I 2
STALL
O I 1 I2
Forwarding E I1 I2
Write and Read in one cycle
….
W I1 I2
Dr P K Singh Slide-2.11
Instruction Dependency
I1 – Branch o
I2 – 1 2 3 4 5 6
Solutions
F I1 I2
STALL
D I1 I2
Predict Branch taken O I1 I2
Predict Branch not taken
….
E I1 I2
W I1 I 2
Dr P K Singh Slide-2.12
Non Linear Pipelines
• Non-Linear pipeline are dynamic pipeline because they
can be reconfigured to perform variable functions at
different times.
• Non-Linear pipeline allows feed-forward and feedback
connections in addition to the streamline connection.
X Y
S1 S2 S3
S1 S2 S3
S1 X X X
S2 X X
S3 X X X
S1 X1 X2 X1 X2 X1
S2 X1 X2 X1 X2
S3 X1 X2 X1 X2 X1
S1 X1 X2 X1 X1
S2 X1 X1 X2
S3 X1 X1 X1 X2
Dr P K Singh Slide-2.19
Forbidden Latencies
• X after X
4
S1 X1 X2 X1 X1
S2 X1 X1 X2 X2
S3 X1 X1 X2 X1
7
S1 X1 X1 X2 X1
S2 X1 X1
S3 X1 X1 X1
Dr P K Singh Slide-2.20
Permissible Latencies
• X after X
1
S1 X1 X2 X1 X2 X1
S2 X1 X2 X1 X2
S3 X1 X2 X1 X2 X1 X2
S1 X1 X2 X1 X1
S2 X1 X1 X2 X2
S3 X1 X1 X2 X1 X2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
X1 X2 X1 X2 X1 X2 X3 X4 X3 X4 X3 X4 X5 X6
X1 X2 X1 X2 X3 X4 X3 X4 X5 X6
X1 X2 X1 X2 X1 X2 X3 X4 X3 X4 X3 X4 X5
OR
0
Grant X if 0
8+
1011010
Cycles: (1, 8), (I, 8, 6, 8), (1,
8, 3, 8), (3), (6), [3, 8), (3, 6,
8+
3
6 8+
3) and many more are the
1*
1011011
legitimate cycles may be
1111111
3* 6
traced.
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-27
Latency Analysis
• Latency Sequence:
• A sequence of permissible latencies between successive initiations
• Latency Cycle:
• A latency sequence that repeats the same subsequence (cycle)
indefinitely
• Simple cycles:
• A simple cycle is a latency cycle in which each state appears only
once.
(3), (6), (8), (1, 8), (3, 8), and (6,8)
• Greedy Cycles:
• Simple cycles whose edges are all made with minimum latencies from
their respective starting states.
• Greedy cycles must first be simple, and their average latencies must
be lower than those of other simple cycles.
(1,8), (3) → one of them is MAL(Minimum Average latency)
Minimum Average latency(MAL)
• The minimum-latency edges in the state diagrams are
marked with asterisks.
• At least one of the greedy cycles will lead to the MAL.
output
S1 S2 S3
MAL = 3
Reservation Tables
1 2 3 4 5
S1 X X
5+ 1011 3*
S2 X X
S3 X X
S1 S2 S3
D2
1 2 3 4 5 6 7
S1 X X
Forbidden: 2, 6 S2 X X
C.V. → 1 0 0 0 1 0 S3 X X
D1 D
D2 D
X = Y + Z and A = B X C
9
Mechanisms for Instruction Pipelining
Internal Dara Forwarding:
The throughput of a pipelined processor can be further improved with
internal data forwarding among multiple functional units. In some eases,
some memory-access operations can be replaced by register transfer
operations.
Store M, R1 Store M, R1
Load R2, M Move R2, R1
Store-Load forwarding
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-10
Mechanisms for Instruction Pipelining
Internal Dara Forwarding
Load R1, M Load R1, M
Load R2, M Move R2, R1
Load-Load forwarding
Instruction
Decode
T F Predi
cate
Regi Instruction Regi
ster Execute ster
File File
Memory Access
Write Back
Partial
B
Products
A F C G H
Exponents Partial Add Find Partial
Subtract Shift Mantissa Leading 1 Shift
/ ADD
Re
Round
normalize
E D
1 2 3 4 5 6 7 8 9
A Y
B
C Y
D Y
E Y
F Y Y
G Y
H Y Y
• Superscalar:
– Issue parallelism = IP = n inst / cycle
– Operation latency = OP = 1 cycle
– Peak IPC = n instr / cycle (n x speedup?)
1
2
3
4
5
6
IF DE EX WB
1 2 3 4 5 6
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-13
Other ILP Architectures
• VLIW: Very Long Instruction Word
– Issue parallelism = IP = n inst / cycle
– Operation latency = OP = 1 cycle
– Peak IPC = n instr / cycle = 1 VLIW / cycle
IF DE WB
EX
Auxiliary memory
Magnetic
tapes I/O Main
processor memory
Magnetic
disks
CPU Cache
memory
• External memory
• Backing store
Write
Memory Type Category Erasure Volatility
Mechanism
Random-access Read-write Electrically, byte-
Electrically Volatile
memory (RAM) memory level
Read-only
Masks
memory (ROM) Read-only
Not possible
Programmable memory
ROM (PROM)
Erasable PROM UV light, chip-
(EPROM) level Nonvolatile
Electrically Electrically
Read-mostly Electrically, byte-
Erasable PROM
memory level
(EEPROM)
Electrically,
Flash memory
block-level
• Optical
• CD-ROM
• CD-Recordable (CD-R)
• CD-R/W
• DVD
• Magnetic Tape
f
i =1
i =1
• Read
• Address line selected
• transistor turns on
• Charge from capacitor fed via bit line to sense amplifier
• Compares with reference value to determine 0 or 1
• Capacitor charge must be restored
DRAM Refreshing
• Refresh circuit included on chip
• Disable memory array chip
• Count through rows and select each in turn
• Read contents & write it back (restore)
• Takes time
• Dynamic cell
• Simpler to build, smaller
• More dense
• Less expensive
• Needs refresh
• Larger memory units
• Static
• Faster
• Cache
Read Only Memory (ROM)
• Provides permanent storage (nonvolatile)
• Types
• Written during manufacture (very expensive for small runs)
• Programmable (once) PROM (needs special equipment to program)
• Read “mostly”
• Erasable Programmable (EPROM) - Erased by UV
• Electrically Erasable (EEPROM) - Takes much longer to write than read
• Flash memory - Erase whole memory electrically
Internal linear organization
Slot 3 Slot 4
SIMM
Slot 1 Slot 2
SIMM SIMM
Processor
• Number of address lines on each 2KX8 A15 A14 A13 A12 A11 A10 A0 Mem. Map
memory chip = 11 0 0 0 0 0 0 0 0000 Not
1 0 0 1 1 1 1 9FFF Used
2m = 2K = 21 x 210 = 211 (A0..A10) 1 0 1 0 0 0 0 A000
RAM1
• Decoder needed = 2X4 1 0 1 0 0 1 1 A7FF
D7
D0 D7 D0 D7 D0 D7 D0 D7
D0 A0 A0 A0 A0
RD WR CS RD WR CS RD WR CS RD WR CS
RD
WR
A0
2X4 DEC.
A11
A Y0
A12
B Y1
A13
Y2
A14
CS Y3
A15 A15
• Beyond that:
• Heat-assisted recording (HAMR)
Arm
Assembly Spindle Cylinder
Arm Head
Platter
Track
Structure of FAT
1. Used by Windows NT, XP, 2000 , Server 2003, Server 2008, Windows Vista
2. NTFS provides better performance, security compatibility and
extendibility than FAT
3. Read, Search, Write, Recovery are done fast.
4. Master File Table (MFT) contain information about all files and folders.
First file on NTFS volume.
5. Partition Boot Sector Start at Sector 0 to 16. First Info on an NTFS volume.
0 block 0 0 block 1
1 block 2 1 block 3
sectors 2 block 4 2
sectors
3 3
4 4
5 5
Disk 0 Disk 1
RAID Level-1
0 block 0 0 block 0
1 block 1 1 block 1
sectors 2 block 2 2 block 2
sectors
3 block 3 3 block 3
4 block 4 4 block 4
5 5
Disk 0 Disk 1
RAID Level-2
• Stripes data across disks similar to Level-0
• difference is data is bit interleaved instead of block
interleaved
• Uses ECC to monitor correctness of information on
disk
• Multiple disks record the ECC information to
determine which disk is in fault
• A parity disk is then used to reconstruct corrupted
or lost data
RAID Level-2
Data Disk Data Disk ECC Disk ECC Disk Parity Disk
RAID Level-2
• Reconstructing data
• assume data striped across eight disks
• correct data: 10011010
• parity: 0
• data read: 10011110
• if we can determine that disk 2 is in error
• just use read data and parity to know which bit to flip
RAID Level-2
• Requires fewer disks than Level-1 to provide
redundancy
• Still needs quite a few more disks
• for 10 data disks need 4 check disks plus parity disk
• Big problem is performance
• must read data plus ECC code from other disks
• for a write, have to modify data, ECC, and parity disks
• Another big problem is only one read at a time
• while a read of a single block can be done in parallel
• multiple blocks from multiple files can’t be read
because of the bit-interleaved placement of data
RAID Level-3
• One big problem with Level-2 are the disks needed to
detect which disk had an error
• Modern disks can already determine if there is an
error
• using ECC codes with each sector
• So just need to include a parity disk
• if a sector is bad, the disk itself tells us, and use the parity
disk to correct it
RAID Level-4
• Big problem with Level-2 and Level-3 is the bit
interleavening
• to access a single file block of data, must access all the
disks
• allows good parallelism for a single access but doesn’t
allow multiple I/O’s
• Level-4 interleaves file blocks
• allows multiple small I/O’s to be done at once
RAID Level-4
• Still use a single disk for parity
• Now the parity is calculated over data from multiple
blocks
• Level-2,3 calculate it over a single block
• If an error detected, need to read other blocks on
other disks to reconstruct data
Level-4 vs. Level-2,3
0 1 2 3
a
b 4 different disks
Transfer Units
c
a0 b0 c0 d0 a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3
L3
a b c d
L3 Parity
a0 a1 a2 a3 b0 b1 b2 b3 c0 c1 c2 c3 d0 d1 d2 d3
L4
0 1 2 3
L4 Parity
RAID Level-4
• Reads are simple to understand
• want to read block A, read it from disk 0
• if there is an error, read in blocks B,C, D, and parity
block and calculate correct data
• What about writes?
• it looks like a write still requires access to 4 data disks
to recalculate the parity data
• not true, can use the following formula
• new parity = (old data xor new data) xor old parity
• a write requires 2 reads and 2 writes
RAID Level-4
• Doing multiple small reads is now faster than before
• However, writes are still very slow
• this is because of calculating and writing the parity blocks
• Also, only one write is allowed at a time
• all writes must access the check disk so other writes have
to wait
RAID Level-5
• Level-5 stripes file data and check data over all the
disks
• no longer a single check disk
• no more write bottleneck
• Drastically improves the performance of multiple
writes
• they can now be done in parallel
• Slightly improves reads
• one more disk to use for reading
RAID Level-5
Level-4 Level-5
check
data disks data and check disks
disk
1 2 3 4 5 1 2 3 4 5
S0 S0
S1 S1
S2 S2
S3 S3
S4 S4
S5 S5
RAID Level-5
P1 P2 … Pn
System Interconnect
(Bus, Crossbar, Multistage network)
• Consider two loops. The first loop adds corresponding elements of two N-
element vectors to yield a third vector. The second loop sums elements of
the third vector. Assume each add/assign operation takes 1 cycle, and
ignore time spent on other actions (e.g. loop counter
incrementing/testing, instruction fetch, etc.). Assume interprocessor
communication requires k cycles.
• On a sequential system, each loop will require N cycles, for a total of 2N
cycles of processor time.
• On an M-processor system, we can partition each loop into M parts, each
having L = N / M add/assigns requiring L cycles. The total time required is
thus 2L. This leaves us with M partial sums that must be totaled.
• Computing the final sum from the M partial sums requires l = log2(M)
additions, each requiring k cycles (to access a non-local term) and 1 cycle
(for the add/assign), for a total of l (k+1) cycles.
• The parallel computation thus requires
2N / M + (k + 1) log2(M) cycles.
LM1 P1
LM2 P2
. Inter-
.
. connection .
. Network .
LMn Pn
P CSM P CSM
P C CSM P CSM
I … C
. N . . I .
. . . N .
. . . .
P CSM P CSM
D D D
C C … C
P P P
P P
…
M M
M P P M
Message-passing
interconnection
network
M P P M
P P
…
M M
Dr. P K Singh MMMUT, Gorakhpur BCS-29(!)-73
BCS-29
Advanced Computer Architecture
Interconnection Topologies and
Synchronization
Connecting Processors and Memories
• Buses
• Interconnection Networks
– Static Networks
– Dynamic Networks
M M M M M M M M
P P P P P P P P
Interconnection Network Interconnection Network
M M M M M M
Global Interconnection Network
M M M
• Synchronous Communication
• Transmitter and receiver have their clocks synchronized
• Data rates are dependent on clock rates
• Continuously transmitting characters to remain in sync.
• The above figure indicates a basic arrangement when the CPU contains
the bus arbiter circuitry.
• In this case, the processor is normally the bus master, unless it grants the
ownership to any of the devices.
• A DMA controller indicates that it needs to become a bus master by
activating the bus request line \BR (‘BR bar’). The signal on the BR line is a
logical OR of the requests from all the devices connected to it. When BR
line is activated.
Switched Network
• Switched paths
• Higher cost
• Higher throughput
• Scalability better
Linear 2D-Mesh
Tree
Star
Ring
Torus
Fully Connected
non-blocking
• Suppose two processors each use (read) a data item X from a shared memory.
Then each processor’s cache will have a copy of X that is consistent with the
shared memory copy.
• Now suppose one processor modifies X (to X’). Now that processor’s cache is
inconsistent with the other processor’s cache and the shared memory.
• With a write-through cache, the shared memory copy will be made consistent, but
the other processor still has an inconsistent value (X).
• With a write-back cache, the shared memory copy will be updated eventually,
when the block containing X (actually X’) is replaced or invalidated.
Address
Lines Input External Data
Output Device
Data Logic Status
Interface
Lines
Logic Control
14
I/O Module Decisions
• Hide or reveal device properties to CPU
• Support multiple or single device
• Control device functions or leave for CPU
• Also O/S decisions
• e.g. Unix treats everything it can as a file
Input Output Techniques
• Programmed
• Interrupt driven
• Direct Memory Access (DMA)
Programmed I/O
• CPU has direct control over I/O
• Sensing status
• Read/write commands
• Transferring data
• CPU waits for I/O module to complete operation
• Wastes CPU time
Programmed I/O - detail
• CPU requests I/O operation
• I/O module performs operation
• I/O module sets status bits
• CPU checks status bits periodically
• I/O module does not inform CPU directly
• I/O module does not interrupt CPU
• CPU may wait or come back later
Program-Controlled I/O
• Discuss I/O issues using keyboard & display
• Read keyboard characters, store in memory, and
display on screen
• Implement this task with a program that performs all
of the relevant functions
• This approach called program-controlled I/O
• How can we ensure correct timing of actions
and synchronized transfers between devices?
Signaling Protocol for I/O Devices
• Assume that the I/O devices have a way to send a
‘READY’ signal to the processor
• For keyboard, it indicates that the character is ready
to be read.
• For display, it indicates the display is ready to receive
the character.
• The ‘READY’ signal in each case is a status flag
in status register that is polled by processor
I/O Commands
• CPU issues address
• Identifies module (& device if >1 per module)
• CPU issues command
• Control - telling module what to do
• e.g. spin up disk
• Test - check status
• e.g. power? Error?
• Read/Write
• Module transfers data via buffer from/to device
Addressing I/O Devices
Programmable
NMI Requesting Interrupt Controller
Device (part of chipset)
NMI
8086 CPU
Intel
INTR
Interrupt Logic 8259A
PIC
Divide Single
int into
Error Step
Software Traps
Interrupt Vector Table – IVT (in memory)
0000:0003
Segment IP MSB
0000:0004 CS LSB
0000:0005
Offset CS MSB
Interrupt 1
0000:0006
0000:0007
Segment
Offset = Type 4
0000:03fc
0000:03fd
Offset Example: int 36h
0000:03fe
Interrupt 255
0000:03ff
Segment Offset = (544) = 216
= 00d8h
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-41
What Happens During an
Interrupt ?
Complete push
Current Flags
Instruction
Internal
YES Set
TEMP=TF
TF=TEMP
Interrupt
NO
YES 1
push
INTR IF popf
CS and IP
0
NO
• ret iret
• call pushes CS, IP and loads CS:IP with address of subroutine
• This is why ALL programs MUST have a stack segment, so that interrupts
can be handled
In minimum mode:
The HOLD and HLDA pins are used for handshaking
DMA Controller
• A DMA controller interfaces with several peripherals that may request
DMA.
• The controller decides the priority of simultaneous DMA requests and
communicates with the peripheral and the CPU, and provides memory
addresses for data transfer.
• DMA controller commonly used with 8088 is the 8237 programmable
device.
• The 8237 is in fact a special-purpose processor.
• Normally it appears as part of the system controller chip-sets.
• The 8237 is a 4-channel device. Each channel is dedicated to a specific
peripheral device and capable of addressing 64 K bytes section of memory.
DMA on the 8088 CPU
• The I/O device asserts the appropriate DRQ signal for the channel.
• The DMA controller will enable appropriate channel, and ask the CPU to
release the bus so that the DMA may use the bus. The DMA requests the
bus by asserting the HOLD signal which goes to the CPU.
• The CPU detects the HOLD signal, and will complete executing the current
instruction. Now all of the signals normally generated by the CPU are
placed in a tri-stated condition (neither high or low) and then the CPU
asserts the HLDA signal which tells the DMA controller that it is now in
charge of the bus.
• The CPU may have to wait (hold cycles).
• DMA activates its -MEMR, -MEMW, -IOR, -IOW output signals, and the
address outputs from the DMA are set to the target address, which will be
used to direct the byte that is about to transferred to a specific memory
location.
• The DMA will then let the device that requested the DMA transfer know
that the transfer is commencing by asserting the -DACK signal.
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-59
DMA on the 8088 CPU
• The peripheral places the byte to be transferred on the bus Data lines.
• Once the data has been transferred, The DMA will de-assert the -DACK2
signal, so that the FDC knows it must stop placing data on the bus.
• The DMA will now check to see if any of the other DMA channels have any
work to do. If none of the channels have their DRQ lines asserted, the
DMA controller has completed its work and will now tri-state the -MEMR,
-MEMW, -IOR, -IOW and address signals.
• Finally, the DMA will de-assert the HOLD signal. The CPU sees this, and de-
asserts the HOLDA signal. Now the CPU resumes control of the buses and
address lines, and it resumes executing instructions and accessing main
memory and the peripherals.
...
blockj
use tag (and maybe
line address) to
identify block address xxx
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-9
Cache Basics
holds up to 64 Kbytes
• cache of 64 KByte of main memory contents
• 16 K (214) lines – each line is 5 bytes wide = 40 bits
tag field: 1 byte
4 byte blocks
data field: 4 bytes
of main memory
• 16 MBytes main memory
• 24 bit address
• 224 = 16 M
• will consider DIRECT and ASSOCIATIVE mappings
24 bit address
8 14 2
...
m-1 m-1, 2m-1,3m-1, … 2s-1
Typo-
leading F
missing!
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-20
Associative Mapping (Address Structure)
Word
Tag 22 bit 2 bit
Word
Tag 9 bit Set 13 bit
2 bit
E.g. Given our 64Kb cache, with a line size of 4 bytes, we have 16384 lines. Say
that we decide to create 8192 sets, where each set contains 2 lines. Then we
need 13 bits to identify a set (213=8192)
Use set field to determine cache set to look in
Compare tag field of all slots in the set to see if we have a hit, e.g.:
Address = 16339C = 0001 0110 0011 0011 1001 1100
Tag = 0 0010 1100 = 02C
Set = 0 1100 1110 0111 = 0CE7
Word = 00 = 0
Address = 008004 = 0000 0000 1000 0000 0000 0100
Tag = 0 0000 0001 = 001
Set = 0 0000 0000 0001 = 0001
Word = 00 = 0
Block 1
Line 1 Block 3
Set 1 Line 2 Block 4
Line 3 Block 5
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-25
K-Way Set Associative Cache Organization
set select Direct + Associative Mapping
(direct)
tag
(associative)
Dr P K Singh BCS-29 Advanced Computer Architecture BCS-29(!)-26
Breaking into Tag, Set, Word
• Given Tag=9 bits, Set=13 bits, Word=2 bits
• Given address FFFFFD16
• What are values of Tag, Set, Word?
• First 9 bits are Tag, next 13 are Set, next 2 are Word
• Rewrite address in base 2: 1111 1111 1111 1111 1111
1101
• Group each field in groups of 4 bits starting at right
• Add zero bits as necessary to leftmost group of bits
• 0001 1111 1111 0001 1111 1111 1111 0001
• → 1FF 1FFF 1 (Tag, Set, Word)
• direct → no choice
• each block only maps to one line
• replace that line