0% found this document useful (0 votes)
6 views

5.Advanced-1

The document outlines advanced topics in computer architecture, focusing on various forms of parallelism including Instruction Level Parallelism (ILP) and Thread Level Parallelism. It discusses techniques such as out-of-order execution, multiple issue pipelines, loop unrolling, and speculation, along with their implications for processor design and performance. Key concepts include dynamic scheduling, hazards, and the use of reorder buffers for managing instruction execution and completion.

Uploaded by

findingaish00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

5.Advanced-1

The document outlines advanced topics in computer architecture, focusing on various forms of parallelism including Instruction Level Parallelism (ILP) and Thread Level Parallelism. It discusses techniques such as out-of-order execution, multiple issue pipelines, loop unrolling, and speculation, along with their implications for processor design and performance. Key concepts include dynamic scheduling, hazards, and the use of reorder buffers for managing instruction execution and completion.

Uploaded by

findingaish00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Advanced Topics

Outline
Data Level Parallelism

ILP Instruction Level Parallelism
Thread Level Parallelism

Out-of-order execution

Multiple Issue, Superscalar Processors

Loop Unrolling

Speculation

Scoreboard

Pipeline of a modern processor
ILP

Instruction Level Parallelism
ILP

Instruction Level Parallelism
– Can multiple instructions be executed in parallel
(without the programmer intending)?

ILP

Instruction Level Parallelism
– Increasing the depth of the pipeline
ILP

Instruction Level Parallelism
– Increasing the depth of the pipeline
– Multiple Issue - Replicate components to launch
multiple instructions
Multiple Issue Pipeline
L/S Unit
L/S Unit
Branch Prediction EA
EA MM
Exception handling

INT ALU
INT ALU
EX
EX

Multiple Fetch,
Multiple Decode FP/INT Multiply
FP/INT Multiply

IF
IF ID
ID WB
M1
M1 M2
M2 M3
M3 M4
M4 M5
M5 M6
M6 M7
M7 WB
PC --> 100
PC <-- PC + 16
A1
A1 A2
A2 A3
A3 A4
A4
100 add FP Adder
FP Adder
104 sub
108 xor
10C ld
110 sd DIV
....
FP/INT Divider
FP/INT Divider
Multiple Issue Pipelines

3 way – 6 way
Multiple Issue Pipelines

3 way – 6 way

IPC > 1 (CPI < 1)
Multiple Issue Pipelines

3 way – 6 way

IPC > 1 (CPI < 1)

Static issue vs. Dynamic issue
Compiler inside the Processor
Multiple Issue Pipelines

3 way – 6 way

IPC > 1 (CPI < 1)

Static issue vs. Dynamic issue

Multiple Issue requires:
– Packing instructions into Issue Slots
– Dealing with data and control hazards
OoO Motivating Code Sequence
0 --> 2 --> 3
Out-of-Order
1 --> 4 --> (5, 6)

00 mul
mul x1,
x1, x2,
x2, x3
x3 F D M1 M2 M3 M4 W
11 addi
addi x11,x10,
x11,x10, 11 FDX W
22 mul
mul x5,
x5, x1,
x1, x4
x4 F D D D D M1 ...
33 mul
mul x7,
x7, x5,
x5, x6
x6 F D D D D M1 ...
44 addi
addi x12,x11,
x12,x11, 11 F D X W
55 addi
addi x13,x12,
x13,x12, 11 F D D X W
66 addi
addi x14,x12,
x14,x12, 22 F D D X W
OoO Motivating Code Sequence

00 mul
mul x1,
x1, x2,
x2, x3
x3
11 addi
addi x11,x10,
x11,x10, 11
22 mul
mul x5,
x5, x1,
x1, x4
x4
33 mul
mul x7,
x7, x5,
x5, x6
x6
44 addi
addi x12,x11,
x12,x11, 11
55 addi
addi x13,x12,
x13,x12, 11
66 addi
addi x14,x12,
x14,x12, 22
Dynamic Scheduling

Out-of-Order Execution
– Choose next instruction to be executed based on
ready inputs and available functional units
– (Check for structural and data hazards)
– Begin executing as soon as operands are available
Dynamic Scheduling

Out-of-Order Execution

Implies out-of-order completion
Dynamic Scheduling

Out-of-Order Execution

Implies out-of-order completion

00 mul
mul x1,
x1, x2,
x2, x3
x3 F D M1 M2 M3 M4 M5 M6 M7 W
11 addi
addi x11,x10,
x11,x10, 11
22 mul
mul x5,
x5, x1,
x1, x4
x4
33 mul
mul x7,
x7, x5,
x5, x6
x6
44 addi
addi x12,x11,
x12,x11, 11
55 addi
addi x13,x12,
x13,x12, 11
66 addi
addi x14,x12,
x14,x12, 22
Dynamic Scheduling

Out-of-Order Execution

Implies out-of-order completion
should not commit out-of-order

00 mul
mul x1,
x1, x2, x3 F D M1 M2 M3 M4 M5 M6 M7 W
x2, x3
11 addi
addi x11,x10,
x11,x10, 11 F D X W
22 mul
mul x5,
x5, x1,
x1, x4
x4
33 mul
mul x7,
x7, x5,
x5, x6
x6
44 addi
addi x12,x11,
x12,x11, 11
55 addi
addi x13,x12,
x13,x12, 11
66 addi
addi x14,x12,
x14,x12, 22
Dynamic Scheduling

Out-of-Order Execution

Implies out-of-order completion

WAR and WAW hazards

00 mul
mul x1,
x1, x2,
x2, x3
x3
11 addi
addi x11,x10,
x11,x10, 11
22 mul
mul x5,
x5, x1,
x1, x4
x4
33 mul
mul x7,
x7, x5,
x5, x6
x6
44 addi
addi x5,x11,
x5,x11, 11
55 addi
addi x12,x12,
x12,x12, 11
66 addi
addi x14,x12,
x14,x12, 22
Dynamic Scheduling

Out-of-Order Execution

Implies out-of-order completion

WAR and WAW hazards

Imprecise exceptions
Dynamic Scheduling

Out-of-Order Execution

Implies out-of-order completion

WAR and WAW hazards

Imprecise exceptions
00 div
div x1,
x1, x2, x3 F D M1 M2 M3 M4 M5 M6 M7 W
x2, x3
11 addi
addi x11,x10,
x11,x10, 11 F D X W
22 mul
mul x5,
x5, x1,
x1, x4
x4
33 mul
mul x7,
x7, x5,
x5, x6
x6
Speculation
Speculation

Execute instructions along predicted execution paths but
only commit the results if prediction was correct
Speculation

Execute instructions along predicted execution paths but
only commit the results if prediction was correct
– Instructions after a predicted branch
– Executing a Load earlier than a Store
Speculation

Execute instructions along predicted execution paths but
only commit the results if prediction was correct

Instruction commit: allowing an instruction to update
the register file when instruction is no longer speculative
Hardware based Speculation

Execute instructions along predicted execution paths but
only commit the results if prediction was correct

Instruction commit: allowing an instruction to update the
register file when instruction is no longer speculative

Need an additional piece of hardware to prevent any
irrevocable action until an instruction commits
Hardware based Speculation

Execute instructions along predicted execution paths but
only commit the results if prediction was correct

Instruction commit: allowing an instruction to update the
register file when instruction is no longer speculative

Need an additional piece of hardware to prevent any
irrevocable action until an instruction commits
– Reorder Buffer
Reorder Buffer (RoB)

Reorder Buffer
– In-order commit
– Stores instruction results before instruction commits
– Clear ROB on misprediction
– Exceptions
Reorder Buffer (RoB)

Reorder Buffer
– In-order commit
– Stores instruction results before instruction commits
– Clear ROB on misprediction
– Exceptions 00 div
div
11 addi
x1,
x1, x2,
x2, x3
x11,x10,
x3
addi x11,x10, 11
22 mul
mul x5,
x5, x1,
x1, x4
x4
33 mul
mul x7, x5, x6
x7, x5, x6

Instruction Destination Value


filled at issue --> div x1
addi x11 10
mul x5
mul x7
Compiler based Speculation

Is done both by the Compiler and the processor

Compiler inserts instructions to check if
speculation failed. Fix-up routine.
Static Multiple Issue
Multiple Issue
L/S Unit
L/S Unit
EA
EA MM

INT ALU
INT ALU
EX
EX

FP/INT Multiply
FP/INT Multiply

IF
IF ID
ID WB
M1
M1 M2
M2 M3
M3 M4
M4 M5
M5 M6
M6 M7
M7 WB

A1
A1 A2
A2 A3
A3 A4
A4
FP Adder
FP Adder

DIV

FP/INT Divider
FP/INT Divider
Static Multiple Issue Pipeline

INT ALU

L/S Unit
Static Multiple Issue Pipeline

ALU/BEQ
ALU/BEQ

MEM
MEM
Static Multiple Issue
Static Multiple Issue

Issue Packet
Static Multiple Issue

Issue Packet

Compiler technique
– VLIW Very Large Instruction Word
Static Multiple Issue – Example
Loop:
ld x31, 0(20)
add x31, x31, x21
sd x31, 0(20)
addi x20, x20, -8
blt x22, x20, Loop

Issue Slot 1 Issue Slot 2 Clock cycle


ALU/BEQ Data Transfer Instruction
1
2
3
4
5
Static Multiple Issue – Example
Loop:
ld x31, 0(20)
add x31, x31, x21
sd x31, 0(20)
addi x20, x20, -8
blt x22, x20, Loop

Issue Slot 1 Issue Slot 2 Clock cycle


ALU/BEQ Data Transfer Instruction
ld x31, 0(20) 1
addi x20, x20, -8 2
add x31, x31, x21 3
blt x22, x20, Loop sd x31, 0(20) 4
5
Static Multiple Issue

Increased use latency
Static Multiple Issue

Use latency: Clock cycles between a generator
instruction and its consumer that can use the
result of without stalling the pipeline

Increased use latency
– Load and successive dependent instruction
– Add and paired dependent instruction
Static Multiple Issue

Increased use latency
– Load and successive dependent instruction
– Add and paired dependent instruction

Decrease stalls/NOPs?
Loop Unrolling
Loop Unrolling
Original
OriginalLoop
Loop

Loop:
Loop: L.D
L.D F0,
F0,0(R1)
0(R1)
UU
ADD.D
ADD.D F4,
F4,F0,
F0,F2
F2 NN
RR
S.D
S.D F4,
F4,0(R1)
0(R1) OO
LL
DADDUI
DADDUI R1,
R1,R1,
R1,#-8
#-8 LL
BNE R1, EE
BNE R1,R2,
R2,Loop
Loop DD

LL
OO
OO
PP
Loop Unrolling
Original
OriginalLoop
Loop Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
Loop:
Loop: L.D
L.D F0,
F0,0(R1)
0(R1)
S.D F4, 0(R1) UU
ADD.D
ADD.D F4,
F4,F0,
F0,F2
F2 NN
RR
S.D
S.D F4,
F4,0(R1)
0(R1) OO
LL
DADDUI
DADDUI R1,
R1,R1,
R1,#-8
#-8 LL
BNE R1, EE
BNE R1,R2,
R2,Loop
Loop DD

LL
OO
Unroll
Unroll22times
times OO
PP
Loop Unrolling
Original
OriginalLoop
Loop Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
Loop:
Loop: L.D
L.D F0,
F0,0(R1)
0(R1)
S.D F4, 0(R1) UU
ADD.D
ADD.D F4,
F4,F0,
F0,F2
F2 NN
L.D F0, -8(R1) RR
S.D
S.D F4,
F4,0(R1)
0(R1) OO
ADD.D F4, F0, F2 LL
DADDUI
DADDUI R1,
R1,R1,
R1,#-8
#-8
S.D F4, -8(R1) LL
BNE R1, EE
BNE R1,R2,
R2,Loop
Loop DD

LL
OO
Unroll
Unroll22times
times OO
PP
Loop Unrolling
Original
OriginalLoop
Loop Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
Loop:
Loop: L.D
L.D F0,
F0,0(R1)
0(R1)
S.D F4, 0(R1) UU
ADD.D
ADD.D F4,
F4,F0,
F0,F2
F2 NN
L.D F0, -8(R1) RR
S.D
S.D F4,
F4,0(R1)
0(R1) OO
ADD.D F4, F0, F2 LL
DADDUI
DADDUI R1,
R1,R1,
R1,#-8
#-8
S.D F4, -8(R1) LL
BNE R1, EE
BNE R1,R2,
R2,Loop
Loop DD

LL
OO
Dependences?
Dependences? OO
PP
Loop Unrolling
Original
OriginalLoop
Loop Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
Loop:
Loop: L.D
L.D F0,
F0,0(R1)
0(R1)
S.D F4, 0(R1) UU
ADD.D
ADD.D F4,
F4,F0,
F0,F2
F2 NN
L.D F0, -8(R1) RR
S.D
S.D F4,
F4,0(R1)
0(R1) OO
ADD.D F4, F0, F2 LL
DADDUI
DADDUI R1,
R1,R1,
R1,#-8
#-8
S.D F4, -8(R1) LL
BNE R1, EE
BNE R1,R2,
R2,Loop
Loop DD

LL
OO
name
nameand
andanti-
anti- OO
dependences
dependences PP
Loop Unrolling
Original
OriginalLoop
Loop Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
Loop:
Loop: L.D
L.D F0,
F0,0(R1)
0(R1)
S.D F4, 0(R1) UU
ADD.D
ADD.D F4,
F4,F0,
F0,F2
F2 NN
L.D F6, -8(R1) RR
S.D
S.D F4,
F4,0(R1)
0(R1) OO
ADD.D F8, F2, F6 LL
DADDUI
DADDUI R1,
R1,R1,
R1,#-8
#-8
S.D F8, -8(R1) LL
BNE R1, EE
BNE R1,R2,
R2,Loop
Loop DD

LL
OO
Register
Register OO
Renaming
Renaming PP
Loop Unrolling
Original
OriginalLoop
Loop Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
Loop:
Loop: L.D
L.D F0,
F0,0(R1)
0(R1)
S.D F4, 0(R1) UU
ADD.D
ADD.D F4,
F4,F0,
F0,F2
F2 NN
L.D F6, -8(R1) RR
S.D
S.D F4,
F4,0(R1)
0(R1) OO
ADD.D F8, F2, F6 LL
DADDUI
DADDUI R1,
R1,R1,
R1,#-8
#-8
S.D F8, -8(R1) LL
BNE R1, EE
BNE R1,R2,
R2,Loop
Loop
L.D F10, -16(R1) DD

ADD.D F12, F2, F10 LL


OO
Loop
Loophas
hasbeen
been S.D F12, -16(R1) OO
unrolled
unrolled44times
times PP
L.D F14, -24(R1)
ADD.D F16, F2, F14
S.D F16, -24(R1)
DADDUI R1, R1, #-32
BNE R1, R2, Loop
Loop Unrolling
Original
OriginalLoop
Loop Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
Loop:
Loop: L.D
L.D F0,
F0,0(R1)
0(R1)
S.D F4, 0(R1) UU
ADD.D
ADD.D F4,
F4,F0,
F0,F2
F2 NN
L.D F6, -8(R1) RR
S.D
S.D F4,
F4,0(R1)
0(R1) OO
ADD.D F8, F2, F6 LL
DADDUI
DADDUI R1,
R1,R1,
R1,#-8
#-8
S.D F8, -8(R1) LL
BNE R1, EE
BNE R1,R2,
R2,Loop
Loop
L.D F10, -16(R1) DD

ADD.D F12, F2, F10 LL


OO
Loop
Loophas
hasbeen
been S.D F12, -16(R1) OO
unrolled
unrolled44times
times PP
L.D F14, -24(R1)
ADD.D F16, F2, F14
Loop
LoopWork
Work = 12
12 S.D F16, -24(R1)
Loop =
LoopOverhead
Overhead 22 DADDUI R1, R1, #-32
BNE R1, R2, Loop
Static Multiple Issue
Loop:
ld x31, 0(20)
add x31, x31, x21
sd x31, 0(20)
addi x20, x20, -8
blt x22, x20, Loop

Issue Slot 1 Issue Slot 2 Clock cycle

ALU/BEQ Data Transfer Instruction


1
2
3
4
Static Multiple Issue
Loop:
ld x31, 0(20)
add x31, x31, x21
sd x31, 0(20)
addi x20, x20, -8
blt x22, x20, Loop

Issue Slot 1 Issue Slot 2 Clock cycle

ALU/BEQ Data Transfer Instruction


ld x31, 0(20) 1
addi x20, x20, -8 2
add x31, x31, x21 3
blt x22, x20, Loop sd x31, 0(20) 4
Static Multiple Issue
Loop:
ld x31, 0(20)
add x31, x31, x21
sd x31, 0(20)
addi x20, x20, -8
blt x22, x20, Loop

Issue Slot 1 Issue Slot 2 Clock cycle

ALU/BEQ Data Transfer Instruction


ld x31, 0(20) 1
addi x20, x20, -8 2
add x31, x31, x21 3
blt x22, x20, Loop sd x31, 0(20) 4


Improve throughput by unrolling the loop body 4 times
Static Multiple Issue
Loop:
Original
OriginalLoop
Loop
ld x31, 0(20)
Loop:
Loop: add x31, x31, x21
ld
ld x31,
x31, 0(20)
0(20) sd x31, 0(20) UU
add
add x31,
x31, x31,
x31, x21
x21 NN
sd x31, 0(20) ld x30, 8(20) RR
sd x31, 0(20)
addi
addi x20,
x20, x20,
x20, -8
-8 add x30, x30, x21 OO
LL
blt x22, x20, sd x30, 8(20)
blt x22, x20, Loop
Loop LL
ld x29, 16(20) EE
DD
add x29, x29, x21
sd x29, 16(20) LL
OO
ld x28, 24(20) OO
add x28, x28, x21 PP
sd x28, 24(20)
addi x20, x20, -32
blt x22, x20, Loop
Static Multiple Issue
Loop:
ld x31, 0(20)
add x31, x31, x21 Issue Slot 1 Issue Slot 2 Clock
cycle
sd x31, 0(20) ALU/BEQ Data Transfer Instruction
ld x30, 8(20)
ld x31, 0(20) 1
add x30, x30, x21
2
sd x30, 8(20)
ld x29, 16(20) 3
add x29, x29, x21 4
sd x29, 16(20) 5
ld x28, 24(20) 6
add x28, x28, x21 7
sd x28, 24(20)
8
addi x20, x20, -32
blt x22, x20, Loop
Static Multiple Issue
Loop:
ld x31, 0(20)
add x31, x31, x21 Issue Slot 1 Issue Slot 2 Clock
cycle
sd x31, 0(20) ALU/BEQ Data Transfer Instruction
ld x30, 8(20)
addi x20, x20, -32 ld x31, 0(20) 1
add x30, x30, x21
sd x30, 8(20) ld x30, 8(20) 2

ld x29, 16(20) add x31, x31, x21 ld x29, 16(20) 3


add x29, x29, x21 add x30, x30, x21 ld x28, 24(20) 4
sd x29, 16(20) add x29, x29, x21 sd x31, 0(20) 5
ld x28, 24(20) add x28, x28, x21 sd x30, 8(20) 6
add x28, x28, x21 sd x29, 16(20) 7
sd x28, 24(20)
blt x22, x20, Loop sd x28, 24(20) 8
addi x20, x20, -32
blt x22, x20, Loop
Dynamic Multiple Issue Processors

Superscalar Processors
Dynamic Multiple Issue Processors

Superscalar Processors

Dynamic pipeline scheduling
Multiple Issue Processors
Scoreboard

You might also like