5.Advanced-1
5.Advanced-1
Outline
Data Level Parallelism
●
ILP Instruction Level Parallelism
Thread Level Parallelism
●
Out-of-order execution
●
Multiple Issue, Superscalar Processors
●
Loop Unrolling
●
Speculation
●
Scoreboard
●
Pipeline of a modern processor
ILP
●
Instruction Level Parallelism
ILP
●
Instruction Level Parallelism
– Can multiple instructions be executed in parallel
(without the programmer intending)?
–
ILP
●
Instruction Level Parallelism
– Increasing the depth of the pipeline
ILP
●
Instruction Level Parallelism
– Increasing the depth of the pipeline
– Multiple Issue - Replicate components to launch
multiple instructions
Multiple Issue Pipeline
L/S Unit
L/S Unit
Branch Prediction EA
EA MM
Exception handling
INT ALU
INT ALU
EX
EX
Multiple Fetch,
Multiple Decode FP/INT Multiply
FP/INT Multiply
IF
IF ID
ID WB
M1
M1 M2
M2 M3
M3 M4
M4 M5
M5 M6
M6 M7
M7 WB
PC --> 100
PC <-- PC + 16
A1
A1 A2
A2 A3
A3 A4
A4
100 add FP Adder
FP Adder
104 sub
108 xor
10C ld
110 sd DIV
....
FP/INT Divider
FP/INT Divider
Multiple Issue Pipelines
●
3 way – 6 way
Multiple Issue Pipelines
●
3 way – 6 way
●
IPC > 1 (CPI < 1)
Multiple Issue Pipelines
●
3 way – 6 way
●
IPC > 1 (CPI < 1)
●
Static issue vs. Dynamic issue
Compiler inside the Processor
Multiple Issue Pipelines
●
3 way – 6 way
●
IPC > 1 (CPI < 1)
●
Static issue vs. Dynamic issue
●
Multiple Issue requires:
– Packing instructions into Issue Slots
– Dealing with data and control hazards
OoO Motivating Code Sequence
0 --> 2 --> 3
Out-of-Order
1 --> 4 --> (5, 6)
00 mul
mul x1,
x1, x2,
x2, x3
x3 F D M1 M2 M3 M4 W
11 addi
addi x11,x10,
x11,x10, 11 FDX W
22 mul
mul x5,
x5, x1,
x1, x4
x4 F D D D D M1 ...
33 mul
mul x7,
x7, x5,
x5, x6
x6 F D D D D M1 ...
44 addi
addi x12,x11,
x12,x11, 11 F D X W
55 addi
addi x13,x12,
x13,x12, 11 F D D X W
66 addi
addi x14,x12,
x14,x12, 22 F D D X W
OoO Motivating Code Sequence
00 mul
mul x1,
x1, x2,
x2, x3
x3
11 addi
addi x11,x10,
x11,x10, 11
22 mul
mul x5,
x5, x1,
x1, x4
x4
33 mul
mul x7,
x7, x5,
x5, x6
x6
44 addi
addi x12,x11,
x12,x11, 11
55 addi
addi x13,x12,
x13,x12, 11
66 addi
addi x14,x12,
x14,x12, 22
Dynamic Scheduling
●
Out-of-Order Execution
– Choose next instruction to be executed based on
ready inputs and available functional units
– (Check for structural and data hazards)
– Begin executing as soon as operands are available
Dynamic Scheduling
●
Out-of-Order Execution
●
Implies out-of-order completion
Dynamic Scheduling
●
Out-of-Order Execution
●
Implies out-of-order completion
00 mul
mul x1,
x1, x2,
x2, x3
x3 F D M1 M2 M3 M4 M5 M6 M7 W
11 addi
addi x11,x10,
x11,x10, 11
22 mul
mul x5,
x5, x1,
x1, x4
x4
33 mul
mul x7,
x7, x5,
x5, x6
x6
44 addi
addi x12,x11,
x12,x11, 11
55 addi
addi x13,x12,
x13,x12, 11
66 addi
addi x14,x12,
x14,x12, 22
Dynamic Scheduling
●
Out-of-Order Execution
●
Implies out-of-order completion
should not commit out-of-order
00 mul
mul x1,
x1, x2, x3 F D M1 M2 M3 M4 M5 M6 M7 W
x2, x3
11 addi
addi x11,x10,
x11,x10, 11 F D X W
22 mul
mul x5,
x5, x1,
x1, x4
x4
33 mul
mul x7,
x7, x5,
x5, x6
x6
44 addi
addi x12,x11,
x12,x11, 11
55 addi
addi x13,x12,
x13,x12, 11
66 addi
addi x14,x12,
x14,x12, 22
Dynamic Scheduling
●
Out-of-Order Execution
●
Implies out-of-order completion
●
WAR and WAW hazards
00 mul
mul x1,
x1, x2,
x2, x3
x3
11 addi
addi x11,x10,
x11,x10, 11
22 mul
mul x5,
x5, x1,
x1, x4
x4
33 mul
mul x7,
x7, x5,
x5, x6
x6
44 addi
addi x5,x11,
x5,x11, 11
55 addi
addi x12,x12,
x12,x12, 11
66 addi
addi x14,x12,
x14,x12, 22
Dynamic Scheduling
●
Out-of-Order Execution
●
Implies out-of-order completion
●
WAR and WAW hazards
●
Imprecise exceptions
Dynamic Scheduling
●
Out-of-Order Execution
●
Implies out-of-order completion
●
WAR and WAW hazards
●
Imprecise exceptions
00 div
div x1,
x1, x2, x3 F D M1 M2 M3 M4 M5 M6 M7 W
x2, x3
11 addi
addi x11,x10,
x11,x10, 11 F D X W
22 mul
mul x5,
x5, x1,
x1, x4
x4
33 mul
mul x7,
x7, x5,
x5, x6
x6
Speculation
Speculation
●
Execute instructions along predicted execution paths but
only commit the results if prediction was correct
Speculation
●
Execute instructions along predicted execution paths but
only commit the results if prediction was correct
– Instructions after a predicted branch
– Executing a Load earlier than a Store
Speculation
●
Execute instructions along predicted execution paths but
only commit the results if prediction was correct
●
Instruction commit: allowing an instruction to update
the register file when instruction is no longer speculative
Hardware based Speculation
●
Execute instructions along predicted execution paths but
only commit the results if prediction was correct
●
Instruction commit: allowing an instruction to update the
register file when instruction is no longer speculative
●
Need an additional piece of hardware to prevent any
irrevocable action until an instruction commits
Hardware based Speculation
●
Execute instructions along predicted execution paths but
only commit the results if prediction was correct
●
Instruction commit: allowing an instruction to update the
register file when instruction is no longer speculative
●
Need an additional piece of hardware to prevent any
irrevocable action until an instruction commits
– Reorder Buffer
Reorder Buffer (RoB)
●
Reorder Buffer
– In-order commit
– Stores instruction results before instruction commits
– Clear ROB on misprediction
– Exceptions
Reorder Buffer (RoB)
●
Reorder Buffer
– In-order commit
– Stores instruction results before instruction commits
– Clear ROB on misprediction
– Exceptions 00 div
div
11 addi
x1,
x1, x2,
x2, x3
x11,x10,
x3
addi x11,x10, 11
22 mul
mul x5,
x5, x1,
x1, x4
x4
33 mul
mul x7, x5, x6
x7, x5, x6
INT ALU
INT ALU
EX
EX
FP/INT Multiply
FP/INT Multiply
IF
IF ID
ID WB
M1
M1 M2
M2 M3
M3 M4
M4 M5
M5 M6
M6 M7
M7 WB
A1
A1 A2
A2 A3
A3 A4
A4
FP Adder
FP Adder
DIV
FP/INT Divider
FP/INT Divider
Static Multiple Issue Pipeline
INT ALU
L/S Unit
Static Multiple Issue Pipeline
ALU/BEQ
ALU/BEQ
MEM
MEM
Static Multiple Issue
Static Multiple Issue
●
Issue Packet
Static Multiple Issue
●
Issue Packet
●
Compiler technique
– VLIW Very Large Instruction Word
Static Multiple Issue – Example
Loop:
ld x31, 0(20)
add x31, x31, x21
sd x31, 0(20)
addi x20, x20, -8
blt x22, x20, Loop
Loop:
Loop: L.D
L.D F0,
F0,0(R1)
0(R1)
UU
ADD.D
ADD.D F4,
F4,F0,
F0,F2
F2 NN
RR
S.D
S.D F4,
F4,0(R1)
0(R1) OO
LL
DADDUI
DADDUI R1,
R1,R1,
R1,#-8
#-8 LL
BNE R1, EE
BNE R1,R2,
R2,Loop
Loop DD
LL
OO
OO
PP
Loop Unrolling
Original
OriginalLoop
Loop Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
Loop:
Loop: L.D
L.D F0,
F0,0(R1)
0(R1)
S.D F4, 0(R1) UU
ADD.D
ADD.D F4,
F4,F0,
F0,F2
F2 NN
RR
S.D
S.D F4,
F4,0(R1)
0(R1) OO
LL
DADDUI
DADDUI R1,
R1,R1,
R1,#-8
#-8 LL
BNE R1, EE
BNE R1,R2,
R2,Loop
Loop DD
LL
OO
Unroll
Unroll22times
times OO
PP
Loop Unrolling
Original
OriginalLoop
Loop Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
Loop:
Loop: L.D
L.D F0,
F0,0(R1)
0(R1)
S.D F4, 0(R1) UU
ADD.D
ADD.D F4,
F4,F0,
F0,F2
F2 NN
L.D F0, -8(R1) RR
S.D
S.D F4,
F4,0(R1)
0(R1) OO
ADD.D F4, F0, F2 LL
DADDUI
DADDUI R1,
R1,R1,
R1,#-8
#-8
S.D F4, -8(R1) LL
BNE R1, EE
BNE R1,R2,
R2,Loop
Loop DD
LL
OO
Unroll
Unroll22times
times OO
PP
Loop Unrolling
Original
OriginalLoop
Loop Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
Loop:
Loop: L.D
L.D F0,
F0,0(R1)
0(R1)
S.D F4, 0(R1) UU
ADD.D
ADD.D F4,
F4,F0,
F0,F2
F2 NN
L.D F0, -8(R1) RR
S.D
S.D F4,
F4,0(R1)
0(R1) OO
ADD.D F4, F0, F2 LL
DADDUI
DADDUI R1,
R1,R1,
R1,#-8
#-8
S.D F4, -8(R1) LL
BNE R1, EE
BNE R1,R2,
R2,Loop
Loop DD
LL
OO
Dependences?
Dependences? OO
PP
Loop Unrolling
Original
OriginalLoop
Loop Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
Loop:
Loop: L.D
L.D F0,
F0,0(R1)
0(R1)
S.D F4, 0(R1) UU
ADD.D
ADD.D F4,
F4,F0,
F0,F2
F2 NN
L.D F0, -8(R1) RR
S.D
S.D F4,
F4,0(R1)
0(R1) OO
ADD.D F4, F0, F2 LL
DADDUI
DADDUI R1,
R1,R1,
R1,#-8
#-8
S.D F4, -8(R1) LL
BNE R1, EE
BNE R1,R2,
R2,Loop
Loop DD
LL
OO
name
nameand
andanti-
anti- OO
dependences
dependences PP
Loop Unrolling
Original
OriginalLoop
Loop Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
Loop:
Loop: L.D
L.D F0,
F0,0(R1)
0(R1)
S.D F4, 0(R1) UU
ADD.D
ADD.D F4,
F4,F0,
F0,F2
F2 NN
L.D F6, -8(R1) RR
S.D
S.D F4,
F4,0(R1)
0(R1) OO
ADD.D F8, F2, F6 LL
DADDUI
DADDUI R1,
R1,R1,
R1,#-8
#-8
S.D F8, -8(R1) LL
BNE R1, EE
BNE R1,R2,
R2,Loop
Loop DD
LL
OO
Register
Register OO
Renaming
Renaming PP
Loop Unrolling
Original
OriginalLoop
Loop Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
Loop:
Loop: L.D
L.D F0,
F0,0(R1)
0(R1)
S.D F4, 0(R1) UU
ADD.D
ADD.D F4,
F4,F0,
F0,F2
F2 NN
L.D F6, -8(R1) RR
S.D
S.D F4,
F4,0(R1)
0(R1) OO
ADD.D F8, F2, F6 LL
DADDUI
DADDUI R1,
R1,R1,
R1,#-8
#-8
S.D F8, -8(R1) LL
BNE R1, EE
BNE R1,R2,
R2,Loop
Loop
L.D F10, -16(R1) DD
●
Improve throughput by unrolling the loop body 4 times
Static Multiple Issue
Loop:
Original
OriginalLoop
Loop
ld x31, 0(20)
Loop:
Loop: add x31, x31, x21
ld
ld x31,
x31, 0(20)
0(20) sd x31, 0(20) UU
add
add x31,
x31, x31,
x31, x21
x21 NN
sd x31, 0(20) ld x30, 8(20) RR
sd x31, 0(20)
addi
addi x20,
x20, x20,
x20, -8
-8 add x30, x30, x21 OO
LL
blt x22, x20, sd x30, 8(20)
blt x22, x20, Loop
Loop LL
ld x29, 16(20) EE
DD
add x29, x29, x21
sd x29, 16(20) LL
OO
ld x28, 24(20) OO
add x28, x28, x21 PP
sd x28, 24(20)
addi x20, x20, -32
blt x22, x20, Loop
Static Multiple Issue
Loop:
ld x31, 0(20)
add x31, x31, x21 Issue Slot 1 Issue Slot 2 Clock
cycle
sd x31, 0(20) ALU/BEQ Data Transfer Instruction
ld x30, 8(20)
ld x31, 0(20) 1
add x30, x30, x21
2
sd x30, 8(20)
ld x29, 16(20) 3
add x29, x29, x21 4
sd x29, 16(20) 5
ld x28, 24(20) 6
add x28, x28, x21 7
sd x28, 24(20)
8
addi x20, x20, -32
blt x22, x20, Loop
Static Multiple Issue
Loop:
ld x31, 0(20)
add x31, x31, x21 Issue Slot 1 Issue Slot 2 Clock
cycle
sd x31, 0(20) ALU/BEQ Data Transfer Instruction
ld x30, 8(20)
addi x20, x20, -32 ld x31, 0(20) 1
add x30, x30, x21
sd x30, 8(20) ld x30, 8(20) 2