ch4-3
ch4-3
•
We can resolve hazards with forwarding
– How do we detect when to forward?
Dependencies & Forwarding
FIGURE 4.51
Detecting the Need to Forward
•
Pass register numbers along pipeline
–
e.g., ID/EX.RegisterRs = register number for Rs sitting
in ID/EX pipeline register
•
ALU operand register numbers in EX stage
are given by
–
ID/EX.RegisterRn1, ID/EX.RegisterRm2
•
Data hazards when
Fwd from
1a. EX/MEM.RegisterRd = ID/EX.RegisterRn1 EX/MEM
1b. EX/MEM.RegisterRd = ID/EX.RegisterRm2 pipeline reg
•
And only if Rd for that instruction is not XZR
– EX/MEM.RegisterRd ≠ 31,
MEM/WB.RegisterRd ≠ 31
FIGURE 4.52
Forwarding Paths
FIGURE 4.53
Forwarding Conditions
Mux control Source Explanation
ForwardA = 00 ID/EX The first ALU operand comes from the register file.
ForwardA = 10 EX/MEM The first ALU operand is forwarded from the prior ALU
result.
ForwardA = 01 MEM/WB The first ALU operand is forwarded from data memory or
an earlier
ALU result.
ForwardB = 00 ID/EX The second ALU operand comes from the register file.
ForwardB = 10 EX/MEM The second ALU operand is forwarded from the prior ALU
result.
ForwardB = 01 MEM/WB The second ALU operand is forwarded from data memory
or an
earlier ALU result.
FIGURE 4.54 The control values for the forwarding multiplexors in Figure 4.53.
Double Data Hazard
• Consider the sequence:
add X1,X1,X2
add X1,X1,X3
add X1,X1,X4
• Both hazards occur
– Want to use the most recent
• Revise MEM hazard condition
– Only fwd if EX hazard condition isn’t true
Revised Forwarding Condition
• MEM hazard
– if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 31)
and not(EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 31)
and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRn1))
and (MEM/WB.RegisterRd = ID/EX.RegisterRn1)) ForwardA = 01
– if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 31)
and not(EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 31)
and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRm2))
and (MEM/WB.RegisterRd = ID/EX.RegisterRm2)) ForwardB = 01
Datapath with Forwarding
FIGURE 4.57 A
pipelined sequence of
instructions
Load-Use Hazard Detection
• Check when using instruction is decoded in ID
stage
• ALU operand register numbers in ID stage
are given by
– IF/ID.RegisterRn1, IF/ID.RegisterRm2
• Load-use hazard when
– If (ID/EX.MemRead and
((ID/EX.RegisterRd = IF/ID.RegisterRn1) or
(ID/EX.RegisterRd = IF/ID.RegisterRm2))
• If detected, stall and insert bubble
How to Stall the Pipeline
• Force control values in ID/EX register to 0
– EX, MEM and WB do nop (no-operation)
• Prevent update of PC and IF/ID register
– Using instruction is decoded again
– Following instruction is fetched again
– 1-cycle stall allows MEM to read data for LDUI
•
Can subsequently forward to EX stage
Load-Use Data Hazard
Stall inserted
here
FIGURE 4.58
Datapath with Hazard Detection
FIGURE 4.59
Stalls and Performance
The BIG Picture
•
Stalls reduce performance
–
But are required to get correct results
•
Compiler can arrange code to avoid
hazards and stalls
–
Requires knowledge of the pipeline structure
Control Hazards
Branch Hazards
•
If branch outcome determined in MEM
Flush these
instructions
(Set control
values to 0)
PC
FIGURE 4.61 a
Example: Branch Taken
FIGURE 4.61 b
Dynamic Branch Prediction
• In deeper and superscalar pipelines, branch
penalty is more significant
• Use dynamic prediction
– Branch prediction buffer (aka branch history table)
– Indexed by recent branch instruction addresses
– Stores outcome (taken/not taken)
– To execute a branch
•
Check table, expect the same outcome
•
Start fetching from fall-through or target
•
If wrong, flush pipeline and flip prediction
1-Bit Predictor: Shortcoming
•
Inner loop branches mispredicted twice!
outer: …
…
inner: …
…
CBZ …, …, inner
…
CBZ …, …, outer
Mispredict as taken on last
iteration of inner loop
Then mispredict as not taken on
first iteration of inner loop next
time around
2-Bit Predictor
•
Only change prediction on two
successive mispredictions
FIGURE 4.64
Exception Example
• Exception on ADD in
40 SUB X11, X2, X4
44 AND X12, X2, X5
48 ORR X13, X2, X6
4C ADD X1, X2, X1
50 SUB X15, X6, X7
54 LDUR X16, [X7,#100]
…
• Handler
80000180 STUR X26, [X0,#1000]
80000184 STUR X27, [X0,#1008]
…
Exception Example
FIGURE 4.66
LEGv8 with Static Dual Issue
FIGURE 4.67
use latency
•
More instructions executing in parallel
•
EX data hazard
–
Forwarding avoided stalls with single-issue
–
Now can’t use ALU result in load/store in same packet
•
ADD X0, X0, X1
LDUR X2, [X0,#0]
•
Split into two packets, effectively a stall
•
Load-use hazard
–
Still one cycle use latency, but now two instructions
•
More aggressive scheduling required
Scheduling Example
•
Schedule this for dual-issue LEGv8
Loop: LDUR X0, [X20,#0] //
X0=array element
ADD X0, X0,X21 //
add scalar in X21
STUR X0, [X20,#0] //
store result
SUBI X20, X20,#4 //
decrement pointer
CMP X20, X22 //
branch $s1!=0
BGT Loop
FIGURE 4.68
Loop Unrolling
• Replicate loop body to expose
more parallelism
– Reduces loop-control overhead
• Use different registers per replication
– Called “register renaming”
– Avoid loop-carried “anti-dependencies”
• Store followed by a load of the same register
• Aka “name dependence”
– Reuse of a register name
Loop Unrolling Example
ALU/branch Load/store cycle
Loop: SUBI X20, X20,#32 LDUR X0, [X20,#0] 1
nop LDUR X1, [X20,#24] 2
ADD X0, X0, X21 LDUR X2, [X20,#16] 3
ADD X1, X1, X21 LDUR X3, [X20,#8] 4
ADD X2, X2, X21 STUR X0, [X20,#32] 5
ADD X3, X3, X21 sw X1, [X20,#24] 6
CMP X20,X22 sw X2, [X20,#16] 7
BGT Loop sw X3, [X20,#8] 8
Figure 4.69
•
IPC = 15/8 = 1.875
– Closer to 2, but at cost of registers and code size
Dynamic Multiple-Issue Processors
• “Superscalar” processors
• CPU decides whether to issue 0, 1, 2, …
each cycle
– Avoiding structural and data hazards
• Avoids the need for compiler scheduling
– Though it may still help
– Code semantics ensured by the CPU
Dynamic Pipeline Scheduling
• Allow the CPU to execute instructions out
of order to avoid stalls
– But commit result to registers in order
• Example
LDUR X0, [X21,#20]
ADD X1, X0, X2 SUB
X23,X23,X3 ANDI X5,
X23,#20
– Can start sub while ADD is waiting for LDUI
Dynamically Scheduled CPU
Preserves
dependencies
Hold pending
operands