0% found this document useful (0 votes)

4 views39 pages

ACA-notes

The document outlines key topics in Advanced Computer Architecture, focusing on MIPS architecture, branch prediction techniques, exception handling, and instruction-level parallelism. It discusses both static and dynamic branch prediction methods, their performance metrics, and the implications of various scheduling techniques. Additionally, it covers exception handling mechanisms, differentiating between synchronous and asynchronous events, and the handling of interrupts in a pipelined environment.

Uploaded by

rash.seba12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views39 pages

ACA-notes

Uploaded by

rash.seba12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Advanced Computer Architecture theory (prof.

Silvano)
2021/2022

Topics:
The arguments are almost those of the pre-exam, attention: the performance part is missing!!!
1. Some basic concept about MIPS
2. Static Branch Prediction Techniques
3. Dynamic Branch Prediction Techniques
4. Exception handling
5. Instruction Level Parallelism
6. Scoreboard Dynamic Scheduling Algorithm
7. Tomasulo: Dynamic Scheduling algorithm
8. Tomasulo: Implicit Register Renaming
9. ReOrder Buffer
10. Speculative Tomasulo Architecture with ROB
11. Some basic concept about cache memories
12. Memory Hierarchy

Be careful there may be some errors!!!

Marco_B

MB ACA Theory
MIPS pipeline

• Forwarding paths: EX/EX, MEM/EX, MEM/MEM, MEM/ID

• WAW and WAR hazards occur when instructions are executed out-of-order

MIPS optimized pipeline

We assume that the RF read occurs in the second half of clock cycle and the RF write in the first
half of clock cycle. In this way the path MEM/ID is not necessarily

Performance metrics
• IC = Instruction Count
• #Clock_Cycles = IC + #Stall_Cycles + 4
• CPI = Clock per Instruction = #Clock_Cycles / IC
• MIPS = fclock / (CPI * 10^6)

MB ACA Theory
Branch resolution

• Branch Outcome and Branch Target Address are ready at the end of the EX-stage
• Conditional branches are solved when PC is updated at the end of the ME-stage

Early Evaluation of the PC

• The computation of the BTA and the update of the PC is done during the ID stage
• We need to enable the EX/ID forwarding path (add + beq_stall + next_instruction_stall)
• The MEM/ID can be useful (lw + beq_2stall + next_instruction_stall)

Branch Prediction Techniques

Main goal: try to predict as early as possible the outcome of a branch instruction
The performance of a branch prediction technique depends on:

• Accuracy measured in terms of percentage of incorrect predictions given by the

predictor
• Cost of an incorrect prediction measured in terms of time lost to execute useless
instructions (misprediction penalty) given by the processor architecture: the cost
increases for deeply pipelined processors
• Branch frequency given by the application: the importance of accurate branch
prediction is higher in programs with higher branch frequency

MB ACA Theory
Static Branch Prediction Techniques
• It is done and fixed at compile time
• It is typically used when the branch behavior for the target application is highly
predictable at compile time
• It can also be used in combination with dynamic predictors
• Early evaluation of the BO, BTA and update of the PC during ID stage
Branch Always Not Taken

• Prediction not taken at the end of the IF stage

• We assume the branch as not taken; thus, the sequential instructions flow that we have
fetched can continue as if the branch condition was not satisfied
• If the BO at the end of ID stage will result taken, the prediction is incorrect
(there is a misprediction): we need to flush the next instruction already fetched (it is
turned into a nop) and we need to fetch the instruction at the Branch Target Address 
One-cycle performance penalty
Branch Always Taken

• Prediction taken at the end of the IF stage

• We assume every branch as taken
• As soon as the branch is fetched, we need to know the BTA to start fetching and
executing at the BTA
The predicted-taken scheme makes sense for pipelines where the Branch Target Address BTA is
known before the Branch Outcome, so we need:
1. Either to anticipate the computation of BTA at the IF stage (before the ID stage where
we compute the Branch Outcome)
2. or to add in the IF stage a Branch Target Buffer, a cache where to store the predicted
values of the BTA for the next instruction after each branch
Backward Taken Forward Not Taken (BTFNT)

• Backward-going branches are predicted as taken (the branches at the end of DO-WHILE
loops going back at the beginning of the next loop iteration)
• Forward-going branches are predicted as not taken (the branches of IF-THEN-ELSE
conditional statements)
• We assume to profile the behavior of the target application program by several early
runs by using different data sets. The branch prediction is based on profiling information
collected during earlier runs
Delayed Branch Technique

MB ACA Theory
• Scheduling technique: the compiler statically schedules an independent instruction in
the Branch Delay Slot
• The instruction in the branch delay slot is always executed, whether the branch is taken
(it is never flushed)
The job of the compiler is to find a valid and useful instruction to be scheduled in the branch
delay slot, there are 4 ways to fill in the branch delay slot:
From before

• The branch delay slot is scheduled with an independent instruction from before the
branch
• The instruction is always executed
From target

• The branch delay slot is scheduled with one instruction from the target of the branch
(branch taken)
• If the branch is untaken (misprediction), the instruction in the delay slot needs to be
flushed or it must be ok to be executed also when the branch goes in the unexpected
direction (the work is wasted)
• It is preferred when the branch is taken with high probability, such as DO-WHILE loop
branches (backward branches)
From fall-through

• The branch delay slot is scheduled with one instruction from the fall-through path
(branch not taken)
• If the branch is taken -> flush or wasted work
• It is preferred when the branch is not taken with high probability, such as IF-THEN-ELSE
statements where the ELSE path is less probable (forwarding branches)
From after
//
The main limitations on delayed branch scheduling arise from:

• The restrictions on the instructions that can be scheduled in the delay slot
• The ability of the compiler to statically predict the outcome of the branch
To improve the ability of the compiler to fill the branch delay slot most processors have
introduced a canceling or nullifying branch: the instruction includes the hint direction that the
branch was predicted. In this way, the compiler need not be as much conservative when filling
the delay slot

MB ACA Theory
MIPS architecture has the branch-likely instruction, that can be used as cancel-if-not-taken
branch:

• The instruction in the branch delay slot is executed whether the branch is taken
• The instruction in the branch delay slot is not executed (it is turned to a nop) whether
the branch is not taken
Useful approach for backward branches (such as loop branches)

Dynamic Branch Prediction Techniques

• Basic Idea: To use the past branch behavior to predict the future.
• We use hardware to dynamically predict the outcome of a branch: the prediction will
depend on the behavior of the branch at run time and will change if the branch changes
its behavior during execution.
Dynamic branch prediction is based on two interacting hardware blocks placed in the
Instruction Fetch stage to predict the next instruction to read in the Instruction Cache:

• Branch Outcome Predictor (BOP): to predict the direction of a branch (taken or not)
• Branch Target Predictor or Branch Target Buffer (BTB): to predict the branch target
address in case of taken branch (Predicted Target Address - PTA)
If branch is predicted by BOP in IF stage as not taken, the PC is incremented (BTB not useful in
this case), otherwise the BTB gives the Predicted Target Address (PTA).
If the BO at the end of ID stage will result a misprediction we need to flush the next instruction
already fetched (the next instruction is turned into a nop) and we restart the execution by
fetching at the correct address  One-cycle penalty.

Branch History Table - 1-bit Branch History Table

MB ACA Theory
• Table containing 1 bit for each entry that says whether the branch was recently taken or
not
• Table indexed by the lower portion k-bit of the address of the branch instruction (to
keep the size of the table limited)
• For locality reasons, we would expect that the most significant bits of the branch
address are not changed
• Prediction: hint that it is assumed to be correct and fetching begins in the predicted
direction. If the hint turns out to be wrong, the prediction bit is inverted and stored
back. The pipeline is flushed, and the correct sequence is executed with one cycle
penalty
• The table has no tags (every access is a hit) and the prediction bit could have been put
there by another branch with the same low-order address bits
A misprediction occurs when:
1. The prediction is incorrect for that branch or
2. the same index has been referenced by two different branches, and the previous history
refers to the other branch (This can occur because there is no tag check).
To reduce this problem, it is enough to increase the number of rows in the BHT (that is
to increase k) or to use a hashing function (such as in GShare).
Shortcoming (lack) of the 1-bit BHT: In a loop branch, even if a branch is almost always taken
and then not taken once, the 1-bit BHT will mispredict twice (rather than once) when it is not
taken. That scheme causes two wrong predictions:

• At the last loop iteration, since the prediction bit will say taken, while we need to exit
from the loop
• When we re-enter the loop, at the end of the first loop iteration we need to take the
branch to stay in the loop, while the prediction bit says to exit from the loop, since the
prediction bit was flipped on previous execution of the last iteration of the loop
Branch History Table - 2-bit Branch History Table

• The prediction must miss twice before it is changed. In a loop branch, at the last loop
iteration, we do not need to change the prediction.
Correlating Branch Predictors

• The 2-bit BHT uses only the recent behavior of a single branch to predict the future
behavior of that branch.
• Basic Idea: the behavior of recent branches is correlated, that is the recent behavior of
other branches rather than just the current branch (we are trying to predict) can
influence the prediction of the current branch. We try to exploit the correlation existing
among different branches

MB ACA Theory
A (1,1) Correlating Predictors is composed by two 1-bit predictors with 1-bit of correlation

Record if the most recently executed branches have been taken or not taken. The branch is
predicted based on the previous executed branch by selecting the appropriate 1-bit BHT
In general (m, n) correlating predictor records last m branches to choose from 2^m BHTs, each
of which is a n-bit predictor
A (2, 2) correlating predictor has 4 2-bit Branch History Tables, and it uses the 2-bit global
history to choose among the 4 BHTs

For example, if the predictors have 64 total entries, it is needed a 6-bit index composed of 2-bit
global history and 4-bit low-order branch address bits (2^4 = 16, k = 4).

• Each BHT is composed by 16 entries

• The 4-bit branch address is used to choose four entries (a row)
• The 2-bit global history is used to choose 1 of 4 entries in a row (1 out of for 4 BHTs)

MB ACA Theory
Note that the (2,2) predictor not only outperforms the simply 2-bit predictor with the same
number of total bits, but it also often outperforms a 2-bit predictor with an unlimited number
of entries.
Two-level Adaptive Branch Predictors
The first level history is recorded in one (or more) k-bit shift register called Branch History
Register (BHR), which records the outcomes of the k most recent branches (i.e. T, NT, NT, T)
(used as a global history)
The second level history is recorded in one (or more) tables called Pattern History Table (PHT)
of two-bit saturating counters (used as a local history)
The BHR is used to index the PHT to select which 2-bit counter to use.
Once the two-bit counter is selected, the prediction is made using the same method as in the
two-bit counter scheme
GA Predictor: the global 2-level predictor uses the correlation between the current branch and
the other branches in the global history to make the prediction
Global and local predictor with 2-level predictor: PHT (local history) indexed by the content of
BHR (global history)
GShare: variation of the GA predictor where want to correlate the BHR recording the outcomes
of the most recent branches (global history) with the low-order bits of the branch address
We make the XOR of 4-bit BHR (global history) with the low-order 4-bit of PC (branch address)
to index the PHT (local history).
Branch Target Buffer
It is a cache storing the Predicted Target Address (PTA) expressed as PC-relative
We access the BTB in the IF stage by using the address of the fetched branch instruction to
index the cache

MB ACA Theory
Exception handling
We use the term ‘exception’ to cover not only exceptions but also interrupts (an event that
requests the attention of the processor) and faults.
Synchronous vs asynchronous:
▪ asynchronous events are caused by devices external to the CPU and memory and
can be handled after the completion of the current instruction (easier to handle)
User requested vs coerced:
▪ User requested (such as I/O events) are predictable: treated as exceptions because
they use the same mechanisms that are used to save and restore the state; handled
after the instruction has completed (easier to handle)
▪ Coerced are caused by some HW event not under control of the user program; hard
to implement because they are unpredictable
User maskable vs user nonmaskable:
▪ the mask simply controls whether the HW responds to the exception or not
Within vs between instructions:
▪ exceptions that occur within instructions are usually synchronous since the
instruction triggers the exception. The instruction must be stopped and restarted.
▪ Asynchronous that occur between instructions arise from catastrophic situations
such as HW malfunctions and always cause program termination.
Resume vs terminate:
▪ Terminating event: program’s execution always stops after the interrupt/exception
▪ Resuming event: program’s execution continues after the interrupt/exception
Asynchronous Interrupts
When the processor decides to process the interrupt:
▪ It stops the current program at instruction Ii, completing all the instructions up to Ii‐
1 (precise interrupt)
▪ It saves the PC of instruction li in a special register Exception Program Counter
▪ It disables interrupts and transfers control to a designated
interrupt handler running in the kernel mode
Interrupt Handler

To allow nested interrupts, we need to save PC before enabling interrupts 

- need an instruction to move PC into GPRs

MB ACA Theory
- need a way to mask further interrupts at least until PC can be saved
Needs to read a status register that indicates the cause of the interrupt
Uses a special indirect jump instruction RFE (return‐from‐ exception) which restore the PC and:
- enables interrupts
- restores the processor to the user mode
- restores hardware status and control state
The instruction Ii and the next instructions (Ii+1, …) are restarted
Synchronous Interrupts
A synchronous interrupt (exception) is caused by a particular instruction stage
In general, the instruction Ii cannot be completed and needs to be restarted
after the exception has been handled (in the pipeline this would require undoing the effect of
one or more partially executed instructions)
Precise Interrupts/Exceptions
An interrupt or exception is precise if there is a single instruction (or interrupt point) for which
all instructions before having committed their state and no following instructions (including the
interrupting instruction Ii) have modified any state.
This means, effectively, that we can restart execution at the interrupt point and “get the right
answer”
Exceptions in 5-stage Pipeline
Exceptions may occur at different stages in pipeline => Exceptions may be raised out of order
Example:
- Data page fault occurs in memory stage of first instruction
- Arithmetic exception occurs in execution stage of second instruction
Due to pipelined execution Page faults occurs simultaneously to overflow, but Data page fault
must be handled first
Exception handling:
▪ Write in Data Memory and Write Back in RF are disabled to guarantee a precise
interrupt model
▪ Hold exception flags and PC in pipeline until commit point (M stage)
▪ Exceptions in earlier pipe stages override later exceptions for a given instruction
▪ Inject external interrupts at commit point (override others)

MB ACA Theory
Instruction Level Parallelism
The max throughput would be to complete one Instruction Per Clock: IPCideal=1 => CPIideal=1
In a pipelined machine, actual CPI is derived as:
CPIpipeline = CPIideal + Structural Stalls + Data Hazard Stalls + Control Stalls + Memory Stalls
Reduction of any right-hand term reduces the CPIpipeline with respect to the CPIideal and
increases Instructions Per Clock: IPC = 1 / CPI
Different types of hazards limit performance:

• Structural: Need more HW resources

• Data: Need forwarding, Compiler scheduling
• Control: Early evaluation, Branch Delay Slot, Static and Dynamic Branch Prediction
Increasing length of pipe (superpipelining) increases the impact of hazards on performance.
Remind that pipelining improves instruction throughput, not latency!
If two instructions are dependent to each other, they cannot be executed in parallel: they must
be executed in order or only partially overlapped
There are three different types of dependences:

• Data Dependences
- True data dependence: Read After Write hazards
• Name Dependences (use same register or memory location)
- Anti-dependence: When Ij writes a register or memory location that instruction Ii
reads (it can generate a Write After Read hazard)
- Output Dependence: When Ii and Ij write the same register or memory location (it
can generate a Write After Write hazard)
Name dependences are not true data dependences since there is no value (no data
flow) being transmitted between instructions
If the name (register number or memory location) used in the instructions could be
changed (register renaming), the instructions do not conflict anymore
Register renaming can be more easily done, if there are enough registers available in the
ISA and it can be done either statically by the compiler or dynamically by the hardware

• Control Dependence: it determines the ordering of instructions, and it is preserved by

two properties:
- Instructions execution in program order to ensure that an instruction that occurs
before a branch is executed before the branch.
- Detection of control hazards to ensure that an instruction (that is control-dependent
on a branch) is not executed until the branch direction is known.

MB ACA Theory
Program properties
Two properties are critical to preserve program correctness:
1. Data flow: Actual flow of data values among instructions that produces the correct
results and consumes them
2. Exception behavior: Preserving exception behavior means that any changes in the
ordering of instruction execution must not change how exceptions are raised in the
program
Multi-cycle pipelining

• Execution and memory stage might require multiple cycles

• In-order issue & in-order commit of instructions preserves precise exception model and
avoids the generation of WAR & WAW hazards
Architecture:

• ID stage split in 2 stages: Instr. Decode (ID) & Register Read (Issue)
• Multiple functional units with variable latency
• Long latency multi-cycle floating-point instructions
• Memory systems with variable access time: multi-cycle memory accesses due to data
cache misses (unpredictable statically)
• No more commit point => Out-of-order and check for WAR & WAW hazards
Instruction level parallelism
Exploit potential overlap of execution among unrelated instructions
Multiple-issue pipeline
In a multiple-issue pipelined processor, the ideal CPI would be CPI ideal < 1
If we consider for example dual-issue processor the max throughput would be to complete 2
Instructions Per Clock: IPC ideal = 2; CPI ideal = 0.5
Two strategies to support ILP:

• Dynamic Scheduling: depend on the hardware to locate parallelism

• Static Scheduling: rely on compiler for identifying potential parallelism
Dynamic scheduling -> Superscalar Processor

• Problem: Hazards due to data dependences that cannot be solved by forwarding cause
stalls of the pipeline: no new instructions are fetched nor issued even if they are not
data dependent
• Solution: Allow data independent instructions behind a stall to proceed

MB ACA Theory
• Hardware reorder instructions execution to reduce stalls, maintaining data flow and
exception behavior
• Enables out-of-order execution and completion (commit)
Instructions are fetched and issued in program order (in-order-issue)
Execution begins as soon as operands are available (possibly, out of order execution)
Out-of order execution introduces possibility of WAR and WAW data hazards
Out-of order execution implies out of order completion unless there is a re-order buffer to get
in order completion

• Problem with out-of order completion: must preserve exception behavior as in-order
execution
• Solution: ensure that no instruction can generate an exception until the processor
knows that the instruction raising the exception will be executed
An exception is imprecise if the processor state when an exception is raised does not look
exactly as if the instructions were executed in-order. Imprecise exceptions can occur because:

• The pipeline may have already completed instructions that are later in program order
than the instruction causing the exception
• The pipeline may have not yet completed some instructions that are earlier in program
order than the instruction causing the exception
Imprecise exception makes it difficult to restart execution after handling
Main advantage:

• Very high performance: Ideal CPI very low: CPIideal = 1 / issue-width

Disadvantages

• Very expensive logic to decide dependencies and independencies, i.e., to decide which
instructions can be issued every clock cycle
• It does not scale: almost impractical to make issuewidth greater than 4 (we would have
to slow down the clock)
Static scheduling -> VLIW processors
Static detection and resolution of dependences that are avoided by code reordering.
Output of the compiler: reordered into dependency-free code
Compilers can use sophisticated algorithms for code scheduling to exploit ILP (Instruction Level
Parallelism). The size of a basic block (a straight-line code sequence with no branches in except

MB ACA Theory
to the entry and no branches out except at the exit) is usually quite small and the amount of
parallelism available within a basic block is quite small.
Data dependence can further limit the amount of ILP we can exploit within a basic block to
much less than the average basic block size
To obtain substantial performance enhancements, we must exploit ILP across multiple basic
blocks (i.e. across branches such as in trace scheduling).
Main Limits of Static Scheduling:

• Unpredictable branches
• Variable memory latency (unpredictable cache misses)
• Code size explosion
• Compiler complexity
• Code portability
• Performance portability
Limits of Multiple-Issue processors
The issue width is the number of instructions that can be issued in a single cycle by a multiple
issue processor
It is too hard to decide which 8, or 16, instructions can execute every cycle (too many!), it takes
too long to compute, so the frequency of the processor would have to be decreased
Limitation due to intrinsic level of parallelism

Scoreboard Dynamic Scheduling Algorithm

Scoreboard Scheme
Hazard detection and resolution is centralized in the scoreboard:

• Every instruction goes through the Scoreboard, where a record of data dependences is
constructed
• The Scoreboard then determines when the instruction can read its operand and begin
execution
• If the scoreboard decides the instruction cannot execute immediately, it monitors every
change and decides when the instruction can execute
• The scoreboard controls when the instruction can write its result into destination
register

MB ACA Theory
Four Stages of Scoreboard Control
1. Issue: decode instruction and check for structural hazards & WAW hazards

Instructions issued in program order (for hazard checking)

- If a functional unit for the instruction is free and no other active instruction has the
same destination register (no WAW), the scoreboard issues the instruction to the
functional unit and updates its internal data structure
- If a structural hazard or a WAW hazard exists, then the instruction issue stalls, and
no further instructions will issue until these hazards are cleared.

2. Read Operands: wait until no RAW hazards, then read operands. Check for structural
hazards in reading RF

A source operand is available if:

- No earlier issued active instruction will write it or
- a functional unit is writing its value in a register
When the source operands are available, the scoreboard tells the functional unit to
proceed to read the operands from the registers and begin execution
RAW hazards are solved dynamically in this step => out-of-order reading of operands
then instructions are sent into execution out-of-order
No forwarding of data in this model
3. Execution: the functional unit begins execution upon receiving operands. When the
result is ready, it notifies the scoreboard that it has completed execution.

FUs are characterized by:

- Variable latency (the effective time used to complete one operation)
- Load/Store latency depends on data cache HIT/MISS

4. Write result: Check for WAR hazards and finish execution

Once the scoreboard is aware that the functional unit has completed execution, the
scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the
completing instruction.

MB ACA Theory
Scoreboard optimizations

• Check for WAW postponed to WRITE stage instead of in ISSUE stage

• Forwarding
Scoreboard Structure
1. Instruction status
2. Functional Unit status (indicates the state of the functional unit (FU)):
Busy – Indicates whether the unit is busy or not
Op - The operation to perform in the unit (+,-, etc.)
Fi - Destination register
Fj, Fk – Source register numbers
Qj, Qk – Functional units producing source registers Fj, Fk
Rj, Rk – Flags indicating when Fj, Fk are ready. Flags are set to NO after operands are
read.
3. Register result status. Indicates which functional unit will write each register. Blank if no
pending instructions will write that register.
Scoreboard Explicit Register Renaming
1. Issue — Decode instructions & Check for structural hazards & allocate new physical
register for result
• Instructions issued in program order (for hazard checking)
• Don’t issue if no free physical registers
• Don’t issue if structural hazard in FUs
2. Read operands — Wait until no RAW hazards, then read operands
• All real dependencies (RAW hazards) solved in this stage, since we wait for
instructions to write back data
3. Execution — Operate on operands
• The functional unit begins execution upon receiving operands. When the result is
ready, it notifies the scoreboard
4. Write result — Finish execution
Note: No checks for WAR or WAW hazards!

MB ACA Theory
Tomasulo: Dynamic Scheduling algorithm
It introduces the Implicit Register Renaming to avoid WAR & WAW hazards to get high
performance at runtime without special compilers
It introduces some FU buffers called “Reservation Stations” in front of the FUs to keep pending
operands
The control logic & FU buffers are distributed with the Function Units (vs. centralized in
scoreboard)
Registers in instructions are replaced by their values or pointers to reservation stations (RS) to
enable Implicit Register Renaming

• Avoids WAR, WAW hazards by renaming results by using RS numbers instead of RF

numbers
• More reservation stations than registers, so can-do optimizations compilers can’t
Basic idea: Results are passed to the FUs from Reservation Stations, not through Registers,
over to Common Data Bus that broadcasts results to all Fus and to Store buffers (like a sort of
forwarding)
Reservation Station Components

• Tag identifying the RS

• Busy = Indicates RS Busy
• OP = Type of operation to perform on the component.
• Vj, Vk = Value of the source operands j and k
Vj holds memory address for loads/stores
• Qj,Qk = Pointers to RS that produce Vj,Vk
Zero value = Source op. is already available in Vj or Vk
Each entry in the RF and in the Store buffers have a Value (Vi) and a Pointer (Qi) field

• the Value (Vi) field holds the register/buffer content

• the Pointer (Qi) field corresponds to the number of the RS producing the result to be
stored in this register (or store buffer).
• if the pointer is zero means that the value is simply the register/buffer content (no
active instruction is computing the result)
Load/Store buffers have Busy and an Address field.
Address field: To hold info for memory address calculation for load/stores.
Initially it contains the instruction offset (immediate field); after address calculation, it stores
the effective address

MB ACA Theory
Tomasulo stages
1. Issue (In-Order)
- Get an instruction I from the head of instruction queue (maintained in FIFO order to
ensure in-order issue)
- Check if an RS is empty (i.e., check for structural hazards in RS) otherwise stalls. If
operands are not in RF, keep track of FU that will produce the operands (Q pointers)

2. Start execution (Out-Of-Order)

- When operands ready (Check for RAW hazards solved)
- When FU available (Check for structural hazards in FU)
Load and Stores: Two-step execution process:
- First step: compute effective address when base register is available, place it in load
or store buffer
- Second step:
• Loads in Load Buffers execute as soon as memory unit is available
• Stores in store buffer wait for the value to be stored before being sent to
memory unit
To preserve exception behaviour:
- no instruction can initiate execution until all branches preceding it in program order
have completed
- this restriction guarantees that an instruction that generates an exception really
would have been executed
- if branch prediction is used, CPU must know prediction correctness before beginning
execution of following instructions

3. Write results (Out-Of-Order)

- When result is available, write it on Common Data Bus and from there into Register
File and into all RSs (including store buffers) waiting for this result
- Stores also write data to memory unit during this stage (when memory address and
result data are available)
- Mark reservation station available
Tomasulo Drawbacks

• Complexity: Large amount of hardware

• Many associative stores (CDB) at high speed
• Performance limited by Common Data Bus
- Multiple CDBs  More FU logic for parallel assoc stores

MB ACA Theory
Tomasulo: Implicit Register Renaming
How can Tomasulo overlap iterations of loops?
Register renaming provided by Reservation Stations (which buffer the operands of
instructions) to eliminate WAR and WAW hazards

• Multiple iterations use different physical destinations for registers (dynamic loop
unrolling without changing the code)
• Replace static register names from code with dynamic register “pointers”
• Effectively increases the size of Register File
• Permit instruction issue to advance past integer control flow operations
Crucial: integer unit must “get ahead” of floating-point unit so that we can issue multiple
iterations (by using branch prediction)
Example of compiler transformation called Loop Unrolling combined with Register Renaming by
using more registers specified in the ISA:

Unrolled Loop (unrolling factor 4) + Reg. Renaming + Code Rescheduling to minimize RAW stalls

MB ACA Theory
Pros:

• The 2 overhead instructions (SUBI, BNEZ) are paid only once

• I do it at compile time, so I can reschedule the instructions to resolve the hazards
-> I can reschedule in based on the HW that I have, the compiler knows everything
-> If the first Load is a miss, I add in the cache the all block that will be used (locality)
Cons:

• The instruction cache used in Loop unrolling is more than in standard Tomasulo
• I use a lot of registers
• I must carefully choose the unrolling factor

Explicit Register Renaming

• Using physical register file that is larger than number of registers specified by the ISA
• Allocate a new physical destination register for every instruction that writes a result
• Physical Registers are not exposed to the compiler because not specified by the ISA
• Removes all changes of WAR or WAW hazards
Mechanism? Keep a translation table:

• On issue, each instruction that writes a result is allocated new physical register from
Freelist
• ISA register → physical register mapping
• When register written, replace entry with new register from freelist
• Physical register becomes free when not used by any active instructions
Advantages:

• Decouples the concept of renaming from scheduling:

- Pipeline can be exactly like “standard” MIPS pipeline (perhaps with multiple
operations issued per cycle)
- Or pipeline could be with dynamic scheduling: Tomasulo-like or Scoreboard, etc
- Standard forwarding or bypassing could be used
• Allows data to be fetched from single register file
- No need to bypass values from reorder buffer
- This can be important for balancing pipeline

MB ACA Theory
ReOrder Buffer
We introduce the concept of HW-based speculation, which extends the ideas of dynamic
scheduling beyond branches
HW-based speculation combines 3 key concepts:
1. Dynamic branch prediction
2. Speculation to enable the execution of instructions before the control dependences are
solved by undoing the effects of an incorrectly speculated sequence
3. Dynamic scheduling beyond branches.

Problem: In case of misprediction we need to undo the effects of an incorrectly speculated

sequence of instructions. Therefore, we need to separate the process of execution completion
from the commit of the instruction result in the RF or in memory.
Solution: We need to introduce an HW buffer, called ReOrder Buffer, to hold the result of an
instruction that has finished execution, but not yet committed in RF/memory

• Buffer introduced to support out-of-order execution but in-order commit

• Buffer to hold the results of instructions that have finished execution but not yet
committed

• Buffer to pass results among instructions that have started speculatively after a branch,
so the renaming function of Reservation Stations is replaced by ROB entries

• Reservation Stations now are used only to buffer instructions and operands to FUs (to
reduce structural hazards)

• ROB supplies operands to other RSs between execution complete & commit

• Once the instruction commits, the result is written in the register file

• Originally (1988) introduced to maintain the precise interrupt model → Processors with
ROB can execute dynamically while maintaining a precise interrupt model because
instruction commit happens in order

MB ACA Theory
The ROB is a circular buffer with the head pointer indicating the instruction that will commit
(leaving ROB) first and with the tail pointer (indicating next free entry)

Instructions are written in ROB in strict program order – when an instruction is issued, an entry
is allocated to the ROB in sequence (in-order)
Each entry must keep the status of an instruction: issued, execution started, execution
completed, ready to commit
Each entry includes a speculative status flag, indicating whether the instruction is being
speculatively executed or not
An instruction is ready to commit (retire) if and only if:
1. It has finished, and
2. it is not marked as speculative, and
3. all previous instructions have already retired.

MB ACA Theory
Speculative Tomasulo Architecture with ROB
Pointers are directed towards the ROB entries instead of reservation stations
Implicit register renaming by using the ROB entries to eliminate WAR and WAW hazards and
enable dynamic loop unrolling
A register or memory location is updated only when the instruction reaches the head of ROB
There are Load buffers, while Store Buffers are completely replaced by the ReOrder Buffer.
By using the RoB to hold the result, it is easy to:
1. Undo speculated instructions on mispredicted branches
2. Manage exceptions precisely
Four Phases of Speculative Tomasulo Algorithm
1. Issue (or dispatch) — get instruction from instr. Queue
• If there are one RS entry free and one ROB entry free →instr. Issued
• If its operands are available in RF or in ROB, they are sent to RS; Number of ROB
entry allocated for result is sent to RSs (to tag the result when it will be placed on
the CDB)
• If RS full or/and ROB full: instruction stalls

2. Execution started — start and execute on operands (EX)

• When both operands are ready in the RS (RAW solved) => start to execute
• If one or more operands are not yet ready => check for RAW solved by
monitoring CDB for result
• For a store: only base register needs to be available (at this point, only effective
address is computed)

3. Execution completed & Write result in ROB

• Write result on Common Data Bus to all awaiting FUs (Reservation Stations)
& to ROB value field
• Mark RS as available
• For a store: the value to be stored is written in the ROB value field; otherwise
monitor CDB until value is broadcast

4. Commit — update RF or memory with ROB result

MB ACA Theory
• When instr. at the head of ROB & result ready, update RF with result (or store to
memory) and remove instr from ROB
• Mispredicted branch flushes ROB (sometimes called “graduation”)
There are 3 different possible sequences:
a. Normal commit: instruction reaches the head of the ROB; result is present in
the buffer. Result is stored in the Register File, instruction is removed from
ROB
b. Store commit: as above, but result is stored in memory rather than in the RF
c. Instruction is a branch with incorrect prediction: it indicates that speculation
was wrong. ROB is flushed (“graduation”), execution restarts at correct
successor of the branch. If the branch was correctly predicted, branch is
finished

Each entry in ROB contains the following fields:

• Instruction type field indicates whether instruction is a branch (no destination result), a
store (has memory address destination), or a load/ALU (register destination)
• Destination field supplies register number (for loads and ALU instructions) or memory
address (for stores) where results should be written
• Value field used to hold value of result until instruction commits
• Ready field indicates that instruction has completed execution, value is ready
• Speculative field indicates whether instruction is executed speculatively or not

Interrupt handling
Exceptions generated in connection with instruction execution are made precise by accepting
exception request only when instruction becomes “ready to retire” (exceptions are processed
in order)

MB ACA Theory
Some concept about cache memories
Main goal: to increase the performance of a computer through the memory system to:

• Provide the user the illusion to use a memory that is simultaneously large and fast
• Provide the data to the processor at high frequency
Problem: Processor-memory performance gap
Solution: to exploit the principle of locality of references

• Temporal Locality: when there is a reference to one memory element, the trend is to
refer again to the same memory element soon (i.e., instruction and data reused in loop
bodies)
• Spatial Locality: when there is a reference to one memory element, the trend is to refer
soon at other memory elements whose addresses are close by (i.e., sequence of
instructions or accesses to data organized as arrays or matrices)
Caches exploit both types of predictability:
- Exploit temporal locality by keeping the contents of recently accessed locations
- Exploit spatial locality by fetching blocks of data around recently accessed locations
Definitions
Hit Rate: Number of memory accesses that find the data in the upper level with respect to the
total number of memory accesses
Hit Rate = # hits / # memory accesses
Hit Time: time to access the data in the upper level of the hierarchy, including the time needed
to decide if the attempt of access will result in a hit or miss
Miss Rate: number of memory accesses not finding the data in the upper level with respect to
the total number of memory accesses
Miss Rate = # misses / # memory accesses
Hit Rate + Miss Rate = 1
Miss Penalty: time needed to access the lower level and to replace the block in the upper level
Miss Time = Hit Time + Miss Penalty
Average Memory Access time = AMAT= Hit Time + Miss Rate * Miss Penalty

MB ACA Theory
Cache structure
Each entry in the cache includes:
1. Valid bit to indicate if this position contains valid data or not. At the bootstrap, all the
entries in the cache are marked as INVALID
2. Cache Tag contains the value that univocally identifies the memory address
corresponding to the stored data
3. Cache Data contains a copy of data (block or cache line)
Problem of block placement
Given the address of the block in the main memory, where the block can be placed in the cache
(upper level)?
We need to find the correspondence between the memory address of the block and the cache
address of the block, such correspondence depends on the cache architecture:

• Direct Mapped Cache

• Fully Associative Cache
• n-way Set-Associative Cache
Direct Mapped Cache
Each memory location corresponds to one and only one cache location
The cache address of the block is given by
(Block Address)|cache = (Block Address)|mem mod(Num. of Cache Blocks)

Memory Address (N bit) composed of 4 fields:

• Byte offset in the word: to identify the given byte within the word: B bit
If the memory is not byte addressable: B = 0
• Word offset in the block: to identify the given word within the block: K bit
If the block contains only one word K = 0
• Index identifies the block: M bit
• Tag to be compared to the cache tag associated to the block selected by the index:
N – (M + K + B) bit

MB ACA Theory
Fully Associative Cache
In a fully associative cache, the memory block can be placed in any position of the cache
→ All the cache blocks must be checked during the search of the block
The index does not exist in the memory address, there are the tag bits only
Number of blocks = Cache Size / Block Size

n-way Set Associative Cache

Cache composed of sets; each set composed of n blocks:
Number of blocks = Cache Size/ Block Size
Number of sets = Cache Size / (Block Size x n)
The memory block can be placed in any block of the set
→ The search must be done on all the blocks of the set
Each memory block corresponds to a single set of the cache and the block can be placed in
whatever block of the n blocks of the set
(Set)|cache = (Block address)|mem mod(Num. sets in cache)

N bit memory address composed of 4 fields:

• Byte offset: B bit

• Word offset: K bit
• Index to identify the set: M bit
• Tag: N – (M + K + B) bit

MB ACA Theory
Recap

Increment of associativity
Main advantage:

• Reduction of miss rate

Main disadvantages:

• Higher implementation cost

• Increment of hit time
• The choice among direct mapped, fully associative and set associative caches depends
on the tradeoff between implementation cost of the associativity (both in time and
added hardware) and the miss rate reduction
• Increasing associativity shrinks index bits, expands tag bits
Block replacement
In case of a miss in a fully associative cache, we need to decide which block to replace: any
block is a potential candidate for the replacement
If the cache is set-associative, we need to select among the blocks in the given set
For a direct mapped cache, there is only one candidate to be replaced (no need of any block
replacement strategy)

MB ACA Theory
Main strategies used to choose the block to be replaced in associative caches:

• Random (or pseudo random)

• LRU (Least Recently Used)
• FIFO (First In First Out) – oldest block replaced
Write policy
Write-Through: the information is written to both the block in the cache and to the block in the
lower-level memory
Write-Back: the information is written only to the block in the cache. The modified cache block
is written to the lower-level memory only when it is replaced due to a miss. Is a block clean or
dirty? We need to add a DIRTY bit
- At the end of the write in cache, the cache block becomes dirty (modified), and the
main memory will contain a value different with respect to the value in the cache:
the main memory and the cache are not coherent
Main advantages: Write-Back vs Write-Through
Write-Back:

• The block can be written by the processor at the frequency at which the cache, and not
the main memory can accept it
• Multiple writes to the same block require only a single write to the main memory
Write-Through:

• Simpler to be implemented, but to be effective it requires a write buffer to do not wait

for the lower level of the memory hierarchy (to avoid write stalls)
• The read misses are cheaper because they do not require any write to the lower level of
the memory hierarchy
• Memory always up to date
Write buffer
Insert a FIFO buffer to do not wait for lower-level memory access (typical number of entries: 4
to 8)
- Processor writes data to the cache and to the write buffer
- The memory controller writes the contents of the buffer to memory
Main problem: Write buffer saturation
Write through always combined with write buffer

MB ACA Theory
Write Miss Options: Write Allocate vs No Write Allocate
What happens on a write miss?
Write allocate (or “fetch on write”): allocate new cache line in cache then write (double write
to cache!)
- Usually means that you must do a “read miss” to fill in rest of the cache-line
- Alternative: per/word valid bits
No write allocate (or “write-around”): simply send write data to lower-level memory - don’t
allocate new cache line!
Write-Back vs Write-Through | Write Allocate vs No Write Allocate
To manage a write miss, both options (Write Allocate and No Write Allocate) can be used for
both write policies (Write-Back and Write-Through), but usually
Write-Back cache uses the Write Allocate option (hoping next writes to the block will be done
again in cache)
Write-Through cache uses the No Write Allocate option (hoping next writes to the block will be
done again in memory)
Hit and Miss: Read and Write

MB ACA Theory
Memory Hierarchy
Programmers want unlimited amounts of memory with low latency
Fast memory technology (SRAM) is more expensive per bit than slower memory (DRAM)
Solution: organize memory system into a hierarchy
- Entire addressable memory space available in largest, slowest memory
- Incrementally smaller and faster memories, each containing a copy of a subset of
the memory below it
Temporal and spatial locality ensures that nearly all references can be found in smaller
memories
- Gives the illusion of a large, fast memory being presented to the processor
Classifying Cache Misses
Three major categories of cache misses:
1. Compulsory Misses: cold start misses or first reference misses
2. Capacity Misses: can be reduced by increasing cache size
3. Conflict Misses: can be reduced by increasing cache size and/or associativity
Compulsory Misses: the first access to a block is not in the cache, so the block must be loaded
in the cache from the MM
- there are compulsory misses even in an infinite cache: they are independent of the
cache size
Capacity Misses: if the cache cannot contain all the blocks needed during execution of a
program, capacity misses will occur due to blocks being replaced and later retrieved
- capacity misses decrease as capacity increases
Conflict Misses: if the block-placement strategy is set associative or direct mapped, conflict
misses will occur because a block can be replaced and later retrieved when other blocks map to
the same location in the cache.
- also called collision misses or interference misses
- conflict misses decrease as associativity increases
- fully associative caches avoid all conflict misses, but they are consuming area

MB ACA Theory
Improving Cache Performance
AMAT= Hit Time + Miss Rate * Miss Penalty
How to improve cache performance:
1. Reduce the miss rate
2. Reduce the miss penalty
3. Reduce the hit time
Reducing the miss rate
1. A way to reduce capacity misses: to increase the cache capacity
• Drawback: Increases hit time, area, power consumption and cost

2. Increasing the block size reduces the miss rate up to a point when the block size is too
large with respect to the cache size
• Larger block size will reduce compulsory misses taking advantage of spatial locality
• Drawbacks:
- Larger blocks increase miss penalty
- Larger blocks reduce the number of blocks so increase conflict misses (and even
capacity misses) if the cache is small

3. Higher associativity decreases the conflict misses

• Drawbacks:
- The complexity increases hit time, area, power consumption and cost
• The 2:1 Cache Rule: Miss Rate Cache Size N  Miss Rate 2-way Cache Size N/2

Multibanked caches introduce a sort of associativity

- Organize cache as independent banks to support simultaneous access to increase
the cache bandwidth
- Interleave banks according to block address to access each bank in turn (sequential
interleaving)

MB ACA Theory
4. Reduce Misses (and Miss Penalty) via a “Victim Cache”
• Victim cache is a small fully associative cache used as a buffer to place data discarded
from cache to better exploit temporal locality
• Victim cache placed between cache and its refilling path towards the next lower level in
the hierarchy
• Victim cache is checked on a miss to see if it has the required data before going to
lower-level memory. If the block is found in Victim cache the victim block and the cache
block are swapped

5. Reducing Misses via Pseudo-Associativity and Way Prediction

• Way prediction in 2-way set associative caches: use extra bits to predict for each set
which of the two ways to try on the next cache access (predict the way to pre-set the
mux):
- If the way prediction is correct  Hit time
- If way misprediction  Pseudo hit time in the other way and change the way
predictor
- Otherwise go to the lower level of hierarchy (Miss penalty)
• Pseudo-associativity in direct mapped caches: divide the cache in two banks in a sort of
associativity
- If the bank prediction is correct  Hit time
- If misprediction on the first bank, check the other bank to see if there is there, if yes
have a pseudo hit (slow hit) and change the bank predictor
- Otherwise go to the lower level of hierarchy (Miss penalty)

6. Reducing Misses by Hardware Pre-fetching of Instructions & Data

• Basic idea: to exploit locality, pre-fetch next instructions (or data) before they are
requested by the processor
- Pre-fetching can be done in cache or in an external stream buffer
• Instruction pre-fetch in Alpha AXP 21064
- Fetch two blocks on a miss; the requested block (i) and the next consecutive block
(i+1)
- Requested block placed in cache, while next block in instruction stream buffer
- If miss in cache, but hit in stream buffer, move stream buffer block into cache and
pre-fetch next block (i+2)
• Pre-fetching relies on extra memory bandwidth that can be used without penalty
• Drawback: If pre-fetching interferes with demand misses, it can lower performance

MB ACA Theory
7. Reducing Misses by Software Pre-fetching Data
• Compiler-controlled pre-fetching (the compiler can help in reducing useless pre-
fetching): compiler inserts prefetch LOAD instructions to load data in registers/cache
before they are needed
• Data Pre-fetch
- Register Pre-fetch: Load data into register (HP PA-RISC loads)
- Cache Pre-fetch: Load data into cache (MIPS IV, PowerPC, SPARC v. 9)
• Issuing pre-fetch instructions takes time (instr. overhead)
- Is cost of pre-fetch issues < savings in reduced misses?
- Higher superscalar reduces difficulty of issue bandwidth

8. Reducing Misses by Compiler Optimizations

• Basic idea: Apply profiling on SW applications, then use profiling info to apply code
transformations
• Managing instructions:
- Reorder instructions in memory to reduce conflict misses
- Profiling to look at instruction conflicts
• Managing Data:
a. Merging Arrays: improve spatial locality by single array of compound elements
vs. 2 arrays (to operate on data in the same cache block)
b. Loop Interchange: improve spatial locality by changing loops nesting to access
data in the order stored in memory (re-ordering maximizes re-use of data in a
cache block)
c. Loop Fusion: improve spatial locality by combining 2 independent loops that
have same looping and some variables overlap
d. Loop Blocking: Improve temporal locality by accessing “sub-blocks” of data
repeatedly vs. accessing by entire columns or rows

Reducing the miss penalty

1. Read Priority over Write on Miss
• Basic idea: Giving higher priority to read misses over writes to reduce the miss penalty
• The write buffer must be properly sized: Larger than usual to keep more writes in hold
• Drawback: This approach can complicate the memory access because the write buffer
might hold the updated value of a memory location needed on a read miss => RAW
hazard through memory
• Write through with write buffers might generate RAW conflicts with main memory
reads on cache misses

MB ACA Theory
- Check the contents of the write buffer on a read miss: if there are no conflicts, let
the memory access continue sending the read miss before the write
- Otherwise, the read miss must wait until the write buffer is empty, but this might
increase the read miss penalty
• Write Back
- Let’s consider a read miss replacing a dirty block
- Instead of writing the dirty block to memory, then reading from memory, we could
copy the dirty block to a write buffer, then do the read miss, and then do the write
to memory
- CPU stalls are reduced since restarts as soon as do the read

2. Sub-block Placement
• Don’t have to load full block on a miss: move subblocks
• We need valid bits per sub-block to indicate validity
• Drawback: no exploiting enough the spatial locality

3. Early Restart and Critical Word First

• Usually, the CPU needs just one word of the block on a miss
• Basic idea: Don’t wait for full block to be loaded before restarting CPU (by sending the
requested missed word)
- Early restart: Request the words in normal order from memory, but as soon as the
requested word of the block arrives, send it to the CPU to let the CPU continue
execution, while filling in the rest of the words in the cache block
- Critical Word First: Request the missed word first from memory and send it to the
CPU as soon as it arrives to let the CPU continue execution, while filling the rest of
the words in the cache block. This approach is also called requested word first
• How is it:
- Generally useful only for large blocks
- Spatial locality tend to want next sequential words, so not clear if there is a real
benefit by early restart
- The benefits of this approach depend on the size of the cache block and the
likelihood of another access to the portion of the block not yet been fetched

4. Non-blocking Caches (Hit under Miss)

• Non-blocking cache (or lockup-free cache) allows data cache to continue to supply
cache hits during a previous miss (Hit under Miss)

MB ACA Theory
• “Hit under Miss” reduces the effective miss penalty by working during a miss instead of
stalling CPUs on misses
- Requires out-of-order execution CPU: the CPU needs to do not stall on a cache miss:
For example, the CPU can continue fetching instructions from I-cache while waiting
for D-cache to return the missing data
- This approach is a sort of “out-of-order” pipelined memory access
• “Hit under Multiple Miss” or “Miss under Miss” may further lower the effective miss
penalty by overlapping multiple misses
- Requires multiple memory banks (otherwise cannot support) to serve multiple
misses
- Significantly increases the complexity of the cache controller as there can be
multiple simultaneous memory accesses

5. Introduce a second level cache ($L2)

• L1 cache small enough to match the fast CPU clock cycle
• L2 cache large enough to capture many accesses that would go to main memory
reducing the effective miss penalty
• L2 cache hit time not tied to CPU clock cycle!
- Speed of L1 affects the CPU clock rate
- Speed of L2 only affects the miss penalty of L1
• More in general, the second level cache concept can be applied recursively to introduce
multi-level caches to reduce overall memory access time
• AMAT = Hit_TimeL1 + Miss_RateL1 x (Hit_TimeL2 + Miss_RateL2 x Miss_PenaltyL2)
• Definitions:
- Local miss rate L2: misses in this cache divided by the total number of memory
accesses to this cache (Miss RateL2)
- Global miss rate: misses in this cache divided by the total number of memory
accesses generated by the CPU (Miss RateL1 x Miss RateL2)
- Global Miss Rate is what really matters: it indicates what fraction of the memory
accesses from CPU go all the way to main memory

6. Merging Write Buffer

• Goal: To reduce stalls due to full write buffer
• Each entry of the Write Buffer can merge words from different memory addresses

MB ACA Theory
Reducing the hit time
1. Fast Hit Times via Small and Simple L1 Caches
• Direct Mapped, small, and simple on-chip L1 cache
- Critical timing path: addressing tag memory, then comparing tags, then selecting
correct set
- Direct-mapped caches can overlap tag compare and transmission of data
- Lower associativity reduces power because fewer cache lines are accessed

2. Avoiding Address Translation

• Basic Idea: Avoiding virtual address translation during indexing of cache
• Index with Physical Portion of Virtual Address: if index is physical part of address, can
start tag access in parallel with address translation so that can compare to physical tag
• This approach limits cache size to page size: what if want bigger caches and uses same
trick?
- Higher associativity moves barrier to right
- Page coloring

3. Pipelined Writes
• Basic idea: To pipeline Tag Check and Update Cache Data as separate stages: current
write tag check & previous write cache update
• The “Delayed Write Buffer”; must be checked on reads; either complete write or read
from write buffer

4. Small Sub-blocks for Write Through

• If most writes are 1-word, sub-block size is 1 word, & write through then always write
sub-block & tag immediately
- Tag match and valid bit already set: writing the block was proper, & nothing lost by
setting valid bit on again
- Tag match and valid bit not set: the tag match means that this is the proper block;
writing the data into the sub-block makes it appropriate to turn the valid bit on
- Tag mismatch: this is a miss and will modify the data portion of the block. Since
write-through cache, no harm was done; memory still has an up-to-date copy of the
old value. Only the tag to the address of the write and the valid bits of the other sub-
block need be changed because the valid bit for this sub-block has already been set
• Doesn’t work with write back due to last case

MB ACA Theory
Cache optimization summary

MB ACA Theory

Onur Digitaldesign - Comparch 2021 Lecture15b Out of Order Execution I Afterlecture
No ratings yet
Onur Digitaldesign - Comparch 2021 Lecture15b Out of Order Execution I Afterlecture
110 pages
lect06
No ratings yet
lect06
89 pages
2. Dynamic Approach Tomosulo Algorithm
No ratings yet
2. Dynamic Approach Tomosulo Algorithm
57 pages
ACA-notes
No ratings yet
ACA-notes
39 pages
Lec18 Tomasulo Algorithm
No ratings yet
Lec18 Tomasulo Algorithm
40 pages
Dynamic scheduling - Tomasulo Algorithm
No ratings yet
Dynamic scheduling - Tomasulo Algorithm
48 pages
CA Lecture 4 Module 3
No ratings yet
CA Lecture 4 Module 3
27 pages
Computer Architecture Prof. Madhu Mutyam Department of Computer Science and Engineering Indian Institute of Technology, Madras
100% (1)
Computer Architecture Prof. Madhu Mutyam Department of Computer Science and Engineering Indian Institute of Technology, Madras
15 pages
L02 Branch Prediction V2021
No ratings yet
L02 Branch Prediction V2021
82 pages
Pipeline - Instr - Super Branch
No ratings yet
Pipeline - Instr - Super Branch
48 pages
9.1.0 Branch Prediction Pentiums IBM PPC
No ratings yet
9.1.0 Branch Prediction Pentiums IBM PPC
163 pages
Questions With Answers
No ratings yet
Questions With Answers
22 pages
Branch Prediction
No ratings yet
Branch Prediction
38 pages
Branch Hazard in Pipelining
No ratings yet
Branch Hazard in Pipelining
35 pages
L13 MIPS Control Hazards
No ratings yet
L13 MIPS Control Hazards
40 pages
CA - Slides
No ratings yet
CA - Slides
28 pages
Delayed Branching
No ratings yet
Delayed Branching
4 pages
pipe3
No ratings yet
pipe3
32 pages
Conditional Branches
No ratings yet
Conditional Branches
35 pages
What About Branches?: Branch Outcomes Are Not Known Until EXE What Are Our Options?
No ratings yet
What About Branches?: Branch Outcomes Are Not Known Until EXE What Are Our Options?
27 pages
05 - Pipelining - Branch Prediction
No ratings yet
05 - Pipelining - Branch Prediction
20 pages
18 740 Fall15 Lecture05 Branch Prediction Afterlecture
No ratings yet
18 740 Fall15 Lecture05 Branch Prediction Afterlecture
93 pages
Branch Handling 1
No ratings yet
Branch Handling 1
50 pages
Lec14 Control Hazards
No ratings yet
Lec14 Control Hazards
48 pages
Ca08 2014 PDF
No ratings yet
Ca08 2014 PDF
54 pages
Computer Architecture: Branching
No ratings yet
Computer Architecture: Branching
37 pages
Lecture11-Instruction Pipelining Control Hazards
No ratings yet
Lecture11-Instruction Pipelining Control Hazards
19 pages
onur-447-spring15-lecture9-branch-prediction-afterlecture
No ratings yet
onur-447-spring15-lecture9-branch-prediction-afterlecture
65 pages
L9 PipelineHazards 2
No ratings yet
L9 PipelineHazards 2
21 pages
Control Hazards
No ratings yet
Control Hazards
19 pages
06 - Branch Prediction Logic Notes Upload
No ratings yet
06 - Branch Prediction Logic Notes Upload
48 pages
Slides Chapter 6 Pipelining
No ratings yet
Slides Chapter 6 Pipelining
60 pages
L10 PipelineHazards 3
No ratings yet
L10 PipelineHazards 3
35 pages
AStudyofTechniquestoIncreaseInstructionLevelParallelism
No ratings yet
AStudyofTechniquestoIncreaseInstructionLevelParallelism
6 pages
Kuliah 14 Pipeliningg
No ratings yet
Kuliah 14 Pipeliningg
28 pages
Reducing Pipeline Branch Penalties
No ratings yet
Reducing Pipeline Branch Penalties
4 pages
Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 16 Branch Prediction
No ratings yet
Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 16 Branch Prediction
26 pages
17.L15 BranchPrediction
No ratings yet
17.L15 BranchPrediction
38 pages
Aca Unit-4 Notes
No ratings yet
Aca Unit-4 Notes
23 pages
Dynamatic (4) Odt
No ratings yet
Dynamatic (4) Odt
4 pages
Instruction Pipelining
No ratings yet
Instruction Pipelining
32 pages
branchPred
No ratings yet
branchPred
27 pages
Lecture-6-13.01.2025 HPC
No ratings yet
Lecture-6-13.01.2025 HPC
17 pages
07 Branch Prediction
No ratings yet
07 Branch Prediction
35 pages
Ue21ec341b 20240412163937
No ratings yet
Ue21ec341b 20240412163937
22 pages
Branch Predictors
No ratings yet
Branch Predictors
41 pages
Software-Based and Hardware-Based Branch Prediction Strategies and Performance Evaluation
No ratings yet
Software-Based and Hardware-Based Branch Prediction Strategies and Performance Evaluation
19 pages
Control Hazard
No ratings yet
Control Hazard
20 pages
Lecture05 Branches
No ratings yet
Lecture05 Branches
47 pages
Pipeline History
No ratings yet
Pipeline History
30 pages
Branch Prediction: Prof. Mikko H. Lipasti University of Wisconsin-Madison
No ratings yet
Branch Prediction: Prof. Mikko H. Lipasti University of Wisconsin-Madison
22 pages
CA Classes-155-160
No ratings yet
CA Classes-155-160
6 pages
Pipeline Hazards (1)
No ratings yet
Pipeline Hazards (1)
53 pages
CA Classes-151-155
No ratings yet
CA Classes-151-155
5 pages
Branch Hazard.: Control Hazards
No ratings yet
Branch Hazard.: Control Hazards
4 pages
5 4-Pipelining
No ratings yet
5 4-Pipelining
10 pages
Branch Prediction Schemes2
No ratings yet
Branch Prediction Schemes2
5 pages
More On Pipelining
100% (1)
More On Pipelining
34 pages
Computer Science 37 Lecture 22
No ratings yet
Computer Science 37 Lecture 22
14 pages
Tuesday, October 31, 2023 10:53 PM: Discuss, The Schemes For Dealing With The Pipeline Stalls Caused by Branch Hazards
No ratings yet
Tuesday, October 31, 2023 10:53 PM: Discuss, The Schemes For Dealing With The Pipeline Stalls Caused by Branch Hazards
7 pages
PipelineHazards
No ratings yet
PipelineHazards
4 pages
Branch Prediction Techniques
No ratings yet
Branch Prediction Techniques
48 pages
Instruction Pipelining (Ii) : Reducing Pipeline Branch Penalties
No ratings yet
Instruction Pipelining (Ii) : Reducing Pipeline Branch Penalties
5 pages
Pipeline Hazards: Structural Hazards: Resource Conflict
No ratings yet
Pipeline Hazards: Structural Hazards: Resource Conflict
49 pages
Dynamic Branch Prediction
No ratings yet
Dynamic Branch Prediction
7 pages
Instruction Pipelining (Ii) : Reducing Pipeline Branch Penalties
No ratings yet
Instruction Pipelining (Ii) : Reducing Pipeline Branch Penalties
4 pages
Branch Prediction
No ratings yet
Branch Prediction
2 pages
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
Computer Science II Essentials
From Everand
Computer Science II Essentials
Randall Raus
No ratings yet
CCNA Exam Excellence: Study Guide & Practice Tests
From Everand
CCNA Exam Excellence: Study Guide & Practice Tests
SUJAN
No ratings yet

Uploaded by

Uploaded by

Advanced Computer Architecture theory (prof.

Be careful there may be some errors!!!

• Forwarding paths: EX/EX, MEM/EX, MEM/MEM, MEM/ID

MIPS optimized pipeline

Early Evaluation of the PC

Branch Prediction Techniques

• Accuracy measured in terms of percentage of incorrect predictions given by the

• Prediction not taken at the end of the IF stage

• Prediction taken at the end of the IF stage

Dynamic Branch Prediction Techniques

Branch History Table - 1-bit Branch History Table

• Each BHT is composed by 16 entries

To allow nested interrupts, we need to save PC before enabling interrupts 

• Structural: Need more HW resources

• Control Dependence: it determines the ordering of instructions, and it is preserved by

• Execution and memory stage might require multiple cycles

• Dynamic Scheduling: depend on the hardware to locate parallelism

• Very high performance: Ideal CPI very low: CPIideal = 1 / issue-width

Scoreboard Dynamic Scheduling Algorithm

Instructions issued in program order (for hazard checking)

A source operand is available if:

FUs are characterized by:

4. Write result: Check for WAR hazards and finish execution

• Check for WAW postponed to WRITE stage instead of in ISSUE stage

• Avoids WAR, WAW hazards by renaming results by using RS numbers instead of RF

• Tag identifying the RS

• the Value (Vi) field holds the register/buffer content

2. Start execution (Out-Of-Order)

3. Write results (Out-Of-Order)

• Complexity: Large amount of hardware

• The 2 overhead instructions (SUBI, BNEZ) are paid only once

Explicit Register Renaming

• Decouples the concept of renaming from scheduling:

Problem: In case of misprediction we need to undo the effects of an incorrectly speculated

• Buffer introduced to support out-of-order execution but in-order commit

2. Execution started — start and execute on operands (EX)

3. Execution completed & Write result in ROB

4. Commit — update RF or memory with ROB result

Each entry in ROB contains the following fields:

• Direct Mapped Cache

Memory Address (N bit) composed of 4 fields:

n-way Set Associative Cache

N bit memory address composed of 4 fields:

• Byte offset: B bit

• Reduction of miss rate

• Higher implementation cost

• Random (or pseudo random)

• Simpler to be implemented, but to be effective it requires a write buffer to do not wait

3. Higher associativity decreases the conflict misses

Multibanked caches introduce a sort of associativity

5. Reducing Misses via Pseudo-Associativity and Way Prediction

6. Reducing Misses by Hardware Pre-fetching of Instructions & Data

8. Reducing Misses by Compiler Optimizations

Reducing the miss penalty

3. Early Restart and Critical Word First

4. Non-blocking Caches (Hit under Miss)

5. Introduce a second level cache ($L2)

6. Merging Write Buffer

2. Avoiding Address Translation

4. Small Sub-blocks for Write Through

You might also like