ACA-notes
ACA-notes
Silvano)
2021/2022
Topics:
The arguments are almost those of the pre-exam, attention: the performance part is missing!!!
1. Some basic concept about MIPS
2. Static Branch Prediction Techniques
3. Dynamic Branch Prediction Techniques
4. Exception handling
5. Instruction Level Parallelism
6. Scoreboard Dynamic Scheduling Algorithm
7. Tomasulo: Dynamic Scheduling algorithm
8. Tomasulo: Implicit Register Renaming
9. ReOrder Buffer
10. Speculative Tomasulo Architecture with ROB
11. Some basic concept about cache memories
12. Memory Hierarchy
MB ACA Theory
MIPS pipeline
Performance metrics
• IC = Instruction Count
• #Clock_Cycles = IC + #Stall_Cycles + 4
• CPI = Clock per Instruction = #Clock_Cycles / IC
• MIPS = fclock / (CPI * 10^6)
MB ACA Theory
Branch resolution
• Branch Outcome and Branch Target Address are ready at the end of the EX-stage
• Conditional branches are solved when PC is updated at the end of the ME-stage
MB ACA Theory
Static Branch Prediction Techniques
• It is done and fixed at compile time
• It is typically used when the branch behavior for the target application is highly
predictable at compile time
• It can also be used in combination with dynamic predictors
• Early evaluation of the BO, BTA and update of the PC during ID stage
Branch Always Not Taken
• Backward-going branches are predicted as taken (the branches at the end of DO-WHILE
loops going back at the beginning of the next loop iteration)
• Forward-going branches are predicted as not taken (the branches of IF-THEN-ELSE
conditional statements)
• We assume to profile the behavior of the target application program by several early
runs by using different data sets. The branch prediction is based on profiling information
collected during earlier runs
Delayed Branch Technique
MB ACA Theory
• Scheduling technique: the compiler statically schedules an independent instruction in
the Branch Delay Slot
• The instruction in the branch delay slot is always executed, whether the branch is taken
(it is never flushed)
The job of the compiler is to find a valid and useful instruction to be scheduled in the branch
delay slot, there are 4 ways to fill in the branch delay slot:
From before
• The branch delay slot is scheduled with an independent instruction from before the
branch
• The instruction is always executed
From target
• The branch delay slot is scheduled with one instruction from the target of the branch
(branch taken)
• If the branch is untaken (misprediction), the instruction in the delay slot needs to be
flushed or it must be ok to be executed also when the branch goes in the unexpected
direction (the work is wasted)
• It is preferred when the branch is taken with high probability, such as DO-WHILE loop
branches (backward branches)
From fall-through
• The branch delay slot is scheduled with one instruction from the fall-through path
(branch not taken)
• If the branch is taken -> flush or wasted work
• It is preferred when the branch is not taken with high probability, such as IF-THEN-ELSE
statements where the ELSE path is less probable (forwarding branches)
From after
//
The main limitations on delayed branch scheduling arise from:
• The restrictions on the instructions that can be scheduled in the delay slot
• The ability of the compiler to statically predict the outcome of the branch
To improve the ability of the compiler to fill the branch delay slot most processors have
introduced a canceling or nullifying branch: the instruction includes the hint direction that the
branch was predicted. In this way, the compiler need not be as much conservative when filling
the delay slot
MB ACA Theory
MIPS architecture has the branch-likely instruction, that can be used as cancel-if-not-taken
branch:
• The instruction in the branch delay slot is executed whether the branch is taken
• The instruction in the branch delay slot is not executed (it is turned to a nop) whether
the branch is not taken
Useful approach for backward branches (such as loop branches)
• Branch Outcome Predictor (BOP): to predict the direction of a branch (taken or not)
• Branch Target Predictor or Branch Target Buffer (BTB): to predict the branch target
address in case of taken branch (Predicted Target Address - PTA)
If branch is predicted by BOP in IF stage as not taken, the PC is incremented (BTB not useful in
this case), otherwise the BTB gives the Predicted Target Address (PTA).
If the BO at the end of ID stage will result a misprediction we need to flush the next instruction
already fetched (the next instruction is turned into a nop) and we restart the execution by
fetching at the correct address One-cycle penalty.
MB ACA Theory
• Table containing 1 bit for each entry that says whether the branch was recently taken or
not
• Table indexed by the lower portion k-bit of the address of the branch instruction (to
keep the size of the table limited)
• For locality reasons, we would expect that the most significant bits of the branch
address are not changed
• Prediction: hint that it is assumed to be correct and fetching begins in the predicted
direction. If the hint turns out to be wrong, the prediction bit is inverted and stored
back. The pipeline is flushed, and the correct sequence is executed with one cycle
penalty
• The table has no tags (every access is a hit) and the prediction bit could have been put
there by another branch with the same low-order address bits
A misprediction occurs when:
1. The prediction is incorrect for that branch or
2. the same index has been referenced by two different branches, and the previous history
refers to the other branch (This can occur because there is no tag check).
To reduce this problem, it is enough to increase the number of rows in the BHT (that is
to increase k) or to use a hashing function (such as in GShare).
Shortcoming (lack) of the 1-bit BHT: In a loop branch, even if a branch is almost always taken
and then not taken once, the 1-bit BHT will mispredict twice (rather than once) when it is not
taken. That scheme causes two wrong predictions:
• At the last loop iteration, since the prediction bit will say taken, while we need to exit
from the loop
• When we re-enter the loop, at the end of the first loop iteration we need to take the
branch to stay in the loop, while the prediction bit says to exit from the loop, since the
prediction bit was flipped on previous execution of the last iteration of the loop
Branch History Table - 2-bit Branch History Table
• The prediction must miss twice before it is changed. In a loop branch, at the last loop
iteration, we do not need to change the prediction.
Correlating Branch Predictors
• The 2-bit BHT uses only the recent behavior of a single branch to predict the future
behavior of that branch.
• Basic Idea: the behavior of recent branches is correlated, that is the recent behavior of
other branches rather than just the current branch (we are trying to predict) can
influence the prediction of the current branch. We try to exploit the correlation existing
among different branches
MB ACA Theory
A (1,1) Correlating Predictors is composed by two 1-bit predictors with 1-bit of correlation
Record if the most recently executed branches have been taken or not taken. The branch is
predicted based on the previous executed branch by selecting the appropriate 1-bit BHT
In general (m, n) correlating predictor records last m branches to choose from 2^m BHTs, each
of which is a n-bit predictor
A (2, 2) correlating predictor has 4 2-bit Branch History Tables, and it uses the 2-bit global
history to choose among the 4 BHTs
For example, if the predictors have 64 total entries, it is needed a 6-bit index composed of 2-bit
global history and 4-bit low-order branch address bits (2^4 = 16, k = 4).
MB ACA Theory
Note that the (2,2) predictor not only outperforms the simply 2-bit predictor with the same
number of total bits, but it also often outperforms a 2-bit predictor with an unlimited number
of entries.
Two-level Adaptive Branch Predictors
The first level history is recorded in one (or more) k-bit shift register called Branch History
Register (BHR), which records the outcomes of the k most recent branches (i.e. T, NT, NT, T)
(used as a global history)
The second level history is recorded in one (or more) tables called Pattern History Table (PHT)
of two-bit saturating counters (used as a local history)
The BHR is used to index the PHT to select which 2-bit counter to use.
Once the two-bit counter is selected, the prediction is made using the same method as in the
two-bit counter scheme
GA Predictor: the global 2-level predictor uses the correlation between the current branch and
the other branches in the global history to make the prediction
Global and local predictor with 2-level predictor: PHT (local history) indexed by the content of
BHR (global history)
GShare: variation of the GA predictor where want to correlate the BHR recording the outcomes
of the most recent branches (global history) with the low-order bits of the branch address
We make the XOR of 4-bit BHR (global history) with the low-order 4-bit of PC (branch address)
to index the PHT (local history).
Branch Target Buffer
It is a cache storing the Predicted Target Address (PTA) expressed as PC-relative
We access the BTB in the IF stage by using the address of the fetched branch instruction to
index the cache
MB ACA Theory
Exception handling
We use the term ‘exception’ to cover not only exceptions but also interrupts (an event that
requests the attention of the processor) and faults.
Synchronous vs asynchronous:
▪ asynchronous events are caused by devices external to the CPU and memory and
can be handled after the completion of the current instruction (easier to handle)
User requested vs coerced:
▪ User requested (such as I/O events) are predictable: treated as exceptions because
they use the same mechanisms that are used to save and restore the state; handled
after the instruction has completed (easier to handle)
▪ Coerced are caused by some HW event not under control of the user program; hard
to implement because they are unpredictable
User maskable vs user nonmaskable:
▪ the mask simply controls whether the HW responds to the exception or not
Within vs between instructions:
▪ exceptions that occur within instructions are usually synchronous since the
instruction triggers the exception. The instruction must be stopped and restarted.
▪ Asynchronous that occur between instructions arise from catastrophic situations
such as HW malfunctions and always cause program termination.
Resume vs terminate:
▪ Terminating event: program’s execution always stops after the interrupt/exception
▪ Resuming event: program’s execution continues after the interrupt/exception
Asynchronous Interrupts
When the processor decides to process the interrupt:
▪ It stops the current program at instruction Ii, completing all the instructions up to Ii‐
1 (precise interrupt)
▪ It saves the PC of instruction li in a special register Exception Program Counter
▪ It disables interrupts and transfers control to a designated
interrupt handler running in the kernel mode
Interrupt Handler
MB ACA Theory
- need a way to mask further interrupts at least until PC can be saved
Needs to read a status register that indicates the cause of the interrupt
Uses a special indirect jump instruction RFE (return‐from‐ exception) which restore the PC and:
- enables interrupts
- restores the processor to the user mode
- restores hardware status and control state
The instruction Ii and the next instructions (Ii+1, …) are restarted
Synchronous Interrupts
A synchronous interrupt (exception) is caused by a particular instruction stage
In general, the instruction Ii cannot be completed and needs to be restarted
after the exception has been handled (in the pipeline this would require undoing the effect of
one or more partially executed instructions)
Precise Interrupts/Exceptions
An interrupt or exception is precise if there is a single instruction (or interrupt point) for which
all instructions before having committed their state and no following instructions (including the
interrupting instruction Ii) have modified any state.
This means, effectively, that we can restart execution at the interrupt point and “get the right
answer”
Exceptions in 5-stage Pipeline
Exceptions may occur at different stages in pipeline => Exceptions may be raised out of order
Example:
- Data page fault occurs in memory stage of first instruction
- Arithmetic exception occurs in execution stage of second instruction
Due to pipelined execution Page faults occurs simultaneously to overflow, but Data page fault
must be handled first
Exception handling:
▪ Write in Data Memory and Write Back in RF are disabled to guarantee a precise
interrupt model
▪ Hold exception flags and PC in pipeline until commit point (M stage)
▪ Exceptions in earlier pipe stages override later exceptions for a given instruction
▪ Inject external interrupts at commit point (override others)
MB ACA Theory
Instruction Level Parallelism
The max throughput would be to complete one Instruction Per Clock: IPCideal=1 => CPIideal=1
In a pipelined machine, actual CPI is derived as:
CPIpipeline = CPIideal + Structural Stalls + Data Hazard Stalls + Control Stalls + Memory Stalls
Reduction of any right-hand term reduces the CPIpipeline with respect to the CPIideal and
increases Instructions Per Clock: IPC = 1 / CPI
Different types of hazards limit performance:
• Data Dependences
- True data dependence: Read After Write hazards
• Name Dependences (use same register or memory location)
- Anti-dependence: When Ij writes a register or memory location that instruction Ii
reads (it can generate a Write After Read hazard)
- Output Dependence: When Ii and Ij write the same register or memory location (it
can generate a Write After Write hazard)
Name dependences are not true data dependences since there is no value (no data
flow) being transmitted between instructions
If the name (register number or memory location) used in the instructions could be
changed (register renaming), the instructions do not conflict anymore
Register renaming can be more easily done, if there are enough registers available in the
ISA and it can be done either statically by the compiler or dynamically by the hardware
MB ACA Theory
Program properties
Two properties are critical to preserve program correctness:
1. Data flow: Actual flow of data values among instructions that produces the correct
results and consumes them
2. Exception behavior: Preserving exception behavior means that any changes in the
ordering of instruction execution must not change how exceptions are raised in the
program
Multi-cycle pipelining
• ID stage split in 2 stages: Instr. Decode (ID) & Register Read (Issue)
• Multiple functional units with variable latency
• Long latency multi-cycle floating-point instructions
• Memory systems with variable access time: multi-cycle memory accesses due to data
cache misses (unpredictable statically)
• No more commit point => Out-of-order and check for WAR & WAW hazards
Instruction level parallelism
Exploit potential overlap of execution among unrelated instructions
Multiple-issue pipeline
In a multiple-issue pipelined processor, the ideal CPI would be CPI ideal < 1
If we consider for example dual-issue processor the max throughput would be to complete 2
Instructions Per Clock: IPC ideal = 2; CPI ideal = 0.5
Two strategies to support ILP:
• Problem: Hazards due to data dependences that cannot be solved by forwarding cause
stalls of the pipeline: no new instructions are fetched nor issued even if they are not
data dependent
• Solution: Allow data independent instructions behind a stall to proceed
MB ACA Theory
• Hardware reorder instructions execution to reduce stalls, maintaining data flow and
exception behavior
• Enables out-of-order execution and completion (commit)
Instructions are fetched and issued in program order (in-order-issue)
Execution begins as soon as operands are available (possibly, out of order execution)
Out-of order execution introduces possibility of WAR and WAW data hazards
Out-of order execution implies out of order completion unless there is a re-order buffer to get
in order completion
• Problem with out-of order completion: must preserve exception behavior as in-order
execution
• Solution: ensure that no instruction can generate an exception until the processor
knows that the instruction raising the exception will be executed
An exception is imprecise if the processor state when an exception is raised does not look
exactly as if the instructions were executed in-order. Imprecise exceptions can occur because:
• The pipeline may have already completed instructions that are later in program order
than the instruction causing the exception
• The pipeline may have not yet completed some instructions that are earlier in program
order than the instruction causing the exception
Imprecise exception makes it difficult to restart execution after handling
Main advantage:
• Very expensive logic to decide dependencies and independencies, i.e., to decide which
instructions can be issued every clock cycle
• It does not scale: almost impractical to make issuewidth greater than 4 (we would have
to slow down the clock)
Static scheduling -> VLIW processors
Static detection and resolution of dependences that are avoided by code reordering.
Output of the compiler: reordered into dependency-free code
Compilers can use sophisticated algorithms for code scheduling to exploit ILP (Instruction Level
Parallelism). The size of a basic block (a straight-line code sequence with no branches in except
MB ACA Theory
to the entry and no branches out except at the exit) is usually quite small and the amount of
parallelism available within a basic block is quite small.
Data dependence can further limit the amount of ILP we can exploit within a basic block to
much less than the average basic block size
To obtain substantial performance enhancements, we must exploit ILP across multiple basic
blocks (i.e. across branches such as in trace scheduling).
Main Limits of Static Scheduling:
• Unpredictable branches
• Variable memory latency (unpredictable cache misses)
• Code size explosion
• Compiler complexity
• Code portability
• Performance portability
Limits of Multiple-Issue processors
The issue width is the number of instructions that can be issued in a single cycle by a multiple
issue processor
It is too hard to decide which 8, or 16, instructions can execute every cycle (too many!), it takes
too long to compute, so the frequency of the processor would have to be decreased
Limitation due to intrinsic level of parallelism
• Every instruction goes through the Scoreboard, where a record of data dependences is
constructed
• The Scoreboard then determines when the instruction can read its operand and begin
execution
• If the scoreboard decides the instruction cannot execute immediately, it monitors every
change and decides when the instruction can execute
• The scoreboard controls when the instruction can write its result into destination
register
MB ACA Theory
Four Stages of Scoreboard Control
1. Issue: decode instruction and check for structural hazards & WAW hazards
2. Read Operands: wait until no RAW hazards, then read operands. Check for structural
hazards in reading RF
Once the scoreboard is aware that the functional unit has completed execution, the
scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the
completing instruction.
MB ACA Theory
Scoreboard optimizations
MB ACA Theory
Tomasulo: Dynamic Scheduling algorithm
It introduces the Implicit Register Renaming to avoid WAR & WAW hazards to get high
performance at runtime without special compilers
It introduces some FU buffers called “Reservation Stations” in front of the FUs to keep pending
operands
The control logic & FU buffers are distributed with the Function Units (vs. centralized in
scoreboard)
Registers in instructions are replaced by their values or pointers to reservation stations (RS) to
enable Implicit Register Renaming
MB ACA Theory
Tomasulo stages
1. Issue (In-Order)
- Get an instruction I from the head of instruction queue (maintained in FIFO order to
ensure in-order issue)
- Check if an RS is empty (i.e., check for structural hazards in RS) otherwise stalls. If
operands are not in RF, keep track of FU that will produce the operands (Q pointers)
MB ACA Theory
Tomasulo: Implicit Register Renaming
How can Tomasulo overlap iterations of loops?
Register renaming provided by Reservation Stations (which buffer the operands of
instructions) to eliminate WAR and WAW hazards
• Multiple iterations use different physical destinations for registers (dynamic loop
unrolling without changing the code)
• Replace static register names from code with dynamic register “pointers”
• Effectively increases the size of Register File
• Permit instruction issue to advance past integer control flow operations
Crucial: integer unit must “get ahead” of floating-point unit so that we can issue multiple
iterations (by using branch prediction)
Example of compiler transformation called Loop Unrolling combined with Register Renaming by
using more registers specified in the ISA:
Unrolled Loop (unrolling factor 4) + Reg. Renaming + Code Rescheduling to minimize RAW stalls
MB ACA Theory
Pros:
• The instruction cache used in Loop unrolling is more than in standard Tomasulo
• I use a lot of registers
• I must carefully choose the unrolling factor
• On issue, each instruction that writes a result is allocated new physical register from
Freelist
• ISA register → physical register mapping
• When register written, replace entry with new register from freelist
• Physical register becomes free when not used by any active instructions
Advantages:
MB ACA Theory
ReOrder Buffer
We introduce the concept of HW-based speculation, which extends the ideas of dynamic
scheduling beyond branches
HW-based speculation combines 3 key concepts:
1. Dynamic branch prediction
2. Speculation to enable the execution of instructions before the control dependences are
solved by undoing the effects of an incorrectly speculated sequence
3. Dynamic scheduling beyond branches.
• Buffer to hold the results of instructions that have finished execution but not yet
committed
• Buffer to pass results among instructions that have started speculatively after a branch,
so the renaming function of Reservation Stations is replaced by ROB entries
• Reservation Stations now are used only to buffer instructions and operands to FUs (to
reduce structural hazards)
• ROB supplies operands to other RSs between execution complete & commit
• Once the instruction commits, the result is written in the register file
• Originally (1988) introduced to maintain the precise interrupt model → Processors with
ROB can execute dynamically while maintaining a precise interrupt model because
instruction commit happens in order
MB ACA Theory
The ROB is a circular buffer with the head pointer indicating the instruction that will commit
(leaving ROB) first and with the tail pointer (indicating next free entry)
Instructions are written in ROB in strict program order – when an instruction is issued, an entry
is allocated to the ROB in sequence (in-order)
Each entry must keep the status of an instruction: issued, execution started, execution
completed, ready to commit
Each entry includes a speculative status flag, indicating whether the instruction is being
speculatively executed or not
An instruction is ready to commit (retire) if and only if:
1. It has finished, and
2. it is not marked as speculative, and
3. all previous instructions have already retired.
MB ACA Theory
Speculative Tomasulo Architecture with ROB
Pointers are directed towards the ROB entries instead of reservation stations
Implicit register renaming by using the ROB entries to eliminate WAR and WAW hazards and
enable dynamic loop unrolling
A register or memory location is updated only when the instruction reaches the head of ROB
There are Load buffers, while Store Buffers are completely replaced by the ReOrder Buffer.
By using the RoB to hold the result, it is easy to:
1. Undo speculated instructions on mispredicted branches
2. Manage exceptions precisely
Four Phases of Speculative Tomasulo Algorithm
1. Issue (or dispatch) — get instruction from instr. Queue
• If there are one RS entry free and one ROB entry free →instr. Issued
• If its operands are available in RF or in ROB, they are sent to RS; Number of ROB
entry allocated for result is sent to RSs (to tag the result when it will be placed on
the CDB)
• If RS full or/and ROB full: instruction stalls
MB ACA Theory
• When instr. at the head of ROB & result ready, update RF with result (or store to
memory) and remove instr from ROB
• Mispredicted branch flushes ROB (sometimes called “graduation”)
There are 3 different possible sequences:
a. Normal commit: instruction reaches the head of the ROB; result is present in
the buffer. Result is stored in the Register File, instruction is removed from
ROB
b. Store commit: as above, but result is stored in memory rather than in the RF
c. Instruction is a branch with incorrect prediction: it indicates that speculation
was wrong. ROB is flushed (“graduation”), execution restarts at correct
successor of the branch. If the branch was correctly predicted, branch is
finished
• Instruction type field indicates whether instruction is a branch (no destination result), a
store (has memory address destination), or a load/ALU (register destination)
• Destination field supplies register number (for loads and ALU instructions) or memory
address (for stores) where results should be written
• Value field used to hold value of result until instruction commits
• Ready field indicates that instruction has completed execution, value is ready
• Speculative field indicates whether instruction is executed speculatively or not
Interrupt handling
Exceptions generated in connection with instruction execution are made precise by accepting
exception request only when instruction becomes “ready to retire” (exceptions are processed
in order)
MB ACA Theory
Some concept about cache memories
Main goal: to increase the performance of a computer through the memory system to:
• Provide the user the illusion to use a memory that is simultaneously large and fast
• Provide the data to the processor at high frequency
Problem: Processor-memory performance gap
Solution: to exploit the principle of locality of references
• Temporal Locality: when there is a reference to one memory element, the trend is to
refer again to the same memory element soon (i.e., instruction and data reused in loop
bodies)
• Spatial Locality: when there is a reference to one memory element, the trend is to refer
soon at other memory elements whose addresses are close by (i.e., sequence of
instructions or accesses to data organized as arrays or matrices)
Caches exploit both types of predictability:
- Exploit temporal locality by keeping the contents of recently accessed locations
- Exploit spatial locality by fetching blocks of data around recently accessed locations
Definitions
Hit Rate: Number of memory accesses that find the data in the upper level with respect to the
total number of memory accesses
Hit Rate = # hits / # memory accesses
Hit Time: time to access the data in the upper level of the hierarchy, including the time needed
to decide if the attempt of access will result in a hit or miss
Miss Rate: number of memory accesses not finding the data in the upper level with respect to
the total number of memory accesses
Miss Rate = # misses / # memory accesses
Hit Rate + Miss Rate = 1
Miss Penalty: time needed to access the lower level and to replace the block in the upper level
Miss Time = Hit Time + Miss Penalty
Average Memory Access time = AMAT= Hit Time + Miss Rate * Miss Penalty
MB ACA Theory
Cache structure
Each entry in the cache includes:
1. Valid bit to indicate if this position contains valid data or not. At the bootstrap, all the
entries in the cache are marked as INVALID
2. Cache Tag contains the value that univocally identifies the memory address
corresponding to the stored data
3. Cache Data contains a copy of data (block or cache line)
Problem of block placement
Given the address of the block in the main memory, where the block can be placed in the cache
(upper level)?
We need to find the correspondence between the memory address of the block and the cache
address of the block, such correspondence depends on the cache architecture:
• Byte offset in the word: to identify the given byte within the word: B bit
If the memory is not byte addressable: B = 0
• Word offset in the block: to identify the given word within the block: K bit
If the block contains only one word K = 0
• Index identifies the block: M bit
• Tag to be compared to the cache tag associated to the block selected by the index:
N – (M + K + B) bit
MB ACA Theory
Fully Associative Cache
In a fully associative cache, the memory block can be placed in any position of the cache
→ All the cache blocks must be checked during the search of the block
The index does not exist in the memory address, there are the tag bits only
Number of blocks = Cache Size / Block Size
MB ACA Theory
Recap
Increment of associativity
Main advantage:
MB ACA Theory
Main strategies used to choose the block to be replaced in associative caches:
• The block can be written by the processor at the frequency at which the cache, and not
the main memory can accept it
• Multiple writes to the same block require only a single write to the main memory
Write-Through:
MB ACA Theory
Write Miss Options: Write Allocate vs No Write Allocate
What happens on a write miss?
Write allocate (or “fetch on write”): allocate new cache line in cache then write (double write
to cache!)
- Usually means that you must do a “read miss” to fill in rest of the cache-line
- Alternative: per/word valid bits
No write allocate (or “write-around”): simply send write data to lower-level memory - don’t
allocate new cache line!
Write-Back vs Write-Through | Write Allocate vs No Write Allocate
To manage a write miss, both options (Write Allocate and No Write Allocate) can be used for
both write policies (Write-Back and Write-Through), but usually
Write-Back cache uses the Write Allocate option (hoping next writes to the block will be done
again in cache)
Write-Through cache uses the No Write Allocate option (hoping next writes to the block will be
done again in memory)
Hit and Miss: Read and Write
MB ACA Theory
Memory Hierarchy
Programmers want unlimited amounts of memory with low latency
Fast memory technology (SRAM) is more expensive per bit than slower memory (DRAM)
Solution: organize memory system into a hierarchy
- Entire addressable memory space available in largest, slowest memory
- Incrementally smaller and faster memories, each containing a copy of a subset of
the memory below it
Temporal and spatial locality ensures that nearly all references can be found in smaller
memories
- Gives the illusion of a large, fast memory being presented to the processor
Classifying Cache Misses
Three major categories of cache misses:
1. Compulsory Misses: cold start misses or first reference misses
2. Capacity Misses: can be reduced by increasing cache size
3. Conflict Misses: can be reduced by increasing cache size and/or associativity
Compulsory Misses: the first access to a block is not in the cache, so the block must be loaded
in the cache from the MM
- there are compulsory misses even in an infinite cache: they are independent of the
cache size
Capacity Misses: if the cache cannot contain all the blocks needed during execution of a
program, capacity misses will occur due to blocks being replaced and later retrieved
- capacity misses decrease as capacity increases
Conflict Misses: if the block-placement strategy is set associative or direct mapped, conflict
misses will occur because a block can be replaced and later retrieved when other blocks map to
the same location in the cache.
- also called collision misses or interference misses
- conflict misses decrease as associativity increases
- fully associative caches avoid all conflict misses, but they are consuming area
MB ACA Theory
Improving Cache Performance
AMAT= Hit Time + Miss Rate * Miss Penalty
How to improve cache performance:
1. Reduce the miss rate
2. Reduce the miss penalty
3. Reduce the hit time
Reducing the miss rate
1. A way to reduce capacity misses: to increase the cache capacity
• Drawback: Increases hit time, area, power consumption and cost
2. Increasing the block size reduces the miss rate up to a point when the block size is too
large with respect to the cache size
• Larger block size will reduce compulsory misses taking advantage of spatial locality
• Drawbacks:
- Larger blocks increase miss penalty
- Larger blocks reduce the number of blocks so increase conflict misses (and even
capacity misses) if the cache is small
MB ACA Theory
4. Reduce Misses (and Miss Penalty) via a “Victim Cache”
• Victim cache is a small fully associative cache used as a buffer to place data discarded
from cache to better exploit temporal locality
• Victim cache placed between cache and its refilling path towards the next lower level in
the hierarchy
• Victim cache is checked on a miss to see if it has the required data before going to
lower-level memory. If the block is found in Victim cache the victim block and the cache
block are swapped
MB ACA Theory
7. Reducing Misses by Software Pre-fetching Data
• Compiler-controlled pre-fetching (the compiler can help in reducing useless pre-
fetching): compiler inserts prefetch LOAD instructions to load data in registers/cache
before they are needed
• Data Pre-fetch
- Register Pre-fetch: Load data into register (HP PA-RISC loads)
- Cache Pre-fetch: Load data into cache (MIPS IV, PowerPC, SPARC v. 9)
• Issuing pre-fetch instructions takes time (instr. overhead)
- Is cost of pre-fetch issues < savings in reduced misses?
- Higher superscalar reduces difficulty of issue bandwidth
MB ACA Theory
- Check the contents of the write buffer on a read miss: if there are no conflicts, let
the memory access continue sending the read miss before the write
- Otherwise, the read miss must wait until the write buffer is empty, but this might
increase the read miss penalty
• Write Back
- Let’s consider a read miss replacing a dirty block
- Instead of writing the dirty block to memory, then reading from memory, we could
copy the dirty block to a write buffer, then do the read miss, and then do the write
to memory
- CPU stalls are reduced since restarts as soon as do the read
2. Sub-block Placement
• Don’t have to load full block on a miss: move subblocks
• We need valid bits per sub-block to indicate validity
• Drawback: no exploiting enough the spatial locality
MB ACA Theory
• “Hit under Miss” reduces the effective miss penalty by working during a miss instead of
stalling CPUs on misses
- Requires out-of-order execution CPU: the CPU needs to do not stall on a cache miss:
For example, the CPU can continue fetching instructions from I-cache while waiting
for D-cache to return the missing data
- This approach is a sort of “out-of-order” pipelined memory access
• “Hit under Multiple Miss” or “Miss under Miss” may further lower the effective miss
penalty by overlapping multiple misses
- Requires multiple memory banks (otherwise cannot support) to serve multiple
misses
- Significantly increases the complexity of the cache controller as there can be
multiple simultaneous memory accesses
MB ACA Theory
Reducing the hit time
1. Fast Hit Times via Small and Simple L1 Caches
• Direct Mapped, small, and simple on-chip L1 cache
- Critical timing path: addressing tag memory, then comparing tags, then selecting
correct set
- Direct-mapped caches can overlap tag compare and transmission of data
- Lower associativity reduces power because fewer cache lines are accessed
3. Pipelined Writes
• Basic idea: To pipeline Tag Check and Update Cache Data as separate stages: current
write tag check & previous write cache update
• The “Delayed Write Buffer”; must be checked on reads; either complete write or read
from write buffer
MB ACA Theory
Cache optimization summary
MB ACA Theory