COA-UNIT-III-FINAL (1)
COA-UNIT-III-FINAL (1)
ARCHITECTURE
UNIT-III
Course Outcome
CLR-3: Understand the concepts of Pipelining and basic processing units
CLO-3 : Analyze the detailed operation of Basic Processing units and the
performance of Pipelining
1
Topics Covered
• Fundamental concepts of basic processing unit
• Performing ALU operation
• Execution of complete instruction, Branch instruction
• Multiple bus organization
• Hardwired control,
• Generation of control signals
• Micro-programmed control, Microinstruction
• Micro-program Sequencing
• Micro instruction with Next address field
• Basic concepts of pipelining
• Pipeline Performance
• Pipeline Hazards-Data hazards, Methods to overcome Data hazards
• Instruction Hazards
• Hazards on conditional and Unconditional Branching
• Control hazards 2
PROCESSING UNIT
FUNCTIONS OF CPU:
• CPU carries out all forms of data processing tasks.
• It saves information, intermediate results and instructions.
• CPU monitors the functionality of all computer components.
COMPONENTS OF CPU:
• Register: Stores data and result and speeds up the operation
• Control unit: This unit monitors all computing processes but does not
execute actual data processing.
• Arithmetic Logic Unit (ALU): This does all the calculations and makes
the decisions.
3
FUNDAMENTAL CONCEPTS OF BASIC
PROCESSING UNIT
• Processor fetches one instruction at a time and perform the specified operation.
• Instructions are fetched from successive memory locations except for branch/ jump
instruction.
• The address of the next instruction to be executed is tracked by the Program Counter
(PC) register.
• Instruction Register (IR) contains instruction that is currently executed.
• Instruction execution happens in three phases:
✔ Fetch: Fetch the instruction from the specified memory
✔Decode: Determined the opcode and the operands
✔Execute: Run the instruction
4
EXECUTING AN INSTRUCTION
• Fetch the contents of memory location pointed by the PC. The
contents of this memory location is loaded to the IR-Fetch phase
IR🡨 [[PC]]
• Increment the PC by 4 (assume the word size as 4 )
PC🡨 [PC]+4
• Carry out the actions specified by the instruction in the IR-Execution
phase
• MDR: Two inputs and two outputs since data can be loaded from
memory or processor bus.
• MAR: Input line is connected to internal bus and output line to
external bus
• Control lines: connected to instruction decoder and control logic
block to issue control signals
• R0-R(n-1): General Purpose registers whose numbers vary between
processors.
• TEMP, Y and Z: temporary registers used by the processor during
instruction execution
Fig : Single bus organization of
• The registers, the ALU, and the interconnecting bus are collectively
datapath
referred to as the datapath. 5
Executing an Instruction
With few exceptions, an instruction can be executed by performing
one or more of the following operations in some specified
sequence:
❑Transfer a word of data from one processor register
to another or to the ALU.
❑Perform an arithmetic or a logic operation and
store the result in a processor register.
❑Fetch the contents of a given memory location and load them
into a processor register.
❑Store a word of data from a processor register into a given
memory location.
Register Transfers
❑ Instruction execution involves a sequence of steps in which data are
transferred from one register to another.
❑ For each register, two control signals are used to place the contents of
that register on the bus or to load the data on the bus into the register.
❑ The input and output of register Ri are connected to the bus
via switches controlled by the signals
Riout Riin
and
respectively.
❑When Riin is set to 1, the data on the bus are loaded into Ri.
❑Similarly, when Riout is set to 1, the contents of register Ri are placed on
the bus.
❑While Riout is equal to 0, the bus can be used for transferring data from
other registers.
Register Transfers
Internal processor
bus
Riin
Ri
Riout
Yin
Constant 4
Select MUX
A B
ALU
Zin
Zout
Figure 7.2. Input and output gating for the registers in Figure 7.1.
Performing an Arithmetic or Logic Operation
❑ The ALU is a combinational circuit that has no internal storage.
❑ ALU gets the two operands from MUX and bus. The result is
temporarily stored in register Z.
❑ What is the sequence of operations to add the contents of register
R1 to those of R2 and store the result in R3?
1. R1out, Yin
2. R2out, SelectY, Add, Zin
3. Zout, R3in
Performing an Arithmetic or Logic Operation
❑ In step 1, the output of register R1 and the input of register Y are enabled,
causing the contents of R1 to be transferred over the bus to Y.
❑ In step 2, the multiplexer's Select signal is set to SelectY, causing the
multiplexer to gate the contents of register Y to input A of the ALU.
❑ At the same time, the contents of register R2 are gated onto the bus and,
hence, to input B.
Performing an Arithmetic or Logic Operation
❑ The function performed by the ALU depends on the signals applied to
its control lines.
❑ In this case, the Add line is set to 1, causing the output of the ALU to
be the sum of the two numbers at inputs A and B.
❑ This sum is loaded into register Z because its input control signal is
activated.
❑ In step 3, the contents of register Z are transferred to the destination
register, R3.
❑ This last transfer cannot be carried out during step 2, because only one
register output can be connected to the bus during any clock cycle.
Fetching a Word from Memory
❑ To fetch a word of information from memory, the processor has to
specify the address of the memory location where this information
is stored and request a Read operation.
❑ This applies whether the information to be fetched represents an
instruction in a program or an operand specified by an instruction.
❑ The processor transfers the required address to the MAR, whose
output is connected to the address lines of the memory bus.
Fetching a Word from Memory
❑ At the same time, the processor uses the control lines of the
memory bus to indicate that a Read operation is needed.
❑ When the requested data are received from the memory they are
stored in register MDR, from where they can be transferred to other
registers in the processor.
❑ The connections for register MDR are illustrated in Figure 7.4 on
next slide.
❑ It has four control signals: MDRin and MDRout control the connection
to the internal bus, and MDRin E and MDRout E control the connection
to the external bus.
Fetching a Word from Memory
❑ Address into MAR; issue Read operation; data into MDR.
Memory-bus Internal process bus
data lines MDRoutE MDRout
MDR
MDRinE MDRin
Figure
Figure 7.4. Connection
7.4. and control
Connection andsignals forsignals
control register fogisterr
MDR. re MDR.
Fetching a Word from Memory
❑ As an example of a read operation, consider the instruction Move (R1), R2. The
actions needed to execute this instruction are:
❑ MAR ← [R1]
❑ Start a Read operation on the memory bus
❑ Wait for the MFC response from the memory
❑ Load MDR from the memory bus
❑ R2 ← [MDR]
❑ These actions may be carried out as separate steps, but some can be combined
into a single step.
❑ Each action can be completed in one clock cycle, except action 3 which requires
one or more clock cycles, depending on the speed of the addressed device.
Fetching a Word from Memory
❑ A Read control signal is activated at the same time MAR is loaded.
❑ The data received from the memory are loaded into MDR at the end of the clock
cycle in which the MFC signal is received.
❑ In the next clock cycle, MDRout is activated to transfer the data to register R2.
❑ This means that the memory read operation requires three steps, which can be
described by the signals being activated as follows:
1. R1out,MARin, Read
2. MDRin E, WMFC
3. MDRout R2in
Storing a Word in Memory
❑ The desired address is loaded into MAR.
❑ Then, the data to be written are loaded into MDR, and a Write command is
issued.
❑ Hence, executing the instruction Move R2,(Rl) requires the
following sequence:
1. R1out,MARin
2. R2out, MDRin, Write
3. MDRout E, WMFC
❑ The processor remains in step 3 until the memory operation is completed and
an MFC response is received.
Execution of a Complete Instruction
❑ Consider the instruction
Add (R3), R1
❑ Executing this instruction requires the following
actions:
❑ Fetch the instruction
❑ Fetch the first operand (the contents of the
memory location pointed to by R3)
❑ Perform the addition
❑ Load the result into R1
Execution of a Complete Instruction
Internal processor bus
PC
Instruction decoder
Step Action Address and control logic
lines
MAR
Constant 4 R0
5
6 MDR out , SelectY, Add, Zin Zout , R1in Select MUX
7 , End
Add
A B
ALU Sub R n-1
control ALU
lines
Carry-in
TEMP
Figure 7.6. Control sequencefor execution of the instruction Add (R3),R1. XOR
Z
Step Action
5 End
Incrementer
single-bus structure of processing unit to
illustrate the basic ideas. PC
Register f ile
MUX
A
cycle.
Instruction
decoder
Step Action
Figure 7.9. Control sequence for the instruction. Add R4,R5,R6, for the three-bus
organization in Figure 7.8.
Multiple-Bus Organization
❑ In step 1, the contents of the PC are passed through the ALU, using the R=B
control signal, and loaded into the MAR to start a memory read operation.
❑ At the same time the PC is incremented by 4.
❑ In step 2, the processor waits for MFC and loads the data received into MDR,
then transfers them to IR in step 3.
❑ Finally, the execution phase of the instruction requires only one control step to
complete, step 4.
❑ By providing more paths for data transfer a significant reduction in the number
of clock cycles needed to execute an instruction is achieved.
Hardwired Control
35
Overview
• To execute instructions, the processor must have some means of
generating the control signals needed in the proper sequence.
• Two categories: hardwired control and microprogrammed control
• Hardwired system can operate at high speed; but with little flexibility.
36
7 Steps:
37
Control Unit Organization
CLK Control step
Clock
counter
External
inputs
Decoder/
IR encoder
Condition
codes
Control signals
Step decoder
T1 T2
T
n
INS1
INS2 External
inputs
Instruction
IR decoder Encoder
Condition
codes
INSm
Run End
Control signals
Figure 7.12.
Generation of the Zin
Generating End
❑ End = T7 • ADD + T5 • BR + (T5 • N + T4 • N) •
BRN +… Add
Branch<0
Branch
N N
T7 T5 T5
T4
End
Instruction Data
cache cache
Bus interface
Processor
System
bus
Main Input/
memory Output
46
Microprogrammed
Control
Microprogrammed Control Unit
• Control signals
• Group of bits used to select paths in multiplexers, decoders, arithmetic logic
units
• Control variables
• Binary variables specify microoperations
• Certain microoperations initiated while others idle
• Control word
• is a word whose individual bits represent the various control signals
48
Microprogrammed Control Unit
• Control memory
• The microroutines for all instructions in the instruction set of a
computer are stored in a special memory called the control
store/control memory
• Microinstructions
• A sequence of CWs corresponding to the control sequence of a
machine instruction constitutes the microroutine for that instruction,
and the individual control words in this microroutine are referred to as
microinstructions
Microprogram
• Sequence of microinstructions
49
Control Unit Implementation
•
Hardwired Memory Instruction code
Combinational . Control
Sequence Counter .
Logic Circuits signals
•
Microprogrammed CAR: Control Address Register
Memory Instruction code CDR: Control Data Register
50
Control Memory
• Read-only memory (ROM)
• Content of word in ROM at given address specifies microinstruction
• Each computer instruction initiates series of microinstructions
(microprogram) in control memory
• These microinstructions generate microoperations to
• Fetch instruction from main memory
• Evaluate effective address
• Execute operation specified by instruction
• Return control to fetch phase for next instruction
Control
Address Control word
memory
(microinstruction)
(ROM)
51
Microprogrammed Control
Organization
External Next Address Control
input CDR Control
Generator CAR Memory word
(sequencer) (ROM)
• Control memory
• Contains microprograms (set of microinstructions)
• Microinstruction contains
• Bits initiate microoperations
• Bits determine address of next microinstruction
• Control address register (CAR)
• Specifies address of next microinstruction
52
Microprogrammed Control Organization
• Next address generator (microprogram sequencer)
• Determines address sequence for control memory
• Microprogram sequencer functions
• Increment CAR by one
• Transfer external address into CAR
• Load initial address into CAR to start control operations
53
Microprogrammed Control Organization
• Control data register (CDR)- or pipeline register
• Holds microinstruction read from control memory
• Allows execution of microoperations specified by control word
simultaneously with generation of next microinstruction
• Control unit can operate without CDR
54
Microinstruction Sequencing:
A micro-program control unit can be viewed as consisting of two parts:
55
Microinstruction Sequencing:
A micro-program sequencer attached to a control memory inputs certain
bits of the microinstruction, from which it determines the next address for
control memory. A typical sequencer provides the following address-
sequencing capabilities:
Increment the present address for control memory.
Branches to an address as specified by the address field of the micro
instruction.
Branches to a given address if a specified status bit is equal to 1.
Transfer control to a new address as specified by an external source
(Instruction Register).
Has a facility for subroutine calls and returns.
56
Microinstruction Sequencing:
Depending on the current microinstruction condition flags, and the
contents of the instruction register, a control memory address must be
generated for the next micro instruction.
There are three general techniques based on the format of the address
information in the microinstruction:
57
Two address field
The simplest approach is to provide two address field in each
microinstruction and multiplexer is provided to select:
58
Two address field
59
Single address field
Two-address approach is simple but it requires more bits in the
microinstruction. With a simpler approach, we can have a single address
field in the micro instruction with the following options for the next address.
Address Field.
Based on OPcode in instruction register.
Next Sequential Address.
enter image description here
The address selection signals determine which option is selected. This
approach reduces the number of address field to one. In most cases (in case
of sequential execution) the address field will not be used. Thus the
microinstruction encoding does not efficiently utilize the entire
microinstruction.
60
Single address field
61
Variable Format
In this approach, there are two entirely different microinstruction
formats. One bit designates which format is being used. In this first
format, the remaining bits are used to activate control signals.
In the second format, some bits drive the branch logic module, and the
remaining bits provide the address. With the first format, the next
address is either the next sequential address or an address derived
from the instruction register. With the second format, either a
conditional or unconditional branch is specified.
62
Variable Format
63
Address Sequencing
• Address sequencing capabilities required in control unit
• Incrementing CAR
• Unconditional or conditional branch, depending on status bit conditions
• Mapping from bits of instruction to address for control memory
• Facility for subroutine call and return
64
Address Sequencing
Instruction code
Mapping
logic
Incrementer
select a status
bit
Microoperation
Branch address s
65
Microprogram Example
MUX
10 0
Computer AR
Configuration Address Memory
10 0 2048 x 16
PC
MUX
15 0
6 0 6 0 DR
SBR CAR
Microinstruction Format
3 3 3 2 2 7
F1 F2 F3 CD BR AD
67
Microinstruction Fields
F1 Microoperation F2 Microoperation Symbol
Symbol 000 None NOP
000 None NOP 001 AC ← AC - DR SUB
001 AC ← AC + DR 010 AC ← AC ∨ DR OR
ADD 011 AC ← AC ∧ DR AND
010 AC ← 0 CLRAC 100 DR ← M[AR] READ
011 AC ← AC + 1 101 DR ← AC ACTDR
INCAC 110 DR ← DR + 1 INCDR
100 AC ← DR DRTAC 111 DR(0-10) ← PC PCTDR
101 AR ← DR(0-10)
DRTAR
110 AR ← PC PCTAR
F3 Microoperation Symbol
111 M[AR] ← DR WRITE
000 None NOP
001 AC ← AC ⊕ DR XOR
010 AC ← AC’ COM
011 AC ← shl AC SHL
100 AC ← shr AC SHR
101 PC ← PC + 1INCPC
110 PC ← AR ARTPC
111 Reserved
68
Microinstruction Fields
BR Symbol Function
00 JMP CAR ← AD if condition = 1
CAR ← CAR + 1 if condition = 0
01 CALL CAR ← AD, SBR ← CAR + 1 if condition
=1
CAR ← CAR + 1 if condition = 0
10 RET CAR ← SBR (Return from subroutine)
11 MAP CAR(2-5) ← DR(11-14), CAR(0,1,6) ← 0
69
Symbolic Microinstruction
▪ Sample Format Label: Micro-ops CD BR
AD
▪ CD one of {U, I, S, Z}
U: Unconditional Branch
I: Indirect address bit
S: Sign of AC
Z: Zero value in AC
71
Symbolic Microprogram
• Control memory: 128 20-bit words
• First 64 words: Routines for 16 machine instructions
• Last 64 words: Used for other purpose (e.g., fetch routine and other
subroutines)
• Mapping: OP-code XXXX into 0XXXX00, first address for 16 routines are
0(0 0000 00), 4(0 0001 00), 8, 12, 16, 20, ..., 60
Partial Symbolic Microprogram
Label Microops CD BR AD
ORG 0
ADD: NOP I CALL INDRCT
READ U JMP NEXT
ADD U JMP FETCH
ORG 4
BRANCH: NOP S JMP OVER
NOP U JMP FETCH
OVER: NOP I CALL INDRCT
ARTPC U JMP FETCH
ORG 8
STORE: NOP I CALL INDRCT
ACTDR U JMP NEXT
WRITE U JMP FETCH
ORG 12
EXCHANGE: NOP I CALL INDRCT
READ U JMP NEXT
ACTDR, DRTAC U JMP NEXT
WRITE U JMP FETCH
ORG 64
FETCH: PCTAR U JMP NEXT
READ, INCPC U JMP NEXT
DRTAR U MAP
INDRCT: READ U JMP NEXT
DRTAR U RET
72
Binary Microprogram
Address Binary Microinstruction
Micro Routine Decimal Binary F1 F2 F3 CD
BR AD
ADD 0 0000000 000 000 000 01
01 1000011
1 0000001 000 100 000 00
00 0000010
2 0000010 001 000 000 00 00
1000000
3 0000011 000 000 000 00 00
1000000
BRANCH 4 0000100 000 000 000 10
00 0000110
5 0000101 000 000 000 00 00
1000000
6 0000110 000 000 000 01 01
1000011
7 0000111 000 000 110 00 00
1000000
STORE 8 0001000 000 000 000 01
01 1000011
9 0001001 000 101 000 00 00
0001010
10 0001010 111 000 000 00
00 1000000
11 0001011 000 000 000 00
00 1000000 73
EXCHANGE 12 0001100 000 000 000 01
Design of Control Unit
microoperation
F1 fields F2 F3
AND
ADD AC
Arithmetic
logic and DR
DRTAC shift unit
PCTAR
DRTAR
From From
PC DR(0-10) Load
AC
Select 0 1
Multiplexers
Load Clock
AR
74
Microprogram Sequencer
External
(MAP)
L
I0 3 2 1 0
Input Load
I1 S1 MUX1 SBR
logic
T S0
1 Incrementer
I MUX2 Test
S
Z Select
Clock CAR
Control memory
Microops CD BR AD
... ...
75
Input Logic for Microprogram Sequencer
1 L L(load SBR with PC)
From I MUX2 Test
CPU S T for subroutine Call
BR field Input
Z Select I0 logic S0 for next address
of CS I1
S1 selection
CD Field of CS
Input Logic
I1I0T Meaning Source of Address S 1S0
L
77
Step-1
• An initial address is loaded into the control address register when power
is turned on in the computer.
• This address is usually the address of the first microinstruction that
activates the instruction fetch routine.
• The fetch routine may be sequenced by incrementing the control
address register through the rest of its microinstructions.
• At the end of the fetch routine, the instruction is in the instruction register
of the computer.
78
Step-2
• The control memory next must go through the routine that determines
the effective address of the operand.
• A machine instruction may have bits that specify various addressing
modes, such as indirect address and index registers.
• The effective address computation routine in control memory can be
reached through a branch microinstruction, which is conditioned on the
status of the mode bits of the instruction.
• When the effective address computation routine is completed, the
address of the operand is available in the memory address register.
79
Step-3
81
Basic Concepts of pipelining
How to improve the performance of the processor?
1.By introducing faster circuit technology
2.Arrange the hardware in such a way that, more than one operation can be performed at the
same time.
What is Pipeining?
It is the process of arrangement of hardware elements in such way that, simultaneous
execution of more than one instruction takes place in a pipelined processor so as to increase
the overall performance.
What is Instruction Pipeining?
• The number of instruction are pipelined and the execution of current instruction is
overlapped by the execution of the subsequent instruction.
• It is a instruction level parallelism where execution of current instruction does not wait until
the previous instruction has executed completely.
82
Basic idea of Instruction Pipelining
Sequential Execution of a program
• The processor executes a program by fetching(Fi) and executing(Ei)
instructions one by one.
83
Hardware organization and instruction pipeline
• Consists of 2 hardware units one for fetching and another one for
execution as follows.
• Also has intermediate buffer to store the fetched instruction
84
2 stage pipeline
• Execution of instruction in pipeline manner is controlled by a clock.
• In the first clock cycle, fetch unit fetches the instruction I1 and store it in buffer
B1.
• In the second clock cycle, fetch unit fetches the instruction I2 , and execution unit
executes the instruction I1 which is available in buffer B1.
• By the end of the second clock cycle, execution of I1 gets completed and the
instruction I2 is available in buffer B1.
• In the third clock cycle, fetch unit fetches the instruction I3 , and execution unit
executes the instruction I2 which is available in buffer B1.
• In this way both fetch and execute units are kept busy always.
85
Contd…
86
Hardware organization for 4 stage pipeline
• Pipelined processor may process each instruction in 4 steps.
1. Fetch(F): Fetch the Instruction
2. Decode(D): Decode the Instruction
3. Execute (E) : Execute the Instruction
4. Write (W) : Write the result in the destination location
⮚4 distinct hardware units are needed as shown below.
87
Execution of instruction in 4 stage pipeline
• In the first clock cycle, fetch unit fetches the instruction I1 and store it in buffer
B1.
• In the second clock cycle, fetch unit fetches the instruction I2 , and decode unit
decodes instruction I1 which is available in buffer B1.
• In the third clock cycle fetch unit fetches the instruction I3 , and decode unit
decodes instruction I2 which is available in buffer B1 and execution unit executes
the instruction I1 which is available in buffer B2.
• In the fourth clock cycle fetch unit fetches the instruction I4 , and decode unit
decodes instruction I3 which is available in buffer B1, execution unit executes the
instruction I2 which is available in buffer B2 and write unit write the result of I1.
88
89
Contd…
90
Role of cache memory in Pipelining
• Each stage of the pipeline is controlled by a clock cycle whose period is that the
fetch, decode, execute and write steps of any instruction can each be completed
in one clock cycle.
• However the access time of the main memory may be much greater than the
time required to perform basic pipeline stage operations inside the processor.
• The use of cache memories solve this issue.
• If cache is included on the same chip as the processor, access time to cache is
equal to the time required to perform basic pipeline stage operations .
91
Pipeline Performance
• Pipelining increases the CPU instruction throughput - the number of instructions
completed per unit time.
• The increase in instruction throughput means that a program runs faster and has
lower total execution time.
• Example in 4 stage pipeline, the rate of instruction processing is 4 times that of
sequential processing.
• Increase in performance is proportional to no. of stages used.
• However, this increase in performance is achieved only if the pipelined operation
is continued without any interruption.
• But this is not the case always.
92
Contd…
• Consider the scenario, where one of the pipeline stage may require more clock cycle than the other.
• For example, consider the following figure where instruction I2 takes 3 cycles to completes its
execution(cycle 4,5,6)
• In cycle 5,6 the write stage must be told to do nothing, because it has no data to work with.
93
The Major Hurdle of Pipelining—Pipeline
Hazards
• These situations are called hazards, that prevent the next instruction
in the instruction stream from executing during its designated clock
cycle.
• Hazards reduce the performance from the ideal speedup gained by
pipelining.
108
Unconditional Branches
● If Sequence of instruction being executed in two stages
pipeline instruction I1 to I3 are stored at consecutive memory
address and instruction I2 is a branch instruction.
● If the branch is taken then the PC value is not known till the
end of I2.
● Next third instructions are fetched even though they are not
required
● Hence they have to be flushed after branch is taken and new
set of instruction have to be fetched from the branch address
109
Unconditional Branches
110
Branch Timing
● Branch penalty
The time lost as the result of branch instruction
● Reducing the penalty
The branch penalties can be reduced by proper
scheduling using compiler techniques
● For longer pipeline, the branch penalty may be much higher
● Reducing the branch penalty requires branch target address to be
computed earlier in the pipeline
● Instruction fetch unit must have dedicated hardware to identify a
branch instruction and compute branch target address as quickly as
possible after an instruction is fetched
111
Branch Timing
112
Branch Timing
113
Instruction Queue and Prefetching
• Either a cache miss or a branch instruction may stall the pipeline for
one or more clock cycle.
• To reduce the interruption many processor uses the instruction fetch
unit which fetches instruction and put them in a queue before it is
needed.
• Dispatch unit-Takes instruction from the front of the queue and
sends them to the execution unit, it also perform the decoding
operation
• Fetch unit keeps the instruction queue filled at all times.
• If there is delay in fetching the instruction, the dispatch unit
continues to issue the instruction from the instruction queue
114
Instruction Queue and Prefetching
115
Conditional Branches
● A conditional branch instruction introduces the added hazard caused by
the dependency of the branch condition on the result of a preceding
instruction.
● The decision to branch cannot be made until the execution of that
instruction has been completed.
116
Delayed Branch
● The location following the branch instruction is branch delay slot.
● The delayed branch technique can minimize the penalty arise due to
conditional branch instruction
● The instructions in the delay slots are always fetched. Therefore, we
would like to arrange for them to be fully executed whether or not the
branch is taken.
● The objective is to place useful instructions in these slots.
● The effectiveness of the delayed branch approach depends on how
often it is possible to reorder instructions.
117
Delayed Branch
118
Delayed Branch
119
Branch Prediction
● To predict whether or not a particular branch will be taken.
● Simplest form: assume branch will not take place and continue to fetch instructions
in sequential address order.
● Until the branch is evaluated, instruction execution along the predicted path must
be done on a speculative basis.
● Speculative execution: instructions are executed before the processor is certain that
they are in the correct execution sequence.
● Need to be careful so that no processor registers or memory locations are updated
until it is confirmed that these instructions should indeed be executed.
120
Incorrectly Predicted Branch
121
Branch Prediction
● Better performance can be achieved if we arrange for some branch
instructions to be predicted as taken and others as not taken.
● Use hardware to observe whether the target address is lower or higher
than that of the branch instruction.
● Let compiler include a branch prediction bit as 0 or 1. The fetch unit
checks this bit to predict whether the branch is taken or not taken
branch
● So far the branch prediction decision is always the same every time a
given instruction is executed – static branch prediction.
122
Branch Prediction
● Static Prediction
● Dynamic branch Prediction
123
Static Prediction
● Prediction is carried out by compiler and it is static because the
prediction is already known before the program is executed
124
Dynamic Branch Prediction
● Dynamic prediction in which the prediction decision may change
depending on the execution history
125
Branch Prediction Algorithm
▪ If the branch taken recently,the next time if the same branch is
executed,it is likely that the branch is taken
▪ State 1: LT : Branch is likely to be taken
▪ State 2: LNT : Branch is likely not to be taken
▪ 1.If the branch is taken,the machine moves to LT. otherwise it
remains in state LNT.
▪ 2.The branch is predicted as taken if the corresponding state
machine is in state LT, otherwise it is predicted as not taken
126
Branch Prediction Algorithm
127
4 State Algorithm
● ST-Strongly likely to be taken
○ LT-Likely to be taken
○ LNT-Likely not to be taken
○ SNT-Strongly not to be taken
● Step 1: Assume that the algorithm is initially set to LNT
● Step 2: If the branch is actually taken changes to ST, otherwise it is
changed to SNT.
● Step 3: when the branch instruction is encountered, the branch will
taken if the state is either LT or ST and begins to fetch instruction at
branch target address, otherwise it continues to fetch the instruction in
sequential manner
128
4 State Algorithm
● When in state SNT,the instruction fetch unit predicts that the
branch will not be taken
● If the branch is actually taken,that is if the prediction is
incorrect,the state changes to LNT
129
4 State Algorithm
130
INFLUENCE ON INSTRUCTION SETS
131
OVERVIEW
• Some instructions are much better suited to pipeline
execution than others.
• Addressing modes
• Conditional code flags
132
ADDRESSING MODES
• Addressing modes include simple ones and complex
ones.
• In choosing the addressing modes to be implemented in
a pipelined processor, we must consider the effect of
each addressing mode on instruction flow in the
pipeline:
- Side effects
- The extent to which complex addressing modes
cause the pipeline to stall
- Whether a given mode is likely to be used by
compilers
133
RECALL
Load X(R1), R2
Load (R1), R2
134
COMPLEX ADDRESSING MODE
Load (X(R1)), R2
T ime
Clock cycle 1 2 3 4 5 6 7
F orw ard
Ne xt instruction F D E W
135
SIMPLE ADDRESSING MODE
Add #X, R1, R2
Load (R2), R2
Load (R2), R2
Add F D X + [R1] W
Load F D [X [R1]] W
+
Ne xt instruction F D E W
137
ADDRESSING MODES
• Good addressing modes should have:
- Access to an operand does not require more than one access
to the memory
- Only load and store instruction access memory
operands
- The addressing modes used do not have side
effects
• Register, register indirect, index
138
CONDITIONAL CODES
• If an optimizing compiler attempts to reorder instruction
to avoid stalling the pipeline when branches or data
dependencies between successive instructions occur, it
must ensure that reordering does not cause a change in
the outcome of a computation.
• The dependency introduced by the condition-code flags
reduces the flexibility available for the compiler to
reorder instructions.
139
CONDITIONAL CODES
Add R1,R2
Compare R3,R4
Branch=0 ...
a) A program fragment
Compare R3,R4
Add R1,R2
Branch=0 ...
b) Instructions reordered
Instruction reordering
140
CONDITIONAL CODES
Two conclusion:
⮚ To provide flexibility in reordering instructions, the
condition-code flags should be affected by as few
instruction as possible.
⮚ The compiler should be able to specify in which
instructions of a program the condition codes are
affected and in which they are not.
141