0% found this document useful (0 votes)
102 views13 pages

Processor Architecture

RISC architectures are better suited for pipelining than CISC architectures due to their simpler instruction sets and uniform instruction formats. Pipelining allows different stages of instruction processing to overlap, improving processor throughput. However, CISC's more complex instructions make it difficult to cleanly separate processing into standardized stages. While CISC can still benefit from pipelining, RISC is inherently better optimized for this technique.

Uploaded by

Vinod kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views13 pages

Processor Architecture

RISC architectures are better suited for pipelining than CISC architectures due to their simpler instruction sets and uniform instruction formats. Pipelining allows different stages of instruction processing to overlap, improving processor throughput. However, CISC's more complex instructions make it difficult to cleanly separate processing into standardized stages. While CISC can still benefit from pipelining, RISC is inherently better optimized for this technique.

Uploaded by

Vinod kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1. CISC vs.

RISC Architecture:
RISC architectures lend themselves more towards pipelining than CISC architectures for
many reasons. RISC architectures because of their simplicity and small set of instructions are
simple to split into stages. In RISC architectures the fetch and decode cycle is more
predictable since most instructions have similar length.

Pipelining requires that the whole fetch to execute cycle can be split into stages where each
stage does not interfere with the next and each instruction can be doing something at each
stage. CISC with more complex instructions are harder to split into stages. Stages that are
important for one instruction may be not be required for another instruction with CISC. The
difference in instruction length with CISC will complicate the fetch and decode sections of a
pipeline. A single byte instruction following an 8-byte instruction will need to be handled so
as not to slow down the whole pipeline. CISC architectures by their very name also have more
complex instructions with complex addressing modes. This makes the whole cycle of
processing an instruction more complex.

As RISC architectures have simpler instruction sets than CISC, the number of gates involved
in the fetch and execute cycle will be far lower than CISC architecture. Therefore, RISC
architectures will tend to have smaller optimum pipeline lengths than more general processors
for identical performance. RISC architectures do suite pipelining more than CISC
architectures, this does not mean however that CISC architecture can not gain from pipelining
or that a large number of pipeline stages are bad (although the hazards of a pipeline in CISC
would become a concern).

Other features, which are typically found in RISC architectures, are:

 Uniform instruction format (fixed size instructions), using a single word with the
opcode in the same bit positions in every instruction, demanding less decoding;
Instructions are executed in a single clock cycle.
 Fetch & Decode unit is much faster due to simple and uniform instructions
 Large number of general purpose registers, to avoid storing variables in a stack
memory.
 Only load and store instructions to refer memory.
 Fewer simple instructions rather than complex ones.
 Simple and less number of addressing modes to simplify reference to operands.
 Few simple data types. Some CISCs have byte string instructions, or support complex
numbers; this is so far unlikely to be found on a RISC.

2. Von NeuMANN and Harvard Architecture:

The von Neumann architecture is a design model for a stored-program digital computer that
uses a processing unit and a single separate storage structure to hold both instructions and
data. A single bus used to transfer instructions and data, leads to the Von Neumann bottleneck.
This limits throughput (data transfer rate) between the CPU and memory. This seriously limits
the effective processing speed when the CPU is required to perform minimal processing on
large amounts of data. The CPU is continuously forced to wait for needed data to be
transferred to or from memory. Since CPU speed and memory size have increased much
faster than the throughput between them, the bottleneck has become more of a problem.

Figure 1: Von Neumann architecture

The Von Neumann bottleneck is resolved in Hardvard architecture. The most basic
characteristic of the Harvard architecture is that it has physically separate signals and storage
for code and data memory. It is possible to access program memory and data memory
simultaneously, thereby creating potentially faster throughput and less of a bottleneck.
Typically, code (or program) memory is read-only and data memory is read-write. In a
computer using the Harvard architecture, the CPU can read an instruction and perform a data
memory access both at the same time, even without a cache. A Harvard architecture computer
can thus be faster for a given circuit complexity because instruction fetches and data access do
not contend for a single memory pathway.

Figure 2: Harvard architecture

2.1 SIMD Processing:

Some DSPs have multiple data memories in distinct address spaces to facilitate SIMD
processing. SIMD exploits data-level parallelism by operating on small to moderate number
of data items in parallel. The true SIMD architecture contains a single control unit(CU) with
multiple execution units(EUs) acting as arithmetic logic units. In this situation, the EUs are
slaves to the control unit working in lock-step under supervision of CU. The EUs cannot fetch
or interpret any instructions. They are merely units which have capabilities of addition,
subtraction, multiplication, division, rotate, shift etc. Each EU has access only to its own
memory. In this sense, if an EU needs the information contained in a different EU, it must put
in a request to the CU and the CU must manage the transferring of information. The
advantage of this type of architecture is in the ease of adding more data memory and EUs to
the computer.
Figure 3: SIMD processing

Conventional data types used in the C programming language, such as char, int and float, are
called scalar types. Data types used for SIMD operations are called vector types. Each vector
type has its corresponding scalar type as shown in Table 2.2.

Table 1: List of Vector Types


Vector Type Data
__vector unsigned char Sixteen unsigned 8-bit data
__vector signed char Sixteen signed 8-bit data
__vector unsigned short Eight unsigned 16-bit data
__vector signed short Eight signed 16-bit data
__vector unsigned int Four unsigned 32-bit data
__vector signed int Four signed 32-bit data
__vector unsigned long long Two unsigned 64-bit data
__vector signed long long Two signed 64-bit data
__vector float Four single-precision floating-point data
__vector double Two double-precision floating-point data

Vector is stored in a Cell where each Cell uses 128-bit (16-byte) fixed-length vector made
up of 2 to 16 elements according to the data type. A vector can be interpreted as a series of
scalars of the corresponding type (char, int, float and so on) in a 16-byte space. A vector
literal is written as a parenthesized vector type followed by a curly braced set of constant
expressions. The elements of the vector are initialized to the corresponding expression.
Elements for which no expressions are specified default to 0. Vector literals may be used
either in initialization statements or as constants in executable statements. Some examples
of usage are shown below. When used for macros, the entire vector literal must be enclosed
in parentheses.
Example (1): Use in variable initialization statement (Vector literal shown in red)

void func_a(void)

{ __vector signed int va = (__vector signed int) { -2, -1, 1, 2 };

Example (2): Use as a constant in executable statement (Vector literal shown in red)

va = vec_add(va, ((__vector signed int) { 1, 2, 3, 4 }));

The disadvantage can be found in the time wasted by the CU managing all memory
exchanges.

 Not all algorithms can be vectorized. For example, a flow-control task like code
parsing wouldn't benefit from SIMD. Not all applications are naturally suited to SIMD
processing.
 Compilers don't generate SIMD instructions easily from a typical C program.
 SIMD processors are specially designed, and tend to be expensive and have long
design cycles.

 Programming with particular SIMD instruction sets can involve numerous low-level
challenges.
o Gathering data into SIMD registers and scattering it to the correct destination
locations is tricky and can be inefficient.
o Specific instructions like rotations or three-operand addition are not available
in some SIMD instruction sets.
o SIMD programs are architecture-specific: Programmers must provide non-
vectorized program for implementation on non-SIMD based processors.

SIMD operations are widely used for 3D graphics and audio/video processing in multimedia
applications. SIMD computers require less hardware and easy to program than MIMD
computers.

2.2 MIMD Processing:

MIMD multiprocessing architecture is suitable for a wide variety of tasks in which completely
independent and parallel execution of instructions touching different sets of data can be put to
productive use. MIMD architectures are used in a number of application areas such as
computer-aided design/computer-aided manufacturing, simulation, modeling, weather
forecasting and multi-port high speed backbone routers. A MIMD computer has multiple
interconnected execution units (EUs), each of which have their own control unit (see Fig. 4).
MIMD exploits instruction-level as well as data-level parallelism (by processing multiple
instructions and data items in parallel). The execution unit works on their own data with their
own instructions. Tasks executed by different execution units can start or finish at different
times. They may send results to share memory space. They are not lock-stepped, as in SIMD
computers, but run asynchronously and require synchronization at program level.
Figure 4: MIMD processing

3. Processors with ILP:

A processor that executes every instruction one after the other (i.e. a non-pipelined scalar
architecture) may use processor resources inefficiently, potentially leading to poor
performance. CPU architecture designed to take advantage of instruction level parallelism
(ILP) improves performance by executing multiple instructions entirely simultaneously. The
maximum number of instruction that can be executed in parallel is called degree of
parallelism. Further improvement can be achieved by executing instructions in an order
different from the order they appear in the program; this is called out-of-order execution.

As often implemented, these techniques all come at a cost: increased hardware complexity.
Before executing any operations in parallel, the processor must verify that the instructions do
not have interdependencies. For example, a first instruction's result is used as input to second
instruction's. Clearly, they cannot execute at the same time, and the second instruction can't be
executed before the first. Modern out-of-order processors have increased the hardware
resources which do the scheduling of instructions and determining interdependencies.

3.1 Superscalar:

Superscalar CPU implements a form of parallelism called instruction level parallelism within
a single processor. It therefore offers high instruction throughput than would otherwise be
possible at a given clock rate. The superscalar technique is traditionally associated with
several identifying characteristics (within a given CPU core):

 Instructions are issued from a sequential instruction stream


 CPU hardware dynamically checks for data dependencies between instructions at run
time (versus software checking at compile time)
 The CPU decodes and dispatches multiple instructions per clock cycle
Figure 5: Superscalar architecture

A superscalar processor executes more than one instruction during single execution cycle by
simultaneously dispatching multiple instructions to multiple functional units on the processor.
This enables multiple operations running concurrently on separate hardware. To do this, the
fetch stages must be enhanced so that multiple instructions are available in instruction queue.
Dependency check logic analyzes instructions queue and identifies independent instructions
which are decoded/dispatched in parallel and can be set to these functional units (see Fig. 5).
Each functional unit is not a separate CPU core but an execution resource within a single CPU
such as an integer unit, floating point unit, branching unit, address computation unit etc.
Superscalar CPU is typically pipelined or non-pipelined. While a superscalar CPU is typically
also pipelined, pipelining and superscalar architecture are considered different performance
enhancement techniques.

Available performance improvement from superscalar techniques is limited by two key areas:

1. The degree of intrinsic parallelism in the instruction stream, i.e. independent


instructions available in instruction queue
2. The complexity and delay cost of the associated dependency checking logic. The area
and delay increases with increase in degree of parallelism and instruction set size.

3.2 VLIW:

VLIW microprocessors and superscalar implementations share some characteristics—multiple


execution units and the ability to execute multiple operations simultaneously. The techniques
used to achieve high performance, however, are very different because the parallelism is
explicit in VLIW instructions but must be discovered by hardware at run time by superscalar
processors. VLIW instruction is really groups of little sub-instructions, and thus the
instructions themselves are very long (often 128 bits or more), hence the name VLIW – very
long instruction word. Each instruction contains information for multiple parallel operations.
The compiler is responsible for identifying multiple independent instructions from program
and group into a single VLIW instruction. Thus, VLIW architecture does not require complex
dependency check logic in hardware and has simple dispatch unit. This saves considerable
chip area of VLIW processor and has shorter clock cycle. Therefore, VLIW processors are
faster and consume less chip area than their superscalar counterparts.
However, VLIW has following disadvantages:
1. Since instruction grouping is determined by compiler, if compiler does not find enough
parallelism to fill in all slots of a VLIW instruction it must place explicit NOP (no-operation)
instructions into the corresponding slots. This causes VLIW programs to take more memory
than equivalent programs for superscalar processors.
2. VLIW programs are usually incompatible between generations of VLIW processors. That
is why they have been less successful in general purpose computers such as workstations or
desktop PCs because customers demand software compatibility between generations of
processors.

3.3 Pipelining:

An instruction pipeline is a technique used in the design of computers and other digital
electronic devices to increase their instruction throughput (the number of instructions that can
be executed in a unit of time). The fundamental idea is to split the processing of a computer
instruction into a series of independent steps, with storage at the end of each step. This allows
the computer's control circuitry to issue instructions at the processing rate of the slowest step,
which is much faster than the time needed to perform all steps at once.

Architecturally, execution hardware is divided into small stages with latches in between
stages. Latches hold results of previous stage which are to be forwarded to next stage upon
clock arrival. Since stage delay is smaller than entire execution hardware, pipeline can be
operated at faster clock. Fig.7 shows a pipeline system of five stages (Fetch, Decode,
Memory access, Execute and Write Back) for processing instruction in microprocessor. After
1st five clock cycle, we get one processed instruction at every clock cycle. The speed of
pipeline can be measure in terms of CPI (clock cycles per instructions) which should be
ideally 1. But due to latency and pipeline hazards (will be explained later) the value of CPI is
greater than one in all practical cases.

Figure 7: Instruction processing in 5-stage pipeline CPU


Assume all stages in pipeline have same delay tC then
Pipeline clock cycle time = tC
Let there are k stages and n instructions to be executed
Total Processing time: Tk = [k+(n-1)] tC (for pipeline processor)
T1 = n k tC (for non-pipeline processor)
Speed up factor: Sk = T1 / Tk = nk / {k+(n-1)}
Efficiency: Ek = Sk / k = n / {k+(n-1)}
Pipeline throughput Hk= No. of instructions executed per unit time in pipeline
= n / {k+(n-1)}tC ≈ 1/ tC (since n >> k)

As k increases, the factor tC reduces but factor (k+(n-1)) changes very little since n is
much higher than k. Thus, pipeline throughput increases.

If delays of stages of pipeline are unequal:


Let delay of largest delay stage= tcw = pipeline clock cycle time
Processing time to execure n instructions is
Tk = [k+(n-1)] tcw . . . . (pipeline)
T1 = n (d1 + d2 + . . . + dk ) . . . (Non-pipeline)

Ideally, if there N stages in a pipeline architecture then


Instruction Execution time (Pipeline) = Instr. Execution time (non-pipeline) / N

Optimum Number of Stages:


The number of stages in a pipeline often depends on the trade-off between performance and
cost. The optimal choice for such a number can be determined by obtaining the peak value of
a performance to cost ratio (PCR).

PCR= maximum instruction throughput/cost

To illustrate, assume a non-pipelined processor with total hardware delay time of tnp. For a
pipeline CPU with k stages, we get per stage delay equal tnp/k. If latch delay is tl, then period
of pipeline clock will be
tc = (tnp/k) + tl

Thus, the maximum instruction throughput (or performance P) that can be obtained with such
a pipeline is

1 1
P 
tc  
tnp k  tl

The pipeline cost C can be expressed as the total cost of logic gates and latches used in all
stages. That is, C = Cp + k Cl where Cp is the cost of logic stages of processor and Cl is the
cost of latches per stage. (Note that the cost of gates and latches may be interpreted in
different ways; for example, the cost may refer to design complexity or the area required on
the chip or circuit board.) By substituting the values for maximum throughput and pipeline
cost in the PCR equation, the following formula can be obtained
PCR  1  tnp k   tl  C p  k Cl  
The optimum value of k is obtained by differentiating PCR w.r.t. k and equating to zero.
tnp * C p
k0 
tl * Cl

Since the value k0 maximizes PCR, this value can be used as an optimal choice for the number
of stages in pipeline CPU.

3.3.1 Typical Pipelining structure

Most modern CPUs are driven by a clock. The CPU consists internally of logic and memory
(flip flops). By breaking the logic into smaller pieces and inserting flip flops between the
pieces of logic, the delay before the logic gives valid outputs is reduced. In this way the clock
period can be reduced. For example, the classic RISC pipeline is broken into five stages with
a set of flip flops between each stage.

1. Instruction fetch
2. Instruction decode and register fetch
3. Execute
4. Memory access
5. Register write back

Figure 8: Five stage Pipeline operation

Figure 9: Architecture of Five-stage Pipeline CPU


1. Instruction fetch

The Instruction fetch on these machines had a latency of one cycle. During the Instruction
fetch, a 32-bit instruction is fetched from the code memory (ROM). At the same time, the PC
predictor predicts the address of the next instruction by incrementing the PC by 4 (all
instructions were 4 bytes long). This prediction was always wrong in the case of a taken
branch, jump, or exception. In CPU with cache, fetch unit tries to read an instruction from
instruction cache and if it fails, normal fetch from ROM (code memory) is performed.

2. Decode

Unlike earlier microcoded machines, the most RISC machines had no microcode. Once
fetched from the instruction cache, the instruction bits were shifted down the pipeline, so that
simple combinational logic in each pipeline stage could produce the control signals for the
datapath (ALU, Barrel Shifter, Multiplier etc.) unit directly from the instruction bits. As a
result, very little decoding is done in the stage traditionally called the decode stage.

If the instruction decoded was a branch/jump, branch/jump condition is tested. Based on the
result of this test, PC is assigned either incremented value or branched address (target
address).

3. Execute

Instructions on these simple RISC machines can be divided into three latency classes
according to the type of the operation:

 Register-Register Operation: Add, subtract, compare, shift and logical operations.


During the execute stage, the two arguments are fed to a simple ALU, which generates
the result by the end of the execute stage.

 Memory Reference Operation: Computes target address for jump or branch


instruction. During the execute stage, the ALU adds the two arguments (a register and
a constant offset) to produce an effective address by the end of the cycle.

 Multi Cycle Operation: Integer multiply and divide and all floating-point operations.
During the execute stage, the operands to these operations are fed to the multi-cycle
multiply/divide unit. The rest of the pipeline was free to continue execution while the
multiply/divide unit did its work. To avoid complicating the write back stage and issue
logic, multi cycle instructions write their results to a separate set of registers.

4. Memory Access

During this stage, results of data processing instruction produced by execute stage is
forwarded to the next stage. If the instruction is a load, the data is read from the data memory
and if the instructions is store, the register data is written to data memory at address computed
by execute stage. For data processing instructions, the output of execute stage is not address
but results which can be written to destination registers bypassing data memory access.

5. Write back

During this stage, both single cycle and two cycle instructions write their results into the
register file. Either content of data memory or result of execute stage is written to register file.
3.3.2 Pipeline Hazards:
> Data hazards The data hazard, which is also referred to as the data dependency problem,
comes about as a result of overlapping the execution of data-dependent instructions. In other
words, when an instruction would attempt to use data before the data is available in the
register file (e.g. an instruction depends on the results of a previous instruction), data hazard
would occur.

For example, in Fig.10(a) instruction i2 has a data dependency on instruction i1 because it


uses the result of i1 (i.e., the contents of register R2) as input data. If the instructions were
sent to a pipeline in the normal manner, i2 would be in the E2 stage before i1 passes through
the W1 stage to write result. This would result in using the old contents of R2 by i2 for
computing a new value for R5, leading to an invalid result. To have a valid result, i2 must not
enter the E2 stage until i1 has passed through the W1 stage. In this way, as is shown in
Fig.10(b), the execution of i2 will be delayed for two clock cycles. In other words, instruction
i2 is said to be stalled for two lock cycles. Often, when an instruction is stalled, the
instructions that are positioned after the stalled instruction will also be stalled.
The result needed by i2 is actually produced during E1 stage of i1 but written in W1
stage. If hardware allows multiple write to register file then result of E1 can be written
immediately in register file which can be then used in E2 stage of i2. This is called
forwarding. Forwarding avoids stalling of pipeline and contributes in improving instruction
throughput. This advantage is achieved with extra cost of complex multiport register file.

> Control hazard occurs whenever there is a change in the normal execution flow of the
program. Events such as branches, interrupts, exceptions and return from interrupts cause
control hazards. Control hazard occurs because branches, interrupts etc. are not caught until
the instruction is executed. By the time it is executed, the following instructions are already
entered into the pipeline and need to be flushed out. This draining and refilling of the pipeline
for each branch degrade the throughput of the pipeline processor. Note that the presence of a
conditional branch statement does not automatically cause the pipeline to drain and begin
refilling. A branch not taken allows the sequential flow of uninterrupted instructions to
continue in pipeline. Only when a branch is taken does the problem arise.

Figure 11 shows the execution of this sequence in our instruction pipeline when the target
path is selected. In this figure, c denotes the branch penalty, that is, the number of cycles
wasted whenever the target path is chosen.

Figure 11: Control hazards in pipeline operation

To show the effect of the branch penalty on the overall pipeline performance, the average
number of cycles per instruction must be determined. Let tave denote the average number of
cycles required for execution of an instruction; then

tave = Pb * (average number of cycles per branch instruction) +


(1 - Pb) * (average number of cycles per nonbranch instruction),

where Pb denotes the probability that a given instruction is a branch instruction. The average
number of cycles per branch instruction can be determined by considering two cases. If the
target path is chosen, 1+ c cycles (c = branch penalty) are needed for the execution; otherwise,
there is no branch penalty and only one cycle is needed. If Pt denotes the probability that the
target path is chosen, then

Average number of cycles per branch instruction = Pt (1+c) + (1- Pt) (1)

where Pt denotes the probability that the target path is chosen. The average number of cycles
per nonbranch instruction is 1. Thus

tave = Pb [Pt (1+c) + (1-Pt )(1)] + (1- Pb)(1) = 1 + cPb Pt.

> Structural hazards occur when two instructions might attempt to use the same resources at
the same time. In CPU if execution of an instruction (such as multiply and divide) requires
more than one clock cycle and execution hardware is not fully pipelined or hardware is not
replicated then a next instruction that tries to use execution unit, cannot be issued and
structural hazard would occur. Another type of structural hazard that may occur is due to the
design of register files. If a register file does not have multiple write (read) ports, multiple
writes (reads) to (from) registers cannot be performed simultaneously. For example, under
certain situations two or more instructions in pipeline might want to perform register writes in
a clock cycle. This may not be possible when the register file has only one write port. The
effect of a structural hazard can be reduced fairly simply by implementing multiple execution
units and using register files with multiple input/output ports.

3.3 Merits and Demerits of Pipelining:

Merits:

1. The cycle time of the processor is reduced, thus increasing instruction issue-rate in
most cases.
2. Some data path circuits such as adders or multipliers can be made faster by
introducing pipeline in these circuits.

Demerits:

1. Pipelined processors suffer from various pipeline hazards which add more latencies.
2. The instruction latency in a pipelined processor is slightly higher than in a non-
pipelined equivalent. This is due to the fact that an instruction has to pass through
many stages and latches.
3. A non-pipelined processor will have a stable instruction bandwidth. The performance
of a pipelined processor is much harder to predict and may vary more widely between
different programs.

References (Further Reading):

1. Computer Architecture: Single and Parallel System by Mehdi R. Zargham, PHI, 1996
2. Advance Computer Architecture by Kai Hwang, TMH, 8th Reprint 2008

You might also like