0% found this document useful (0 votes)
315 views

Inside The Cpu

The document discusses the inner workings of CPUs. It describes how today's CPUs are still based on principles from the 1940s, utilizing both Von Neumann and Harvard architectures. It explains how programs must be in machine language to communicate directly with the CPU, but assembly language and compilers allow for higher-level programming. The document outlines how CPUs execute instructions through pipelines and describes variations like superscalar and superpipelined designs that can process multiple instructions simultaneously.

Uploaded by

api-3760834
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
315 views

Inside The Cpu

The document discusses the inner workings of CPUs. It describes how today's CPUs are still based on principles from the 1940s, utilizing both Von Neumann and Harvard architectures. It explains how programs must be in machine language to communicate directly with the CPU, but assembly language and compilers allow for higher-level programming. The document outlines how CPUs execute instructions through pipelines and describes variations like superscalar and superpipelined designs that can process multiple instructions simultaneously.

Uploaded by

api-3760834
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 10

INSIDE THE CPU

Amazingly, to this day the fundamental design of mainstream microprocessors and computers is
still based on principles from the 1940s. The structure created by John Mauchly and J. Presper
Eckert (developers of the ENIAC), and formalized by mathematician and computer scientist John
Von Neumann, allowed memory to hold both program instructions and data. This stored program
concept, later called Von Neumann Architecture, no longer required hard-wiring of instructions,
enabling easy reprogramming and revolutionizing computing.

Today's CPUs also utilize a Harvard Architecture loosely based on the Harvard Mark I from the
1940s, which had separate memory blocks for instructions and data. For example, modern
processors generally divide the onboard L1 cache into separate instruction and data areas. (L2
and L3 caches, however, typically include both instructions and data, making them unified
caches.)

Programs at the lowest level—those communicating directly with the CPU—must be in machine
language, which represents instructions and data for the processor as binary numbers. The last
we looked, though, few people enjoy writing programs consisting of code like 1000101111010000.
Assembly language, a slightly higher-level tool, eases the process by representing operations
and data symbolically, resulting in directives like MOV AX, 9 instead of
101110000000100100000000. An assembler program then converts the symbolic code into
machine language. Higher-level languages like C perform the translation using compiler software.

Many CPU instructions specify some operation, like add, multiply, or compare, to be carried out
on an operand—data the processor often stores in registers (temporary internal holding
locations), but may also fetch from cache or main memory. Different types of instructions may
vary in length. Current CPUs, like the Pentium 4 and Athlon64, decode variable-length x86
instructions into one or a few simpler, fixed-length internal instructions called micro-ops, which are
not accessible outside the CPU.

Although compatible microprocessors must produce identical results for identical instructions,
internal designs and operation may be entirely different. For example, the AMD Athlon 64, unlike
the Pentium 4, initially decodes an instruction into one or more intermediate macro-ops (complex
instructions require several macro-ops), which are ultimately converted into one or more micro-
ops. Internal architecture may even differ within CPU families and in succeeding generations of
highly compatible processor families, such as the Intel 386, 486, and Pentium. Later generations
often add instruction-set enhancements such as MMX (Multimedia Extensions) and SSE
(Streaming Single-Instruction, Multiple-Data Extensions) in the case of Intel processors. AMD64,
which extended the AMD x86 architecture to enable 64-bit instructions and addressing, is another
example.

click on image for full view


AMD Athlon 64
In this die shot, you can clearly see that the 1MB L2 cache (1) takes up about half of the chip. The DDR memory
interface (2) runs almost all the way around the outside edges of the L2 cache; the left edge of the chip houses
the HyperTransport circuitry (3). The left half of the chip comprises the floating point unit (4), the load/store area
(5), and the data cache (6). Below these circuits sit the execution units (7) and the bus unit (8). Finally, the
bottom left quadrant of the die holds the fetch scan align micro-code (9), the instruction cache (10), and the
memory controller (11). The clock generator (12) is a tiny area just left of the memory controller.

Whatever their differences, though, processors execute instructions sequentially unless a


comparison result, an external interrupt, or some other event causes a branch. A program counter
in the CPU keeps track of which instruction is current and which to fetch next. The processor
fetches data from memory (either on-board caches or external system RAM), places it in
registers, performs calculations such as comparisons and string operations, and writes results to
either internal registers or external memory.

An input clock signal keeps everything operating at a specific rate. Within the processor, the
signal may be subdivided into multiple, higher-frequency clocks to synchronize higher-speed
operations. But the big trick is organizing the processor's architecture so instructions execute as
efficiently and quickly as possible, a talent that ultimately influences the CPU's cost, size, and
performance. Continued...

Pipelines: Hanging 5

Nearly every CPU today processes instructions in a collection of many discrete stages called a
pipeline, most of which include five basic functions—instruction fetch, instruction decode, operand
address generation and fetch, execution, and write-back (which places results in registers or
memory). As instructions move from one stage to another, the CPU can fetch more instructions
from memory, inserting them into the top of the pipeline. A simple processor with a single pipeline
that permits only one instruction at a time to enter the decode stage is called scalar. Instructions
at various stages of processing fill its pipeline, and one instruction executes every clock cycle, for
simple operations. More complex tasks can cause delays, which are called pipeline stalls.
There are many variations on the scalar pipelining theme, with varying numbers and types of
stages for different types of operations. For example, in most processors, pipelines that handle
integer operations are a different length than those for floating point.

click on image for full view

Typically, a modern processor will divide basic pipeline operations into multiple simpler stages.
This type of design is called superpipelined. The fetch and decode stages might each be divided
into two or more substages, for example. Because simpler stages cause less electron
propagation delay, clock speeds can push higher.

Clock frequency isn't the only factor affecting performance, but high gigahertz ratings draw users.
Intel has consistently lengthened pipelines in new generations of Pentiums, increasing
performance. There's a flip side, though: High heat from faster processing can limit clock speed,
as demonstrated by the 90nm Pentium 4 generation, which recently hit a 3.8-GHz ceiling. Also,
shorter pipelines that run at the same frequency can execute more instructions per clock cycle.

Rather than gauge performance by clock speed, many CPU enthusiasts rely heavily on
instructions per clock (IPC). Although that's an important measure, for many types of operations a
CPU with a lower IPC rating but a longer pipeline, higher clock speed, and other
microarchitectural enhancements can be much faster than one with a shorter pipeline and a
better IPC rating but a slower clock. That said, with a long pipeline, mispredicting the branch a
program will take—which causes a lengthy pipeline reload—can have serious performance
penalties if it happens frequently. Continued...

Superscalar Processors

A superscalar processor incorporates multiple pipelines, functional units (which actually process
and execute the low-level commands), or both, and can decode, issue, and execute multiple
instructions per clock cycle. The original Pentium had dual integer pipelines and could often issue
and execute two simple integer instructions simultaneously.

Today, complex superscalar processors can issue and execute six or more different types of
instructions at the same time. With current x86 chips, these are micro-ops. So, for example, the
Athlon 64 can issue up to nine micro-ops per cycle. In fact, some specialized processors
incorporate thousands of very small pipelines, which operate simultaneously on large groups of
similar data.

Way back in August, 2000, Intel presented details of the original Pentium 4 (Willamette) pipeline
design, which has 20 stages. The pipeline reads instructions from a trace cache—a special L1
instruction cache that holds micro-ops generated from decoding x86 instructions.
click on image for full view

Intel Pentium 4
The colorful 1MB L2 cache (1) on the P4 Prescott chip takes up far less space, proportionally, than its equivalent
on the AMD Athlon 64. The floating point unit (2), schedulers (3), and allocator (4) lie, one above the other, near
the left edge of the die. ALUs (arithmetic logic units), AGUs (address generation units), and the register file sit in
a block (5) just to the right of the floating point unit and just left of the 16KB data cache (6). Directly below the
data cache is the trace cache, and the microcode ROM (8) is to the left of that. At far right are the DTLB?data
translation lookaside buffer (9)?the branch prediction unit (10), and the ITLB?instruction TLB (11).
The Athlon 64 implements a 12-stage pipeline for integer instructions and a 17-stage for floating-
point. Unfortunately, Intel hasn't published a block diagram of the 31-stage pipeline in the newer
P4 (Prescott), but we can assume its stages are further subdivided with added functionality. The
extra stages allow higher clock speeds, and Intel improved branch prediction and instruction
scheduling to help mitigate long-pipeline performance penalties. Continued...

Instruction Issue Policies and Designs

Designing a CPU so it can process instructions out of their original program order can accelerate
decoding and execution. Chip complexity increases, but so does the potential for performance
enhancement from parallel execution of independent instructions. Rather than waiting for one
instruction to process before moving to the next—causing long stalls when, for example, data
must be fetched from main memory—the CPU works on other instructions, resulting in dramatic
improvement.

CPU architects have created many policies for instruction-processing order, such as in-order
issue with in-order completion, in-order issue with out-of-order completion, and out-of-order issue
with out-of-order completion. In-order issue sends instructions to pipelines in program order. In-
order completion does the same when writing results. A processor can issue instructions in order
but execute and complete them differently.

Out-of-order dispatching requires an instruction buffer between the decode and execute stages of
the pipeline. Naturally, the CPU must ultimately maintain program integrity when writing data to
registers and memory. Increased out-of-order operation requires more buffers. Some CPUs have
a centralized instruction window to buffer decoded instructions destined for all execution units.
Other designs use reservation stations—small instruction windows before the input to individual
execution units.

Present-day processors with multiple execution units are said to have decoupled superscalar
architectures, because the fetch and decode stages are fully independent of subsequent stages.
A decoder can send its results to a central buffer or reservation station even if the execution units
are processing other instructions. And if decoding stalls, execution units can obtain already-
translated instructions from the reservation stations.

When the reservation stations for multiple execution units have many queued instructions, they
can execute out of order among the units, increasing overall throughput. The processor can
evaluate a fairly large number of instructions and determine which are best suited for execution at
a given time, based on factors like resource and operand availability.

Out-of-order completion implies that results are written to memory out of program order, but they
are not. Temporary buffers on the chip hold the results, which are written to system memory in
program order. Processors may differ in their methods of tracking and tagging the instructions
coursing through a pipeline, but all CPUs ultimately update architectural registers (those visible
externally) and system memory in program order. And of course, all of this magic is done out of
the programmer's sight. Continued...

Dependencies Limit Performance

Simultaneously issuing and executing multiple instructions to multiple pipelines causes complex
problems. The most important are true data dependencies (or read-after-write dependencies),
resource conflicts, and procedural dependencies. Other data dependencies, called output (write-
after-write) and anti-dependencies (write-after-read), are commonly classified as false
dependencies and are easily removed with special techniques. The problems just detailed are
often referred to as pipeline hazards.

True data dependencies among instructions—when one instruction depends on the result of a
previous one—prevent their simultaneous issue and execution. The processor forces the first
instruction to complete before the second. The possibility of output dependencies occurs
frequently. If two successive instructions perform separate calculations but write their results to
the same register, attempting to execute the second command before the first would cause an
output dependency. Anti-dependencies happen when the second instruction executes first, then
generates a result and stores it in the same register used as a source operand by the first
instruction.

Conflicts and dependencies among instructions limit the potential throughput of a superscalar
processor, but various software and hardware techniques can remove or at least reduce many
problems. Compilers can optimize instruction execution, but recompiling a large program for
many different processor architectures is an unpleasant alternative. Aggressively implementing
superscalar designs can help minimize many types of dependencies, improving performance and
diminishing the need for recompilation. Continued...

Register Renaming

For many computations, a CPU uses the data stored in its registers. Processors with a large
number of registers can use them to hold many program variables, which reduces the number of
cache and external memory accesses. RISC (reduced instruction-set computing) chips often
have 128 registers or more.

CISC (complex instruction-set computing) chips, like x86 Intel and AMD CPUs, have a mere eight
32-bit general-purpose architectural (or logical) registers: EAX, EBX, ECX, EDX, EDI, ESI, EBP,
and ESP. The new AMD64 and Intel EM64T 64-bit x86 extensions add eight 64-bit registers and
expand the existing 32-bit registers to 64-bit.

The limited number of registers makes x86 processors prone to dependencies between
successive instructions. From instruction to instruction, registers are often reused when
performing independent operations, wreaking havoc with superscalar processing efficiency.

Register renaming, a technique used in all current processors based on the x86 instruction set,
can eliminate this false dependency by dynamically substituting different physical registers when
dealing with closely-grouped independent operations. Internally, renaming requires more physical
registers than the logical registers that are visible to the outside world (external programs).

If this seems complicated, it is. Here's an example. Picture two physical registers that represent
two or more instances of a logical register called R3—call them R3a and R3b. This renaming lets
the processor concurrently execute two independent instructions that each originally used R3,
eliminating the false dependency and preserving program order and correctness, as long as
subsequent instructions receive the contents of the most recent R3. If, however, information
written to a particular architectural register by one instruction is mmediately read by the next,
there is a true dependency between the two instructions, which renaming can't remove.

Designers often use Data forwarding (also referred to as data bypassing) in conjunction with
register renaming to help remove or reduce the performance penalties of true data dependencies.
Data forwarding permits the processor to send the result from a just-executed instruction to a
successive one prior to execution—before the result is written back to architectural registers, thus
reducing the read-after-write dependency. Note that more complex forwarding scenarios exist,
and not all sequences of instructions can benefit from data forwarding. Continued...
Resource Conflicts

When two instructions compete for specific processor resources—registers, cache memories, or
execution units, for example—the processor must manage the conflicts. Often a CPU design
simply replicates particular resources, such as arithmetic logic units (ALUs) and load/store units,
but doing this too much complicates chip design and can make manufacturing prohibitively
expensive.

Some processors include arbitration schemes to manage access to shared buses, and most x86
chips rely on register renaming to eliminate resource conflicts. But chip designers face many
architectural tradeoffs and must avoid driving cost, complexity, or die sizes too high.

Intel's version of symmetric multithreading—Hyper-Threading—is present in many Pentium 4


processors, and does an amazing job at conflict management. It attains up to 40-percent
performance gains in certain multithreaded applications and multiprocessing scenarios, yet adds
less than 5 percent to processor die size, and consumes less than 5 percent more power.

With Hyper-Threading, two separate instruction streams can execute simultaneously in a single-
core P4. Read all about Hyper-Threading in gory detail on our Web site at:
go.extremetech.com/hyperthreading. Continued

Branches and Procedural Dependencies

Branch instructions, which can force a program to divert from its serial flow of instructions,
present a unique obstacle. An unconditional branch, as the name suggests, always diverts
instruction flow. A conditional branch diverts program flow based on a comparison or condition
code, though.

Conditional branches include many varieties of jump and loop instructions. These instructions
generate procedural dependencies resulting in serious performance problems in superscalar
processors. (Scalar processors are also affected, but not as dramatically.)

When a conditional branch is not taken, program flow falls through the branch to the next
sequential instruction. In many standard x86 programs, from 10 to 20 percent of instructions are
conditional branches, and up to 10 percent are unconditional. Conditional jumps are actually
taken roughly 50 percent of the time. Loop instructions cause branches to be taken very often—
some studies show a 90-percent average.

Most high-performance processors try to predict branch outcomes. Without this capability, when a
pipelined processor is handling many instructions at various stages and a branch changes
program flow, the entire pipeline must be flushed and instructions must load from the new target
address. That causes a big performance hit, especially with long pipelines. Branch prediction lets
the processor fetch and begin executing instructions from the expected branch address long
before the branch occurs, reducing potential delays and improving performance considerably.
Continued...

CPU Building Blocks

We've reviewed basics of pipeline operations, but haven't yet discussed the key functional
elements and operational units inside CPUs. At the highest level of abstraction, all
microprocessors have functional similarities. Virtually all of them read and write data from internal
registers, internal caches, and external memory. Intel P4 chips connect to the outside world via
the bus interface unit (BIU). AMD Athlon chips use the on-chip HyperTransport and memory
controller interface units.
Most current Intel and AMD CPU block diagrams show processing flow from top to bottom,
through the BIU, HyperTransport, and memory controller interfaces. The L2 cache is shown
attached at the sides in P4 and Athlon 64 diagrams. Note that some arrows between blocks are
unidirectional and some bidirectional. Block diagrams are not pipeline diagrams, but still let you
visualize pipeline flow.

The CPU reads instructions and data from the unified on-board L2 cache and feeds them to the
separate and smaller L1 instruction and data caches. (Data can also be written to the L1 data and
L2 unified caches, but most programs don't generally modify or write program instruction code.)
Instructions are read from the instruction cache and decoded. Branch prediction logic, in concert
with L1 instruction fetches, will try to predict the outcome of a conditional branch and provide a
pointer to the predicted address. The processor then fetches instructions from that address and
feeds them into the pipeline for decoding.

The Pentium 4 and Athlon 64 use similar techniques to break complex x86 instructions into
simpler internal operations. As noted earlier, the P4 decodes x86 instructions into micro-ops,
which it writes into the trace cache as sequences of micro-ops, called traces, in program order.
One big advantage of the trace cache is that it lets the CPU contiguously store a sequence of
micro-ops prior to a branch, the branch instruction itself, and the branch target instructions in the
same trace line. The Athlon 64 fetches x86 instructions from L2 cache, stores them in L1, and
then sends them to initial decoders for conversion to macro-ops.

The next step is usually register renaming and scheduling of instructions for execution. Working
in concert with various control units, the schedulers make certain that instructions are ready for
execution. (In an Athlon 64, they also perform the final decode into micro-ops.) Specifically, the
schedulers check that all operands are available. A reorder-buffer mechanism keeps track of the
original program order so the processor can retire instructions, processed out of order, in program
order.

The CPU then dispatches instructions to various execution units, and sends the results to
registers, to other instructions, or both. Load and store units handle any memory reads and
writes. The processor makes sure that data is written to the architectural registers and to memory
in program order, and all is well.

So much goes on inside a processor that covering everything is impossible in one article. We've
oversimplified operations, and we haven't discussed the internals of arithmetic logic units,
floating-point -unit pipelines, bus protocols, and more. But now you've got the lay of the land, and
you're well prepared to venture deep into complex and fascinating territory.

Pentium D
Let's get the most exciting news out of the way first: Intel is launching its Pentium D processor
line today, and the price of the 2.8Ghz Pentium D is listed as $241 (in quantities of 1000)! It's
likely that initial processors in the channel will be pricier, but once the early demand is filled, you
should be able to find a CPU for well under $300--maybe even approaching $250. To put this in
context, AMD's lowest-cost desktop dual-core processor, the Athlon 64 X2 4200+, costs about
$537 (in quantities of 1000). Meanwhile, Intel's Pentium D 840, which clocks at 3.2GHz, will cost
$530 in lots of 1,000 units.

We'll discuss pricing and its impact in the final section, but we felt compelled to point out that Intel
is really pushing affordable dual-core CPUs out the door. Of course, if the performance is
inadequate for daily computing tasks, then pricing is irrelevant. So we ran the Pentium D 820
through our extensive benchmark suite.

Intel also announced another processor, the Pentium 4 Model 670, running at 3.80GHz. The 670
rounds out a hole in the EM64T processor line. Prior to its release, the highest-clock-rate Intel
CPU was the Pentium 4 570J, which also runs at 3.80GHz. However, the 670 offers 2MB of L2
cache, hardware no-execute bit support, and EM64T capabilities. We'll also present performance
results for this CPU, but as you might expect, it's not that much better than the 3.73GHz Pentium
4 Extreme Edition--though it is a tad cheaper.

In addition to the Model 820, Intel is announcing two more CPUs based on the Smithfield core,
the Pentium D models 830 and 840. Note that the Pentium D model 840 is not the same CPU as
the Pentium Extreme Edition 840.

What are some of the other differences? The Pentium Extreme Edition 840 supports Hyper-
Threading, while none of the Pentium Ds currently support Hyper-Threading. Also, Intel is
shipping the Extreme Edition 840 with an unlocked multiplier, making overclocking easier, as we
saw with the 4GHz Velocity Micro system we recently reviewed. The key difference between the
Pentium 4 670 and the Pentium 4 Extreme Edition 3.73GHz processor is the front-side bus speed
and, hence, the clock multiplier. The 670 runs with a 18x multiplier, while the P4EE uses a 14x
multiplier. The P4EE has a slight bandwidth edge, but it seems not to matter much. All the new
CPUs support a hardware NX (no execute) bit.

We also compared these processors with the Athlon 64 X2 and the Intel Pentium Extreme Edition
840. We also tossed in results from our previews of the 3.73GHz Pentium 4 Extreme Edition and
Pentium 4 660.

With these thoughts in mind, let's see just how well a low-cost dual-core processor perfo

You might also like