0% found this document useful (0 votes)
34 views

Introduc) On: Lecture 12: Measuring Cpu Performance Lecture 1: Evolution of Computer System

The document discusses measuring CPU performance by calculating execution time. Execution time (XT) is calculated as the product of instruction count (IC), average cycles per instruction (CPI), and clock cycle time (C). Comparing execution times of the same program on different machines allows estimating their relative performance and speedup. The document provides examples calculating execution time and speedup based on given parameters like IC, CPI, and clock rate. Key factors affecting performance include hardware technology, instruction set architecture, compiler, and program characteristics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Introduc) On: Lecture 12: Measuring Cpu Performance Lecture 1: Evolution of Computer System

The document discusses measuring CPU performance by calculating execution time. Execution time (XT) is calculated as the product of instruction count (IC), average cycles per instruction (CPI), and clock cycle time (C). Comparing execution times of the same program on different machines allows estimating their relative performance and speedup. The document provides examples calculating execution time and speedup based on given parameters like IC, CPI, and clock rate. Key factors affecting performance include hardware technology, instruction set architecture, compiler, and program characteristics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

24/07/17

Introduc)on
•  Most processors execute instruc4ons in a synchronous manner using a clock
that runs at a constant clock rate or frequency f.
•  Clock cycle 4me C is the reciprocal of the clock rate f:
C = 1 / f

Lecture 1: EVOLUTION OF COMPUTER SYSTEM •  The clock rate f depends on two factors:
Lecture 12:1:MEASURING
Lecture CPU
EVOLUTION OF PERFORMANCE
COMPUTER SYSTEM a)  The implementa4on technology used.
b)  The CPU organiza4on used.
DR. KAMALIKA DATTA •  A machine instruc4on typically consists of a number of elementary micro-
DR. KAMALIKA DATTA
DR. KAMALIKA DATTA opera4ons that vary in number and complexity depending on the
DEPARTMENT OF
DEPARTMENT OFCOMPUTER
COMPUTERSCIENCE AND ENGINEERING,
SCIENCE AND ENGINEERING,NIT
NIT MEGHALAYA
MEGHALAYA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT MEGHALAYA instruc4on and the CPU organiza4on used.

1 2 2 2

•  A micro-opera4on is an elementary hardware opera4on that can be •  For a given program compiled to run on a specific machine, we can define
carried out in one clock cycle. the following parameters:
–  Register transfer opera4ons, arithme4c and logic opera4ons, etc. a)  The total number of instruc4ons executed or instruc6on count (IC).
•  Thus a single machine instruc4on may take one or more CPU cycles to b)  The average number of cycles per instruc6on (CPI).
complete. c)  Clock cycle 6me (C) of the machine.
–  We can characterize an instruc4on by Cycles Per Instruc6on (CPI). •  The total execu4on 4me can be computed as:
•  Average CPI of a program: Execu&on Time XT = IC x CPI x C
–  Average CPI of all instruc4ons executed in the program on a given processor. •  How do we evaluate and compare the performances of several machines?
–  Different instruc4ons can have different CPIs.

3 2 4 2

By Measuring the Execu)on Times


•  One of the easiest methods to make the comparison. •  An example:
•  We measure the execu4on 4mes of a program on two machines (A and B), as A program is run on three different machines A, B and C and execu4on
XTA and XTB. 4mes of 10, 25 and 75 are noted.
–  A is 2.5 4mes faster than B
•  Performance can be defined as the reciprocal of execu4on 4me:
–  A is 7.5 4mes faster than C
PerfA = 1 / XTA
–  B is 3.0 4mes faster than C
PerfB = 1 / XTB
•  We can es4mate the speedup of machine A over machine B as: •  Simple for one program. But the main challenge is to extend the comparison
Speedup = PerfA / PerfB = XTB / XTA when we have a set of programs.
–  Shall be discussed later.

5 2 6 2

1
24/07/17

Example 1 Factors Affec)ng Performance


•  A program is running on a machine with the following parameters: C CPI IC
–  Total number of instruc4ons executed = 50,000,000 Hardware Technology (VLSI) X
–  Average CPI for the program = 2.7
–  CPU clock rate = 2.0 GHz (i.e. C = 0.5 x 10-9 sec) Hardware Technology (Organiza)on) X X

•  Execu4on 4me of the program: XT = IC x CPI x C Instruc)on set architecture X X


XT = 50,000,000 x 2.7 x 0.5 x 10-9 = 0.0675 sec Compiler technology X X

Program X X

7 2 8 2

•  IC depends on:
–  Program used, compiler, ISA •  A tradeoff:
•  CPI depends on: –  RISC: increases number of instruc4ons/program, but decreases CPI and
–  Program used, compiler, ISA, CPU organiza4on clock cycle 4me because the instruc4ons and hence the implementa4ons
•  C depends on: are simple.
–  Technology used to implement the CPU –  CISC: decreases number of instruc4ons/program, but increases CPI and
clock cycle 4me because many instruc4ons are more complex.
•  Unfortunately, it is very difficult to change one parameter in complete •  Overall, it has been found through experimenta4on that RISC
isola4on from the others.
architecture gives becer performance.
–  Basic technologies are interdependent.

9 2 10 2

Example 2 Example 3
•  Suppose that a machine A executes a program with an average CPI of 2.3. •  Consider the earlier example with IC = 50,000,000; average CPI = 2.7, and
Consider another machine B (with the same instruc4on set and a becer clock rate = 2.0 GHz.
compiler) that executes the same program with 20% less instruc4ons and Suppose we use a new compiler on the same program, for which:
with a CPI of 1.7 at 1.2 GHz. –  New IC = 40,000,000
What should be the clock rate of A so that the two machines have the same –  New CPI = 3.0 (i.e. the new compiler is using more complex instruc4ons)
performance? –  Also we have a faster CPU implementa4on, with clock rate = 2.4 GHz.
We must have: ICA x CPIA x CA = ICB x CPIB x CB
Hence: ICA x 2.3 x CA = 0.80 x ICA x 1.7 x (1 / (1.2 x 109)) Speedup = XTold / XTnew
We get: CA = 0.49 x 10-9 sec = (50,000,000 x 2.7 x 0.5 x 10-9) / (40,000,000 x 3.0 x 0.4167 x 10-9)
Thus, clock rate of A = 1 / CA = 2.04 GHz = 1.35 à 35% faster

11 2 12 2

2
24/07/17

n
X
CPU clock cycles = (ICi ⇥ CP Ii)
i=1

Instruc)on Types and CPI Example 4


•  Consider a program execu4ng on a processor, with n types or classes of
n
•  Consider an implementa4on of a ISA where the instruc4ons can be classified
X
instruc4ons (like, load, store, ALU, branch, etc.).
Instruction Count (IC) = Ci into four types, with CPI values of 1, 2, 3 and 4 respec4vely.
i=1
ICi = number of instruc4ons of type i executed Two code sequences have the following instruc4on counts:
CPIi = cycles per instruc4on for type i Code Sequence ICType1 ICType2 ICType3 ICType4
n
•  The following expressions follow.
CPU clock cycles =
X
(ICi ⇥ CP Ii)
i=1 Pn CS-1 20 15 5 2
n i=1 (ICi ⇥ CP Ii)
CPU clock cycles =
X
(ICi ⇥ CP Ii) CPI = CS-2 10 12 10 4
i=1 0
IC 1
n
XICi CPU cycles for CS-1: 20x1 + 15x2 + 5x3 + 2x4 = 73 CPI for CS-1: 73 / 42 = 1.74
n
X = @ ⇥ CP IiA
Instruction Count (IC) = ICi i=1 IC
i=1 CPU cycles for CS-2: 10x1 + 12x2 + 10x3 + 4x4 = 80 CPI for CS-2: 80 / 36 = 2.22
n
X
Instruction Count (IC) = Ci n
X
i=1 CPU clock cycles = (ICi ⇥ CP Ii)
i=1
13 2 14 2
Pn
⇥ CP Ii)
i=1 (ICi
CPI = Pn
i=1
IC
0 (ICi ⇥ CP I1i ) n
X
CPI = X
n ICi n (ICi ⇥ CP Ii )
CPU clock cycles = X n
X
= @ IC
⇥ CP IiA1 i=1(IC Instruction Count (IC) = ICi
i=1
X
0
n IC ICi CPU clock cycles = i ⇥ CP Ii)
= @ A ⇥ CP Ii i=1 i=1
i=1 IC

ICi
Fi = n
X
IC n ICi
Instruction Count (IC) =X Pn
Instruction Count (IC) = i=1ICi i=1 (ICi ⇥ CP Ii)
i=1 CPI =
n
X IC
CPU clock cycles = (ICi ⇥ CP Ii) n
X
0
ICi
1
i=1 1
= @ ⇥ CP IiA
Pn
i=1 IC
Instruc)on Frequency and CPI
ni=1 (ICi ⇥ CP Ii )
Example 5
P
CPI = i=1 (ICi ⇥ CP Ii)
CPI = IC
n
X IC
Instruction Count (IC) = Ci 0 0
ICi
nn IC
X
X
11 •  Suppose for an implementa4on of a RISC ISA there are four instruc4on
ICi
i
i=1 = @
@ ⇥⇥CPCPIiIAiA
•  CPI can also be expressed in terms of the frequencies of the various types, with their frequency of occurrence (for a typical mix of programs)
Fi =
i=1 IC
instruc4on types that are executed in a program.
i=1 IC and CPI as shown in the table below. IC
–  Fi denotes the frequency of execu4on of instruc4on type i.
1
1 n
X
Type Frequency CPI CPI = (Fi ⇥ CP Ii)
Pn
(ICi ⇥ CP Ii) IC
ICi i Load 20 % 4 i=1
CPI = i=1 F
Fii =
= IC
0
IC 1
IC Store 8 % 3
n
X ICi CPI = (0.20 x 4) + (0.08 x 3) + (0.60 x 1)
= @ ⇥ CP IiA n ALU 60 % 1 + (0.12 x 2)
X
i=1 IC CPI = (Fi ⇥ CP Ii) Branch 12 % 2 = 1.88
i=1

15 2 16 2

1
1

Example 6
•  Suppose that a program is running on a machine with the following
instruc4on types, CPI values, and the frequencies of occurrence.
The CPU designer gives two op4ons: (a) reduce CPI of instruc4on type A to
1.1, and (b) reduce CPI of instruc4on type B to 1.6. Which one is becer? END OF LECTURE 12
Type CPI Frequency Average CPI for (a): 0.60 x 1.1 + 0.10 x 2.2 + 0.30 x 2.0
A 1.3 60 % = 1.48

B 2.2 10 % Average CPI for (b): 0.60 x 1.3 + 0.10 x 1.6 + 0.30 x 2.0
C 2.0 30 % = 1.54
Op&on (a) is beAer

17 2 18 2

3
24/07/17

Introduc)on
•  Basic concept:
–  How to compare the performances of two or more computer systems?
–  We need to execute some programs and measure the execu4on 4mes.
•  Set of standard programs used to comparison is called benchmark.
Lecture 1: EVOLUTION OF COMPUTER SYSTEM
Lecture
Lecture 1:13: CHOICEOF
EVOLUTION OFCOMPUTER
BENCHMARKS
SYSTEM •  Various metrics have been proposed to carry out the evalua4on.
–  To be discussed next.

DR. KAMALIKA DATTA


DR. KAMALIKA DATTA
DR. KAMALIKA DATTA
DEPARTMENT OF
DEPARTMENT OFCOMPUTER
COMPUTERSCIENCE AND ENGINEERING,
SCIENCE AND ENGINEERING,NIT
NIT MEGHALAYA
MEGHALAYA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT MEGHALAYA

19 2 20 2

Some Early Metrics Used


a)  MIPS (Million Instruc)ons Per Second)
–  Computed as (IC / XT) x 10-6 •  The MIPS ra4ng is only valid to compare the performance of two or more
–  Dependent on instruc4on set, making it difficult to compare MIPS of computers
processors provided that the following condi4ons are sa4sfied:
with different instruc4on sets. a)  The same program is used
–  MIPS varies between programs running on the same processor. b)  The same ISA is used
–  Higher MIPS ra4ng may not mean becer performance. c)  The same compiler is used
–  Example: A machine with op4onal floa4ng-point co-processor.
•  When co-processor is used, overall execu4on 4me is less but more complex •  In other words, the resul4ng programs used to obtain the MIPS ra4ng are
instruc4ons are executed (i.e. smaller MIPS). iden4cal at the machine code level with the same instruc4on count.
•  Solware rou4nes take more 4me but gives higher MIPS value à FALLACY.

21 2 22 2

Example 1
•  Consider a processor with three instruc4on classes A, B and C, with the corresponding
b)  MFLOPS (Million Floa)ng Point Opera)ons Per Second) CPI values being 1, 2 and 3 respec4vely. The processor runs at a clock rate of 1 GHz.
•  Simply computes number of floa4ng-point opera4ons executed per second. For a given program wricen in C, two compilers produce the following executed
•  More suitable for applica4ons that involve lot of floa4ng-point computa4ons. instruc4on counts.
•  Here again, different machines implement different floa4ng-point opera4ons.
Instruc)on Count (in millions)
•  Different floa4ng-point opera4ons take different 4mes.
For ICA For ICB For ICC
–  Division is much slower than addi4on.
•  Compilers have no floa4ng-point opera4ons and has a MFLOP ra4ng of 0. Compiler 1 7 2 1
•  Hence, not very suitable to use this metric across machines and also across Compiler 2 12 1 1
programs. Compute the MIPS ra4ng and the CPU 4me for the two program versions.

23 2 24 2

4
24/07/17

•  $t1 = address of s
MIPS = Clock Rate (MHz) / CPI CPI = CPU Execu4on Cycles / Instruc4on Count
Example 2 •  $t3 = s
A loop in C •  $t2 points to A[0]
•  Solu4on: CPU Time = Instruc4on Count x CPI / Clock Rate LW $t3, 0($t1)
for (k=0; k<1000; k++)
–  For compiler 1: { ADDI $t6, $t2, 4000
A[k] = A[k] + s; Loop: LW $t4, 0($t3)
CPI1 = (7 x 1 + 2 x 2 + 1 x 3) / (7 + 2 + 1) = 14 / 10 = 1.40 ADD $t5, $t4, $t3
} MIPS32 Code
MIPS Ra4ng1 = 1000 MHz / 1.40 = 714.3 MIPS SW $t5, 0($t2)
CPU Time1 = ((7 + 2 + 1) x 106 x 1.40) / (1 x 109) = 0.014 sec ADDI $t2, $t2, 4
BNE $t6, $t2, Loop
–  For compiler 2:
Instruc)on Type CPI
CPI2 = (12 x 1 + 1 x 2 + 1 x 3) / (12 + 1 + 1) = 17 / 14 = 1.21 •  The code is executed on a processor that runs at 1 GHz
(C = 1 nsec). ALU 2
MIPS Ra4ng2 = 1000 MHz / 1.21 = 826.4 MIPS
•  There are four instruc4on types with CPI values as shown in LOAD 5
CPU Time2 = ((12 + 1 + 1) x 106 x 1.21) / (1 x 109) = 0.017 sec the table. STORE 6
MIPS ra&ng indicates that compiler 2 is faster, while in reality the reverse is true. •  We show some calcula4ons next. BRANCH 3

25 2 26 2

•  The code has 2 instruc4ons before the loop and 5 instruc4ons in the body of the loop
that executes 1000 4mes.
–  Total instruc4on count IC = 5 x 1000 + 2 = 5002. •  MIPS ra4ng = Clock Rate (MHz) / CPI = 1000 MHz / 3.6 = 277.8 MIPS
•  Number of instruc4ons executed and frac4on Fi for each instruc4on type: •  The processor achieves its peak MIPS ra4ng when execu4ng a program that only
–  ICALU = 1 + 2 x 1000 = 2001, FALU = 2001 / 5002 = 0.4 = 40% has instruc4ons of the type with lowest CPI (i.e. ALU which has CPI = 2).
–  ICLOAD = 1 + 1 x 1000 = 1001, FLOAD = 1001 / 5002 = 0.2 = 20% –  Peak MIPS ra4ng = Clock Rate (MHz) / CPIALU = 1000 MHz / 2 = 500 MIPS
–  ICSTORE = 1000 , FSTORE = 1000 / 5002 = 0.2 = 20%
–  ICBRANCH = 1000, FBRANCH = 1000 / 5002 = 0.0 = 20%
•  Total CPU clock cycles = 2001 x 2 + 1001 x 5 + 1000 x 6 + 1000 x 3 = 18,007 cycles
•  Average CPI = CPU clock cycles / IC = 18007 / 5002 = 3.6
•  Execu4on 4me = IC x CPI x C = 5002 x 3.6 x 1 x 10-9 = 18.0 μsec

27 2 28 2

Choosing Programs for Benchmarking


•  Suppose you are trying to buy a new computer and you have several •  Different levels of programs used for benchmarking:
alterna4ves. a)  Real applica4ons
–  How to decide which one will be best for you?
b)  Kernel benchmarks
•  The best way to evaluate is to run the actual applica4ons that you are c)  Toy benchmarks
expected to run (actual target workload), and find out which computer runs
d)  Synthe4c benchmarks
them the fastest.
–  Not possible for everyone to do this.
–  We olen rely on other methods that are standardized to give us a good measure
of performance.

29 2 30 2

5
24/07/17

(a) Real Applica)ons SPEC95 Programs (Integer)


•  Here we select a specific mix or suite of programs that are typical of target
applica4ons or workload (e.g. SPEC95, SPEC CPU2000, etc.). Benchmark Descrip)on
go A game based on ar4ficial intelligence
•  SPEC (System Performance Evalua6on Corpora6on) is the most popular and
industry-standard set of CPU benchmarks. m88ksim A simulator for Motorola 88k chip
gcc Gnu C compiler to generate SPARC code
•  Examples:
compress Compression and decompression u4lity
–  SPECint95 consists of 8 integer programs.
li LISP interpreter
–  SPECfp95 consists of 10 floa4ng-point intensive programs.
ijpeg Image compression and decompression u4lity
–  SPEC CPU2000 consists of 12 integer programs (CINT2000) and 14 floa4ng-point
intensive programs (CFP2000). perl PERL interpreter
–  SPEC CPU2006 consists of 12 integer programs (CINT2006) and 17 floa4ng-point vortex A database program
intensive programs (CFP2006).

31 2 32 2

SPEC95 Programs
Benchmark Descrip)on
CINT2000 (Integer)
tomcatv A mesh genera4on program
(Floa)ng-Point) swim Shallow water modeling
su2cor Quantum physics Monte Carlo simula4on
1.  164.gzip : compression 9.  254.gap : Group theory interpreter
hydro2d Solving hydrodynamic Naiver Stokes equa4ons
2.  175.vpr : FPGA placement / rou4ng 10. 255.vortex : Object-oriented database
mgrid Mul4grid solver on 3D poten4al field 3.  176.gcc : C compiler 11. 256.bzip2 : Compression
applu Solving parabolic/ellip4cal differen4al equa4ons 4.  181.mcf : Combinatorial op4miza4on 12. 300.twolf : VLSI Place / Route
trub3d Simulates turbulence in a cube 5.  186.craly : Chess playing
apsi Solver for distribu4on of pollutant 6.  197.parser : Word processing
fpppp Quantum chemistry simula4on 7.  252.con : Computer visualiza4on
wave5 Simula4on of plasma physics 8.  253.perlbmk : PERL interpreter

33 2 34 2

CFP2000 (Floa)ng-Point) (b) Kernel Benchmarks


•  Here key computa4onally-intensive pieces of code are extracted from real
1.  168.wupwise: Quantum dynamics 9.  187.facerec : Face recogni4on programs (e.g. Fast Fourier transform, matrix factoriza4on, etc.).
2.  171.swim : Shallow water modeling 10. 188.ammp : Computa4onal chemistry •  Unlike real programs, no user would be running the kernel benchmarks.
3.  172.mgrid : Mul4-grid solver 11. 189.lucas : Primality tes4ng –  They are used solely to evaluate performance.
4.  173.applu : Differen4al equa4on solver 12. 191.fma3d : Finite-element simula4on •  Kernels are best used to isolate performance of specific features of a
5.  177.mesa : 3D graphics library 13. 200.sixtrack : Nuclear accelerator machine and evaluate them.
6.  178.galgel : Fluid dynamics 14. 301.apsi : Pollutant distribu4on •  Examples: Livermore Loops, Linpack.
7.  179.art : Neural networks –  Some compilers were reported to have been using benchmark specific
8.  183.equake : Seismic wave simula4on op4miza4ons so as to give the machine a good ra4ng.

35 2 36 2

6
24/07/17

(c) Toy Benchmarks (d) Synthe)c Benchmarks


•  These are small pieces of code, typically between 10 and 100 lines. •  Somewhat similar in principle to kernel benchmarks.
•  They are convenient and can be run easily on any computer. –  They try to match the average frequency of opera4ons and operands of a
large set of programs.
•  They have limited u4lity in benchmarking and hence sparingly used.
•  Synthe4c benchmarks are further removed from reality than kernels, as
•  Examples: Sieve of Eratosthenes, quicksort, etc.
kernel code is extracted from real programs, while synthe4c code is
created ar4ficially to match an average execu4on profile.
•  Examples: Whetstone, Dhrystone, etc.
–  These are not real programs.
•  Some drawbacks with synthe4c benchmarks are discussed next.

37 2 38 2

n
X
Weighted Arithmetic Mean = (Wi ⇥ XTi)
i=1

v
u n
u Y
Normalized Geometric Mean = t XT Ri
n

i=1

•  Compilers and hardware op4miza4ons can ar4ficially inflate performance of


these benchmarks but not of real programs.
X n
–  Op4mizing compilers can discard more than 25% of Dhrystone code (e.g. loops
that are executed once).
1
Normalized Arithmetic Mean = XT Ri
n
–  Compilers can op4mize specific code sequences that appear in, say, Whetstone
i=1
END OF LECTURE 13
benchmark.
X = SQRT (EXP (ALOG (X) / T1)) à X = EXP (ALOG (X) / (2 * T1))
p X
eX = e 2

39 2 40 2

Introduc)on
•  We have seen the need for using real programs for benchmarking.
–  Benchmarks consist of a suite of programs.
•  How to consolidate all the run 4mes and come up with a single metric that
can be used for comparison?
Lecture 1: EVOLUTION OF COMPUTER SYSTEM
Lecture 14: SUMMARIZING
Lecture PERFORMANCE
1: EVOLUTION OF RESULTS
COMPUTER SYSTEM –  Machine X may run benchmark B1 faster, while machine Y may run benchmark
B2 faster.
–  Is X faster than Y, or is Y faster than X?
DR. KAMALIKA DATTA
DR. KAMALIKA DATTA •  We shall be discussing several measures that are used for consolida4on.
DR. KAMALIKA DATTA
DEPARTMENT OF
DEPARTMENT OFCOMPUTER
COMPUTERSCIENCE AND ENGINEERING,
SCIENCE AND ENGINEERING,NIT
NIT MEGHALAYA
MEGHALAYA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT MEGHALAYA

41 2 42 2

7
24/07/17

What about Reproducibility? How to Summarize Performance Results?


•  The choice of a good benchmark suite that relates to real applica4ons is
•  Actual run 4me of a program on a machine depends on so many factors. essen4al to measuring performance.
–  Degree of mul4programming, disk usage, compiler op4miza4on, etc. •  For a single program, it is very easy to say which computer runs faster.
•  Reproducibility of the experiments is very important. •  However, when there are mul4ple programs, the comparison may not be
–  Anyone should be able to run the experiment and get the same results. so straighuorward.
–  Benchmarks must therefore specify the execu4on environment very clearly.
CPU A (in secs) CPU B (in secs) CPU C (in secs)
•  Example:- SPEC benchmarks men4on details such as: An example Program P1 1 10 25
–  Extensive descrip4on of the computer and the compiler flags.
Program P2 500 250 10
–  Hardware, solware and baseline tuning parameters.
Total Time 501 260 35

43 2 44 2

CPU A (in secs) CPU B (in secs) CPU C (in secs)


(a) Total Execu)on Time
Program P1 1 10 25
Program P2 500 250 10
•  The simplest approach to summarize the rela4ve performances is to look at
Total Time 501 260 35 the total execu4on 4mes of the two programs.
•  We can make the following statements, which may depict a –  CPU A : 501, CPU B: 260, CPU C: 35.
confusing picture when considered together: •  Based on this measure, we can make the following comments:
–  A is 10 4mes faster than B for program P1. –  B is 501 / 260 = 1.93 4mes faster than A for the two programs.
–  B is 2 4mes faster than A for program P2.
–  C is 501 / 35 = 14.31 4mes faster than A for the two programs.
–  A is 25 4mes faster than C for program P1.
–  C is 260 / 35 = 7.43 4mes faster than B for the two programs.
–  C is 50 4mes faster than A for program P2.
–  B is 2.5 4mes faster than C for program P1. •  If the actual workload consists of running P1 and P2 unequal number of
–  X n
C is 25 4mes faster than B for program P2. 4mes, this measure will not give the correct result.
CPU clock cycles = (ICi ⇥ CP Ii)
i=1
45 2 46 2

n
X
Instruction Count (IC) = ICi
i=1

Pn
⇥ CP Ii)
i=1 (ICi
CPI =
IC
Xn ✓ ◆
ICi
= ⇥ CP Ii
i=1
IC

(b) Weighted Execu)on Time


ICi
Fi =
•  Arithme4c Mean: IC •  If the programs cons4tu4ng the workload do not run equally, we can add a
–  Defined as the average execu4on 4me for all the programs in the
n weightage factor to every program and carry out the calcula4on accordingly.
X
benchmark suite.
CPI = (Fi ⇥ CP Ii) –  Thus, if 40% of the tasks in the workload is program P1, and 60% is program P2,
we can define the corresponding weights as W1 = 0.4, and W2 = 0.6.
–  If XTi denotes the execu4on 4me of the i-th program, and there are n
i=1
programs, we can write •  Two alternate approaches are possible:
n i.  Weighted Arithme4c Mean
1X ii.  Normalized Execu4on Time
Arithmetic Mean = XTi
n i=1

1
47 2 48 2

8
24/07/17

Example 1
•  Weighted Arithme4c Mean (WAM): CPU A (in secs) CPU B (in secs) CPU C (in secs)
–  It is computed as the sum of the products of weigh4ng factors and Program P1 1 10 25
execu4on 4mes. Program P2 500 250 10
–  If Wi denotes the weigh4ng factor of program i, we can write Total Time 501 260 35
n
X
Weighted Arithmetic Mean = (Wi ⇥ XTi) •  If W1 = 0.50 and W2 = 0.50, we get WAMA = 250.5, WAMB = 130, WAMC = 17.5
i=1 •  If W1 = 0.90 and W2 = 0.10, we get WAMA = 50.9, WAMB = 34, WAMC = 23.5

v •  If W1 = 0.10 and W2 = 0.90, we get WAMA = 450.1, WAMB = 226, WAMC = 11.5
u n
u Y
Geometric Mean = t XT Ri
n

i=1
49 2 50 2

n
X
Weighted Arithmetic Mean = (Wi ⇥ XTi)
n i=1
X
Weighted Arithmetic Mean = (Wi ⇥ XTi)
(c) Normalized Execu)on Time
i=1 v
u n
u Y
Normalized Geometric Mean = t XT Ri
n
v
•  As an alterna4ve, we can normalize all execu4on 4mes to a reference
u n
u Y
machine, and then take the average of the normalized execu4on 4mes. i=1
Normalized Geometric Mean =
n
XT Ri t
–  Followed in the SPEC benchmarks, where a SPARCsta4on is taken as the
reference machine. i=1 •  Here, XTRi denotes the execu4on 4me for the i-th program, normalized to the
•  Average normalized execu4on 4me can be expressed as either an arithme4c reference machine.
Xn
or geometric mean. 1
Normalized Arithmetic Mean = XT Ri
n n
2
1X i=1
Normalized Arithmetic Mean = XT Ri
n i=1

51 2 52 2

2
CPU A (in secs) CPU B (in secs) CPU C (in secs)
Example 2 Program P1
2
1 10 25 •  Summary:
Program P2 500 250 10 –  In contrast to arithme4c means, geometric means of normalized
Total Time 501 260 35 execu4on 4mes are consistent no macer which machine is the
reference.
Normalized to A Normalized to B Normalized to C –  Hence, the arithme4c mean should not be used to average normalized
A B C A B C A B C execu4on 4mes.
Program P1 1.0 10.0 25.0 0.1 1.0 2.5 0.04 0.4 1.0 –  One drawback of geometric mean is that they do not predict
execu4on 4mes.
Program P2 1.0 0.5 0.02 2.0 1.0 0.04 50.0 25.0 1.0
•  Also can encourage hardware and solware designers to focus their
Arithme4c mean 1.0 5.25 12.51 1.05 1.0 1.27 25.02 12.7 1.0 acen4on to those benchmarks where performance is easiest to improve
Geometric mean 1.0 2.24 0.71 0.45 1.0 0.32 1.41 3.16 1.0 rather than the ones that are the slowest.

53 2 54 2

9
24/07/17

END OF LECTURE 14
Lecture 1: EVOLUTION OF COMPUTER SYSTEM
Lecture1:14:
Lecture AMADAHL’S
EVOLUTION LAW (PART
OF COMPUTER 1)
SYSTEM

DR. KAMALIKA DATTA


DR. KAMALIKA DATTA
DR. KAMALIKA DATTA
DEPARTMENT OF
DEPARTMENT OFCOMPUTER
COMPUTERSCIENCE AND ENGINEERING,
SCIENCE AND ENGINEERING,NIT
NIT MEGHALAYA
MEGHALAYA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT MEGHALAYA

55 2 56 2

Introduc)on What is Amadahl’s Law?


•  It can be used to find the maximum expected improvement of an overall
•  Amadahl’s law was established in 1967 by Gene system when only part of the system is improved.
Amadahl. •  It basically states that the performance improvement to be gained from
•  Basically provides an understanding on scaling, using some faster mode of execu4on is limited by the frac4on of the 4me
limita4ons and economics of parallel compu4ng. the faster mode can be used.
•  Forms the basis for quan4ta4ve principles in •  Very useful to check whether any proposed improvement can provide
computer system design. expected return.
•  Can be applied to other applica4on domains as well. –  Used by computer designers to enhance only those architectural features that
result in reasonable performance improvement.
Gene Amadahl –  Referred to as quan6ta6ve principles in design.

57 2 58 2

•  Amadahl’s law demonstrates the law of diminishing returns. •  Amadahl’s law concerns the speedup achievable from an improvement in
computa4on that affects a frac4on F of the computa4on, where the
•  An example: improvement has a speedup of S.
–  Suppose we are improving a part of the computer system that affects
only 25% of the overall task.
–  The improvement can be very liUle or extremely large. Before improvement 1 – F F
–  With “infinite” speedup, the 25% of the task can be done in “zero” 4me.
–  Maximum possible speedup = XTorig / XTnew = 1 / (1 – 0.25) = 1.33
Aaer improvement 1 – F F / S
We can never get a speedup of more than 1.33

59 2 60 2

10
24/07/17

•  Illustra4on of law of diminishing returns:


1 / (1 – 0.25) = 1.33
–  Let F = 0.25.
•  Execu4on 4me before improvement: (1 – F) + F = 1
–  The table shows the speedup (= 1 / (1 – F + F / S) for various values of S.
•  Execu4on 4me aler improvement: (1 – F) + F / S
•  Speedup obtained: S Speedup S Speedup

1 1 1.00 50 1.32
Speedup = 2 1.14 100 1.33
(1 – F) + F / S
5 1.25 1000 1.33
•  As S à ∞, Speedup à 1 / (1 – F) 10 1.29 100,000 1.33
–  The frac4on F limits the maximum speedup that can be obtained.

61 2 62 2

Design Alterna)ve using Amadahl’s law


•  Illustra4on of law of diminishing returns: 1 / (1 – 0.75) = 4.00
–  Let F = 0.75.
–  The table shows the speedup for various values of S.
Loop 1 500 lines 10% of total execu4on 4me
S Speedup S Speedup
1 1.00 50 3.77
2 1.60 100 3.88
5 2.50 1000 3.99 Loop 2 20 lines 90% of total execu4on 4me
10 3.08 100,000 4.00

63 2 64 2

Example 1
•  Suppose we are running a set of programs on a RISC processor, for which the
•  Some examples: following instruc4on mix is observed:
–  We make 10% of a program 90X faster, speedup = 1 / (0.9 + 0.1 / 90) = 1.11 Opera)on Frequency CPIi Wi * CPIi % Time CPI = 2.08
–  We make 90% of a program 10X faster, speedup = 1 / (0.1 + 0.9 / 10) = 5.26 Load 20 % 5 1.00 0.48
–  We make 25% of a program 25X faster, speedup = 1 / (0.75 + 0.25 / 25) = 1.32 Store 8 % 3 0.24 0.12 1 / 2.08
–  We make 50% of a program 20X faster, speedup = 1 / (0.5 + 0.5 / 20) = 1.90 ALU 60 % 1 0.60 0.29
–  We make 90% of a program 50X faster, speedup = 1 / (0.1 + 0.9 / 50) = 8.47 Branch 12 % 2 0.24 0.11

We carry out a design enhancement by which the CPI of Load instruc4ons


reduces from 5 to 2. What will be the overall performance improvement?

65 2 66 2

11
n
X
Weighted Arithmetic Mean = (Wi ⇥ XTi) 24/07/17
i=1

v
u n
u Y
Normalized Geometric Mean = t XT Ri
n

i=1

n
1X
Normalized Arithmetic Mean = XT Ri
n i=1

Frac4on enhanced F = 0.48 p


•  Alternate way of calcula4on: X
Frac4on unaffected 1 – F = 1 – 0.48 = 0.52 –  Old CPI = 2.08
eX = e 2
Enhancement factor S = 5 / 2 = 2.5 –  New CPI = 0.20 * 2 + 0.08 * 3 + 0.60 * 1 + 0.12 * 2 = 1.48
Therefore, speedup is
1 1 XTorig IC ⇤ CP Iold ⇤ C
= = 1.40 Speedup = =
(1 – F) + F / S 0.52 + 0.48 / 2.5 XTnew IC ⇤ CP Inew ⇤ C
CP Iold 2.08
= = = 1.40
CP Inew 1.48

67 2 68 2

Example 2 Example 2a
•  The execu4on 4me of a program on a machine is found to be 50 seconds, •  The execu4on 4me of a program on a machine is found to be 50 seconds,
out of which 42 seconds is consumed by mul4ply opera4ons. It is required to out of which 42 seconds is consumed by mul4ply opera4ons. It is required to
make the program run 5 4mes faster. By how much must the speed of the make the program run 8 4mes faster. By how much must the speed of the
mul4plier be improved? mul4plier be improved?
–  Here, F = 42 / 50 = 0.84 –  Here, F = 42 / 50 = 0.84
–  According to Amadahl’s law, –  According to Amadahl’s law, No amount to speed improvement
5 = 1 / (0.16 + 0.84 / S) 8 = 1 / (0.16 + 0.84 / S) in the mul)plier can achieve this.
or, 0.80 + 4.2 / S = 1 or, 1.28 + 6.72 / S = 1 Maximum speedup achievable:
or, S = 21 or, S = – 24 1 / (1 – F) = 6.25

69 2 70 2

Example 3

•  Suppose we plan to upgrade the processor of a web server. The CPU is 30


4mes faster on search queries than the old processor. The old processor is
busy with search queries 80% of the 4me. Es4mate the speedup obtained by END OF LECTURE 15
the upgrade.
–  Here, F = 0.80 and S = 30
–  Thus, speedup = 1 / (0.20 + 0.80 / 30) = 4.41

71 2 72 2

12
24/07/17

Example 1

•  The total execu4on 4me of a typical program is made up of 60% of CPU 4me
and 40% of I/O 4me. Which of the following alterna4ves is becer?
a)  Increase the CPU speed by 50%
b)  Reduce the I/O 4me by half
Lecture 1: EVOLUTION OF COMPUTER SYSTEM Assume that there is no overlap between CPU and I/O opera4ons.
Lecture1:16:
Lecture AMADAHL’S
EVOLUTION LAW (PART
OF COMPUTER 2)
SYSTEM
CPU I/O CPU I/O CPU I/O
DR. KAMALIKA DATTA
DR. KAMALIKA DATTA
DR. KAMALIKA DATTA
DEPARTMENT OF
DEPARTMENT OFCOMPUTER
COMPUTERSCIENCE AND ENGINEERING,
SCIENCE AND ENGINEERING,NIT
NIT MEGHALAYA
MEGHALAYA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT MEGHALAYA

73 2 74 2

Example 2
•  Suppose that a compute-intensive bioinforma4cs program is running on a
•  Increase CPU speed by 50%
given machine X, which takes 10 days to run. The program spends 25% of its
–  Here, F = 0.60 and S = 1.5
4me doing integer instruc4ons, and 40% of 4me doing I/O. Which of the
–  Speedup = 1 / (0.40 + 0.60 / 1.5) = 1.25 following two alterna4ves provides a becer tradeoff?
•  Reduce the I/O 4me by half a)  Use an op4mizing compiler that reduces the number of integer instruc4ons by
30% (assume all integer instruc4ons take the same 4me).
–  Here, F = 0.40 and S = 2
b)  Op4mizing the I/O subsystem that reduces the latency of I/O opera4ons from
–  Speedup = 1 / (0.60 + 0.40 / 2) = 1.25
10 μsec to 5 μsec (that is, speedup of 2).

Thus, both the alterna)ves result in the same speedup.

75 2 76 2

•  Alterna4ve (a): Amadahl’s Law Corollary 1


–  Here, F = 0.25 and S = 100 / 70
–  Speedup = 1 / (0.75 + 0.25 * 70 / 100) = 1.08 •  Make the common case fast.
•  Alterna4ve (b): –  Here “common” means most 4me consuming, and not “most frequent”.
–  Here, F = 0.40 and S = 2 –  According to Amadahl’s law, improving the “uncommon” case will not
–  Speedup = 1 / (0.60 + 0.40 / 2) = 1.25 result in much improvement.
–  The “common” case has to be determined through experimenta4on and
profiling.
•  When op4miza4ons are carried out, a case that was common earlier may
become uncommon later, or vice versa.

77 2 78 2

13
24/07/17

Amadahl’s Law Corollary 2 Amadahl’s Non-Corollary


Lnew = Lold * F / S + Lold * (1 – F)
•  Amadahl’s law for latency (L)
–  By defini4on, Lnew = Lold / Speedup •  Amadahl’s law does not bound slowdown.
–  By Amadahl’s law, Lnew = Lold * ((1 – F) + F / S) –  Things can get arbitrarily slow if we hurt the non-common case too much.
–  We can write: •  Example: Suppose F = 0.01 and Lold = 1
Lnew = Lold * F / S + Lold * (1 – F) –  Case 1: S= 0.001 (1000 4mes worse)
Lnew = Lold * 0.01 / 0.001 + Lold * 0.99 = 10 * Lold
–  Case 2: S = 0.00001 (105 4mes worse)
Lnew = Lold * 0.01 / 0.00001 + Lold * 0.99 = 1000 * Lold

79 2 80 2

Extension to Mul)ple Enhancements •  In the calcula4on as shown, it is assumed that F1 and F2 are disjoint.
–  S1 and S2 do not apply to the same por4on of execu4on.
•  If it is not so, we have to treat the overlap as a separate por4on of execu4on
•  Suppose we carry out mul4ple op4miza4ons to a program:
and measure its speedup independently.
–  Op4miza4on 1 speeds up a frac4on F1 of the program by a factor S1
–  F1only , F2only , and F1&2 with speedups S1only , S2only , and S1&2
–  Op4miza4on 2 speeds up a frac4on F2 of the program by a factor S2
1 – F1only – F2only – F1&2 F1only F1&2 F2only
1 – F1 – F2 F1 F2
n
X F1only / F1&2 / F2only /
Weighted Arithmetic Mean = (Wi ⇥ XTi)Speedup 1 – F1only – F2only – F1&2
S1only S1&2 S2only
1 – F1 – F2 F1 / S1 F2 / S2 i=1
1 1
(1 – F1 – F2) + F1 / S1 + F2 / S2 Speedup =
(1 – F1only – F2only – F1&2) + F1only / S1only + F2only / S2only + F1&2 / S1&2
v
u n
u Y
Normalized Geometric Mean = t XT Ri
n 81 2 82 2

i=1

n
1X
Normalized Arithmetic Mean = XT Ri
n i=1

p X
eX = e 2

XTorig IC ⇤ CP Iold ⇤ C Example 3


Speedup = =
XTnew IC ⇤ CP Inew ⇤ C •  Consider an example of memory system.
•  General expression: CP Iold 2.08 –  Main memory and a fast memory called cache CPU
Cache Main
= = = 1.40 Memory Memory
–  Assume m enhancements of frac4ons F
CP Inew 1.48 1, F2, …, Fm by factors of S1, S2, memory.
…, Sm respec4vely. –  Frequently used parts of program/data are kept in
cache memory.
–  Use of the cache memory speeds up memory
1 accesses by a factor of 8.
Speedup = Pm Pm Fi –  Without the cache, memory opera4ons consume a
(1 F
i=1 i ) + i=1 Si frac4on 0.40 of the total execu4on 4me.
–  Es4mate the speedup. Solu)on
1 = 1 = 0.91
2 Speedup =
(1 – F) + F / S (1 – 0.4) + 0.4 / 8

83 2 84 2

14
24/07/17

Example 4
•  Solu4on:
•  Now we consider two levels of cache memory, L1-cache and L2-cache. –  Memory opera4ons = 0.3 Speedup
Assump4ons: –  FL1 = 0.3 * 0.8 = 0.24
1
–  Without the cache, memory opera4ons take 30% of execu4on 4me. –  SL1 = 4
(1 – FL1 – FL2) + FL1 / SL1 + FL2 / SL2
–  The L1-cache speeds up 80% of memory opera4ons by a factor of 4. –  FL2 = 0.3 * (1 – 0.8) * 0.5 = 0.03
–  The L2-cache speeds up 50% of the remaining 20% memory opera4ons by a –  SL2 = 2 1
factor of 2. (1 – 0.24 – 0.03) + 0.24 / 4 + 0.03 / 2
We want to find out the overall speedup.
= 1.24

85 2 86 2

END OF LECTURE 16

87 2

15

You might also like