Introduc) On: Lecture 12: Measuring Cpu Performance Lecture 1: Evolution of Computer System
Introduc) On: Lecture 12: Measuring Cpu Performance Lecture 1: Evolution of Computer System
Introduc)on
• Most processors execute instruc4ons in a synchronous manner using a clock
that runs at a constant clock rate or frequency f.
• Clock cycle 4me C is the reciprocal of the clock rate f:
C = 1 / f
Lecture 1: EVOLUTION OF COMPUTER SYSTEM • The clock rate f depends on two factors:
Lecture 12:1:MEASURING
Lecture CPU
EVOLUTION OF PERFORMANCE
COMPUTER SYSTEM a) The implementa4on technology used.
b) The CPU organiza4on used.
DR. KAMALIKA DATTA • A machine instruc4on typically consists of a number of elementary micro-
DR. KAMALIKA DATTA
DR. KAMALIKA DATTA opera4ons that vary in number and complexity depending on the
DEPARTMENT OF
DEPARTMENT OFCOMPUTER
COMPUTERSCIENCE AND ENGINEERING,
SCIENCE AND ENGINEERING,NIT
NIT MEGHALAYA
MEGHALAYA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT MEGHALAYA instruc4on and the CPU organiza4on used.
1 2 2 2
• A micro-opera4on is an elementary hardware opera4on that can be • For a given program compiled to run on a specific machine, we can define
carried out in one clock cycle. the following parameters:
– Register transfer opera4ons, arithme4c and logic opera4ons, etc. a) The total number of instruc4ons executed or instruc6on count (IC).
• Thus a single machine instruc4on may take one or more CPU cycles to b) The average number of cycles per instruc6on (CPI).
complete. c) Clock cycle 6me (C) of the machine.
– We can characterize an instruc4on by Cycles Per Instruc6on (CPI). • The total execu4on 4me can be computed as:
• Average CPI of a program: Execu&on Time XT = IC x CPI x C
– Average CPI of all instruc4ons executed in the program on a given processor. • How do we evaluate and compare the performances of several machines?
– Different instruc4ons can have different CPIs.
3 2 4 2
5 2 6 2
1
24/07/17
Program X X
7 2 8 2
• IC depends on:
– Program used, compiler, ISA • A tradeoff:
• CPI depends on: – RISC: increases number of instruc4ons/program, but decreases CPI and
– Program used, compiler, ISA, CPU organiza4on clock cycle 4me because the instruc4ons and hence the implementa4ons
• C depends on: are simple.
– Technology used to implement the CPU – CISC: decreases number of instruc4ons/program, but increases CPI and
clock cycle 4me because many instruc4ons are more complex.
• Unfortunately, it is very difficult to change one parameter in complete • Overall, it has been found through experimenta4on that RISC
isola4on from the others.
architecture gives becer performance.
– Basic technologies are interdependent.
9 2 10 2
Example 2 Example 3
• Suppose that a machine A executes a program with an average CPI of 2.3. • Consider the earlier example with IC = 50,000,000; average CPI = 2.7, and
Consider another machine B (with the same instruc4on set and a becer clock rate = 2.0 GHz.
compiler) that executes the same program with 20% less instruc4ons and Suppose we use a new compiler on the same program, for which:
with a CPI of 1.7 at 1.2 GHz. – New IC = 40,000,000
What should be the clock rate of A so that the two machines have the same – New CPI = 3.0 (i.e. the new compiler is using more complex instruc4ons)
performance? – Also we have a faster CPU implementa4on, with clock rate = 2.4 GHz.
We must have: ICA x CPIA x CA = ICB x CPIB x CB
Hence: ICA x 2.3 x CA = 0.80 x ICA x 1.7 x (1 / (1.2 x 109)) Speedup = XTold / XTnew
We get: CA = 0.49 x 10-9 sec = (50,000,000 x 2.7 x 0.5 x 10-9) / (40,000,000 x 3.0 x 0.4167 x 10-9)
Thus, clock rate of A = 1 / CA = 2.04 GHz = 1.35 à 35% faster
11 2 12 2
2
24/07/17
n
X
CPU clock cycles = (ICi ⇥ CP Ii)
i=1
ICi
Fi = n
X
IC n ICi
Instruction Count (IC) =X Pn
Instruction Count (IC) = i=1ICi i=1 (ICi ⇥ CP Ii)
i=1 CPI =
n
X IC
CPU clock cycles = (ICi ⇥ CP Ii) n
X
0
ICi
1
i=1 1
= @ ⇥ CP IiA
Pn
i=1 IC
Instruc)on Frequency and CPI
ni=1 (ICi ⇥ CP Ii )
Example 5
P
CPI = i=1 (ICi ⇥ CP Ii)
CPI = IC
n
X IC
Instruction Count (IC) = Ci 0 0
ICi
nn IC
X
X
11 • Suppose for an implementa4on of a RISC ISA there are four instruc4on
ICi
i
i=1 = @
@ ⇥⇥CPCPIiIAiA
• CPI can also be expressed in terms of the frequencies of the various types, with their frequency of occurrence (for a typical mix of programs)
Fi =
i=1 IC
instruc4on types that are executed in a program.
i=1 IC and CPI as shown in the table below. IC
– Fi denotes the frequency of execu4on of instruc4on type i.
1
1 n
X
Type Frequency CPI CPI = (Fi ⇥ CP Ii)
Pn
(ICi ⇥ CP Ii) IC
ICi i Load 20 % 4 i=1
CPI = i=1 F
Fii =
= IC
0
IC 1
IC Store 8 % 3
n
X ICi CPI = (0.20 x 4) + (0.08 x 3) + (0.60 x 1)
= @ ⇥ CP IiA n ALU 60 % 1 + (0.12 x 2)
X
i=1 IC CPI = (Fi ⇥ CP Ii) Branch 12 % 2 = 1.88
i=1
15 2 16 2
1
1
Example 6
• Suppose that a program is running on a machine with the following
instruc4on types, CPI values, and the frequencies of occurrence.
The CPU designer gives two op4ons: (a) reduce CPI of instruc4on type A to
1.1, and (b) reduce CPI of instruc4on type B to 1.6. Which one is becer? END OF LECTURE 12
Type CPI Frequency Average CPI for (a): 0.60 x 1.1 + 0.10 x 2.2 + 0.30 x 2.0
A 1.3 60 % = 1.48
B 2.2 10 % Average CPI for (b): 0.60 x 1.3 + 0.10 x 1.6 + 0.30 x 2.0
C 2.0 30 % = 1.54
Op&on (a) is beAer
17 2 18 2
3
24/07/17
Introduc)on
• Basic concept:
– How to compare the performances of two or more computer systems?
– We need to execute some programs and measure the execu4on 4mes.
• Set of standard programs used to comparison is called benchmark.
Lecture 1: EVOLUTION OF COMPUTER SYSTEM
Lecture
Lecture 1:13: CHOICEOF
EVOLUTION OFCOMPUTER
BENCHMARKS
SYSTEM • Various metrics have been proposed to carry out the evalua4on.
– To be discussed next.
19 2 20 2
21 2 22 2
Example 1
• Consider a processor with three instruc4on classes A, B and C, with the corresponding
b) MFLOPS (Million Floa)ng Point Opera)ons Per Second) CPI values being 1, 2 and 3 respec4vely. The processor runs at a clock rate of 1 GHz.
• Simply computes number of floa4ng-point opera4ons executed per second. For a given program wricen in C, two compilers produce the following executed
• More suitable for applica4ons that involve lot of floa4ng-point computa4ons. instruc4on counts.
• Here again, different machines implement different floa4ng-point opera4ons.
Instruc)on Count (in millions)
• Different floa4ng-point opera4ons take different 4mes.
For ICA For ICB For ICC
– Division is much slower than addi4on.
• Compilers have no floa4ng-point opera4ons and has a MFLOP ra4ng of 0. Compiler 1 7 2 1
• Hence, not very suitable to use this metric across machines and also across Compiler 2 12 1 1
programs. Compute the MIPS ra4ng and the CPU 4me for the two program versions.
23 2 24 2
4
24/07/17
• $t1 = address of s
MIPS = Clock Rate (MHz) / CPI CPI = CPU Execu4on Cycles / Instruc4on Count
Example 2 • $t3 = s
A loop in C • $t2 points to A[0]
• Solu4on: CPU Time = Instruc4on Count x CPI / Clock Rate LW $t3, 0($t1)
for (k=0; k<1000; k++)
– For compiler 1: { ADDI $t6, $t2, 4000
A[k] = A[k] + s; Loop: LW $t4, 0($t3)
CPI1 = (7 x 1 + 2 x 2 + 1 x 3) / (7 + 2 + 1) = 14 / 10 = 1.40 ADD $t5, $t4, $t3
} MIPS32 Code
MIPS Ra4ng1 = 1000 MHz / 1.40 = 714.3 MIPS SW $t5, 0($t2)
CPU Time1 = ((7 + 2 + 1) x 106 x 1.40) / (1 x 109) = 0.014 sec ADDI $t2, $t2, 4
BNE $t6, $t2, Loop
– For compiler 2:
Instruc)on Type CPI
CPI2 = (12 x 1 + 1 x 2 + 1 x 3) / (12 + 1 + 1) = 17 / 14 = 1.21 • The code is executed on a processor that runs at 1 GHz
(C = 1 nsec). ALU 2
MIPS Ra4ng2 = 1000 MHz / 1.21 = 826.4 MIPS
• There are four instruc4on types with CPI values as shown in LOAD 5
CPU Time2 = ((12 + 1 + 1) x 106 x 1.21) / (1 x 109) = 0.017 sec the table. STORE 6
MIPS ra&ng indicates that compiler 2 is faster, while in reality the reverse is true. • We show some calcula4ons next. BRANCH 3
25 2 26 2
• The code has 2 instruc4ons before the loop and 5 instruc4ons in the body of the loop
that executes 1000 4mes.
– Total instruc4on count IC = 5 x 1000 + 2 = 5002. • MIPS ra4ng = Clock Rate (MHz) / CPI = 1000 MHz / 3.6 = 277.8 MIPS
• Number of instruc4ons executed and frac4on Fi for each instruc4on type: • The processor achieves its peak MIPS ra4ng when execu4ng a program that only
– ICALU = 1 + 2 x 1000 = 2001, FALU = 2001 / 5002 = 0.4 = 40% has instruc4ons of the type with lowest CPI (i.e. ALU which has CPI = 2).
– ICLOAD = 1 + 1 x 1000 = 1001, FLOAD = 1001 / 5002 = 0.2 = 20% – Peak MIPS ra4ng = Clock Rate (MHz) / CPIALU = 1000 MHz / 2 = 500 MIPS
– ICSTORE = 1000 , FSTORE = 1000 / 5002 = 0.2 = 20%
– ICBRANCH = 1000, FBRANCH = 1000 / 5002 = 0.0 = 20%
• Total CPU clock cycles = 2001 x 2 + 1001 x 5 + 1000 x 6 + 1000 x 3 = 18,007 cycles
• Average CPI = CPU clock cycles / IC = 18007 / 5002 = 3.6
• Execu4on 4me = IC x CPI x C = 5002 x 3.6 x 1 x 10-9 = 18.0 μsec
27 2 28 2
29 2 30 2
5
24/07/17
31 2 32 2
SPEC95 Programs
Benchmark Descrip)on
CINT2000 (Integer)
tomcatv A mesh genera4on program
(Floa)ng-Point) swim Shallow water modeling
su2cor Quantum physics Monte Carlo simula4on
1. 164.gzip : compression 9. 254.gap : Group theory interpreter
hydro2d Solving hydrodynamic Naiver Stokes equa4ons
2. 175.vpr : FPGA placement / rou4ng 10. 255.vortex : Object-oriented database
mgrid Mul4grid solver on 3D poten4al field 3. 176.gcc : C compiler 11. 256.bzip2 : Compression
applu Solving parabolic/ellip4cal differen4al equa4ons 4. 181.mcf : Combinatorial op4miza4on 12. 300.twolf : VLSI Place / Route
trub3d Simulates turbulence in a cube 5. 186.craly : Chess playing
apsi Solver for distribu4on of pollutant 6. 197.parser : Word processing
fpppp Quantum chemistry simula4on 7. 252.con : Computer visualiza4on
wave5 Simula4on of plasma physics 8. 253.perlbmk : PERL interpreter
33 2 34 2
35 2 36 2
6
24/07/17
37 2 38 2
n
X
Weighted Arithmetic Mean = (Wi ⇥ XTi)
i=1
v
u n
u Y
Normalized Geometric Mean = t XT Ri
n
i=1
39 2 40 2
Introduc)on
• We have seen the need for using real programs for benchmarking.
– Benchmarks consist of a suite of programs.
• How to consolidate all the run 4mes and come up with a single metric that
can be used for comparison?
Lecture 1: EVOLUTION OF COMPUTER SYSTEM
Lecture 14: SUMMARIZING
Lecture PERFORMANCE
1: EVOLUTION OF RESULTS
COMPUTER SYSTEM – Machine X may run benchmark B1 faster, while machine Y may run benchmark
B2 faster.
– Is X faster than Y, or is Y faster than X?
DR. KAMALIKA DATTA
DR. KAMALIKA DATTA • We shall be discussing several measures that are used for consolida4on.
DR. KAMALIKA DATTA
DEPARTMENT OF
DEPARTMENT OFCOMPUTER
COMPUTERSCIENCE AND ENGINEERING,
SCIENCE AND ENGINEERING,NIT
NIT MEGHALAYA
MEGHALAYA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT MEGHALAYA
41 2 42 2
7
24/07/17
43 2 44 2
n
X
Instruction Count (IC) = ICi
i=1
Pn
⇥ CP Ii)
i=1 (ICi
CPI =
IC
Xn ✓ ◆
ICi
= ⇥ CP Ii
i=1
IC
1
47 2 48 2
8
24/07/17
Example 1
• Weighted Arithme4c Mean (WAM): CPU A (in secs) CPU B (in secs) CPU C (in secs)
– It is computed as the sum of the products of weigh4ng factors and Program P1 1 10 25
execu4on 4mes. Program P2 500 250 10
– If Wi denotes the weigh4ng factor of program i, we can write Total Time 501 260 35
n
X
Weighted Arithmetic Mean = (Wi ⇥ XTi) • If W1 = 0.50 and W2 = 0.50, we get WAMA = 250.5, WAMB = 130, WAMC = 17.5
i=1 • If W1 = 0.90 and W2 = 0.10, we get WAMA = 50.9, WAMB = 34, WAMC = 23.5
v • If W1 = 0.10 and W2 = 0.90, we get WAMA = 450.1, WAMB = 226, WAMC = 11.5
u n
u Y
Geometric Mean = t XT Ri
n
i=1
49 2 50 2
n
X
Weighted Arithmetic Mean = (Wi ⇥ XTi)
n i=1
X
Weighted Arithmetic Mean = (Wi ⇥ XTi)
(c) Normalized Execu)on Time
i=1 v
u n
u Y
Normalized Geometric Mean = t XT Ri
n
v
• As an alterna4ve, we can normalize all execu4on 4mes to a reference
u n
u Y
machine, and then take the average of the normalized execu4on 4mes. i=1
Normalized Geometric Mean =
n
XT Ri t
– Followed in the SPEC benchmarks, where a SPARCsta4on is taken as the
reference machine. i=1 • Here, XTRi denotes the execu4on 4me for the i-th program, normalized to the
• Average normalized execu4on 4me can be expressed as either an arithme4c reference machine.
Xn
or geometric mean. 1
Normalized Arithmetic Mean = XT Ri
n n
2
1X i=1
Normalized Arithmetic Mean = XT Ri
n i=1
51 2 52 2
2
CPU A (in secs) CPU B (in secs) CPU C (in secs)
Example 2 Program P1
2
1 10 25 • Summary:
Program P2 500 250 10 – In contrast to arithme4c means, geometric means of normalized
Total Time 501 260 35 execu4on 4mes are consistent no macer which machine is the
reference.
Normalized to A Normalized to B Normalized to C – Hence, the arithme4c mean should not be used to average normalized
A B C A B C A B C execu4on 4mes.
Program P1 1.0 10.0 25.0 0.1 1.0 2.5 0.04 0.4 1.0 – One drawback of geometric mean is that they do not predict
execu4on 4mes.
Program P2 1.0 0.5 0.02 2.0 1.0 0.04 50.0 25.0 1.0
• Also can encourage hardware and solware designers to focus their
Arithme4c mean 1.0 5.25 12.51 1.05 1.0 1.27 25.02 12.7 1.0 acen4on to those benchmarks where performance is easiest to improve
Geometric mean 1.0 2.24 0.71 0.45 1.0 0.32 1.41 3.16 1.0 rather than the ones that are the slowest.
53 2 54 2
9
24/07/17
END OF LECTURE 14
Lecture 1: EVOLUTION OF COMPUTER SYSTEM
Lecture1:14:
Lecture AMADAHL’S
EVOLUTION LAW (PART
OF COMPUTER 1)
SYSTEM
55 2 56 2
57 2 58 2
• Amadahl’s law demonstrates the law of diminishing returns. • Amadahl’s law concerns the speedup achievable from an improvement in
computa4on that affects a frac4on F of the computa4on, where the
• An example: improvement has a speedup of S.
– Suppose we are improving a part of the computer system that affects
only 25% of the overall task.
– The improvement can be very liUle or extremely large. Before improvement 1 – F F
– With “infinite” speedup, the 25% of the task can be done in “zero” 4me.
– Maximum possible speedup = XTorig / XTnew = 1 / (1 – 0.25) = 1.33
Aaer improvement 1 – F F / S
We can never get a speedup of more than 1.33
59 2 60 2
10
24/07/17
1 1 1.00 50 1.32
Speedup = 2 1.14 100 1.33
(1 – F) + F / S
5 1.25 1000 1.33
• As S à ∞, Speedup à 1 / (1 – F) 10 1.29 100,000 1.33
– The frac4on F limits the maximum speedup that can be obtained.
61 2 62 2
63 2 64 2
Example 1
• Suppose we are running a set of programs on a RISC processor, for which the
• Some examples: following instruc4on mix is observed:
– We make 10% of a program 90X faster, speedup = 1 / (0.9 + 0.1 / 90) = 1.11 Opera)on Frequency CPIi Wi * CPIi % Time CPI = 2.08
– We make 90% of a program 10X faster, speedup = 1 / (0.1 + 0.9 / 10) = 5.26 Load 20 % 5 1.00 0.48
– We make 25% of a program 25X faster, speedup = 1 / (0.75 + 0.25 / 25) = 1.32 Store 8 % 3 0.24 0.12 1 / 2.08
– We make 50% of a program 20X faster, speedup = 1 / (0.5 + 0.5 / 20) = 1.90 ALU 60 % 1 0.60 0.29
– We make 90% of a program 50X faster, speedup = 1 / (0.1 + 0.9 / 50) = 8.47 Branch 12 % 2 0.24 0.11
65 2 66 2
11
n
X
Weighted Arithmetic Mean = (Wi ⇥ XTi) 24/07/17
i=1
v
u n
u Y
Normalized Geometric Mean = t XT Ri
n
i=1
n
1X
Normalized Arithmetic Mean = XT Ri
n i=1
67 2 68 2
Example 2 Example 2a
• The execu4on 4me of a program on a machine is found to be 50 seconds, • The execu4on 4me of a program on a machine is found to be 50 seconds,
out of which 42 seconds is consumed by mul4ply opera4ons. It is required to out of which 42 seconds is consumed by mul4ply opera4ons. It is required to
make the program run 5 4mes faster. By how much must the speed of the make the program run 8 4mes faster. By how much must the speed of the
mul4plier be improved? mul4plier be improved?
– Here, F = 42 / 50 = 0.84 – Here, F = 42 / 50 = 0.84
– According to Amadahl’s law, – According to Amadahl’s law, No amount to speed improvement
5 = 1 / (0.16 + 0.84 / S) 8 = 1 / (0.16 + 0.84 / S) in the mul)plier can achieve this.
or, 0.80 + 4.2 / S = 1 or, 1.28 + 6.72 / S = 1 Maximum speedup achievable:
or, S = 21 or, S = – 24 1 / (1 – F) = 6.25
69 2 70 2
Example 3
71 2 72 2
12
24/07/17
Example 1
• The total execu4on 4me of a typical program is made up of 60% of CPU 4me
and 40% of I/O 4me. Which of the following alterna4ves is becer?
a) Increase the CPU speed by 50%
b) Reduce the I/O 4me by half
Lecture 1: EVOLUTION OF COMPUTER SYSTEM Assume that there is no overlap between CPU and I/O opera4ons.
Lecture1:16:
Lecture AMADAHL’S
EVOLUTION LAW (PART
OF COMPUTER 2)
SYSTEM
CPU I/O CPU I/O CPU I/O
DR. KAMALIKA DATTA
DR. KAMALIKA DATTA
DR. KAMALIKA DATTA
DEPARTMENT OF
DEPARTMENT OFCOMPUTER
COMPUTERSCIENCE AND ENGINEERING,
SCIENCE AND ENGINEERING,NIT
NIT MEGHALAYA
MEGHALAYA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT MEGHALAYA
73 2 74 2
Example 2
• Suppose that a compute-intensive bioinforma4cs program is running on a
• Increase CPU speed by 50%
given machine X, which takes 10 days to run. The program spends 25% of its
– Here, F = 0.60 and S = 1.5
4me doing integer instruc4ons, and 40% of 4me doing I/O. Which of the
– Speedup = 1 / (0.40 + 0.60 / 1.5) = 1.25 following two alterna4ves provides a becer tradeoff?
• Reduce the I/O 4me by half a) Use an op4mizing compiler that reduces the number of integer instruc4ons by
30% (assume all integer instruc4ons take the same 4me).
– Here, F = 0.40 and S = 2
b) Op4mizing the I/O subsystem that reduces the latency of I/O opera4ons from
– Speedup = 1 / (0.60 + 0.40 / 2) = 1.25
10 μsec to 5 μsec (that is, speedup of 2).
75 2 76 2
77 2 78 2
13
24/07/17
79 2 80 2
Extension to Mul)ple Enhancements • In the calcula4on as shown, it is assumed that F1 and F2 are disjoint.
– S1 and S2 do not apply to the same por4on of execu4on.
• If it is not so, we have to treat the overlap as a separate por4on of execu4on
• Suppose we carry out mul4ple op4miza4ons to a program:
and measure its speedup independently.
– Op4miza4on 1 speeds up a frac4on F1 of the program by a factor S1
– F1only , F2only , and F1&2 with speedups S1only , S2only , and S1&2
– Op4miza4on 2 speeds up a frac4on F2 of the program by a factor S2
1 – F1only – F2only – F1&2 F1only F1&2 F2only
1 – F1 – F2 F1 F2
n
X F1only / F1&2 / F2only /
Weighted Arithmetic Mean = (Wi ⇥ XTi)Speedup 1 – F1only – F2only – F1&2
S1only S1&2 S2only
1 – F1 – F2 F1 / S1 F2 / S2 i=1
1 1
(1 – F1 – F2) + F1 / S1 + F2 / S2 Speedup =
(1 – F1only – F2only – F1&2) + F1only / S1only + F2only / S2only + F1&2 / S1&2
v
u n
u Y
Normalized Geometric Mean = t XT Ri
n 81 2 82 2
i=1
n
1X
Normalized Arithmetic Mean = XT Ri
n i=1
p X
eX = e 2
83 2 84 2
14
24/07/17
Example 4
• Solu4on:
• Now we consider two levels of cache memory, L1-cache and L2-cache. – Memory opera4ons = 0.3 Speedup
Assump4ons: – FL1 = 0.3 * 0.8 = 0.24
1
– Without the cache, memory opera4ons take 30% of execu4on 4me. – SL1 = 4
(1 – FL1 – FL2) + FL1 / SL1 + FL2 / SL2
– The L1-cache speeds up 80% of memory opera4ons by a factor of 4. – FL2 = 0.3 * (1 – 0.8) * 0.5 = 0.03
– The L2-cache speeds up 50% of the remaining 20% memory opera4ons by a – SL2 = 2 1
factor of 2. (1 – 0.24 – 0.03) + 0.24 / 4 + 0.03 / 2
We want to find out the overall speedup.
= 1.24
85 2 86 2
END OF LECTURE 16
87 2
15