0% found this document useful (0 votes)

66 views

L7 Multicore 2

This document provides a 3 sentence summary of a lecture on multicore threading and parallelism: The lecture discusses techniques for improving performance through parallelism using multicore processors, including dividing work across processor cores and synchronizing access to shared memory locations. It reviews multiprocessor systems and cache coherency protocols, and covers topics like multithreading, data races, lock-based synchronization, and hardware synchronization instructions. Examples are provided to illustrate parallel sum reduction and lock-based critical section synchronization.

Uploaded by

AsHraf G. ElrawEi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views

L7 Multicore 2

Uploaded by

AsHraf G. ElrawEi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

CS3350B

Computer Architecture
Winter 2015

Lecture 7.2: Multicore TLP (1)

Marc Moreno Maza

www.csd.uwo.ca/Courses/CS3350b

[Adapted from lectures on

Computer Organization and Design,
Patterson & Hennessy, 4th or 5th edition, 2011]
0
Review: Multiprocessor Systems (MIMD)
 Multiprocessor (Multiple Instruction Multiple Data):
a computer system with at least 2 processors

Processor Processor Processor

Cache Cache Cache

Interconnection Network

Memory I/O

 Deliver high throughput for independent jobs via job-level parallelism

on top of ILP
 Improve the run time of a single program that has been specially
crafted to run on a multiprocessor - a parallel processing program

Now Use term core for processor (“Multicore”)

because “Multiprocessor Microprocessor” too redundant
Review

 Sequential software is slow software

 SIMD and MIMD only path to higher performance

 Multiprocessor (Multicore) uses Shared Memory (single

address space) (SMP)
 Cache coherency implements shared memory even with
multiple copies in multiple caches
 False sharing a concern

 MESI Protocol ensures cache consistency and has

optimizations for common cases.

2
Multiprocessors and You
 Only path to performance is parallelism
 Clock rates flat or declining
 SIMD: 2X width every 3-4 years
- 128b wide now, 256b 2011, 512b in 2014?, 1024b in
2018?
- Advanced Vector Extensions are 256-bits wide!
 MIMD: Add 2 cores every 2 years: 2, 4, 6, 8, 10, …

 A key challenge is to craft parallel programs that have

high performance on multiprocessors as the number of
processors increase – i.e., that scale
 Scheduling, load balancing, time for synchronization,
overhead for communication
Example: Sum Reduction

 Sum 100,000 numbers on 100 processor SMP

 Each processor has ID: 0 ≤ Pn ≤ 99
 Phase I:
Partition 1000 numbers per processor;
Initial summation on each processor
sum[Pn] = 0; // 0 ≤ Pn ≤ 99
for (i = 1000*Pn;
i < 1000*(Pn+1); i = i + 1)
sum[Pn] = sum[Pn] + A[i];

 Phase II: Add these partial sums

 Reduction: divide and conquer
 Half the processors add pairs, then quarter, …
 Need to synchronize between reduction steps

4
Example: Sum Reduction
Second Phase:
After each processor has
computed its “local” sum

This code runs simultaneously

on each core

half = 100;
repeat
synch();
/*Proc 0 sums extra element if there is one */
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];

half = half/2; /* dividing line on who sums */

if (Pn < half)
sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);
5
An Example with 10 Processors

sum[P0] sum[P1] sum[P2] sum[P3] sum[P4] sum[P5] sum[P6] sum[P7] sum[P8] sum[P9]

P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 half = 10

P0 P1 P2 P3 P4 half = 5

P0 P1 half = 2

half = 1
P0

6
Threads

 thread of execution: smallest unit of processing scheduled

by operating system
 Threads have their own state or context:
 Program counter, Register file, Stack pointer,

 Threads share a memory address space

 Note: A “process” is a heavier-weight construct, which has
its own address space. A process typically contains one or
more threads.
 Not to be confused with a processor, which is a physical device
(i.e., a core)

7
Memory Model for Multi-threading

Process

CAN BE SPECIFIED IN A LANGUAGE WITH MIMD SUPPORT –

such as OpenMP and CilkPlus 8
Multithreading
 On a single processor, multithreading occurs by time-
division multiplexing:
 Processor switched between different threads
- may be “pre-emptive” or “non pre-emptive”
 Context switching happens frequently enough that user
perceives threads as running at the same time

 On a multiprocessor, threads run at the same time, with

each processor running a thread

9
Multithreading vs. Multicore

 Basic idea: Processor resources are expensive and

should not be left idle
 For example: Long latency to memory on cache miss?
 Hardware switches threads to bring in other useful work
while waiting for cache miss
 Cost of thread context switch must be much less than
cache miss latency

 Put in redundant hardware so don’t have to save context

on every thread switch:
 PC, Registers, …

 Attractive for applications with abundant TLP

10
Data Races and Synchronization
 Two memory accesses form a data race if from different
threads, to same location, and at least one is a write, and
they occur one after another
 If there is a data race, result of program can vary
depending on chance (which thread ran first?)
 Avoid data races by synchronizing writing and reading
to get deterministic behavior
 Synchronization done by user-level routines that rely on
hardware synchronization instructions

11
Question: Consider the following code
when executed concurrently by two threads.
What possible values can result in *($s0)?
# *($s0) = 100
lw $t0,0($s0)
addi $t0,$t0,1
sw $t0,0($s0)

☐ 101 or 102
☐ 100, 101, or 102
☐ 100 or 101
☐

12
Lock and Unlock Synchronization
 Lock used to create region
(critical section) where only
one thread can operate Set the lock
 Given shared memory, use Critical section
memory location as (only one thread
synchronization point: lock, gets to execute
semaphore or mutex this section of
code at a time)
 Thread reads lock to see if it
must wait, or OK to go into e.g., change
critical section (and set to shared variables
locked)
0 => lock is free / open / Unset the lock
unlocked / lock off
1 => lock is set / closed /
locked / lock on

13
Possible Lock Implementation

 Lock (a.k.a. busy wait)

Get_lock: # $s0 -> addr of lock
addiu $t1,$zero,1 # t1 = Locked value
Loop: lw $t0,0($s0) # load lock
bne $t0,$zero,Loop # loop if locked
Lock: sw $t1,0($s0) # Unlocked, so lock

 Unlock
Unlock:
sw $zero,0($s0)

 Any problems with this?

14
Possible Lock Problem

 Thread 1  Thread 2
addiu $t1,$zero,1
Loop: lw $t0,0($s0)
addiu $t1,$zero,1
Loop: lw $t0,0($s0)

bne $t0,$zero,Loop
bne $t0,$zero,Loop

Lock: sw $t1,0($s0)
Time Lock: sw $t1,0($s0)

Both threads think they have set the lock!

Exclusive access not guaranteed! 15
Hardware-supported Synchronization
 Hardware support required to prevent interloper (either
thread on other core or thread on same core) from
changing the value
 Atomic read/write memory operation
 No other access to the location allowed between the read
and write

 Could be a single instruction

 e.g., atomic swap of register ↔ memory
 or an atomic pair of instructions

16
Synchronization in MIPS

 Load linked: ll rt, off(rs)

Load rt with the contents at Mem(off+rs) and reserves the
memory address off+rs by storing it in a special link register
(Rlink)
 Store conditional: sc rt, off(rs)
Check if the reservation of the memory address is valid in the
link register. If so, the contents of rt is written to
Mem(off+rs) and rt is set to 1; otherwise no memory store
is performed and 0 is written into rt.
 Returns 1 (success) if location has not changed since the ll
 Returns 0 (failure) if location has changed

 Note that sc clobbers the register value being stored (rt) !

 Need to have a copy elsewhere if you plan on repeating on failure or
using value later

17
Synchronization in MIPS Example

 Atomic swap (to test/set lock variable)

Exchange contents of register and memory:
$s4 ↔ Mem($s1)

try: add $t0,$zero,$s4 #copy value

ll $t1,0($s1) #load linked
sc $t0,0($s1) #store conditional
beq $t0,$zero,try #loop if sc fails
add $s4,$zero,$t1 #load value in $s4
sc would fail if another thread executes sc here

18
Test-and-Set

 In a single atomic operation:

 Test to see if a memory location is
set (contains a 1)
 Set it (to 1) if it isn’t (it contained a
zero when tested)
 Otherwise indicate that the Set failed,
so the program can try again
 While accessing, no other instruction
can modify the memory location,
including other Test-and-Set
instructions
 Useful for implementing lock operations
19
Test-and-Set in MIPS
 Single atomic operation
 Example:
MIPS sequence for
implementing a T&S at ($s1)

Try: addiu $t0,$zero,1

ll $t1,0($s1)
bne $t1,$zero,Try
sc $t0,0($s1)
beq $t0,$zero,Try
Locked:

critical section

sw $zero,0($s1)
20
Summary

 Sequential software is slow software

 SIMD and MIMD only path to higher performance

 Multiprocessor (Multicore) uses Shared Memory

(single address space)
 Cache coherency implements shared memory even
with multiple copies in multiple caches
 False sharing a concern

 Synchronization via hardware primitives:

 MIPS does it with Load Linked + Store Conditional

HC30 80TS LISInterfaceManual V4
No ratings yet
HC30 80TS LISInterfaceManual V4
12 pages
Thread-Level Parallelism and Synchronization Issues
No ratings yet
Thread-Level Parallelism and Synchronization Issues
9 pages
29-3-42-36-20241109_144133
No ratings yet
29-3-42-36-20241109_144133
27 pages
Unit-5 Part1
No ratings yet
Unit-5 Part1
85 pages
CS Chap7 Multicores Multiprocessors Clusters
No ratings yet
CS Chap7 Multicores Multiprocessors Clusters
65 pages
KTMTSS Shared Memory Multiprocessor
No ratings yet
KTMTSS Shared Memory Multiprocessor
29 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
05 Multiprocessor
No ratings yet
05 Multiprocessor
54 pages
10-Multithreading
No ratings yet
10-Multithreading
60 pages
Multi Processor
No ratings yet
Multi Processor
63 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
CH17 COA9e
No ratings yet
CH17 COA9e
51 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
51 pages
Shared Memory Multiprocessors: Logical Design and Software Interactions
No ratings yet
Shared Memory Multiprocessors: Logical Design and Software Interactions
107 pages
MAP - Unit2
No ratings yet
MAP - Unit2
134 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
CH 4 Synchronization Models of Memory Consistency
100% (1)
CH 4 Synchronization Models of Memory Consistency
26 pages
Thread Level Parallelism
No ratings yet
Thread Level Parallelism
21 pages
comporg6_ch12
No ratings yet
comporg6_ch12
36 pages
Programming with Shared Memory: Nguyễn Quang Hùng
No ratings yet
Programming with Shared Memory: Nguyễn Quang Hùng
54 pages
Multi Processors and Thread Level Parallelism
No ratings yet
Multi Processors and Thread Level Parallelism
74 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
CA Chap7 Multicores Multiprocessors
No ratings yet
CA Chap7 Multicores Multiprocessors
42 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
Programming Shared Address Space Platforms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Programming Shared Address Space Platforms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
67 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Basic Operating System Concepts: A Review
No ratings yet
Basic Operating System Concepts: A Review
53 pages
CA Lecture 13
No ratings yet
CA Lecture 13
27 pages
16 Synchronization
No ratings yet
16 Synchronization
9 pages
CS 61C: Great Ideas in Computer Architecture (Machine Structures)
No ratings yet
CS 61C: Great Ideas in Computer Architecture (Machine Structures)
32 pages
CH17 COA9e Parallel Processing
No ratings yet
CH17 COA9e Parallel Processing
52 pages
Synchronization Mechanisms
No ratings yet
Synchronization Mechanisms
41 pages
Multi-Core Architectures
100% (1)
Multi-Core Architectures
43 pages
Computer System: Operating Systems: Internals and Design Principles
No ratings yet
Computer System: Operating Systems: Internals and Design Principles
62 pages
MULTIPROCTLPA
No ratings yet
MULTIPROCTLPA
99 pages
Implementing Locks: How To Write Correct Concurrent Programs? No Race
No ratings yet
Implementing Locks: How To Write Correct Concurrent Programs? No Race
4 pages
Programming Shared Address Space Platforms
No ratings yet
Programming Shared Address Space Platforms
44 pages
1.interprocess Communication Mechanisms 2.memory Management and Virtual Memory
No ratings yet
1.interprocess Communication Mechanisms 2.memory Management and Virtual Memory
45 pages
Programming Shared-Memory Platforms With Pthreads: John Mellor-Crummey
No ratings yet
Programming Shared-Memory Platforms With Pthreads: John Mellor-Crummey
34 pages
23.L20 Multiprocessing Multithreading Vectorization
No ratings yet
23.L20 Multiprocessing Multithreading Vectorization
38 pages
Threads: Multicore Programming Multithreading Models Thread Libraries Threading Issues Operating System Examples
No ratings yet
Threads: Multicore Programming Multithreading Models Thread Libraries Threading Issues Operating System Examples
22 pages
Lecture 16
No ratings yet
Lecture 16
30 pages
Mod 7
No ratings yet
Mod 7
56 pages
Operating System 4
No ratings yet
Operating System 4
33 pages
Threads: Tevfik Koşar
100% (1)
Threads: Tevfik Koşar
40 pages
Demystifying Multicore Germany 14 PDF
No ratings yet
Demystifying Multicore Germany 14 PDF
82 pages
Operating Systems Finals Revision
No ratings yet
Operating Systems Finals Revision
21 pages
Lecture 05
No ratings yet
Lecture 05
8 pages
Multiprocessors and Multithreading: CS151B/EE M116C Computer Systems Architecture
No ratings yet
Multiprocessors and Multithreading: CS151B/EE M116C Computer Systems Architecture
13 pages
08 Systems Programming-Concurrent Programming
No ratings yet
08 Systems Programming-Concurrent Programming
61 pages
CSE211 Computer Architecturemodule 18-21
No ratings yet
CSE211 Computer Architecturemodule 18-21
19 pages
Parallel Computer Architecture A Hardware-Software
No ratings yet
Parallel Computer Architecture A Hardware-Software
18 pages
Unit 2 Os
No ratings yet
Unit 2 Os
30 pages
Introduction To Parallel Programming: Center For Institutional Research Computing
No ratings yet
Introduction To Parallel Programming: Center For Institutional Research Computing
98 pages
High Performance Computing
No ratings yet
High Performance Computing
67 pages
L9 - Parallelism and Instruction Synchronization
No ratings yet
L9 - Parallelism and Instruction Synchronization
4 pages
Concurrent Programming With Threads: Rajkumar Buyya
No ratings yet
Concurrent Programming With Threads: Rajkumar Buyya
168 pages
Basic Operating System Concepts
No ratings yet
Basic Operating System Concepts
54 pages
CSE211 Computer Architecture
No ratings yet
CSE211 Computer Architecture
18 pages
The Mac Terminal Reference and Scripting Primer
From Everand
The Mac Terminal Reference and Scripting Primer
Jay Docherty
4.5/5 (3)
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
CS3350B Computer Architecture Memory Hierarchy: How?: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture Memory Hierarchy: How?: Marc Moreno Maza
33 pages
CS3350B Computer Architecture MIPS Introduction: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture MIPS Introduction: Marc Moreno Maza
24 pages
CS3350B Computer Architecture CPU Performance and Profiling: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture CPU Performance and Profiling: Marc Moreno Maza
28 pages
CS3350B Computer Architecture Memory Hierarchy: Why?: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture Memory Hierarchy: Why?: Marc Moreno Maza
30 pages
CS3350B Computer Architecture: Marc Moreno Maza
100% (1)
CS3350B Computer Architecture: Marc Moreno Maza
45 pages
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
No ratings yet
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
24 pages
An Overview of General Purpose Graphics Processing Units: Marc Moreno Maza
No ratings yet
An Overview of General Purpose Graphics Processing Units: Marc Moreno Maza
18 pages
Section 9
No ratings yet
Section 9
60 pages
Nano PDF
No ratings yet
Nano PDF
1 page
CS3350B Computer Architecture: Lecture 6.2: Instructional Level Parallelism: Hazards and Resolutions
No ratings yet
CS3350B Computer Architecture: Lecture 6.2: Instructional Level Parallelism: Hazards and Resolutions
31 pages
AlgNotes PDF
No ratings yet
AlgNotes PDF
106 pages
مكتبة نور - مميز بالاصفر PDF
No ratings yet
مكتبة نور - مميز بالاصفر PDF
229 pages
SQL Server TDE
No ratings yet
SQL Server TDE
3 pages
TFT SDK Manual
No ratings yet
TFT SDK Manual
102 pages
Data Migration
No ratings yet
Data Migration
15 pages
Specifications Manual: Tcs-Net Modbus
No ratings yet
Specifications Manual: Tcs-Net Modbus
18 pages
DBMS-UNIT-6 R16 (1)
No ratings yet
DBMS-UNIT-6 R16 (1)
16 pages
XSLT 2 0 Cheat Sheet
No ratings yet
XSLT 2 0 Cheat Sheet
2 pages
Air SDK 33.0.1.228 - Readme File PDF
No ratings yet
Air SDK 33.0.1.228 - Readme File PDF
13 pages
3PAR Technical For PDF
No ratings yet
3PAR Technical For PDF
52 pages
h16084 WP Tech Overview New Improved Features Onefs8.1
No ratings yet
h16084 WP Tech Overview New Improved Features Onefs8.1
5 pages
Lesson7 LZ77
No ratings yet
Lesson7 LZ77
49 pages
Interacting With Database
No ratings yet
Interacting With Database
83 pages
CARPROG Mercedes Benz Airbag Reset Manual
No ratings yet
CARPROG Mercedes Benz Airbag Reset Manual
4 pages
Redis
No ratings yet
Redis
7 pages
Netsim
No ratings yet
Netsim
18 pages
2018
No ratings yet
2018
17 pages
gc_٢٠٢٤_١١_٢٥
No ratings yet
gc_٢٠٢٤_١١_٢٥
32 pages
Digital Circuits & Fundamentals of Microprocessor: B.E. (Computer Science & Engineering (New) ) Third Semester (C.B.S.)
No ratings yet
Digital Circuits & Fundamentals of Microprocessor: B.E. (Computer Science & Engineering (New) ) Third Semester (C.B.S.)
2 pages
CFFire800 Pro FireWire 800 To UDMA Compact Flash Drive Read-Writer
100% (1)
CFFire800 Pro FireWire 800 To UDMA Compact Flash Drive Read-Writer
3 pages
4.BP Travel - Create Quotes - Solution Design Document (SDD)
No ratings yet
4.BP Travel - Create Quotes - Solution Design Document (SDD)
10 pages
Syllabus - Information Technology & Its Application in Business
No ratings yet
Syllabus - Information Technology & Its Application in Business
4 pages
Real-Time Performance of Dynamic Memory Allocation Algorithms
No ratings yet
Real-Time Performance of Dynamic Memory Allocation Algorithms
10 pages
Array in Shell Scripting: ARRAYNAME (INDEXNR) Value
No ratings yet
Array in Shell Scripting: ARRAYNAME (INDEXNR) Value
4 pages
DBMS Material Unit 1
No ratings yet
DBMS Material Unit 1
55 pages
How To Configure HSRP in Cisco IOS Routers
No ratings yet
How To Configure HSRP in Cisco IOS Routers
6 pages
Veeam Backup Datasheet
No ratings yet
Veeam Backup Datasheet
2 pages
Install and Configure Nfs
No ratings yet
Install and Configure Nfs
3 pages
Log
No ratings yet
Log
37 pages
Handling Exceptions: Part F
No ratings yet
Handling Exceptions: Part F
18 pages
Line Codes RZ AND NRZ
No ratings yet
Line Codes RZ AND NRZ
48 pages

Uploaded by

Uploaded by

CS3350B

Lecture 7.2: Multicore TLP (1)

Marc Moreno Maza

[Adapted from lectures on

Processor Processor Processor

Cache Cache Cache

 Deliver high throughput for independent jobs via job-level parallelism

Now Use term core for processor (“Multicore”)

 Sequential software is slow software

 Multiprocessor (Multicore) uses Shared Memory (single

 MESI Protocol ensures cache consistency and has

 A key challenge is to craft parallel programs that have

 Sum 100,000 numbers on 100 processor SMP

 Phase II: Add these partial sums

This code runs simultaneously

half = half/2; /* dividing line on who sums */

 thread of execution: smallest unit of processing scheduled

 Threads share a memory address space

CAN BE SPECIFIED IN A LANGUAGE WITH MIMD SUPPORT –

 On a multiprocessor, threads run at the same time, with

 Basic idea: Processor resources are expensive and

 Put in redundant hardware so don’t have to save context

 Attractive for applications with abundant TLP

 Lock (a.k.a. busy wait)

 Any problems with this?

Both threads think they have set the lock!

 Could be a single instruction

 Load linked: ll rt, off(rs)

 Note that sc clobbers the register value being stored (rt) !

 Atomic swap (to test/set lock variable)

try: add $t0,$zero,$s4 #copy value

 In a single atomic operation:

Try: addiu $t0,$zero,1

 Sequential software is slow software

 Multiprocessor (Multicore) uses Shared Memory

 Synchronization via hardware primitives:

You might also like