02+Parallel+Processing+-+CPP+and+OMP
02+Parallel+Processing+-+CPP+and+OMP
PROCESSING/PROGRAMMING
CATALIN BOJA
[email protected]
BUCHAREST UNIVERSITY OF ECONOMIC STUDIES
DEPARTMENT OF ECONOMIC INFORMATICS & CYBERNETICS
MULTI-PROCESS VS MULTI-THREAD
https://computing.llnl.gov/tutorials/pthreads/ http://www.javamex.com/tutorials/threads/how_threads_work.shtml
MULTI-PROCESS VS MULTI-THREAD
• Thread model is an extension of the process model. • Thread operations include thread creation, termination,
synchronization (joins, blocking), scheduling, data
• Each process consists of multiple independent
management and process interaction.
instruction streams (or threads) that are assigned
computer resources by some scheduling procedure. • A thread does not maintain a list of created threads,
nor does it know the thread that created it.
• Threads of a process share the address space of this
process. Global variables and all dynamically • Threads in the same process share: Process instructions,
allocated data objects are accessible by all threads of Most data, open files (descriptors), signals and signal
a process. handlers, current working directory, User and group id
• Each thread has its own run time stack, register, • Each thread has a unique: (Thread ID, set of registers,
program counter. stack pointer, stack for local variables, return
addresses, signal mask, priority, Return value: errno)
• Threads can communicate by reading/writing
variables in the common address space. • pthread functions return "0" if OK.
[email protected] 4
THREADS CONCEPTS
[email protected] 5
THREADS CONCEPTS
• Atomic operations – operations which are executed at once (in the same processor
time) by a thread (and not partially); the thread will not executed parts of it; it’s all
or nothing
• Race condition – 2 or more threads have simultaneously read/write access on the
same resource (object, variable, file) and the operation is not atomic
• Critical section – code sequence executed by multiple threads in the same time ,
which is accessing a shared resource
• Deadlock – 2 or more threads blocking each other as each of them is locking
simultaneously a resource needed by the others
[email protected] 6
THREADS CONCEPTS
[email protected] 7
MULTI-THREADS IN C++
[email protected] 8
MULTI-THREADS IN C++
void thread_function() {
std::cout << "Hello World !";
• Threads are created based on
} existing functions
int main() {
//start a thread • Creating a thread object will start a
std::thread t1(thread_function); new one automatically
//join the thread with main • Joining the thread with the main one
t1.join(); is done by calling join()
}
[email protected] 9
MULTI-THREADS IN C++ - SHARING DATA
[email protected] 10
MULTI-THREADS IN C++ - PROTECTING SHARED
DATA
[email protected] 11
MULTI-THREADS IN C++ - MUTEX
[email protected] 12
MULTI-THREADS IN C++ - MUTEX AND RAII
[email protected] 13
MULTI-THREADS IN C++ - MUTEX AND RAII
[email protected] 14
MULTI-THREADS IN C++ - ADVANCED MUTEXES
[email protected] 15
MULTI-THREADS IN C++ - ATOMIC
[email protected] 16
MULTI-THREADS IN C++ - PROBLEMS
• Race conditions
• Shared variables
• Load balancing
• Cache coherence
• Cache false sharing
• Deadlocks
• Synchronization overhead
[email protected] 17
OPENMP – PROGRAMMING MODEL
[email protected] 19
OPENMP – WHAT IS NOT
[email protected] 20
OPENMP
OpenMP program begin as a single process: the master thread (in pictures in red/grey). The
master thread executes sequentially until the first parallel region construct is encountered.
When a parallel region is encountered, master thread
Create a group of threads by FORK.
Becomes the master of this group of threads, and is assigned the thread id 0 within the
group.
The statement in the program that are enclosed by the parallel region construct are then executed
in parallel among these threads.
JOIN: When the threads complete executing the statement in the parallel region construct, they
synchronize and terminate, leaving only the master thread.
[email protected] 21
OPENMP
[email protected] 22
OPENMP
I/O
OpenMP does not specify parallel I/O.
It is up to the programmer to ensure that I/O is conducted correctly within the context of a multi-threaded
program.
Memory Model
Threads can “cache” their data and are not required to maintain exact consistency with real memory all of
the time.
When it is critical that all threads view a shared variable identically, the programmer is responsible for
insuring that the variable is updated by all threads as needed.
[email protected] 23
OPENMP – HELLO WORLD
[email protected] 24
OPENMP
#include “omp.h”
void main ()
{
int var1, var2, var3;
// 1. Serial code
...
// 2. Beginning of parallel section.
// Fork a team of threads. Specify variable scoping
#pragma omp parallel private(var1, var2) shared(var3)
{
// 3. Parallel section executed by all threads
...
// 4. All threads join master thread and disband
}
// 5. Resume serial code . . .
}
[email protected] 25
OPENMP - C/C++ DIRECTIVE FORMAT
• General Rules:
• Case sensitive
• Only one directive-name may be specified per directive
• Each directive applies to at most one succeeding statement, which must be a structured block.
• Long directive lines can be “continued” on succeeding lines by escaping the newline character with a
backslash “\” at the end of a directive line.
[email protected] 26
OPENMP - C/C++ DIRECTIVE FORMAT
[email protected] 27
OPENMP - C/C++ DIRECTIVE FORMAT
if (is_parallel == 1) num_threads(8)
• If the value of the variable is_parallel is one then creates 8 threads
int is_parallel = 1; shared (var_b)
int var_c = 10; • Each thread shares a single copy of variable var_b
int var_b = 0;
//…. private (var_a) firstprivate (var_c)
#pragma omp parallel if (is_parallel == 1) num_threads(8) \ • Each thread gets private copies of variable var_a and var_c
shared (var_b) private (var_a) firstprivate (var_c) \ • Each private copy of var_c is initialized with the value of var_c
default (none) in main thread when the parallel directive is encountered
{ • var_a must be initialized in the parallel block
var_a = 0; //must be initialized
/* structured block */ default (none)
} • Default state of a variable is specified as none (rather than shared)
• Signals error if not all variables are specified as shared or private
[email protected] 28
OPENMP – NUMBER OF THREADS
The number of threads in a parallel region is determined by the following factors, in order
of precedence:
1. Evaluation of the if clause
2. Setting of the num_threads() clause
3. Use of the omp_set_num_threads() library function
4. Setting of the OMP_NUM_THREAD environment variable
5. Implementation default – usually the number of cores on a node
[email protected] 32
OPENMP – SYNCHRONIZATION
Thread Thread
Thread Thread 1 2
1 Thread 3
2
Thread critical waits Thread
4 section 4
wait wait
wait Thread
Time
2
Time barrier
Thread
Thread Thread 1 critical
waits
1 Thread Thread 3 section
2 4
Thread critical
2 section
[email protected] 34
OPENMP - WORK SHARING CONSTRUCTS
[email protected] 35
OPENMP - WORK SHARING CONSTRUCTS
[email protected] 36
OPENMP - WORK SHARING CONSTRUCTS
void main() {
Work-Sharing do/for Directives: int nthreads, tid;
omp_set_num_threads(3);
• Shares iterations of a loop across the group
• Represents a “data parallelism” #pragma omp parallel private(tid)
{
• for directive partitions parallel iterations int i;
tid = omp_get_thread_num();
across threads in C++ printf("Hello world from (%d)\n", tid);
• do is the analogous directive in Fortran #pragma omp for
for(i = 0; i <=4; i++)
• Implicit barrier at end of for loop {
printf(“Iteration %d by %d\n”, i, tid);
}
#pragma omp for [clause list] } // all threads join master thread and terminates
}
/* for loop */
[email protected] 37
OPENMP - WORK SHARING CONSTRUCTS
Sequential code to add two vectors: Parallel logic to add two vectors:
[email protected] 38
OPENMP - WORK SHARING CONSTRUCTS
//OpenMP implementation 1 (not desired): //A worksharing for construct to add vectors:
#pragma omp parallel #pragma omp parallel
{ {
int id, i, Nthrds, istart, iend; #pragma omp for
id = omp_get_thread_num(); {
Nthrds = omp_get_num_threads(); for(i=0; i<N; i++) { c[i]=b[i]+a[i]; }
istart = id*N/Nthrds; }
iend = (id+1)*N/Nthrds; }
for directive syntax: Directive Restrictions for the “for loop” that follows the for
omp directive:
#pragma omp for [clause list]
schedule (type [,chunk])
It must NOT have a break statement
ordered The loop control variable must be an integer
private (variable list) The initialization expression of the “for loop” must be
firstprivate (variable list) an integer assignment.
shared (variable list)
The logical expression must be one of <, ≤, > ,≥
reduction (operator: variable list)
collapse (n) The increment expression must have integer increments
nowait or decrements only.
/* for_loop */
[email protected] 40
OPENMP – DATA ENVIRONMENT
[email protected] 42
OPENMP - WORK SHARING CONSTRUCTS
avg /= MAX;
}
[email protected] 43
OPENMP - WORK SHARING CONSTRUCTS
[email protected] 44
OPENMP - WORK SHARING CONSTRUCTS
[email protected] 45
OPENMP - WORK SHARING CONSTRUCTS
//A work-sharing for average value from a vector: • avg – is a local variable in each thread of the
{ parallel region
double avg = 0.0, A[MAX];
• After the for loop, the external avg variable
int i;
becomes the sum of the local avg variables
…
#pragma omp parallel for reduction (+: avg)
for(i = 0; i < MAX; i++) {avg += a[i];}
avg /= MAX;
}
[email protected] 46
OPENMP - MATRIX-VECTOR MULTIPLICATION
a[i] =sum;
}
[email protected] ……… 47
OPENMP - WORK SHARING CONSTRUCTS
[email protected] 48
OPENMP - WORK SHARING CONSTRUCTS
• schedule (static [, chunk]) - Loop iterations are divided into pieces of size chunk and then
statically (compile time) assigned to threads. If chunk is not specified, the iteration are evenly
(if possible) divided contiguously among the threads.
• schedule (dynamic [, chunk]) - Loop iterations are divided into pieces of size chunk and then
dynamically assigned to threads. When a thread finishes one chunk, it is dynamically assigned
another. The default chunk size is 1.
• Schedule (guided [, chunk]) - For a chunk size of 1, the size of each chunk is proportional to
the number of unassigned iterations divided by the number of threads, decreasing to 1. For a
chunk size with value 𝑘(𝑘>1), the size of each chunk is determined in the same way with the
restriction that the chunks do not contain fewer than 𝑘 iterations (except for the last chunk to
be assigned, which may have fewer than 𝑘 iterations). The default chunk size is 1.
[email protected] 49
OPENMP - WORK SHARING CONSTRUCTS
[email protected] 50
OPENMP - WORK SHARING CONSTRUCTS
Static Dynamic
• Predictable • Unpredictable
• Pre-determined at compile time by • Determined at run-time
the programmer • Complex logic at run-time which
• Reduce overhead at run-time leads to an overhead
[email protected] 51
OPENMP - WORK SHARING CONSTRUCTS
Find loop
intensive
routines
Remove loop
carry
dependencies
Implement a
work-sharing
construct
[email protected] 52
OPENMP - MATRIX-VECTOR MULTIPLICATION
[email protected] 53
OPENMP - WORK SHARING CONSTRUCTS
[email protected] 54
OPENMP - WORK SHARING CONSTRUCTS
[email protected] 56
OPENMP - WORK SHARING CONSTRUCTS
[email protected] 57
OPENMP – SECTIONS CONSTRUCT
[email protected] 60
OPENMP – RUNTIME ROUTINES
[email protected] 62
OPENMP - TASKS
[email protected] 63
PARALLEL PROGRAMMING PROBLEMS