0% found this document useful (0 votes)
21 views

Presentation2 HS OpenMP

OpenMP is an API that allows developers to write code that can run concurrently on multi-core CPUs. It uses compiler directives to specify parallel regions. OpenMP supports shared memory parallelism, thread creation and management, data sharing between threads, synchronization constructs, and parallelizing loops and tasks. It is portable, widely used in scientific computing, and provides a consistent model for parallel programming across platforms.

Uploaded by

Iqbal Tawakal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Presentation2 HS OpenMP

OpenMP is an API that allows developers to write code that can run concurrently on multi-core CPUs. It uses compiler directives to specify parallel regions. OpenMP supports shared memory parallelism, thread creation and management, data sharing between threads, synchronization constructs, and parallelizing loops and tasks. It is portable, widely used in scientific computing, and provides a consistent model for parallel programming across platforms.

Uploaded by

Iqbal Tawakal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Parallel Programming with

OpenMP
What is OpenMP
• OpenMP is an API (Application Programming
Interface) that allows developers to write code that
can be executed concurrently on multi-core CPUs and
other shared-memory architectures.
• **1. Shared-Memory Parallelism:**
• **2. Compiler Directives:**
• **3. Parallel Regions:**
• **4. Thread Creation and Management:**
• **5. Data Sharing and Private Variables:**
• **6. Synchronization:**
• **7. Loop Parallelism:**
• **8. Task Parallelism:**
• **9. Performance Considerations:**
• **10. Portability:**
• **11. Error Handling:**
• **12. Examples of Use Cases:**
• **1. Shared-Memory Parallelism:**
• - OpenMP is designed for shared-memory parallelism, where multiple threads
of execution share the same memory space within a single process.
• - It is well-suited for multi-core processors and SMP (Symmetric Multi-
Processing) systems.

• **2. Compiler Directives:**
• - OpenMP uses compiler directives to specify parallel regions in the code.
• - These directives are pragmas (e.g., `#pragma omp`) that guide the compiler in
generating parallel code.
• **3. Parallel Regions:**
• - A parallel region is a block of code that can be executed concurrently by multiple
threads.
• - Threads are created automatically when entering a parallel region and terminated
when exiting.
• - Parallel regions can be created using directives like `#pragma omp parallel` or
`#pragma omp for`.

• **4. Thread Creation and Management:**
• - OpenMP abstracts thread creation and management, making it easier for
developers.
• - Threads are typically managed by the OpenMP runtime library, which handles
tasks like thread creation, synchronization, and load balancing.
• **5. Data Sharing and Private Variables:**
• - OpenMP provides mechanisms for sharing data among threads, such as shared
variables and thread-private variables.
• - Shared variables are accessible by all threads in a parallel region, while thread-
private variables are unique to each thread.

• **6. Synchronization:**
• - OpenMP supports various synchronization constructs to coordinate threads, such
as barriers (`#pragma omp barrier`) and critical sections (`#pragma omp critical`).
• - These constructs ensure that threads do not interfere with each other's execution
when accessing shared resources.

• **7. Loop Parallelism:**
• - OpenMP is commonly used to parallelize loops using directives like
`#pragma omp for`.
• - Parallel loops can be efficiently divided among threads, and loop
iterations are executed concurrently.

• **8. Task Parallelism:**
• - OpenMP also supports task parallelism, allowing developers to specify
tasks that can be executed independently.
• - Tasks can be created using `#pragma omp task` directives and can be
used for more fine-grained parallelism.
• **9. Performance Considerations:**
• - Efficient use of OpenMP requires careful consideration of load
balancing, data dependencies, and minimizing synchronization overhead.
• - Profiling and performance tuning tools can help identify bottlenecks and
optimize parallel code.

• **10. Portability:**
• - OpenMP is supported by many compilers and is highly portable, making
it easier to write parallel code that can run on various platforms.
• - It provides a consistent API for shared-memory parallelism across
different systems.

• **11. Error Handling:**
• - OpenMP provides mechanisms for handling errors, such as
runtime library functions for querying the number of threads and
checking for errors during parallel execution.

• **12. Examples of Use Cases:**
• - OpenMP is commonly used in scientific computing, numerical
simulations, data processing, and other applications where
parallelism can be exploited to improve performance.
examples
#include <omp.h>
#include <stdio.h>
int main() {
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
printf("Hello from thread %d\n", thread_id);
}
return 0;
}

In this code, the `#pragma omp parallel` directive creates a parallel region, and each thread prints its own thread
ID. The code will run with multiple threads, and each thread will execute the specified block in parallel.
• Output on a computer with two cores, and thus two threads:

• Hello, world.

• Hello, world.

• On computer with 24 threads, I got 24 hellos, for 24 threads. On my


desktop I get (only) 8. How many do you get?
Code Example of addition two vectors
examples #include <stdio.h>
#include <omp.h>
#define N 1000
int main() {
int A[N], B[N], C[N]; // Input and output arrays
// Initialize the input arrays A and B
for (int i = 0; i < N; i++) {
A[i] = i;
B[i] = 2 * i;
}
// Parallelize the vector addition using OpenMP
#pragma omp parallel for
for (int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
// Print the result (C)
printf("Resultant vector (C):\n");
for (int i = 0; i < N; i++) {
printf("%d ", C[i]);
}
printf("\n");
return 0;
}
Some notes: private and shared vars
• In a parallel section variables can be private (each thread owns a copy of
the variable) or shared among all threads. Shared variables must be used
with care because they cause race conditions.

• shared: the data within a parallel region is shared, which means visible and
accessible by all threads simultaneously. By default, all variables in the
work sharing region are shared except the loop iteration counter.

• private: the data within a parallel region is private to each thread, which
means each thread will have a local copy and use it as a temporary
variable. A private variable is not initialized and the value is not maintained
for use outside the parallel region. By default, the loop iteration counters in
the OpenMP loop constructs are private.
Some notes: private and shared vars
int main (int argc, char *argv[]) {
int th_id, nthreads;

#pragma omp parallel private(th_id)


// th_id is declared above.
// It is is specified as private; so each thread will have its own copy of th_id
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
}

Sharing variables is sometimes what you want, other times its not, and can lead to race conditions. Put
differently, some variables need to be shared, some need to be private, and you the programmer have to
specify what you want.
Some notes : Synchronization
• • critical: the enclosed code block will be executed by only one thread at a time, and not
simultaneously executed by multiple threads. It is often used to protect shared data from race conditions.

• • atomic: the memory update (write, or read-modify-write) in the next instruction will be performed
atomically. It does not make the entire statement atomic; only the memory update is atomic. A compiler
might use special hardware instructions for better performance than when using critical.

• • ordered: the structured block is executed in the order in which iterations would be executed in a
sequential loop

• • barrier: each thread waits until all of the other threads of a team have reached this point. A work-
sharing construct has an implicit barrier synchronization at the end.

• • nowait: specifies that threads completing assigned work can proceed without waiting for all threads
in the team to finish. In the absence of this clause, threads encounter a barrier synchronization at the end of
the work sharing construct.
Some notes : synchronization
• on barriers: If we wanted all threads to be at a specific point in their execution before proceeding,
we would use a barrier.
• A barrier basically tells each thread, "wait here until all other threads have reached this point...".
int main (int argc, char *argv[]) {
Some other runtime functions
int th_id, nthreads;
#pragma omp parallel private(th_id)
are:
{
th_id = omp_get_thread_num(); • omp_get_num_threads
printf("Hello World from thread %d\n", th_id); • omp_get_num_procs
#pragma omp barrier <----------- master waits until all threads finish before • omp_set_num_threads
printing • omp_get_max_threads
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n",nthreads);
}
}
}//m
//compute the sum of two arrays in parallel
Parallelizing loops #include < stdio.h >
#include < omp.h >
#define N 1000000
int main(void) {
float a[N], b[N], c[N];
int i;
/* Initialize arrays a and b */
for (i = 0; i < N; i++) {
a[i] = i * 2.0;
b[i] = i * 3.0;
}
/* Compute values of array c = a+b in parallel. */
#pragma omp parallel shared(a, b, c) private(i)
{
#pragma omp for
for (i = 0; i < N; i++) {
c[i] = a[i] + b[i];
printf ("%f\n", c[10]);
}
}
}
Adding two elements
//example4.c: add all elements in an array in //the array is distributde statically between
parallel
threads
#include < stdio.h >
#pragma omp for schedule(static,1)
int main() { for (int i=0; i< N; i++) {
const int N=100; local_sum += a[i];
int a[N]; }
//each thread calculated its local_sum. ALl
//initialize threads have to add to
for (int i=0; i < N; i++) //the global sum. It is critical that this
operation is atomic.
a[i] = i;
#pragma omp critical
//compute sum
int local_sum, sum; sum += local_sum;
#pragma omp parallel private(local_sum) }
shared(sum)
{ printf("sum=%d should be %d\n", sum, N*(N-
local_sum =0; 1)/2);
}
Performance consideration
• Critical sections and atomic sections serialize the
execution and eliminate the concurrent
execution of threads.
• If used unwisely, OpenMP code can be worse
than serial code because of all the thread
overhead.
Some comments
Some comments
Exercises (OpenMP-1)
ref: https://www.r-ccs.riken.jp/en/wp-content/uploads/sites/2/2021/08/RIKEN-iphcss_day1-rev1.pdf
Exercises (OpenMP-2):
ref: https://www.r-ccs.riken.jp/en/wp-content/uploads/sites/2/2021/08/RIKEN-iphcss_day1-rev1.pdf
Exercises (OpenMP-3):
ref: https://www.r-ccs.riken.jp/en/wp-content/uploads/sites/2/2021/08/RIKEN-iphcss_day1-rev1.pdf
Tested at Fugaku
Tested at Fugaku
Tested at Fugaku
Tested at Fugaku
Tested at Fugaku
• Job submission
$ pjsub hello_world_omp.pjsh

• Observing the submitted jobs


$ pjstat

You might also like