HPC Lab Manual-1
HPC Lab Manual-1
Practical No: 1
Title: Design and implement Parallel Breadth First Search and Depth First Search based on existing
algorithms using OpenMP. Use a Tree or an undirected graph for BFS and DFS
Objective : Students should be able to Write a program to implement Parallel Breadth First Search
and Depth First Search based on existing algorithms using OpenMP
Prerequisite:
3. Concept of Parallelism
Theory:
What is BFS?
BFS stands for Breadth-First Search. It is a graph traversal algorithm used to explore all the nodes of a
graph or tree systematically, starting from the root node or a specified starting point, and visiting all the
neighbouring nodes at the current depth level before moving on to the next depth level. The algorithm
uses a queue data structure to keep track of the nodes that need to be visited, and marks each visited node
to avoid processing it again. The basic idea of the BFS algorithm is to visit all the nodes at a given level
before moving on to the next level, which ensures that all the nodes are visited in breadth-first order. BFS
is commonly used in many applications, such as finding the shortest path between two nodes, solving
puzzles, and searching through a tree or graph.
Now let’s take a look at the steps involved in traversing a graph by using Breadth-First Search:
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
What is DFS?
DFS stands for Depth-First Search. It is a popular graph traversal algorithm that explores as far as
possible along each branch before backtracking. This algorithm can be used to find the shortest path
between two vertices or to traverse a graph in a systematic way. The algorithm starts at the root
node and explores as far as possible along each branch before backtracking. The backtracking is
done to explore the next branch that has not been explored yet.
DFS can be implemented using either a recursive or an iterative approach. The recursive approach
is simpler to implement but can lead to a stack overflow error for very large graphs. The iterative
approach uses a stack to keep track of nodes to be explored and is preferred for larger graphs.
DFS can also be used to detect cycles in a graph. If a cycle exists in a graph, the DFS algorithm will
eventually reach a node that has already been visited, indicating that a cycle exists. A standard DFS
implementation puts each vertex of the graph into one of two categories: 1. Visited 2. Not Visited The
purpose of the algorithm is to mark each vertex as visited while avoiding cycles.
Concept of OpenMP
● OpenMP (Open Multi-Processing) is an application programming interface (API) that
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
● OpenMP is widely used in scientific computing, engineering, and other fields that require high-
performance computing. It is supported by most modern compilers and is available on a wide range of
platforms, including desktops, servers, and supercomputers.
How Parallel BFS Work:
● Parallel BFS (Breadth-First Search) is an algorithm used to explore all the nodes of a graph or
tree systematically in parallel. It is a popular parallel algorithm used for graph traversal in
distributed computing, shared-memory systems, and parallel clusters.
● The parallel BFS algorithm starts by selecting a root node or a specified starting point, and then
assigning it to a thread or processor in the system. Each thread maintains a local queue of nodes
to be visited and marks each visited node to avoid processing it again.
● The algorithm then proceeds in levels, where each level represents a set of nodes that are at a
certain distance from the root node. Each thread processes the nodes in its local queue at the
current level, and then exchanges the nodes that are adjacent to the current level with other
threads or processors. This is done to ensure that the nodes at the next level are visited by the
next iteration of the algorithm.
● The parallel BFS algorithm uses two phases: the computation phase and the communication
phase. In the computation phase, each thread processes the nodes in its local queue, while in the
communication phase, the threads exchange the nodes that are adjacent to the current level with
other threads or processors.
● The parallel BFS algorithm terminates when all nodes have been visited or when a specified
node has been found. The result of the algorithm is the set of visited nodes or the shortest path
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Conclusion:
In this way we can achieve parallelism while implementing Breadth First Search and Depth First Search.
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
#include<iostream>
#include<stdlib.h>
#include<queue>
using namespace std;
class node
{
public:
};
class Breadthfs
{
public:
};
if(!root)
{
root=new node;
root->left=NULL;
root->right=NULL;
root->data=data;
return root;
}
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
queue<node *> q;
q.push(root);
while(!q.empty())
{
node *temp=q.front();
q.pop();
if(temp->left==NULL)
{
temp->left=new node;
temp->left->left=NULL;
temp->left->right=NULL;
temp->left->data=data;
return root;
}
else
{
q.push(temp->left);
if(temp->right==NULL)
{
temp->right=new node;
temp->right->left=NULL;
temp->right->right=NULL;
temp->right->data=data;
return root;
}
else
{
q.push(temp->right);
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
queue<node*> q;
q.push(head);
int qSize;
while (!q.empty())
{
qSize = q.size();
#pragma omp parallel for
//creates parallel threads
for (int i = 0; i < qSize; i++)
{
node* currNode;
#pragma omp critical
{
currNode = q.front();
q.pop();
cout<<"\t"<<currNode->data;
}
}
int main(){
node *root=NULL;
int data;
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
char ans;
do
{
cout<<"\n enter data=>";
cin>>data;
root=insert(root,data);
}while(ans=='y'||ans=='Y');
bfs(root);
return 0;
}
Run Commands:
1) g++ -fopenmp bfs.cpp -o bfs
2) ./bfs
Output:
Enter data => 8 Do you want to insert one more node? (y/n) n
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
5 3 7 2 1 8
Code to implement DFS using OpenMP:
#include<iostream>
#include <vector>
#include <stack>
#include <omp.h>
while (!s.empty())
{
int curr_node =s.top();
s.pop();
if (!visited[curr_node])
{
visited[curr_node] =true;
if (visited[curr_node])
{
cout << curr_node <<" ";
}
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
s.push(adj_node);
}
}
}
}
}
int main()
{
int n, m, start_node;
cout << "Enter No of Node, Edges, and start node:" ;
cin >> n >> m >> start_node;
//n: node,m:edges
dfs(start_node);
/* for (int i = 0; i < n;i++)
{
if (visited[i])
{
cout << i << " ";
}
}*/
return 0;
}
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Output:
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Practical No: 2
Title: Write a program to implement Parallel Bubble Sort. Use existing algorithms and measure
the performance of sequential and parallel algorithms.
Objective: Students should be able to Write a program to implement Parallel Bubble Sort and can
measure the performance of sequential and parallel algorithms.
Prerequisite:
Theory:
Bubble Sort is a simple sorting algorithm that works by repeatedly swapping adjacent elements if
they are in the wrong order. It is called "bubble" sort because the algorithm moves the larger
elements towards the end of the array in a manner that resembles the rising of bubbles in a liquid.
5. If any swaps were made in step 2-4, repeat the process from step 1.
The time complexity of Bubble Sort is O(n^2), which makes it inefficient for large lists. However, it
has the advantage of being easy to understand and implement, and it is useful for educational
purposes and for sorting small datasets.
Bubble Sort has limited practical use in modern software development due to its inefficient time
complexity of O(n^2) which makes it unsuitable for sorting large datasets. However, Bubble Sort
has some advantages and use cases that make it a valuable algorithm to understand, such as:
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
1. Simplicity: Bubble Sort is one of the simplest sorting algorithms, and it is easy to
understand and implement. It can be used to introduce the concept of sorting to beginners
and as a basis for more complex sorting algorithms.
2. Educational purposes: Bubble Sort is often used in academic settings to teach the
principles of sorting algorithms and to help students understand how algorithms work.
3. Small datasets: For very small datasets, Bubble Sort can be an efficient sorting
algorithm, as its overhead is relatively low.
4. Partially sorted datasets: If a dataset is already partially sorted, Bubble Sort can be very
efficient. Since Bubble Sort only swaps adjacent elements that are in the wrong order, it
has a low number of operations for a partially sorted dataset.
5. Performance optimization: Although Bubble Sort itself is not suitable for sorting large
datasets, some of its techniques can be used in combination with other sorting algorithms
to optimise their performance. For example, Bubble Sort can be used to optimise the
performance of Insertion Sort by reducing the number of comparisons needed.
Example of Bubble sort
Let's say we want to sort a series of numbers 5, 3, 4, 1, and 2 so that they are arranged
in ascending order…
The sorting begins the first iteration by comparing the first two values. If the first value is greater
than the second, the algorithm pushes the first value to the index of the second value.
First Iteration of the Sorting
Step 1: In the case of 5, 3, 4, 1, and 2, 5 is greater than 3. So 5 takes the position of 3 and the numbers
become 3, 5, 4, 1, and 2.
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Step 2: The algorithm now has 3, 5, 4, 1, and 2 to compare, this time around, it compares the next
two values, which are 5 and 4. 5 is greater than 4, so 5 takes the index of 4 and the values now
become 3, 4, 5, 1, and 2.
Step 3: The algorithm now has 3, 4, 5, 1, and 2 to compare. It compares the next two values, which
are 5 and 1. 5 is greater than 1, so 5 takes the index of 1 and the numbers become 3, 4, 1, 5, and 2.
Step 4: The algorithm now has 3, 4, 1, 5, and 2 to compare. It compares the next two values, which
are 5 and 2. 5 is greater than 2, so 5 takes the index of 2 and the numbers become 3, 4, 1, 2, and 5.
That’s the first iteration. And the numbers are now arranged as 3, 4, 1, 2, and 5 – from the initial 5,
3, 4, 1, and 2. As you might realize, 5 should be the last number if the numbers are sorted in
ascending order. This means the first iteration is really completed.
The algorithm starts the second iteration with the last result of 3, 4, 1, 2, and 5. This time
around, 3 is smaller than 4, so no swapping happens. This means the numbers will remain the
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
same.
The algorithm proceeds to compare 4 and 1. 4 is greater than 1, so 4 is swapped for 1 and the
numbers become 3, 1, 4, 2, and 5.
The algorithm now proceeds to compare 4 and 2. 4 is greater than 2, so 4 is swapped for 2
and the numbers become 3, 1, 2, 4, and 5.
4 is now in the right place, so no swapping occurs between 4 and 5 because 4 is smaller than 5.
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
That’s how the algorithm continues to compare the numbers until they are arranged in
ascending order of 1, 2, 3, 4, and 5.
● Parallel Bubble Sort is a modification of the classic Bubble Sort algorithm that takes
advantage of parallel processing to speed up the sorting process.
● In parallel Bubble Sort, the list of elements is divided into multiple sublists that are sorted
concurrently by multiple threads. Each thread sorts its sublist using the regular Bubble
Sort algorithm. When all sublists have been sorted, they are merged together to form the
final sorted list.
● The parallelization of the algorithm is achieved using OpenMP, a programming API that
supports parallel processing in C++, Fortran, and other programming languages. OpenMP
provides a set of compiler directives that allow developers to specify which parts of the
code can be executed in parallel.
● In the parallel Bubble Sort algorithm, the main loop that iterates over the list of elements is
divided into multiple iterations that are executed concurrently by multiple threads. Each
thread sorts a subset of the list, and the threads synchronise their work at the end of each
iteration to ensure that the elements are properly ordered.
● Parallel Bubble Sort can provide a significant speedup over the regular Bubble Sort
algorithm, especially when sorting large datasets on multi-core processors. However, the
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
speedup is limited by the overhead of thread creation and synchronisation, and it may not
be worth the effort for small datasets or when using a single-core processor.
How to measure the performance of sequential and parallel algorithms?
To measure the performance of sequential Bubble sort and parallel Bubble sort
algorithms, you can follow these steps:
When measuring the performance of the parallel Bubble sort algorithm, you will need to specify the
number of threads to use. You can experiment with different numbers of threads to find the optimal
value for your system.
In Ubuntu, you can use a variety of tools to check CPU utilization and memory consumption. Here
are some common tools:
1. top: The top command provides a real-time view of system resource usage, including CPU
utilization and memory consumption. To use it, open a terminal window and type top. The
output will display a list of processes sorted by resource usage, with the most resource-
intensive processes at the top.
2. htop: htop is a more advanced version of top that provides additional features, such as
interactive process filtering and a color-coded display. To use it, open a terminal window and
type htop.
3. ps: The ps command provides a snapshot of system resource usage at a particular moment in
time. To use it, open a terminal window and type ps aux. This will display a list of all running
processes and their resource usage.
4. free: The free command provides information about system memory usage, including total,
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
used, and free memory. To use it, open a terminal window and type free -h.
5. vmstat: The vmstat command provides a variety of system statistics, including CPU
utilization, memory usage, and disk activity. To use it, open a terminal window and type
vmstat.
Conclusion-
In this way we can implement Bubble Sort in parallel way using OpenMP also
come to know how to how to measure performance of serial and parallel algorithm
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
int test;
test=a;
a=b;
b=test;
int main()
{
int *a,n;
cout<<"\n Enter total no of elements=>";
cin>>n;
a=new int[n];
cout<<"\n Enter elements=>";
for(int i=0;i<n;i++)
{
cin>>a[i];
}
bubble(a,n);
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
return 0;
}
Output:
Enter total no of elements=> 6
Enter elements=> 4
6
3
1
2
5
Sorted array is=> 1
2
3
4
5
6
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
{
int mid;
if(i<j)
{
mid=(i+j)/2;
merge(a,i,mid,mid+1,j);
}
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
while(i<=j1)
{
temp[k++]=a[i++];
}
while(j<=j2)
{
temp[k++]=a[j++];
}
for(i=i1,j=0;i<=j2;i++,j++)
{
a[i]=temp[j];
}
}
int main()
{
int *a,n,i;
cout<<"\n enter total no of elements=>";
cin>>n;
a= new int[n];
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Output:
enter elements=>3
6
5
2
4
Practical No: 3
Title: Implement Min, Max, Sum and Average operations using Parallel Reduction.
Objective: Should be able to learn about how to perform min, max, sum, and average operations on
a large set of data using parallel reduction techniques in CUDA. The program defines four kernel
functions, reduce_min, reduce_max, reduce_sum, and reduce_avg.
Prerequisite:
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Theory :
The parallel reduction algorithm works by dividing the input data into smaller chunks that can be
processed independently in parallel. Each thread or process computes the reduction operation on its
local chunk of data, producing a partial result. The partial results are then combined in a
hierarchical manner until a single result is obtained.
The most common parallel reduction algorithm is the binary tree reduction algorithm, which has a
logarithmic time complexity and can achieve optimal parallel efficiency. In this algorithm, the input data
is initially divided into chunks of size n, where n is the number of parallel threads or processes. Each
thread or process computes the reduction operation on its chunk of data, producing n partial results.
The partial results are then recursively combined in a binary tree structure, where each internal node
represents the reduction operation of its two child nodes. The tree structure is built in a bottom-up
manner, starting from the leaf nodes and ending at the root node. Each level of the tree reduces the
number of partial results by a factor of two, until a single result is obtained at the root node.
The binary tree reduction algorithm can be implemented using various parallel programming
models, such as OpenMP, MPI, or CUDA. In OpenMP, the algorithm can be implemented using
the parallel and for directives for parallelizing the computation, and the reduction clause for
combining the partial results. In MPI, the algorithm can be implemented using the MPI_Reduce
function for performing the reduction operation, and the MPI_Allreduce function for distributing
the result to all processes. In CUDA, the algorithm can be implemented using the parallel
reduction kernel, which uses shared memory to store the partial results and reduce the memory
access latency.
Parallel reduction has many applications in scientific computing, machine learning, data analytics,
and computer graphics. It can be used to compute the sum, maximum, minimum, or average of
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
large datasets, to perform data filtering, feature extraction, or image processing, to solve
optimization problems, or to accelerate numerical simulations. Parallel reduction can also be
combined with other parallel algorithms, such as parallel sorting, searching, or matrix operations,
to achieve higher performance and scalability.
Conclusion :
In each section, we use a loop and a critical section to combine the maximum or sum values from
each thread. The #pragma omp flush directive ensures that the values are properly synchronized
between threads.
Code to Implement Min, Max, Sum and Average operations using Parallel Reduction.
#include <iostream>
//#include <vector>
#include <omp.h>
#include <climits>
using namespace std;
void min_reduction(int arr[], int n) {
int min_value = INT_MAX;
#pragma omp parallel for reduction(min: min_value)
for (int i = 0; i < n; i++) {
if (arr[i] < min_value) {
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
min_value = arr[i];
}
}
cout << "Minimum value: " << min_value << endl;
}
int main() {
int *arr,n;
cout<<"\n Enter total no of elements=>";
cin>>n;
arr=new int[n];
cout<<"\n Enter elements=>";
for(int i=0;i<n;i++)
{
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
cin>>arr[i];
}
min_reduction(arr, n);
max_reduction(arr, n);
sum_reduction(arr, n);
average_reduction(arr, n);
}
Output:
Compile: g++ -fopenmp min.cpp -o min
Run: ./min
Practical No: 4
Objective: Students should be able to learn about parallel computing and students should learn
about CUDA( Compute Unified Device Architecture) and how it helps to boost high
performance computations.
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Prerequisite:
Contents of Theory :
3. CUDA kernel function: A CUDA kernel function is a function that is executed on the
GPU. It is defined with the global keyword and is called from the host code using a
launch configuration. Each kernel function runs in parallel on multiple threads, where
each thread performs the same operation on different data.
5. CUDA thread organization: In CUDA, threads are organized into blocks, and blocks are
organized into a grid. Each thread is identified by a unique thread index, and each block is
identified by a unique block index.
CUDA stands for Compute Unified Device Architecture. It is a parallel computing platform and
programming model developed by NVIDIA. CUDA allows developers to use the power of the GPU
to accelerate computations. It is designed to be used with C, C++, and Fortran programming
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
languages. CUDA architecture consists of host and device components. The host is the CPU, and
the device is the GPU. The CPU is responsible for managing the GPU memory and launching the
kernel functions on the device.
A CUDA kernel function is a function that is executed on the GPU. It is defined with the global
keyword and is called from the host code using a launch configuration. Each kernel function runs in
parallel on multiple threads, where each thread performs the same operation on different data.
CUDA provides three types of memory: global, shared, and local. Global memory is allocated on
the device and can be accessed by all threads. Shared memory is allocated on the device and can be
accessed by threads within a block. Local memory is allocated on each thread and is used for
temporary storage.
CUDA threads are organized into blocks, and blocks are organized into a grid. Each thread is
identified by a unique thread index, and each block is identified by a unique block index.
CUDA devices have a hierarchical memory architecture consisting of multiple memory levels,
including registers, shared memory, L1 cache, L2 cache, and global memory.
CUDA supports various libraries, including cuBLAS for linear algebra, cuFFT for Fast Fourier
Transform, and cuDNN for deep learning.
CUDA programming requires a compatible NVIDIA GPU and an installation of the CUDA
Toolkit, which includes the CUDA compiler, libraries, and tools.
This program uses CUDA to add two large vectors of size 1000000. The vectors are initialized
on the host, and then copied to the device memory. A kernel function is defined to perform the
vector addition, and then launched on the device. The result is copied back to the host memory
and verified. Finally, the device and host memories are freed.
The kernel function calculates the row and column indices of the output matrix using the block index and
thread index. It then uses a for loop to calculate the sum of the products of the corresponding elements in
the input matrices. The result is stored in the output matrix.
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Note that in this program, we use CUDA events to measure the elapsed time of the kernel function. This
is because the kernel function runs asynchronously on the GPU, so we need to use events to synchronize
the host and device and measure the time accurately.
5. This will generate an executable program named program_name. Run the CUDA
program: Finally, you can run the CUDA program by executing the executable file
generated in the previous step.
The command to run the program is:
./program_name
Conclusion:
Hence we have implemented Addition of two large vectors and Matrix Multiplication using CUDA.
#include <stdio.h>
#include <stdlib.h>
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
}
}
int main() {
int n = 1000000; // Vector size
int *a, *b, *c; // Host vectors
int *d_a, *d_b, *d_c; // Device vectors
int size = n * sizeof(int); // Size in bytes
// Launch kernel
vectorAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
break;
}
}
#include <stdio.h>
#define BLOCK_SIZE 16
__global__ void matrix_multiply(float *a, float *b, float *c, int n)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0;
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
int main()
{
int n = 1024;
size_t size = n * n * sizeof(float);
float *a, *b, *c;
float *d_a, *d_b, *d_c;
cudaEvent_t start, stop;
float elapsed_time;
// Initialize matrices
for (int i = 0; i < n * n; ++i)
{
a[i] = i % n;
b[i] = i % n;
}
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
// Launch kernel
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
matrix_multiply<<<blocks, threads>>>(d_a, d_b, d_c, n);
cudaEventRecord(stop); cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time, start, stop);
Group B
Mini Project: 1
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Approach: We will use Python and MPI to implement the parallel version of
Quicksort Algorithm and compare its performance with the serial version of the
algorithm.
Requirements:
Python3.x
mpi4py
Theory :
• Divide: Divide the input data set S into disjoint subsets S1, S2, S3…Sk.
• Recursion: Solve the sub-problems associated with S1, S2, S3…Sk.
• Conquer: Combine the solutions for S1, S2, S3…Sk. into a solution for S.
• Base case: The base case for the recursion is generally subproblems of size 0 or
1.
Many studies [2] have revealed that in order to sort N items; it will take QuickSort
an average running time of O(NlogN). The worst-case running time for QuickSort
will occur when the pivot is a unique minimum or maximum element, and as stated
in [2], the worst-case running time for QuickSort on N items is O(N2). These
different running times can be influenced by the input distribution (uniform, sorted
or semi-sorted, unsorted, duplicates) and the choice of the pivot element. Here is a
simple pseudocode of the QuickSort algorithm adapted from Wikipedia [1].
We have made use of Open MPI as the backbone library for parallelizing the
QuickSort algorithm. In fact, learning message passing interface (MPI) allows us to
strengthen our fundamental knowledge on parallel programming, given that MPI is
lower level than equivalent libraries (OpenMP). As simple as its name means, the
basic idea behind MPI is that messages can be passed or exchanged among different
processes in order to perform a given task. An illustration can be a communication
and coordination by a master process which splits a huge task into chunks and shares
them to its slave processes. Open MPI is developed and maintained by a consortium
of academic, research and industry partners; it combines the expertise, technologies
and resources all across the high performance computing community [11]. As
elaborated in [4], MPI has two types of communication routines: point-to-point
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Algorithm
In general, the overall algorithm used here to perform QuickSort with MPI works as
followed:
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Steps:
1. Initialize MPI:
rank = comm.Get_rank()
size = comm.Get_size()
for x in arr:
if x < pivot:
left.append(x)
elif x == pivot:
middle.append(x)
else:
right.append(x)
left_size = len(left)
middle_size = len(middle)
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
right_size = len(right)
comm.barrier()
comm.Scatter(left, chunk_left, root=0)
comm.Scatter(middle, chunk_middle, root=0)
comm.Scatter(right, chunk_right, root=0)
import random
# Generate a large dataset of numbers
5. Compare the performance of the serial and parallel versions of the algorithm
python:
if rank == 0:
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Output:
Serial Quicksort Algorithm time: 1.5536 seconds
Parallel Quicksort Algorithm time: 1.3488 seconds
Mini Project : 2
Title - Implement Huffman Encoding on GPU
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Huffman Coding makes sure that there is no ambiguity when decoding the
generated bitstream.
Let us understand prefix codes with a counter example. Let there be four characters
a, b, c and d, and their corresponding variable length codes be 00, 01, 0 and 1. This
coding leads to ambiguity because code assigned to c is the prefix of codes
assigned to a and b. If the compressed bit stream is 0001, the de-compressed output
may be “cccd” or “ccb” or “acd” or “ab”.
Here are some specific optimizations that can be applied to each step:
1.Calculating character frequencies:
Use parallelism to split the input text into chunks and count the frequencies of each
character in parallel on different threads.
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Source Code -
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
codes[input_text[i]].size(
);
}
output_size = (output_size + 7) / 8;
char* output_text = new
char[output_size]; char*
d_output_text;
cudaMalloc((void**) & d_output_text, output_size * sizeof(char));
cudaMemcpy(d_output_text, output_text, output_size * sizeof(char),
cudaMemcpyHostToDevice); encode_text<<<grid_size,
block_size>>>(input_text, input_size, d_output_text, output_size,
codes);
cudaMemcpy(output_text, d_output_text, output_size * sizeof(char),
cudaMemcpyDeviceToHost);
Output -
Input text: Hello, world!
Encoded text: 01000110 11010110 10001011 10101110 11110100 11011111 00101101
01000000
11111010
Mini Project : 3
Title - Implement Parallelization of Database Query optimization
Theory -
Query processing is the process through which a Database Management System
(DBMS) parses, verifies, and optimizes a given query before creating low-level
code that the DB understands.
Query Processing in DBMS, like any other High-Level Language (HLL) where
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
code is first generated and then executed to perform various operations, has two
phases: compile-time and runtime.
Query the use of declarative languages and query optimization is one of the main
factors contributing to the success of RDBMS technology. Any database allows
users to create queries to request specific data, and the database then uses effective
methods to locate the requested data.
A database optimization approach based on CMP has been studied by numerous
other academics. But the majority of their effort was on optimizing join operations
while taking into account the L2-cache and the parallel buffers of the shared main
memory.
The following techniques can be used to make a query parallel
• I/O parallelism
• Internal parallelism of queries
• Parallelism among queries
• Within-operation parallelism
• Parallelism in inter-operation
I/O parallelism :
This type of parallelism involves partitioning the relationships among the discs in
order to speed up the retrieval of relationships from the disc.
The inputted data is divided within, and each division is processed simultaneously.
After processing all of the partitioned data, the results are combined. Another
name for it is data partitioning.
Hash partitioning is best suited for point queries that are based on the partitioning
attribute and have the benefit of offering an even distribution of data across the
discs.
It should be mentioned that partitioning is beneficial for the sequential scans of the
full table stored on “n” discs and the speed at which the table may be scanned. For
a single disc system, relationship takes around 1/n of the time needed to scan the
table. In I/O parallelism, there are four different methods of partitioning:
Hash partitioning :
A hash function is a quick mathematical operation. The partitioning properties are
hashed for each row in the original relationship.
Let’s say that the data is to be partitioned across 4 drives, numbered disk1, disk2,
disk3, and disk4. The row is now stored on disk3 if the function returns.
Range partitioning :
Each disc receives continuous attribute value ranges while using range
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
partitioning. For instance, if we are range partitioning three discs with the numbers
0, 1, and 2, we may assign a relation with a value of less than 5 is written to disk0,
numbers from 5 to 40 are sent to disk1, and values above 40 are written to disk2.
It has several benefits, such as putting shuffles on the disc that have attribute values
within a specified range.
Round-robin partitioning :
Any order can be used to study the relationships in this method. It sends the ith
tuple to the disc number (i% n).
Therefore, new rows of data are received by discs in turn. For applications that want to
read the full relation sequentially for each query, this strategy assures an even
distribution of tuples across drives.
Schema Partitioning :
Various tables inside a database are put on different discs using a
technique called schema partitioning.
Intra-query parallelism :
Using a shared-nothing paralleling architecture technique, intra-query parallelism
refers to the processing of a single query in a parallel process on many CPUs.
This employs two different strategies:
First method — In this method, a duplicate task can be executed on a small amount
of data by each CPU.
Second method — Using this method, the task can be broken up into various sectors,
with each CPU carrying out a separate subtask.
Inter-query parallelism
Each CPU executes numerous transactions when inter-query parallelism is used.
Parallel transaction processing is what it is known as. To support inter-query
parallelism, DBMS leverages transaction dispatching.
We can also employ a variety of techniques, such as effective lock management.
This technique runs each query sequentially, which slows down the running time.
In such circumstances, DBMS must be aware of the locks that various
transactions operating on various processes have acquired. When
simultaneous transactions don’t accept the same data, inter-query parallelism
on shared storage architecture works well.
Additionally, the throughput of transactions is boosted, and it is the simplest form
of parallelism in DBMS.
Intra-operation parallelism :
In this type of parallelism, we execute each individual operation of a task, such as
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Inter-operation parallelism :
This term refers to the concurrent execution of many operations within a query
expression. They come in two varieties:
Pipelined parallelism — In pipeline parallelism, a second operation consumes a row of
the first operation’s output before the first operation has finished producing the whole
set of rows in its output. Additionally, it is feasible to perform these two processes
concurrently on several CPUs, allowing one operation to consume tuples concurrently
with another operation and thereby reduce them.
It is advantageous for systems with a limited number of CPUs and prevents the
storage of interim results on a disc.
Independent parallelism- In this form of parallelism, operations contained within
query phrases that are independent of one another may be carried out
concurrently. This analogy is extremely helpful when dealing with parallelism of
a lower degree.
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Parallelization divides this stage into two parts: extraction of parallelism and
scheduling. Optimizing database queries is an important task in database
management systems to improve the performance of database operations.
Parallelization of database query optimization can significantly improve query
execution time by dividing the workload among multiple processors or nodes.
1. Partitioning: The first step is to partition the data into smaller subsets. The
partitioning can be done based on different criteria, such as range partitioning,
hash partitioning, or list partitioning. This can be done in parallel by assigning
different processors or nodes to handle different parts of the partitioning process.
2. Query optimization: Once the data is partitioned, the next step is to optimize the
queries. Query optimization involves finding the most efficient way to execute the
query by considering factors such as index usage, join methods, and filtering. This
can also be done in parallel by assigning different processors or nodes to handle
different parts of the query optimization process.
3. Query execution: After the queries are optimized, the final step is to execute
the queries. The execution can be done in parallel by assigning different
processors or nodes to handle different parts of the execution process. The results
can then be combined to generate the final result set.
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Here's an example of how we can parallelize the query optimization process using
OpenMP:
//C++
// Partition the data
std::vector<std::vector<int>> partitions;
int num_partitions = omp_get_num_threads();
#pragma omp parallel for
for (int i = 0; i < num_partitions; i++)
{
std::vector<int> partition = partition_data(data, i, num_partitions);
partitions.push_back(partition);
optimize_query(query, partition);
}
In this example, we first partition the data into smaller subsets using OpenMP
parallelism. Then we optimize each query in parallel by assigning different
processors or nodes to handle different parts of the optimization process. Finally,
we execute the queries in parallel by assigning different processors or nodes to
handle different parts of the execution process.
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Mini Project : 4
Title - Implement Non-Serial Polyadic Dynamic Programming with GPU
Parallelization.
Theory -
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
imbalance, i.e. non-optimal mapping between the sub-problems of NPDP and the
processing elements of the GPU.
NPDP exhibits non-uniformity in the number of subproblems as well as
computational complexity across the phases. In NPDP parallelization, phases are
computed sequentially whereas subproblems of each phase are computed
concurrently.
Therefore, it is essential to effectively map the subproblems of each phase
to the processing elements while implementing thread level parallelism. We
propose an adaptive Generalized Mapping Method (GMM) for NPDP
parallelization that utilizes the GPU for efficient mapping of subproblems onto
processing threads in each phase.
Input-size and targeted GPU decide the computing power and the best
mapping for each phase in NPDP parallelization. The performance of GMM is
compared with different conventional parallelization approaches.
For sufficiently large inputs, our technique outperforms the state-of-the-art
conventional parallelization approach and achieves a significant speedup of a
factor 30. We also summarize the general heuristics for achieving better gain in the
NPDP parallelization.
Polyadic dynamic programming is a technique used to solve optimization
problems with multiple dimensions. Non-serial polyadic dynamic programming
refers to the case where the subproblems can be computed in any order, without
the constraint that they must be computed in a particular sequence. This makes it
possible to parallelize the computation on a GPU.
#include <iostream>
#include <cuda.h>
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
float* d_y;
cudaMalloc(&d_y, M * sizeof(float));
cudaMemcpy(d_y, y, M * sizeof(float), cudaMemcpyHostToDevice);
float* d_z;
cudaMalloc(&d_z, K * sizeof(float));
Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer
Prof.Tushar D Kolhe