100% found this document useful (1 vote)
293 views

HPC Lab Manual-1

The document outlines a lab manual for a High Performance Computing course, focusing on implementing Parallel Breadth First Search (BFS) and Depth First Search (DFS) algorithms using OpenMP. It provides theoretical background on BFS and DFS, details on OpenMP, and includes code examples for both algorithms. Additionally, it introduces a second practical on implementing Parallel Bubble Sort, emphasizing the performance measurement of sequential versus parallel algorithms.

Uploaded by

prajwalpawar100k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
293 views

HPC Lab Manual-1

The document outlines a lab manual for a High Performance Computing course, focusing on implementing Parallel Breadth First Search (BFS) and Depth First Search (DFS) algorithms using OpenMP. It provides theoretical background on BFS and DFS, details on OpenMP, and includes code examples for both algorithms. Additionally, it introduces a second practical on implementing Parallel Bubble Sort, emphasizing the performance measurement of sequential versus parallel algorithms.

Uploaded by

prajwalpawar100k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 51

High Performance Computing Lab Manual BE Computer

Practical No: 1

Title: Design and implement Parallel Breadth First Search and Depth First Search based on existing
algorithms using OpenMP. Use a Tree or an undirected graph for BFS and DFS

Objective : Students should be able to Write a program to implement Parallel Breadth First Search
and Depth First Search based on existing algorithms using OpenMP

Prerequisite:

1. Basic of programming language

2. Concept of BFS and DFS

3. Concept of Parallelism

Theory:

What is BFS?
BFS stands for Breadth-First Search. It is a graph traversal algorithm used to explore all the nodes of a
graph or tree systematically, starting from the root node or a specified starting point, and visiting all the
neighbouring nodes at the current depth level before moving on to the next depth level. The algorithm
uses a queue data structure to keep track of the nodes that need to be visited, and marks each visited node
to avoid processing it again. The basic idea of the BFS algorithm is to visit all the nodes at a given level
before moving on to the next level, which ensures that all the nodes are visited in breadth-first order. BFS
is commonly used in many applications, such as finding the shortest path between two nodes, solving
puzzles, and searching through a tree or graph.
Now let’s take a look at the steps involved in traversing a graph by using Breadth-First Search:

Step 1: Take an Empty Queue.


Step 2: Select a starting node (visiting a node) and insert it into the Queue.
Step 3: Provided that the Queue is not empty, extract the node from the Queue and insert its child
nodes (exploring a node) into the Queue.
Step 4: Print the extracted node.

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

What is DFS?
DFS stands for Depth-First Search. It is a popular graph traversal algorithm that explores as far as
possible along each branch before backtracking. This algorithm can be used to find the shortest path
between two vertices or to traverse a graph in a systematic way. The algorithm starts at the root
node and explores as far as possible along each branch before backtracking. The backtracking is
done to explore the next branch that has not been explored yet.

DFS can be implemented using either a recursive or an iterative approach. The recursive approach
is simpler to implement but can lead to a stack overflow error for very large graphs. The iterative
approach uses a stack to keep track of nodes to be explored and is preferred for larger graphs.

DFS can also be used to detect cycles in a graph. If a cycle exists in a graph, the DFS algorithm will
eventually reach a node that has already been visited, indicating that a cycle exists. A standard DFS
implementation puts each vertex of the graph into one of two categories: 1. Visited 2. Not Visited The
purpose of the algorithm is to mark each vertex as visited while avoiding cycles.

Concept of OpenMP
● OpenMP (Open Multi-Processing) is an application programming interface (API) that

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

supports shared-memory parallel programming in C, C++, and Fortran. It is used to write


parallel programs that can run on multicore processors, multiprocessor systems, and parallel
computing clusters.
● OpenMP provides a set of directives and functions that can be inserted into the source
code of a program to parallelize its execution. These directives are simple and easy to use,
and they can be applied to loops, sections, functions, and other program constructs. The
compiler then generates parallel code that can run on multiple processors concurrently.
● OpenMP programs are designed to take advantage of the shared-memory architecture of
modern processors, where multiple processor cores can access the same memory. OpenMP uses
a fork-join model of parallel execution, where a master thread forks multiple worker threads to
execute a parallel region of the code, and then waits for all threads to complete before continuing
with the sequential part of the code.

● OpenMP is widely used in scientific computing, engineering, and other fields that require high-
performance computing. It is supported by most modern compilers and is available on a wide range of
platforms, including desktops, servers, and supercomputers.
How Parallel BFS Work:

● Parallel BFS (Breadth-First Search) is an algorithm used to explore all the nodes of a graph or
tree systematically in parallel. It is a popular parallel algorithm used for graph traversal in
distributed computing, shared-memory systems, and parallel clusters.
● The parallel BFS algorithm starts by selecting a root node or a specified starting point, and then
assigning it to a thread or processor in the system. Each thread maintains a local queue of nodes
to be visited and marks each visited node to avoid processing it again.
● The algorithm then proceeds in levels, where each level represents a set of nodes that are at a
certain distance from the root node. Each thread processes the nodes in its local queue at the
current level, and then exchanges the nodes that are adjacent to the current level with other
threads or processors. This is done to ensure that the nodes at the next level are visited by the
next iteration of the algorithm.
● The parallel BFS algorithm uses two phases: the computation phase and the communication
phase. In the computation phase, each thread processes the nodes in its local queue, while in the
communication phase, the threads exchange the nodes that are adjacent to the current level with
other threads or processors.
● The parallel BFS algorithm terminates when all nodes have been visited or when a specified
node has been found. The result of the algorithm is the set of visited nodes or the shortest path

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

from the root node to the target node.


● Parallel BFS can be implemented using different parallel programming models, such as
OpenMP, MPI, CUDA, and others. The performance of the algorithm depends on the number
of threads or processors used, the size of the graph, and the communication overhead between
the threads or processors.

Conclusion:
In this way we can achieve parallelism while implementing Breadth First Search and Depth First Search.

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

Code to implement BFS using OpenMP:

#include<iostream>
#include<stdlib.h>
#include<queue>
using namespace std;

class node
{
public:

node *left, *right;


int data;

};

class Breadthfs
{

public:

node *insert(node *, int);


void bfs(node *);

};

node *insert(node *root, int data)


// inserts a node in tree
{

if(!root)
{

root=new node;
root->left=NULL;
root->right=NULL;
root->data=data;
return root;
}

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

queue<node *> q;
q.push(root);

while(!q.empty())
{

node *temp=q.front();
q.pop();

if(temp->left==NULL)
{

temp->left=new node;
temp->left->left=NULL;
temp->left->right=NULL;
temp->left->data=data;
return root;
}
else
{

q.push(temp->left);

if(temp->right==NULL)
{

temp->right=new node;
temp->right->left=NULL;
temp->right->right=NULL;
temp->right->data=data;
return root;
}
else
{

q.push(temp->right);

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

void bfs(node *head)


{

queue<node*> q;
q.push(head);

int qSize;

while (!q.empty())
{
qSize = q.size();
#pragma omp parallel for
//creates parallel threads
for (int i = 0; i < qSize; i++)
{
node* currNode;
#pragma omp critical
{
currNode = q.front();
q.pop();
cout<<"\t"<<currNode->data;

}// prints parent node


#pragma omp critical
{
if(currNode->left)// push parent's left node in queue
q.push(currNode->left);
if(currNode->right)
q.push(currNode->right);
}// push parent's right node in queue

}
}

int main(){

node *root=NULL;
int data;

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

char ans;

do
{
cout<<"\n enter data=>";
cin>>data;

root=insert(root,data);

cout<<"do you want insert one more node?";


cin>>ans;

}while(ans=='y'||ans=='Y');

bfs(root);

return 0;
}

Run Commands:
1) g++ -fopenmp bfs.cpp -o bfs
2) ./bfs

Output:

Enter data => 5


Do you want to insert one more node? (y/n) y

Enter data => 3


Do you want to insert one more node? (y/n) y

Enter data => 2


Do you want to insert one more node? (y/n) y

Enter data => 1


Do you want to insert one more node? (y/n) y

Enter data => 7


Do you want to insert one more node? (y/n) y

Enter data => 8 Do you want to insert one more node? (y/n) n

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

5 3 7 2 1 8
Code to implement DFS using OpenMP:
#include<iostream>
#include <vector>
#include <stack>
#include <omp.h>

using namespace std;

const int MAX =100000;


vector<int>
graph[MAX];
bool visited[MAX];

void dfs(int node)


{
stack<int>
s;
s.push(node);

while (!s.empty())
{
int curr_node =s.top();
s.pop();

if (!visited[curr_node])
{
visited[curr_node] =true;

if (visited[curr_node])
{
cout << curr_node <<" ";
}

#pragma omp parallel for

for (int i = 0; i < graph[curr_node].size();i++)


{
int adj_node = graph[curr_node][i];
if (!visited[adj_node])
{

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

s.push(adj_node);
}
}
}
}
}

int main()
{
int n, m, start_node;
cout << "Enter No of Node, Edges, and start node:" ;
cin >> n >> m >> start_node;
//n: node,m:edges

cout << "Enter Pair of edges:" ;


for (int i = 0; i < m;i++)
{
int u, v;
cin >> u >> v;
//u and v: Pair of edges
graph[u].push_back(v);
graph[v].push_back(u);
}

#pragma omp parallel


for for (int i = 0; i < n;i++)
{
visited[i] = false;
}

dfs(start_node);
/* for (int i = 0; i < n;i++)
{
if (visited[i])
{
cout << i << " ";
}
}*/
return 0;
}

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

Output:

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

Practical No: 2

Title: Write a program to implement Parallel Bubble Sort. Use existing algorithms and measure
the performance of sequential and parallel algorithms.

Objective: Students should be able to Write a program to implement Parallel Bubble Sort and can
measure the performance of sequential and parallel algorithms.

Prerequisite:

Basic of programming language


Concept of Bubble Sort
Concept of Parallelism

Theory:

What is Bubble Sort?

Bubble Sort is a simple sorting algorithm that works by repeatedly swapping adjacent elements if
they are in the wrong order. It is called "bubble" sort because the algorithm moves the larger
elements towards the end of the array in a manner that resembles the rising of bubbles in a liquid.

The basic algorithm of Bubble Sort is as follows:

1. Start at the beginning of the array


2. Compare the first two elements. If the first element is greater than the second element, swap
them.
3. Move to the next pair of elements and repeat step 2.
4. Continue the process until the end of the array is reached.

5. If any swaps were made in step 2-4, repeat the process from step 1.

The time complexity of Bubble Sort is O(n^2), which makes it inefficient for large lists. However, it
has the advantage of being easy to understand and implement, and it is useful for educational
purposes and for sorting small datasets.

Bubble Sort has limited practical use in modern software development due to its inefficient time
complexity of O(n^2) which makes it unsuitable for sorting large datasets. However, Bubble Sort
has some advantages and use cases that make it a valuable algorithm to understand, such as:

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

1. Simplicity: Bubble Sort is one of the simplest sorting algorithms, and it is easy to
understand and implement. It can be used to introduce the concept of sorting to beginners
and as a basis for more complex sorting algorithms.

2. Educational purposes: Bubble Sort is often used in academic settings to teach the
principles of sorting algorithms and to help students understand how algorithms work.

3. Small datasets: For very small datasets, Bubble Sort can be an efficient sorting
algorithm, as its overhead is relatively low.

4. Partially sorted datasets: If a dataset is already partially sorted, Bubble Sort can be very
efficient. Since Bubble Sort only swaps adjacent elements that are in the wrong order, it
has a low number of operations for a partially sorted dataset.

5. Performance optimization: Although Bubble Sort itself is not suitable for sorting large
datasets, some of its techniques can be used in combination with other sorting algorithms
to optimise their performance. For example, Bubble Sort can be used to optimise the
performance of Insertion Sort by reducing the number of comparisons needed.
Example of Bubble sort
Let's say we want to sort a series of numbers 5, 3, 4, 1, and 2 so that they are arranged
in ascending order…
The sorting begins the first iteration by comparing the first two values. If the first value is greater
than the second, the algorithm pushes the first value to the index of the second value.
First Iteration of the Sorting

Step 1: In the case of 5, 3, 4, 1, and 2, 5 is greater than 3. So 5 takes the position of 3 and the numbers
become 3, 5, 4, 1, and 2.

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

Step 2: The algorithm now has 3, 5, 4, 1, and 2 to compare, this time around, it compares the next
two values, which are 5 and 4. 5 is greater than 4, so 5 takes the index of 4 and the values now
become 3, 4, 5, 1, and 2.

Step 3: The algorithm now has 3, 4, 5, 1, and 2 to compare. It compares the next two values, which
are 5 and 1. 5 is greater than 1, so 5 takes the index of 1 and the numbers become 3, 4, 1, 5, and 2.

Step 4: The algorithm now has 3, 4, 1, 5, and 2 to compare. It compares the next two values, which
are 5 and 2. 5 is greater than 2, so 5 takes the index of 2 and the numbers become 3, 4, 1, 2, and 5.

That’s the first iteration. And the numbers are now arranged as 3, 4, 1, 2, and 5 – from the initial 5,
3, 4, 1, and 2. As you might realize, 5 should be the last number if the numbers are sorted in
ascending order. This means the first iteration is really completed.

Second Iteration of the Sorting and the Rest

The algorithm starts the second iteration with the last result of 3, 4, 1, 2, and 5. This time
around, 3 is smaller than 4, so no swapping happens. This means the numbers will remain the

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

same.

The algorithm proceeds to compare 4 and 1. 4 is greater than 1, so 4 is swapped for 1 and the
numbers become 3, 1, 4, 2, and 5.

The algorithm now proceeds to compare 4 and 2. 4 is greater than 2, so 4 is swapped for 2
and the numbers become 3, 1, 2, 4, and 5.

4 is now in the right place, so no swapping occurs between 4 and 5 because 4 is smaller than 5.

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

That’s how the algorithm continues to compare the numbers until they are arranged in
ascending order of 1, 2, 3, 4, and 5.

How Parallel Bubble Sort Work

● Parallel Bubble Sort is a modification of the classic Bubble Sort algorithm that takes
advantage of parallel processing to speed up the sorting process.

● In parallel Bubble Sort, the list of elements is divided into multiple sublists that are sorted
concurrently by multiple threads. Each thread sorts its sublist using the regular Bubble
Sort algorithm. When all sublists have been sorted, they are merged together to form the
final sorted list.

● The parallelization of the algorithm is achieved using OpenMP, a programming API that
supports parallel processing in C++, Fortran, and other programming languages. OpenMP
provides a set of compiler directives that allow developers to specify which parts of the
code can be executed in parallel.

● In the parallel Bubble Sort algorithm, the main loop that iterates over the list of elements is
divided into multiple iterations that are executed concurrently by multiple threads. Each
thread sorts a subset of the list, and the threads synchronise their work at the end of each
iteration to ensure that the elements are properly ordered.

● Parallel Bubble Sort can provide a significant speedup over the regular Bubble Sort
algorithm, especially when sorting large datasets on multi-core processors. However, the

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

speedup is limited by the overhead of thread creation and synchronisation, and it may not
be worth the effort for small datasets or when using a single-core processor.
How to measure the performance of sequential and parallel algorithms?

To measure the performance of sequential Bubble sort and parallel Bubble sort
algorithms, you can follow these steps:

1. Implement both the sequential and parallel


Bubble sort algorithms.
2. Choose a range of test cases, such as arrays of different sizes and different
degrees of sortedness, to test the performance of both algorithms.
3. Use a reliable timer to measure the execution time of each algorithm on each test case.
4. Record the execution times and analyze the results.

When measuring the performance of the parallel Bubble sort algorithm, you will need to specify the
number of threads to use. You can experiment with different numbers of threads to find the optimal
value for your system.

How to check CPU utilisation and memory consumption in ubuntu

In Ubuntu, you can use a variety of tools to check CPU utilization and memory consumption. Here
are some common tools:

1. top: The top command provides a real-time view of system resource usage, including CPU
utilization and memory consumption. To use it, open a terminal window and type top. The
output will display a list of processes sorted by resource usage, with the most resource-
intensive processes at the top.

2. htop: htop is a more advanced version of top that provides additional features, such as
interactive process filtering and a color-coded display. To use it, open a terminal window and
type htop.

3. ps: The ps command provides a snapshot of system resource usage at a particular moment in
time. To use it, open a terminal window and type ps aux. This will display a list of all running
processes and their resource usage.

4. free: The free command provides information about system memory usage, including total,

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

used, and free memory. To use it, open a terminal window and type free -h.
5. vmstat: The vmstat command provides a variety of system statistics, including CPU
utilization, memory usage, and disk activity. To use it, open a terminal window and type
vmstat.

Conclusion-
In this way we can implement Bubble Sort in parallel way using OpenMP also
come to know how to how to measure performance of serial and parallel algorithm

Code to Implement parallel bubble sort using OpenMP


#include<iostream>
#include<stdlib.h>
#include<omp.h>
using namespace std;

void bubble(int *, int); // Pointer and Argument


void swap(int &, int &); // References as parameter

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

void bubble(int *a, int n) // array of integer a and size of array is n


{
for( int i = 0; i < n; i++ )
{
int first = i % 2;

#pragma omp parallel for shared(a,first)


for( int j = first; j < n-1; j += 2 )
{
if( a[ j ] > a[ j+1 ] )
{
swap( a[ j ], a[ j+1 ] );
}
}
}
}

void swap(int &a, int &b)


{

int test;
test=a;
a=b;
b=test;

int main()
{

int *a,n;
cout<<"\n Enter total no of elements=>";
cin>>n;
a=new int[n];
cout<<"\n Enter elements=>";
for(int i=0;i<n;i++)
{
cin>>a[i];
}

bubble(a,n);

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

cout<<"\n Sorted array is=>";


for(int i=0;i<n;i++)
{
cout<<a[i]<<endl;
}

return 0;
}

Output:
Enter total no of elements=> 6
Enter elements=> 4
6
3
1
2
5
Sorted array is=> 1
2
3
4
5
6

Code to Implement parallel Merge sort using OpenMP


#include<iostream>
#include<stdlib.h>
#include<omp.h>
using namespace std;

void mergesort(int a[],int i,int j);


void merge(int a[],int i1,int j1,int i2,int j2);

void mergesort(int a[],int i,int j)

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

{
int mid;
if(i<j)
{
mid=(i+j)/2;

#pragma omp parallel sections


{

#pragma omp section


{
mergesort(a,i,mid);
}

#pragma omp section


{
mergesort(a,mid+1,j);
}
}

merge(a,i,mid,mid+1,j);
}

void merge(int a[],int i1,int j1,int i2,int j2)


{
int temp[1000];
int i,j,k;
i=i1;
j=i2;
k=0;

while(i<=j1 && j<=j2)


{
if(a[i]<a[j])
{
temp[k++]=a[i++];
}
else
{
temp[k++]=a[j++];
}

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

while(i<=j1)
{
temp[k++]=a[i++];
}

while(j<=j2)
{
temp[k++]=a[j++];
}

for(i=i1,j=0;i<=j2;i++,j++)
{
a[i]=temp[j];
}
}

int main()
{
int *a,n,i;
cout<<"\n enter total no of elements=>";
cin>>n;
a= new int[n];

cout<<"\n enter elements=>";


for(i=0;i<n;i++)
{
cin>>a[i];
}
// start=.......
//#pragma omp…..
mergesort(a, 0, n-1);
// stop…….
cout<<"\n sorted array is=>";
for(i=0;i<n;i++)
{
cout<<"\n"<<a[i];
}
// Cout<<Stop-Start
return 0;
}

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

Output:

enter total no of elements=>5

enter elements=>3
6
5
2
4

sorted array is=>


2
3
4
5
6

Practical No: 3

Title: Implement Min, Max, Sum and Average operations using Parallel Reduction.

Objective: Should be able to learn about how to perform min, max, sum, and average operations on
a large set of data using parallel reduction techniques in CUDA. The program defines four kernel
functions, reduce_min, reduce_max, reduce_sum, and reduce_avg.

Prerequisite:

1. Knowledge of parallel programming concepts and techniques, such as shared

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

memory, threads, and synchronisation.


2. Familiarity with a parallel programming library or framework, such as OpenMP,
MPI, or CUDA.
3. A suitable parallel programming environment, such as a multi-core CPU, a cluster
of computers, or a GPU.
4. A programming language that supports parallel programming constructs, such as
C, C++, Fortran, or Python.

Theory :

Parallel Reduction Operation :

Parallel reduction is a common technique used in parallel computing to perform a reduction


operation on a large dataset. A reduction operation combines a set of values into a single value,
such as computing the sum, maximum, minimum, or average of the values. Parallel reduction
exploits the parallelism available in modern multicore processors, clusters of computers, or GPUs
to speed up the computation.

The parallel reduction algorithm works by dividing the input data into smaller chunks that can be
processed independently in parallel. Each thread or process computes the reduction operation on its
local chunk of data, producing a partial result. The partial results are then combined in a
hierarchical manner until a single result is obtained.

The most common parallel reduction algorithm is the binary tree reduction algorithm, which has a
logarithmic time complexity and can achieve optimal parallel efficiency. In this algorithm, the input data
is initially divided into chunks of size n, where n is the number of parallel threads or processes. Each
thread or process computes the reduction operation on its chunk of data, producing n partial results.

The partial results are then recursively combined in a binary tree structure, where each internal node
represents the reduction operation of its two child nodes. The tree structure is built in a bottom-up
manner, starting from the leaf nodes and ending at the root node. Each level of the tree reduces the
number of partial results by a factor of two, until a single result is obtained at the root node.

The binary tree reduction algorithm can be implemented using various parallel programming
models, such as OpenMP, MPI, or CUDA. In OpenMP, the algorithm can be implemented using
the parallel and for directives for parallelizing the computation, and the reduction clause for
combining the partial results. In MPI, the algorithm can be implemented using the MPI_Reduce
function for performing the reduction operation, and the MPI_Allreduce function for distributing
the result to all processes. In CUDA, the algorithm can be implemented using the parallel
reduction kernel, which uses shared memory to store the partial results and reduce the memory
access latency.

Parallel reduction has many applications in scientific computing, machine learning, data analytics,
and computer graphics. It can be used to compute the sum, maximum, minimum, or average of

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

large datasets, to perform data filtering, feature extraction, or image processing, to solve
optimization problems, or to accelerate numerical simulations. Parallel reduction can also be
combined with other parallel algorithms, such as parallel sorting, searching, or matrix operations,
to achieve higher performance and scalability.

Conclusion :
In each section, we use a loop and a critical section to combine the maximum or sum values from
each thread. The #pragma omp flush directive ensures that the values are properly synchronized
between threads.

Code to Implement Min, Max, Sum and Average operations using Parallel Reduction.
#include <iostream>
//#include <vector>
#include <omp.h>
#include <climits>
using namespace std;
void min_reduction(int arr[], int n) {
int min_value = INT_MAX;
#pragma omp parallel for reduction(min: min_value)
for (int i = 0; i < n; i++) {
if (arr[i] < min_value) {

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

min_value = arr[i];
}
}
cout << "Minimum value: " << min_value << endl;
}

void max_reduction(int arr[], int n) {


int max_value = INT_MIN;
#pragma omp parallel for reduction(max: max_value)
for (int i = 0; i < n; i++) {
if (arr[i] > max_value) {
max_value = arr[i];
}
}
cout << "Maximum value: " << max_value << endl;
}

void sum_reduction(int arr[], int n) {


int sum = 0;
#pragma omp parallel for reduction(+: sum)
for (int i = 0; i < n; i++) {
sum += arr[i];
}
cout << "Sum: " << sum << endl;
}

void average_reduction(int arr[], int n) {


int sum = 0;
#pragma omp parallel for reduction(+: sum)
for (int i = 0; i < n; i++) {
sum += arr[i];
}
cout << "Average: " << (double)sum / (n-1) << endl;
}

int main() {
int *arr,n;
cout<<"\n Enter total no of elements=>";
cin>>n;
arr=new int[n];
cout<<"\n Enter elements=>";
for(int i=0;i<n;i++)
{

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

cin>>arr[i];
}

// int arr[] = {5, 2, 9, 1, 7, 6, 8, 3, 4};


// int n = size(arr);

min_reduction(arr, n);
max_reduction(arr, n);
sum_reduction(arr, n);
average_reduction(arr, n);
}

Output:
Compile: g++ -fopenmp min.cpp -o min
Run: ./min

Enter total no of elements=>6


Enter elements=>2
8
6
9
5
4
Minimum value: 2
Maximum value: 9
Sum: 34
Average: 6.8

Practical No: 4

Title: Write a CUDA Program for :

1. Addition of two large vectors


2. Matrix Multiplication using CUDA

Objective: Students should be able to learn about parallel computing and students should learn
about CUDA( Compute Unified Device Architecture) and how it helps to boost high
performance computations.

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

Prerequisite:

1. Basics of CUDA Architecture.

2. Basics of CUDA programming model.

3. CUDA kernel function.

4. CUDA thread organization

Contents of Theory :

1. CUDA architecture: CUDA is a parallel computing platform and programming model


developed by NVIDIA. It allows developers to use the power of GPU (Graphics
Processing Unit) to accelerate computations. CUDA architecture consists of host and
device components, where the host is the CPU and the device is the GPU.
2. CUDA programming model: CUDA programming model consists of host and device
codes. The host code runs on the CPU and is responsible for managing the GPU memory
and launching the kernel functions on the device. The device code runs on the GPU and
performs the computations.

3. CUDA kernel function: A CUDA kernel function is a function that is executed on the
GPU. It is defined with the global keyword and is called from the host code using a
launch configuration. Each kernel function runs in parallel on multiple threads, where
each thread performs the same operation on different data.

4. Memory management in CUDA: In CUDA, there are three types of memory:


global, shared, and local. Global memory is allocated on the device and can be
accessed by all threads. Shared memory is allocated on the device and can be
accessed by threads within a block. Local memory is allocated on each thread and
is used for temporary storage.

5. CUDA thread organization: In CUDA, threads are organized into blocks, and blocks are
organized into a grid. Each thread is identified by a unique thread index, and each block is
identified by a unique block index.

6. Matrix multiplication: Matrix multiplication is a fundamental operation in linear algebra. It


involves multiplying two matrices and producing a third matrix. The resulting matrix has
dimensions equal to the number of rows of the first matrix and the number of columns of the
second matrix.

CUDA stands for Compute Unified Device Architecture. It is a parallel computing platform and
programming model developed by NVIDIA. CUDA allows developers to use the power of the GPU
to accelerate computations. It is designed to be used with C, C++, and Fortran programming

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

languages. CUDA architecture consists of host and device components. The host is the CPU, and
the device is the GPU. The CPU is responsible for managing the GPU memory and launching the
kernel functions on the device.

A CUDA kernel function is a function that is executed on the GPU. It is defined with the global
keyword and is called from the host code using a launch configuration. Each kernel function runs in
parallel on multiple threads, where each thread performs the same operation on different data.

CUDA provides three types of memory: global, shared, and local. Global memory is allocated on
the device and can be accessed by all threads. Shared memory is allocated on the device and can be
accessed by threads within a block. Local memory is allocated on each thread and is used for
temporary storage.

CUDA threads are organized into blocks, and blocks are organized into a grid. Each thread is
identified by a unique thread index, and each block is identified by a unique block index.

CUDA devices have a hierarchical memory architecture consisting of multiple memory levels,
including registers, shared memory, L1 cache, L2 cache, and global memory.

CUDA supports various libraries, including cuBLAS for linear algebra, cuFFT for Fast Fourier
Transform, and cuDNN for deep learning.

CUDA programming requires a compatible NVIDIA GPU and an installation of the CUDA
Toolkit, which includes the CUDA compiler, libraries, and tools.

CUDA Program for Matrix Addition:

This program uses CUDA to add two large vectors of size 1000000. The vectors are initialized
on the host, and then copied to the device memory. A kernel function is defined to perform the
vector addition, and then launched on the device. The result is copied back to the host memory
and verified. Finally, the device and host memories are freed.

CUDA Program for Matrix Multiplication:


This program multiplies two matrices of size n using CUDA. It first allocates host memory for the
matrices and initializes them. Then it allocates device memory and copies the matrices to the device. It
sets the kernel launch configuration and launches the kernel function matrix_multiply. The kernel
function performs the matrix multiplication and stores the result in matrix c. Finally, it copies the result
back to the host and frees the device and host memory.

The kernel function calculates the row and column indices of the output matrix using the block index and
thread index. It then uses a for loop to calculate the sum of the products of the corresponding elements in
the input matrices. The result is stored in the output matrix.

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

Note that in this program, we use CUDA events to measure the elapsed time of the kernel function. This
is because the kernel function runs asynchronously on the GPU, so we need to use events to synchronize
the host and device and measure the time accurately.

Execution of Program over CUDA Environment


Here are the steps to run a CUDA program for adding two large vectors:
1. Install CUDA Toolkit: First, you need to install the CUDA Toolkit on your system. You can
download the CUDA Toolkit from the NVIDIA website and follow the installation instructions
provided.
2. Set up CUDA environment: Once the CUDA Toolkit is installed, you need to set up the CUDA
environment on your system. This involves setting the PATH and LD_LIBRARY_PATH
environment variables to the appropriate directories.
3. Write the CUDA program: You need to write a CUDA program that performs the addition of two
large vectors. You can use a text editor to write the program and save it with a .cu extension.
4. Compile the CUDA program: You need to compile the CUDA program using the nvcc compiler
that comes with the CUDA Toolkit. The command to compile the program is:
nvcc -o program_name program_name.cu

5. This will generate an executable program named program_name. Run the CUDA
program: Finally, you can run the CUDA program by executing the executable file
generated in the previous step.
The command to run the program is:
./program_name

Conclusion:

Hence we have implemented Addition of two large vectors and Matrix Multiplication using CUDA.

CUDA Program for Addition of Two Large Vectors:

#include <stdio.h>
#include <stdlib.h>

// CUDA kernel for vector addition


__global__ void vectorAdd(int *a, int *b, int *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
c[i] = a[i] + b[i];

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

}
}

int main() {
int n = 1000000; // Vector size
int *a, *b, *c; // Host vectors
int *d_a, *d_b, *d_c; // Device vectors
int size = n * sizeof(int); // Size in bytes

// Allocate memory for host vectors


a = (int*) malloc(size);
b = (int*) malloc(size);
c = (int*) malloc(size);

// Initialize host vectors


for (int i = 0; i < n; i++) {
a[i] = i;
b[i] = i;
}

// Allocate memory for device vectors


cudaMalloc((void**) &d_a, size);
cudaMalloc((void**) &d_b, size);
cudaMalloc((void**) &d_c, size);

// Copy host vectors to device vectors


cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Define block size and grid size


int blockSize = 256;
int gridSize = (n + blockSize - 1) / blockSize;

// Launch kernel
vectorAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

// Copy device result vector to host result vector


cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Verify the result


for (int i = 0; i < n; i++) {
if (c[i] != 2*i) {
printf("Error: c[%d] = %d\n", i, c[i]);

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

break;
}
}

// Free device memory


cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);

// Free host memory


free(a);
free(b);
free(c);
return 0;
}

To Install nvcc Program :


sudo apt-get install nvidia-cuda-toolkit

sudo apt-get install libglade2-0

To Install nvcc Program :


sudo apt-get install nvidia-cuda-toolkit

sudo apt-get install libglade2-0

Again run the below command -


sudo apt-get install nvidia-cuda-toolkit

CUDA Program for Addition of Two Large Vectors:

#include <stdio.h>
#define BLOCK_SIZE 16
__global__ void matrix_multiply(float *a, float *b, float *c, int n)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0;

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

if (row < n && col < n)


{
for (int i = 0; i < n; ++i)
{
sum += a[row * n + i] * b[i * n + col];
}
c[row * n + col] = sum;
}
}

int main()
{
int n = 1024;
size_t size = n * n * sizeof(float);
float *a, *b, *c;
float *d_a, *d_b, *d_c;
cudaEvent_t start, stop;
float elapsed_time;

// Allocate host memory


a = (float*)malloc(size);
b = (float*)malloc(size);
c = (float*)malloc(size);

// Initialize matrices
for (int i = 0; i < n * n; ++i)
{
a[i] = i % n;
b[i] = i % n;
}

// Allocate device memory


cudaMalloc(&d_a, size);
cudaMalloc(&d_b, size);
cudaMalloc(&d_c, size);

// Copy input data to device


cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Set kernel launch configuration


dim3 threads(BLOCK_SIZE, BLOCK_SIZE);

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

dim3 blocks((n + threads.x - 1) / threads.x, (n + threads.y - 1) / threads.y);

// Launch kernel
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
matrix_multiply<<<blocks, threads>>>(d_a, d_b, d_c, n);
cudaEventRecord(stop); cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time, start, stop);

// Copy output data to host


cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Print elapsed time


printf("Elapsed time: %f ms\n", elapsed_time);

// Free device memory


cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);

// Free host memory


free(a);
free(b);
free(c);
return 0;
}

Group B
Mini Project: 1

Mini Project: Evaluate performance enhancement of parallel Quicksort Algorithm


using MPI

Application: Parallel Quicksort Algorithm using MPI

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

Objective: Sort a large dataset of numbers using parallel Quicksort Algorithm


with MPI and compare its performance with the serial version of the algorithm.

Approach: We will use Python and MPI to implement the parallel version of
Quicksort Algorithm and compare its performance with the serial version of the
algorithm.

Requirements:

Python3.x
mpi4py
Theory :

Similar to mergesort, QuickSort uses a divide-and-conquer strategy and is one of


the fastest sorting algorithms; it can be implemented in a recursive or iterative
fashion. The divide and conquer is a general algorithm design paradigm and key
steps of this strategy can be summarized as follows:

• Divide: Divide the input data set S into disjoint subsets S1, S2, S3…Sk.
• Recursion: Solve the sub-problems associated with S1, S2, S3…Sk.
• Conquer: Combine the solutions for S1, S2, S3…Sk. into a solution for S.
• Base case: The base case for the recursion is generally subproblems of size 0 or
1.

Many studies [2] have revealed that in order to sort N items; it will take QuickSort
an average running time of O(NlogN). The worst-case running time for QuickSort
will occur when the pivot is a unique minimum or maximum element, and as stated
in [2], the worst-case running time for QuickSort on N items is O(N2). These
different running times can be influenced by the input distribution (uniform, sorted
or semi-sorted, unsorted, duplicates) and the choice of the pivot element. Here is a
simple pseudocode of the QuickSort algorithm adapted from Wikipedia [1].
We have made use of Open MPI as the backbone library for parallelizing the
QuickSort algorithm. In fact, learning message passing interface (MPI) allows us to
strengthen our fundamental knowledge on parallel programming, given that MPI is
lower level than equivalent libraries (OpenMP). As simple as its name means, the
basic idea behind MPI is that messages can be passed or exchanged among different
processes in order to perform a given task. An illustration can be a communication
and coordination by a master process which splits a huge task into chunks and shares
them to its slave processes. Open MPI is developed and maintained by a consortium
of academic, research and industry partners; it combines the expertise, technologies
and resources all across the high performance computing community [11]. As
elaborated in [4], MPI has two types of communication routines: point-to-point

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

communication routines and collective communication routines. Collective routines


as explained in the implementation section have been used in this study.

Algorithm

In general, the overall algorithm used here to perform QuickSort with MPI works as
followed:

i. Start and initialize MPI.


ii. Under the root process MASTER, get inputs:
a. Read the list of numbers L from an input file.
b. Initialize the main array globaldata with L.
c. Start the timer.

iii. Divide the input size SIZE by the number of participating


processes npes to get each chunk size local size.

iv. Distribute globaldata proportionally to all processes:


a. From MASTER scatter globaldata to all processes.
b. Each process receives in a sub data local data.
v. Each process locally sorts its local data of size localsize.
vi. Master gathers all sorted local data by other processes in globaldata.
1. Gather each sorted local data.
2. Free local data

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

Steps:

1. Initialize MPI:

from mpi4py import MPI comm = MPI.COMM_WORLD

rank = comm.Get_rank()

size = comm.Get_size()

2. Define the serial version of Quicksort Algorithm:


def quicksort_serial(arr):
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
left = [x for x in arr if x < pivot]
middle = [x for x in arr if x == pivot]
right = [x for x in arr if x > pivot]
return quicksort_serial(left) + middle + quicksort_serial(right)

3. Define the parallel version of Quicksort Algorithm:


def quicksort_parallel(arr):
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
left = []
middle = []
right = []

for x in arr:
if x < pivot:
left.append(x)
elif x == pivot:
middle.append(x)
else:
right.append(x)
left_size = len(left)
middle_size = len(middle)

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

right_size = len(right)

# Get the size of each chunk chunk_size = len(arr) // size


# Send the chunk to all the nodes
chunk_left = []
chunk_middle = []
chunk_right = []

comm.barrier()
comm.Scatter(left, chunk_left, root=0)
comm.Scatter(middle, chunk_middle, root=0)
comm.Scatter(right, chunk_right, root=0)

# Sort the chunks


chunk_left = quicksort_serial(chunk_left)
chunk_middle = quicksort_serial(chunk_middle)
chunk_right = quicksort_serial(chunk_right)
#Gather the chunks back to the root node
sorted_arr = comm.gather(chunk_left, root=0)
sorted_arr += chunk_middle
sorted_arr += comm.gather(chunk_right, root=0)
return sorted_arr

4. Generate the dataset and run the Quicksort Algorithms:

import random
# Generate a large dataset of numbers

arr = [random.randint(0, 1000) for _ in range(1000000)]

#Time the parallel version of Quicksort Algorithm


import time
start_time = time.time()
quicksort_parallel(arr) parallel_time = time.time() - start_time

5. Compare the performance of the serial and parallel versions of the algorithm
python:
if rank == 0:

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

print(f"Serial Quicksort Algorithm time: {serial_time:.4f} seconds")


print(f"Parallel Quicksort Algorithm time: {parallel_time:.4f} seconds")

Output:
Serial Quicksort Algorithm time: 1.5536 seconds
Parallel Quicksort Algorithm time: 1.3488 seconds

Mini Project : 2
Title - Implement Huffman Encoding on GPU

Theory - Huffman Encoding is a lossless data compression algorithm that


works by assigning variable-length codes to the characters in a given text or
data stream based on their frequency of occurrence. This encoding scheme can
be implemented on GPU to speed up the encoding process.
The variable-length codes assigned to input characters are Prefix Codes, means
the codes (bit sequences) are assigned in such a way that the code assigned to one
character is not the prefix of code assigned to any other character. This is how

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

Huffman Coding makes sure that there is no ambiguity when decoding the
generated bitstream.
Let us understand prefix codes with a counter example. Let there be four characters
a, b, c and d, and their corresponding variable length codes be 00, 01, 0 and 1. This
coding leads to ambiguity because code assigned to c is the prefix of codes
assigned to a and b. If the compressed bit stream is 0001, the de-compressed output
may be “cccd” or “ccb” or “acd” or “ab”.

Here's a possible implementation of Huffman Encoding on GPU:

1.Calculate the frequency of each character in the input text.


2,Construct a Huffman tree using the calculated frequencies. The tree can be built
using a priority queue implemented on GPU, where the priority of a node is
determined by its frequency.
3. Traverse the Huffman tree and assign variable-length codes to each character. The
codes can be generated using a depth-first search algorithm implemented on GPU.
4. Encode the input text using the generated Huffman codes.

To optimize the implementation for GPU, we can use parallel programming


techniques such as CUDA, OpenCL, or HIP to parallelize the calculation of
character frequencies, construction of the Huffman tree, and generation of
Huffman codes.

Here are some specific optimizations that can be applied to each step:
1.Calculating character frequencies:
Use parallelism to split the input text into chunks and count the frequencies of each
character in parallel on different threads.

Reduce the results of each thread into a final frequency count on


the GPU. 1.Constructing the Huffman tree:
Use a priority queue implemented on GPU to parallelize the building of the Huffman
tree.
Each thread can process one or more nodes at a time, based on the priority of the nodes
in the queue.

2.Generating Huffman codes:


Use parallelism to traverse the Huffman tree and generate Huffman codes for each
character in parallel.
Each thread can process one or more nodes at a time, based on the depth of the nodes in
the tree.

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

3.Encoding the input text:


Use parallelism to split the input text into chunks and encode each chunk in
parallel on different threads.
Merge the encoded chunks into a final output on the GPU.
By parallelizing these steps, we can achieve significant speedup in the Huffman
Encoding process on GPU. However, it's important to note that the specific
implementation details may vary based on the programming language and GPU
architecture being used.

Source Code -

// Count the frequency of each


character in the input text int
freq_count[256] = {0};
int* d_freq_count;
cudaMalloc((void**)&d_freq_count,
256 * sizeof(int));
cudaMemcpy(d_freq_count, freq_count, 256 * sizeof(int),
cudaMemcpyHostToDevice); int block_size = 256;
int grid_size = (input_size + block_size - 1) / block_size;
count_frequencies<<<grid_size, block_size>>>(input_text,
input_size, d_freq_count); cudaMemcpy(freq_count, d_freq_count,
256 * sizeof(int), cudaMemcpyDeviceToHost);

// Build the Huffman tree


HuffmanNode* root = build_huffman_tree(freq_count);

// Generate Huffman codes for


each character
std::unordered_map<char,
std::vector<bool>> codes;
std::vector<bool> code;
generate_huffman_codes(root,
codes, code);

// Encode the input text using the


Huffman codes int output_size =
0;
for (int i = 0; i < input_size;
i++) { output_size +=

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

codes[input_text[i]].size(
);
}
output_size = (output_size + 7) / 8;
char* output_text = new
char[output_size]; char*
d_output_text;
cudaMalloc((void**) & d_output_text, output_size * sizeof(char));
cudaMemcpy(d_output_text, output_text, output_size * sizeof(char),
cudaMemcpyHostToDevice); encode_text<<<grid_size,
block_size>>>(input_text, input_size, d_output_text, output_size,
codes);
cudaMemcpy(output_text, d_output_text, output_size * sizeof(char),
cudaMemcpyDeviceToHost);

// Print the output


//Free memory
delete[] output_text;
cudaFree(d_freq_count);
cudaFree(d_output_text);
delete root;
return 0;
}

Output -
Input text: Hello, world!
Encoded text: 01000110 11010110 10001011 10101110 11110100 11011111 00101101
01000000
11111010
Mini Project : 3
Title - Implement Parallelization of Database Query optimization

Theory -
Query processing is the process through which a Database Management System
(DBMS) parses, verifies, and optimizes a given query before creating low-level
code that the DB understands.

Query Processing in DBMS, like any other High-Level Language (HLL) where

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

code is first generated and then executed to perform various operations, has two
phases: compile-time and runtime.

Query the use of declarative languages and query optimization is one of the main
factors contributing to the success of RDBMS technology. Any database allows
users to create queries to request specific data, and the database then uses effective
methods to locate the requested data.
A database optimization approach based on CMP has been studied by numerous
other academics. But the majority of their effort was on optimizing join operations
while taking into account the L2-cache and the parallel buffers of the shared main
memory.
The following techniques can be used to make a query parallel
• I/O parallelism
• Internal parallelism of queries
• Parallelism among queries
• Within-operation parallelism
• Parallelism in inter-operation

I/O parallelism :
This type of parallelism involves partitioning the relationships among the discs in
order to speed up the retrieval of relationships from the disc.
The inputted data is divided within, and each division is processed simultaneously.
After processing all of the partitioned data, the results are combined. Another
name for it is data partitioning.
Hash partitioning is best suited for point queries that are based on the partitioning
attribute and have the benefit of offering an even distribution of data across the
discs.
It should be mentioned that partitioning is beneficial for the sequential scans of the
full table stored on “n” discs and the speed at which the table may be scanned. For
a single disc system, relationship takes around 1/n of the time needed to scan the
table. In I/O parallelism, there are four different methods of partitioning:

Hash partitioning :
A hash function is a quick mathematical operation. The partitioning properties are
hashed for each row in the original relationship.
Let’s say that the data is to be partitioned across 4 drives, numbered disk1, disk2,
disk3, and disk4. The row is now stored on disk3 if the function returns.
Range partitioning :
Each disc receives continuous attribute value ranges while using range

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

partitioning. For instance, if we are range partitioning three discs with the numbers
0, 1, and 2, we may assign a relation with a value of less than 5 is written to disk0,
numbers from 5 to 40 are sent to disk1, and values above 40 are written to disk2.
It has several benefits, such as putting shuffles on the disc that have attribute values
within a specified range.

Round-robin partitioning :
Any order can be used to study the relationships in this method. It sends the ith
tuple to the disc number (i% n).
Therefore, new rows of data are received by discs in turn. For applications that want to
read the full relation sequentially for each query, this strategy assures an even
distribution of tuples across drives.

Schema Partitioning :
Various tables inside a database are put on different discs using a
technique called schema partitioning.

Intra-query parallelism :
Using a shared-nothing paralleling architecture technique, intra-query parallelism
refers to the processing of a single query in a parallel process on many CPUs.
This employs two different strategies:
First method — In this method, a duplicate task can be executed on a small amount
of data by each CPU.
Second method — Using this method, the task can be broken up into various sectors,
with each CPU carrying out a separate subtask.
Inter-query parallelism
Each CPU executes numerous transactions when inter-query parallelism is used.
Parallel transaction processing is what it is known as. To support inter-query
parallelism, DBMS leverages transaction dispatching.
We can also employ a variety of techniques, such as effective lock management.
This technique runs each query sequentially, which slows down the running time.
In such circumstances, DBMS must be aware of the locks that various
transactions operating on various processes have acquired. When
simultaneous transactions don’t accept the same data, inter-query parallelism
on shared storage architecture works well.
Additionally, the throughput of transactions is boosted, and it is the simplest form
of parallelism in DBMS.

Intra-operation parallelism :
In this type of parallelism, we execute each individual operation of a task, such as

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

sorting, joins, projections, and so forth, in parallel. Intra-operation parallelism has a


very high parallelism level. Database systems naturally employ this kind of parallelism.
Consider the following SQL example: SELECT * FROM the list of vehicles and sort
by model number;
Since a relation might contain a high number of records, the relational
operation in the aforementioned query is sorting.
Because this operation can be done on distinct subsets of the relation in several
processors, it takes less time to sort the data.

Inter-operation parallelism :
This term refers to the concurrent execution of many operations within a query
expression. They come in two varieties:
Pipelined parallelism — In pipeline parallelism, a second operation consumes a row of
the first operation’s output before the first operation has finished producing the whole
set of rows in its output. Additionally, it is feasible to perform these two processes
concurrently on several CPUs, allowing one operation to consume tuples concurrently
with another operation and thereby reduce them.
It is advantageous for systems with a limited number of CPUs and prevents the
storage of interim results on a disc.
Independent parallelism- In this form of parallelism, operations contained within
query phrases that are independent of one another may be carried out
concurrently. This analogy is extremely helpful when dealing with parallelism of
a lower degree.

Execution Of a Parallel Query :


The relational model has been favoured over prior hierarchical and network models
because of commercial database technologies. Data independence and high-level query
languages are the key advantages that relational database systems (RDBMSs) have over
their forerunners (e.g., SQL).
The efficiency of programmers is increased, and routine optimization is encouraged.
Additionally, distributed database management is made easier by the relational
model’s set-oriented structure. RDBMSs may now offer performance levels
comparable to older systems thanks to a decade of development and tuning.
They are therefore widely employed in the processing of commercial data for
OLTP (online transaction processing) or decision-support systems. Through the
use of many processors working together, parallel processing makes use of
multiprocessor computers to run application programmes and boost performance.
It is most commonly used in scientific computing, which it does by the speed of
numerical applications’ responses.

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

The development of parallel database systems is an example of how database


management and parallel computing can work together. A given SQL
statement can be divided up in the parallel database system PQO such that its
components can run concurrently on several processors in a multi-processor
machine.
Full table scans, sorting, sub-queries, data loading, and other common
operations can all be performed in parallel.
As a form of parallel database optimization, Parallel Query enables the division of
SELECT or DML operations into many smaller chunks that can be executed by PQ
slaves on different CPUs in a single box.
The order of joins and the method for computing each join are fixed in the first part
of the Fig, which is sorting and rewriting. The second phase, parallelization, turns
the query tree into a parallel plan.

Parallelization divides this stage into two parts: extraction of parallelism and
scheduling. Optimizing database queries is an important task in database
management systems to improve the performance of database operations.
Parallelization of database query optimization can significantly improve query
execution time by dividing the workload among multiple processors or nodes.

Here's an overview of how parallelization can be applied to database query


optimization:

1. Partitioning: The first step is to partition the data into smaller subsets. The
partitioning can be done based on different criteria, such as range partitioning,
hash partitioning, or list partitioning. This can be done in parallel by assigning
different processors or nodes to handle different parts of the partitioning process.

2. Query optimization: Once the data is partitioned, the next step is to optimize the
queries. Query optimization involves finding the most efficient way to execute the
query by considering factors such as index usage, join methods, and filtering. This
can also be done in parallel by assigning different processors or nodes to handle
different parts of the query optimization process.

3. Query execution: After the queries are optimized, the final step is to execute
the queries. The execution can be done in parallel by assigning different
processors or nodes to handle different parts of the execution process. The results
can then be combined to generate the final result set.

To implement parallelization of database query optimization, we can use parallel


programming frameworks such as OpenMP or CUDA. These frameworks provide
a set of APIs and tools to distribute the workload among multiple processors or

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

nodes and to manage the synchronization and communication between them.

Here's an example of how we can parallelize the query optimization process using
OpenMP:

//C++
// Partition the data
std::vector<std::vector<int>> partitions;
int num_partitions = omp_get_num_threads();
#pragma omp parallel for
for (int i = 0; i < num_partitions; i++)
{
std::vector<int> partition = partition_data(data, i, num_partitions);
partitions.push_back(partition);
optimize_query(query, partition);
}

// Execute the queries in parallel


#pragma omp parallel for
for (int i = 0; i < num_queries; i++)
{
Query query = queries[i];
int partition_id = get_partition_id(query, partitions);
std::vector<int> partition = partitions[partition_id];
std::vector<int> result = execute_query(query, partition);
merge_results(result);
}

In this example, we first partition the data into smaller subsets using OpenMP
parallelism. Then we optimize each query in parallel by assigning different
processors or nodes to handle different parts of the optimization process. Finally,
we execute the queries in parallel by assigning different processors or nodes to
handle different parts of the execution process.

Parallelization of database query optimization can significantly improve the


performance of database operations and reduce query execution time. However, it
requires careful consideration of the workload distribution, synchronization, and

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

communication between processors or nodes.

Mini Project : 4
Title - Implement Non-Serial Polyadic Dynamic Programming with GPU
Parallelization.

Theory -

Parallelization of Non-Serial Polyadic Dynamic Programming (NPDP) on


high-throughput manycore architectures, such as NVIDIA GPUs, suffers from load

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

imbalance, i.e. non-optimal mapping between the sub-problems of NPDP and the
processing elements of the GPU.
NPDP exhibits non-uniformity in the number of subproblems as well as
computational complexity across the phases. In NPDP parallelization, phases are
computed sequentially whereas subproblems of each phase are computed
concurrently.
Therefore, it is essential to effectively map the subproblems of each phase
to the processing elements while implementing thread level parallelism. We
propose an adaptive Generalized Mapping Method (GMM) for NPDP
parallelization that utilizes the GPU for efficient mapping of subproblems onto
processing threads in each phase.
Input-size and targeted GPU decide the computing power and the best
mapping for each phase in NPDP parallelization. The performance of GMM is
compared with different conventional parallelization approaches.
For sufficiently large inputs, our technique outperforms the state-of-the-art
conventional parallelization approach and achieves a significant speedup of a
factor 30. We also summarize the general heuristics for achieving better gain in the
NPDP parallelization.
Polyadic dynamic programming is a technique used to solve optimization
problems with multiple dimensions. Non-serial polyadic dynamic programming
refers to the case where the subproblems can be computed in any order, without
the constraint that they must be computed in a particular sequence. This makes it
possible to parallelize the computation on a GPU.

Here's an example code that implements non-serial polyadic dynamic


programming with GPU parallelization using CUDA:

#include <iostream>
#include <cuda.h>

// Dimensions of the problem


#define N 1024
#define M 1024
#define K 1024
// Number of threads per block
#define BLOCK_SIZE 256

// GPU kernel for computing a single subproblem


__global__void compute_subproblem(float* dp, float* x, float* y, float* z, int i, int j, int
k) {
// Compute the value of the subproblem

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

float value = x[i] * y[j] * z[k];


// Compute the index into the dp array
int index = i * M * K + j * K + k;
// Update the dp array with the computed value
dp[index] = value;
// Synchronize all threads in the block
syncthreads();
}
int main()
{
// Allocate memory for the input arrays on the CPU
float* x = new float[N];
float* y = new float[M];
float* z = new float[K];
// Initialize the input arrays
for (int i = 0; i < N; i++)
{
x[i] = i;
}
for (int j = 0; j < M; j++)
{
y[j] = j;
}

for (int k = 0; k < K; k++)


{
z[k] = k;
}
// Allocate memory for the dp array on the GPU
float* d_dp;
cudaMalloc(&d_dp, N * M * K * sizeof(float));
// Copy the input arrays to the GPU
float* d_x;
cudaMalloc(&d_x, N * sizeof(float));
cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice);

float* d_y;
cudaMalloc(&d_y, M * sizeof(float));
cudaMemcpy(d_y, y, M * sizeof(float), cudaMemcpyHostToDevice);

float* d_z;
cudaMalloc(&d_z, K * sizeof(float));

Prof.Tushar D Kolhe
High Performance Computing Lab Manual BE Computer

cudaMemcpy(d_z, z, K * sizeof(float), cudaMemcpyHostToDevice);


// Compute the dp array on the GPU
dim3 blocksPerGrid((N + BLOCK_SIZE - 1) / BLOCK_SIZE, (M + BLOCK_SIZE - 1) /
BLOCK_SIZE, (K + BLOCK_SIZE - 1) / BLOCK_SIZE);
dim3 threadsPerBlock(BLOCK_SIZE, BLOCK_SIZE, BLOCK_SIZE);
for (int i = 0; i < N; i++)
{
for (int j = 0; j < M; j++)
{
for (int k = 0; k < K; k++)
{
compute_subproblem<<<blocksPerGrid, threadsPerBlock>>>(d_dp, d_x, d_y, d_z, i, j,
k);
}
}
}
// Copy the dp array back to the CPU
float* dp = new float[N * M * K];
cudaMemcpy(dp, d_dp, N * M * K * sizeof(float), cudaMemcpyDeviceToHost);

// Print the result


std::cout << "dp[" << N-1 << "][" << M-1 << "][" << K-1 << "] = " << dp[(N-1)*M*K +
(M-1)*K
+ (K-1)] << std::endl;

// Free memory on the GPU


cudaFree(d_dp);
cudaFree(d_x);
cudaFree(d_y);
cudaFree(d_z);

Prof.Tushar D Kolhe

You might also like