0% found this document useful (0 votes)

356 views

BE LP5 Manual 23-24

The document discusses implementing parallel breadth-first search and depth-first search algorithms using OpenMP. It provides objectives, prerequisites, explanations of BFS and DFS, an overview of OpenMP, and code to perform parallel BFS on a tree structure with OpenMP directives and critical sections.

Uploaded by

kunalisatwork

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

356 views

BE LP5 Manual 23-24

Uploaded by

kunalisatwork

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

Smt. Kashibai Navale College of Engineering, Vadgaon(Bk.), Pune.

Savitribai Phule Pune University (SPPU) Fourth Year of Computer Engineering

(2019 Course)
410255: Laboratory Practice V
Trem work : 50 Marks Practical : 50 Marks
High Performance Computing (410250) Deep Learning (410251)

Course Objectives:

• To understand and implement searching and sorting algorithms.

• To learn the fundamentals of GPU Computing in the CUDA environment.

• To illustrate the concepts of Artificial Intelligence/Machine Learning (AI/ML).

• To understand Hardware acceleration. • To implement different deep learning models.

Course Outcomes:

CO1: Analyze and measure performance of sequential and parallel algorithms.

CO2: Design and Implement solutions for multicore/Distributed/parallel environment.

CO3: Identify and apply the suitable algorithms to solve AI/ML problems.

CO4: Apply the technique of Deep Neural network for implementing Linear regression and
classification.

CO5: Apply the technique of Convolution (CNN) for implementing Deep Learning models CO6: Design
and develop Recurrent Neural Network (RNN) for prediction.

@The CO-PO Mapping Matrix

CO/PO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12

CO1 1 - 1 1 - 2 1 - - - - -

CO2 1 2 1 - - 1 - - - - - 1

CO3 - 1 1 1 1 1 - - - - - -

CO4 3 3 3 - 3 - - - - - - -

CO5 3 3 3 3 3 - - - - - - -

CO6 3 3 3 3 3 - - - - - - -

CO7 3 3 3 3 3 - - - - - -
410250: High-Performance Computing Group A
Assignment No.: 1

Title of the Assignment: Design and implement Parallel Breadth First Search and Depth First Search
based on existing algorithms using OpenMP. Use a Tree or an undirected graph for BFS and DFS

Objective of the Assignment: Students should be able to Write a program to implement Parallel
Breadth First Search and Depth First Search based on existing algorithms using OpenMP

Prerequisite:

1. Basic of programming language

2. Concept of BFS and DFS

3. Concept of Parallelism

Contents for Theory:

1. What is BFS?

2. What is DFS?

3. Concept of OpenMP

4. Code Explanation with Output

What is BFS?

BFS stands for Breadth-First Search. It is a graph traversal algorithm used to explore all the nodes of a
graph or tree systematically, starting from the root node or a specified starting point, and visiting all
the neighboring nodes at the current depth level before moving on to the next depth level. The
algorithm uses a queue data structure to keep track of the nodes that need to be visited, and marks
each visited node to avoid processing it again. The basic idea of the BFS algorithm is to visit all the
nodes at a given level before moving on to the next level, which ensures that all the nodes are visited

in breadth-first order. BFS is commonly used in many applications, such as finding the shortest path
between two nodes, solving puzzles, and searching through a tree or graph.

Now let’s take a look at the steps involved in traversing a graph by using Breadth-First Search:

Step 1: Take an Empty Queue.

Step 2: Select a starting node (visiting a node) and insert it into the Queue.

Step 3: Provided that the Queue is not empty, extract the node from the Queue and insert its child
nodes (exploring a node) into the Queue.

Step 4: Print the extracted node.

What is DFS?

DFS stands for Depth-First Search. It is a popular graph traversal algorithm that explores as far as
possible along each branch before backtracking. This algorithm can be used to find the shortest path
between two vertices or to traverse a graph systematically. The algorithm starts at the root node and
explores as far as possible along each branch before backtracking. The backtracking is done to explore
the next branch that has not been explored yet.

DFS can be implemented using either a recursive or an iterative approach. The recursive approach is
simpler to implement but can lead to a stack overflow error for very large graphs. The iterative
approach uses a stack to keep track of nodes to be explored and is preferred for larger graphs.

DFS can also be used to detect cycles in a graph. If a cycle exists in a graph, the DFS algorithm will
eventually reach a node that has already been visited, indicating that a cycle exists. A standard DFS
implementation puts each vertex of the graph into one of two categories: 1. Visited 2. Not Visited The
purpose of the algorithm is to mark each vertex as visited while avoiding cycles.

Concept of OpenMP

● OpenMP (Open Multi-Processing) is an application programming interface (API) that supports

shared-memory parallel programming in C, C++, and Fortran. It is used to write parallel programs that
can run on multicore processors, multiprocessor systems, and parallel computing clusters.

● OpenMP provides a set of directives and functions that can be inserted into the source code
of a program to parallelize its execution. These directives are simple and easy to use, and they can be

applied to loops, sections, functions, and other program constructs. The compiler then generates
parallel code that can run on multiple processors concurrently.

● OpenMP programs are designed to take advantage of the shared-memory architecture of

modern processors, where multiple processor cores can access the same memory. OpenMP uses a
fork-join model of parallel execution, where a master thread forks multiple worker threads to execute
a parallel region of the code, and then waits for all threads to complete before continuing with the
sequential part of the code

● OpenMP is widely used in scientific computing, engineering, and other fields that require

high-performance computing. It is supported by most modern compilers and is available on a wide

range of platforms, including desktops, servers, and supercomputers. How Parallel BFS Work .

● Parallel BFS (Breadth-First Search) is an algorithm used to explore all the nodes of a graph or
tree SNJB’s Late Sau.K.B.Jain College of Engineering Chandwad 3 Department of Computer Engineering
Course : Laboratory Practice V systematically in parallel. It is a popular parallel algorithm used for graph
traversal in distributed computing, shared-memory systems, and parallel clusters.

● The parallel BFS algorithm starts by selecting a root node or a specified starting point, and then
assigning it to a thread or processor in the system. Each thread maintains a local queue of nodes to be
visited and marks each visited node to avoid processing it again.

● The algorithm then proceeds in levels, where each level represents a set of nodes that are at
a certain distance from the root node. Each thread processes the nodes in its local queue at the current
level, and then exchanges the nodes that are adjacent to the current level with other threads or
processors. This is done to ensure that the nodes at the next level are visited by the next iteration of
the algorithm.

● The parallel BFS algorithm uses two phases: the computation phase and the communication
phase. In the computation phase, each thread processes the nodes in its local queue, while in the
communication phase, the threads exchange the nodes that are adjacent to the current level with
other threads or processors.
● The parallel BFS algorithm terminates when all nodes have been visited or when a specified
node has been found. The result of the algorithm is the set of visited nodes or the shortest path from
the root node to the target node.

● Parallel BFS can be implemented using different parallel programming models, such as
OpenMP, MPI, CUDA, and others. The performance of the algorithm depends on the number of
threads or processors used, the size of the graph, and the communication overhead between the
threads or processors.

Code to implement BFS using OpenMP:

#include<iostream>

#include<stdlib.h>

#include<queue>

using namespace std;

class node

public:

node left, right; int data;

};

class Breadthfs

public:

node insert(node , int); void bfs(node *);

};

node insert(node root, int data)

// inserts a node in tree

if(!root)

root=new node; root->left=NULL; root->right=NULL; root->data=data; return root;

queue<node *> q; q.push(root);

while(!q.empty())

{
node *temp=q.front(); q.pop();

if(temp->left==NULL)

else

temp->left=new node; temp->left->left=NULL; temp->left->right=NULL; temp->left->data=data;

return root;

q.push(temp->left);

if(temp->right==NULL)

else

temp->right=new node; temp->right->left=NULL; temp->right->right=NULL; temp->right->data=data;

return root;

q.push(temp->right);

void bfs(node *head)

queue<node*> q; q.push(head);

int qSize;

while (!q.empty())

qSize = q.size(); #pragma omp parallel for

//creates parallel threads

for (int i = 0; i < qSize; i++)

node* currNode; #pragma omp critical

{

currNode = q.front(); q.pop();

cout<<"\t"<<currNode->data;

}// prints parent node #pragma omp critical

if(currNode->left)// push parent's left node in queue q.push(currNode->left);

if(currNode->right)

q.push(currNode->right);

}// push parent's right node in queue

int main(){

node *root=NULL; int data;

char ans;

cout<<"\n enter data=>"; cin>>data;

root=insert(root,data);

cout<<"do you want insert one more node?"; cin>>ans;

}while(ans=='y'||ans=='Y'); bfs(root);

return 0;

Run Commands:

1) g++ -fopenmp bfs.cpp -o bfs

2) ./bfs

Output:

Enter data => 5

Do you want to insert one more node? (y/n) y

Enter data => 3

Do you want to insert one more node? (y/n) y

Enter data => 2

Do you want to insert one more node? (y/n) y

Enter data => 1

Do you want to insert one more node? (y/n) y

Enter data => 7

Do you want to insert one more node? (y/n) y

Enter data => 8 Do you want to insert one more node? (y/n) n

5 3 7 2 1 8

Code to implement DFS using OpenMP:

#include <iostream> #include <vector> #include <stack> #include <omp.h>

using namespace std;

const int MAX = 100000; vector<int> graph[MAX]; bool visited[MAX];

void dfs(int node) {

stack<int> s; s.push(node);

while (!s.empty()) { int curr_node = s.top(); s.pop();

if (!visited[curr_node]) { visited[curr_node] = true;

if (visited[curr_node]) { cout << curr_node << " ";

#pragma omp parallel for

for (int i = 0; i < graph[curr_node].size(); i++) { int adj_node = graph[curr_node][i];

if (!visited[adj_node]) {

s.push(adj_node);

}
}

int main() {

int n, m, start_node;

cout << "Enter No of Node,Edges,and start node:" ; cin >> n >> m >> start_node;

//n: node,m:edges

cout << "Enter Pair of edges:" ; for (int i = 0; i < m; i++) { int u, v;

cin >> u >> v;

//u and v: Pair of edges

graph[u].push_back(v); graph[v].push_back(u);

#pragma omp parallel for for (int i = 0; i < n; i++) { visited[i] = false;

dfs(start_node);

/* for (int i = 0; i < n; i++) { if (visited[i]) {

cout << i << " ";

}*/

return 0;

Conclusion:

In this way we can achieve parallelism while implementing Breadth First Search and Depth First
Search
Assignment No.: 2

Title of the Assignment: Write a program to implement Parallel Bubble Sort. Use existing algorithms
and measure the performance of sequential and parallel algorithms.

Objective of the Assignment: Students should be able to Write a program to implement Parallel Bubble
Sort and can measure the performance of sequential and parallel algorithms.

Prerequisite:

1. Basic of programming language

2. Concept of Bubble Sort

3. Concept of Parallelism

Contents for Theory:

1. What is Bubble Sort? Use of Bubble Sort

2. Example of Bubble sort?

3. Concept of OpenMP

4. How Parallel Bubble Sort Work

5. How to measure the performance of sequential and parallel algorithms?

What is Bubble Sort?

Bubble Sort is a simple sorting algorithm that works by repeatedly swapping adjacent elements if they
are in the wrong order. It is called "bubble" sort because the algorithm moves the larger elements
towards the end of the array in a manner that resembles the rising of bubbles in a liquid.

The basic algorithm of Bubble Sort is as follows:

1. Start at the beginning of the array.

2. Compare the first two elements. If the first element is greater than the second element, swap
them.

3. Move to the next pair of elements and repeat step 2.

4. Continue the process until the end of the array is reached.

5. If any swaps were made in step 2-4, repeat the process from step 1.

The time complexity of Bubble Sort is O(n^2), which makes it inefficient for large lists. However, it has
the advantage of being easy to understand and implement, and it is useful for educational purposes
and for sorting small datasets. Bubble Sort has limited practical use in modern software development
due to its inefficient time complexity of O(n^2) which makes it unsuitable for sorting large datasets.
However, Bubble Sort has some advantages and use cases that make it a valuable algorithm to
understand, such as:

1. Simplicity: Bubble Sort is one of the simplest sorting algorithms, and it is easy to understand
and implement. It can be used to introduce the concept of sorting to beginners and as a basis for more
complex sorting algorithms.

2. Educational purposes: Bubble Sort is often used in academic settings to teach the principles of
sorting algorithms and to help students understand how algorithms work.

3. Small datasets: For very small datasets, Bubble Sort can be an efficient sorting algorithm, as
its overhead is relatively low.

4. Partially sorted datasets: If a dataset is already partially sorted, Bubble Sort can be very
efficient. Since Bubble Sort only swaps adjacent elements that are in the wrong order, it has a low
number of operations for a partially sorted dataset.

5. Performance optimization: Although Bubble Sort itself is not suitable for sorting large datasets,
some of its techniques can be used in combination with other sorting algorithms to optimize their
performance. For example, Bubble Sort can be used to optimize the performance of Insertion Sort by
reducing the number of comparisons needed.

Example of Bubble sort

Let's say we want to sort a series of numbers 5, 3, 4, 1, and 2 so that they are arranged in ascending
order…

The sorting begins the first iteration by comparing the first two values. If the first value is greater than
the second, the algorithm pushes the first value to the index of the second value.

First Iteration of the Sorting

Step 1: In the case of 5, 3, 4, 1, and 2, 5 is greater than 3. So 5 takes the position of 3 and the

numbers become 3, 5, 4, 1, and 2.

Step 2: The algorithm now has 3, 5, 4, 1, and 2 to compare, this time around, it compares the next two
values, which are 5 and 4. 5 is greater than 4, so 5 takes the index of 4 and the values now become 3,
4, 5, 1, and 2.

Step 3: The algorithm now has 3, 4, 5, 1, and 2 to compare. It compares the next two values, which are
5 and 1. 5 is greater than 1, so 5 takes the index of 1 and the numbers become 3, 4, 1, 5, and 2.

Step 4: The algorithm now has 3, 4, 1, 5, and 2 to compare. It compares the next two values, which are
5 and 2. 5 is greater than 2, so 5 takes the index of 2 and the numbers become 3, 4, 1, 2, and 5.

That’s the first iteration. And the numbers are now arranged as 3, 4, 1, 2, and 5 – from the initial 5, 3,
4, 1, and 2. As you might realize, 5 should be the last number if the numbers are sorted in ascending
order. This means the first iteration is really completed.

Second Iteration of the Sorting and the Rest

The algorithm starts the second iteration with the last result of 3, 4, 1, 2, and 5. This time around, 3 is
smaller than 4, so no swapping happens. This means the numbers will remain the same.

The algorithm proceeds to compare 4 and 1. 4 is greater than 1, so 4 is swapped for 1 and the numbers
become 3, 1, 4, 2, and 5.

The algorithm now proceeds to compare 4 and 2. 4 is greater than 2, so 4 is swapped for 2 and the
numbers become 3, 1, 2, 4, and 5.

4 is now in the right place, so no swapping occurs between 4 and 5 because 4 is smaller than

5.That’s how the algorithm continues to compare the numbers until they are arranged in ascending
order of 1, 2, 3, 4, and 5.

Concept of OpenMP

● OpenMP (Open Multi-Processing) is an application programming interface (API) that supports

shared-memory parallel programming in C, C++, and Fortran. It is used to write parallel programs that
can run on multicore processors, multiprocessor systems, and parallel computing clusters.

● OpenMP provides a set of directives and functions that can be inserted into the source code
of a program to parallelize its execution. These directives are simple and easy to use, and they can be
applied to loops, sections, functions, and other program constructs.

● The compiler then generates parallel code that can run on multiple processors concurrently.

● OpenMP programs are designed to take advantage of the shared-memory architecture of

How Parallel Bubble Sort Work

● Parallel Bubble Sort is a modification of the classic Bubble Sort algorithm that takes advantage
of parallel processing to speed up the sorting process.

● In parallel Bubble Sort, the list of elements is divided into multiple sublists that are sorted
concurrently by multiple threads. Each thread sorts its sublist using the regular Bubble Sort algorithm.
When all sublists have been sorted, they are merged together to form the final sorted list.

● The parallelization of the algorithm is achieved using OpenMP, a programming API that
supports parallel processing in C++, Fortran, and other programming languages. OpenMP provides a
set of compiler directives that allow developers to specify which parts of the code can be executed in
parallel.

● In the parallel Bubble Sort algorithm, the main loop that iterates over the list of elements is
divided into multiple iterations that are executed concurrently by multiple threads. Each thread sorts
a subset of the list, and the threads synchronize their work at the end of each iteration to ensure that
the elements are properly ordered.

● Parallel Bubble Sort can provide a significant speedup over the regular Bubble Sort algorithm,
especially when sorting large datasets on multi-core processors. However, the speedup is
limited by the overhead of thread creation and synchronization, and it may not be worth the effort for
small datasets or when using a single-core processor.

How to measure the performance of sequential and parallel algorithms?

To measure the performance of sequential Bubble sort and parallel Bubble sort algorithms, you can
follow these steps:

1. Implement both the sequential and parallel Bubble sort algorithms.

2. Choose a range of test cases, such as arrays of different sizes and different degrees of
sortedness, to test the performance of both algorithms.

3. Use a reliable timer to measure the execution time of each algorithm on each test case.

4. Record the execution times and analyze the results.

When measuring the performance of the parallel Bubble sort algorithm, you will need to specify the
number of threads to use. You can experiment with different numbers of threads to find the optimal
value for your system.

How to check CPU utilization and memory consumption in ubuntu

In Ubuntu, you can use a variety of tools to check CPU utilization and memory consumption. Here are
some common tools:

1. top: The top command provides a real-time view of system resource usage, including CPU
utilization and memory consumption. To use it, open a terminal window and type top. The output will
display a list of processes sorted by resource usage, with the most resource-intensive processes at the
top.

2. htop: htop is a more advanced version of top that provides additional features, such as
interactive process filtering and a color-coded display. To use it, open a terminal window and type htop.

3. ps: The ps command provides a snapshot of system resource usage at a particular moment in
time. To use it, open a terminal window and type ps aux. This will display a list of all running processes
and their resource usage.

4. free: The free command provides information about system memory usage, including total,
used, and free memory. To use it, open a terminal window and type free -h.

5. vmstat: The vmstat command provides a variety of system statistics, including CPU utilization,
memory usage, and disk activity. To use it, open a terminal window and type vmstat.

Code to Implement parallel bubble sort using OpenMP

import numpy as np import time

import random import omp

def parallel_bubble_sort(arr): n = len(arr)

for i in range(n):

# Set the number of threads to the maximum available

omp.set_num_threads(omp.get_max_threads())

# Use the parallel construct to distribute the loop iterations among the threads # Each thread sorts a
portion of the array

# The ordered argument ensures that the threads wait for each other before moving on to the next
iteration

# This guarantees that the array is fully sorted before the loop ends

with omp.parallel(num_threads=omp.get_max_threads(), default_shared=False, private=['temp']):

for j in range(i % 2, n-1, 2): if arr[j] > arr[j+1]:

temp = arr[j] arr[j] = arr[j+1] arr[j+1] = temp

if _name_ == '_main_':

# Generate a random array of 10,000 integers

arr = np.array([random.randint(0, 1000) for i in range(10000)]) print(f"Original array: {arr}")

start_time = time.time() parallel_bubble_sort(arr) end_time = time.time()

print(f"Sorted array: {arr}")

print(f"Execution time: {end_time - start_time} seconds")

Output:

Original array: [69 22 51 ... 18 56 9]

Sorted array: [ 0 0 0 ... 99 99 99]

Execution time: 0.07419133186340332 seconds

Code to Implement parallel merge sort using openmp

import numpy as np import time

import random import omp

def parallel_merge_sort(arr): n = len(arr)

# Base case if n == 1: return arr

# Split the array into two halves mid = n // 2 left = arr[:mid] right = arr[mid:]

# Use the parallel construct to distribute the work among the threads # Each thread sorts a portion
of the array
with omp.parallel(num_threads=omp.get_max_threads(), default_shared=False): left_sorted =
parallel_merge_sort(left)

right_sorted = parallel_merge_sort(right) # Merge the two sorted halves i = j = 0

n1, n2 = len(left_sorted), len(right_sorted) merged_arr = np.zeros(n1+n2, dtype=int)

# Use the parallel construct to distribute the loop iterations among the threads # Each thread merges
a portion of the array

with omp.parallel(num_threads=omp.get_max_threads(), default_shared=False, private=['k']):

for k in range(n1+n2): if i == n1:

merged_arr[k:] = right_sorted[j:] break

elif j == n2:

merged_arr[k:] = left_sorted[i:] break

elif left_sorted[i] <= right_sorted[j]: merged_arr[k] = left_sorted[i]

i += 1

else:

merged_arr[k] = right_sorted[j] j += 1

return merged_arr

if _name_ == '_main_':

# Generate a random array of 10,000 integers

arr = np.array([random.randint(0, 1000) for i in range(10000)]) print(f"Original array: {arr}")

start_time = time.time()

sorted_arr = parallel_merge_sort(arr) end_time = time.time()

print(f"Sorted array: {sorted_arr}")

print(f"Execution time: {end_time - start_time} seconds")

Output:

Original array: [59 43 87 ... 22 50 83]

Sorted array: [ 0 0 0 ... 99 99 99]

Execution time: 0.031245946884155273 seconds

Conclusion-In this way we can implement Bubble Sort in parallel way using OpenMP also come to
know how to how to measure performance of serial and parallel algorithm

Assignment No.: 3

Title of the Assignment:Implement Min, Max, Sum and Average operations using Parallel Reduction.
Objective of the Assignment: Students should be able to learn about how to perform min, max, sum,
and average operations on a large set of data using parallel reduction technique in CUDA. The program
defines four kernel functions, reduce_min, reduce_max, reduce_sum, and reduce_avg.

Prerequisite:

1. Knowledge of parallel programming concepts and techniques, such as shared memory,

threads, and synchronization.

2. Familiarity with a parallel programming library or framework, such as OpenMP, MPI, or CUDA.

3. A suitable parallel programming environment, such as a multi-core CPU, a cluster of

computers, or a GPU.

4. A programming language that supports parallel programming constructs, such as C, C++,

Fortran, or Python.

Contents of Theory :

Parallel Reduction Operation :

Parallel reduction is a common technique used in parallel computing to perform a reduction operation
on a large dataset. A reduction operation combines a set of values into a single value, such as
computing the sum, maximum, minimum, or average of the values. Parallel reduction exploits the
parallelism available in modern multicore processors, clusters of computers, or GPUs to speed up the
computation.

The parallel reduction algorithm works by dividing the input data into smaller chunks that can be
processed independently in parallel. Each thread or process computes the reduction operation on its
local chunk of data, producing a partial result. The partial results are then combined in a hierarchical
manner until a single result is obtained.

The most common parallel reduction algorithm is the binary tree reduction algorithm, which has a
logarithmic time complexity and can achieve optimal parallel efficiency. In this algorithm, the input

data is initially divided into chunks of size n, where n is the number of parallel threads or processes.
Each thread or process computes the reduction operation on its chunk of data, producing n partial
results.

The partial results are then recursively combined in a binary tree structure, where each internal node
represents the reduction operation of its two child nodes. The tree structure is built in a bottom-up
manner, starting from the leaf nodes and ending at the root node. Each level of the tree reduces the
number of partial results by a factor of two, until a single result is obtained at the root node.

The binary tree reduction algorithm can be implemented using various parallel programming models,
such as OpenMP, MPI, or CUDA. In OpenMP, the algorithm can be implemented using the parallel and
for directives for parallelizing the computation, and the reduction clause for combining the partial
results. In MPI, the algorithm can be implemented using the MPI_Reduce function for performing the
reduction operation, and the MPI_Allreduce function for distributing the result to all processes. In
CUDA, the algorithm can be implemented using the parallel reduction kernel, which uses shared
memory to store the partial results and reduce the memory access latency.
Parallel reduction has many applications in scientific computing, machine learning, data analytics, and
computer graphics. It can be used to compute the sum, maximum, minimum, or average of large
datasets, to perform data filtering, feature extraction, or image processing, to solve optimization
problems, or to accelerate numerical simulations. Parallel reduction can also be combined with other
parallel algorithms, such as parallel sorting, searching, or matrix operations, to achieve higher
performance and scalability.

Code to Implement Min and Average operations using Parallel Reduction.

#include <stdio.h>

#include <stdlib.h>

#include <omp.h>

#define CHUNK_SIZE 1000 struct ChunkStats {

int min_val; int sum_val;

int size;

};

struct ChunkStats get_chunk_stats(int* chunk, int chunk_size) {

// compute the minimum, sum, and size of a chunk struct ChunkStats stats;

stats.min_val = chunk[0]; stats.sum_val

= 0; stats.size = chunk_size;

for (int i = 0; i < chunk_size; i++) {

stats.min_val = chunk[i] < stats.min_val ? chunk[i] : stats.min_val; stats.sum_val += chunk[i];

return stats;

void parallel_reduction_min_avg(int* data, int data_size, int* min_val_ptr, double* avg_val_ptr) { //

split the data into chunks
int num_threads = omp_get_max_threads(); int chunk_size = data_size / num_threads; int
num_chunks = num_threads;

if (data_size % chunk_size != 0) { num_chunks++;

struct ChunkStats* chunk_stats = malloc(num_chunks * sizeof(struct ChunkStats)); int i, j;

#pragma omp parallel shared(data, chunk_size, num_chunks, chunk_stats) private(i, j) { int thread_id
= omp_get_thread_num();

int start_index = thread_id * chunk_size;

int end_index = (thread_id + 1) * chunk_size - 1; if (thread_id == num_threads - 1) {

end_index = data_size - 1;

int chunk_size_actual = end_index - start_index + 1; int* chunk = data + start_index;

chunk_stats[thread_id] = get_chunk_stats(chunk, chunk_size_actual); // compute the minimum and

sum of each chunk in parallel

for (i = 1, j = thread_id - 1; i <= num_threads && j >= 0; i *= 2, j -= i) { if (thread_id % i == 0 &&

thread_id + i < num_threads) {

chunk_stats[thread_id].min_val = chunk_stats[thread_id].min_val < chunk_stats[thread_id +

i].min_val ? chunk_stats[thread_id].min_val : chunk_stats[thread_id + i].min_val;

chunk_stats[thread_id].sum_val += chunk_stats[thread_id + i].sum_val; chunk_stats[thread_id].size

+= chunk_stats[thread_id + i].size;

#pragma omp barrier

}
// perform a binary operation on adjacent pairs of minimum and sum values int min_val =
chunk_stats[0].min_val;

int sum_val = chunk_stats[0].sum_val; int size = chunk_stats[0].size;

for (i = 1, j = 0; i < num_chunks; i *= 2, j++) { if (j % i

== 0 && j + i < num_chunks) {

min_val = min_val < chunk_stats[j + i].min_val ? min_val : chunk_stats[j + i].min_val; sum_val +=

chunk_stats[j + i].sum_val;

size += chunk_stats[j + i].size;

// the final minimum value is the minimum value of the entire dataset

*min_val_ptr = min_val;

// the final average value is the sum of the entire dataset divided by its size

*avg_val_ptr = (double)sum_val / (double)size; free(chunk_stats);

int main() {

int data_size = 1000000;

int* data = malloc(data_size * sizeof(int)); for (int i = 0; i < data_size; i++) {

data[i] = rand() % 100;

}
int min_val; double avg_val;

parallel_reduction_min_avg(data, data_size, &min_val, &avg_val); printf("Minimum value: %d\n",

min_val); printf("Average value: %lf\n", avg_val);

free(data);

return 0;

Code to Implement Max and Sum operations using Parallel Reduction.

#include <stdio.h>

#include <stdlib.h>

#include <omp.h>

void parallel_reduction_max_sum(int* data, int size, int* max_val_ptr, int* sum_val_ptr) {

// Initialize shared variables

max_val_ptr = data[0]; sum_val_ptr = 0;

// Compute maximum and sum of each chunk in parallel

#pragma omp parallel for reduction(max: *max_val_ptr) reduction(+: *sum_val_ptr) for (int i = 0; i <
size; i++) {

if (data[i] > *max_val_ptr) {

*max_val_ptr = data[i];

*sum_val_ptr += data[i];

// Combine maximum and sum values from each chunk #pragma omp parallel sections

#pragma omp section

{// Compute maximum value

for (int i = 1; i < omp_get_num_threads(); i++) { int thread_max_val;

#pragma omp critical

{

thread_max_val = *max_val_ptr;

#pragma omp flush

if (thread_max_val > *max_val_ptr) {

*max_val_ptr = thread_max_val;

#pragma omp section

// Compute sum value

for (int i = 1; i < omp_get_num_threads(); i++) { int thread_sum_val;

#pragma omp critical

thread_sum_val = *sum_val_ptr;

#pragma omp flush

*sum_val_ptr += thread_sum_val;

int main() {

int data_size = 1000000;

int* data = malloc(data_size * sizeof(int)); for (int i = 0; i < data_size; i++) {

data[i] = rand() % 100;

int max_val, sum_val;

parallel_reduction_max_sum(data, data_size, &max_val, &sum_val); printf("Maximum value: %d\n",
max_val); printf("Sum value: %d\n", sum_val);

free(data); return 0;

#include <stdio.h> #include <stdlib.h> #include <omp.h>

void parallel_reduction_max_sum(int* data, int size, int* max_val_ptr, int* sum_val_ptr) {

// Initialize shared variables

max_val_ptr = data[0]; sum_val_ptr = 0;

// Compute maximum and sum of each chunk in parallel

#pragma omp parallel for reduction(max: *max_val_ptr) reduction(+: *sum_val_ptr) for (int i = 0; i <
size; i++) {

if (data[i] > *max_val_ptr) {

*max_val_ptr = data[i];

*sum_val_ptr += data[i];

// Combine maximum and sum values from each chunk #pragma omp parallel sections

#pragma omp section

// Compute maximum value

for (int i = 1; i < omp_get_num_threads(); i++) { int thread_max_val;

#pragma omp critical

thread_max_val = *max_val_ptr;

#pragma omp flush

if (thread_max_val > *max_val_ptr) {

*max_val_ptr = thread_max_val;

}}

#pragma omp section

{

// Compute sum value

for (int i = 1; i < omp_get_num_threads(); i++) { int thread_sum_val;

#pragma omp critical

thread_sum_val = *sum_val_ptr;

#pragma omp flush

*sum_val_ptr += thread_sum_val;

int main() {

int data_size = 1000000;

int* data = malloc(data_size * sizeof(int)); for (int i = 0; i < data_size; i++) {

data[i] = rand() % 100;

int max_val, sum_val;

parallel_reduction_max_sum(data, data_size, &max_val, &sum_val); printf("Maximum value: %d\n",

max_val); printf("Sum value: %d\n", sum_val);

free(data); return 0;

In this code, we use the #pragma omp parallel for directive to execute the loop that computes the
maximum and sum of each chunk in parallel. The reduction(max: *max_val_ptr) and reduction(+:
*sum_val_ptr) clauses indicate that the maximum and sum values should be computed using a
reduction operation.

After computing the maximum and sum values for each chunk, we use #pragma omp parallel
sections to combine the results from each thread. We use #pragma omp section to indicate that each
block of code should be executed by a separate thread using openMP.

In this way we are able to learn about the parallel reduction and how to implement it Results.

Conclusion :
In each section, we use a loop and a critical section to combine the maximum or sum values from
each thread. The #pragma omp flush directive ensures that the values are properly synchronized
between threads.

Assignment No.: 4
Title of the Assignment:Write a CUDA Program for :

1. Addition of two large vectors

2. Matrix Multiplication using CUDA

Objective of the Assignment: Students should be able to learn about parallel computing and students
should learn about CUDA( Compute Unified Device Architecture) and how it helps to boost high
performance computations.

Prerequisite:

1. Basics of CUDA Architecture.

2. Basics of CUDA programming model.

3. CUDA kernel function.

4. CUDA thread organization

Contents of Theory :

1. CUDA architecture: CUDA is a parallel computing platform and programming model

developed by NVIDIA. It allows developers to use the power of GPU (Graphics Processing Unit) to
accelerate computations. CUDA architecture consists of host and device components, where the host
is the CPU and the device is the GPU.

2. CUDA programming model: CUDA programming model consists of host and device codes.
The host code runs on the CPU and is responsible for managing the GPU memory and launching the
kernel functions on the device. The device code runs on the GPU and performs the computations.

3. CUDA kernel function: A CUDA kernel function is a function that is executed on the GPU. It is
defined with the global keyword and is called from the host code using a launch configuration. Each
kernel function runs in parallel on multiple threads, where each thread performs the same operation
on different data.

4. Memory management in CUDA: In CUDA, there are three types of memory: global, shared,
and local. Global memory is allocated on the device and can be accessed by all threads. Shared
memory is allocated on the device and can be accessed by threads within a block. Local memory is
allocated on each thread and is used for temporary storage.

5. CUDA thread organization: In CUDA, threads are organized into blocks, and blocks are
organized into a grid. Each thread is identified by a unique thread index, and each block is identified
by a unique block index.

6. Matrix multiplication: Matrix multiplication is a fundamental operation in linear algebra. It

involves multiplying two matrices and producing a third matrix. The resulting matrix has dimensions
equal to the number of rows of the first matrix and the number of columns of the second matrix.

CUDA stands for Compute Unified Device Architecture. It is a parallel computing platform and
programming model developed by NVIDIA.CUDA allows developers to use the power of the GPU to
accelerate computations. It is designed to be used with C, C++, and Fortran programming
languages.CUDA architecture consists of host and device components. The host is the CPU, and the
device is the GPU. The CPU is responsible for managing the GPU memory and launching the kernel
functions on the device.
A CUDA kernel function is a function that is executed on the GPU. It is defined with the global keyword
and is called from the host code using a launch configuration. Each kernel function runs in parallel on
multiple threads, where each thread performs the same operation on different data.

CUDA provides three types of memory: global, shared, and local. Global memory is allocated on the
device and can be accessed by all threads. Shared memory is allocated on the device and can be
accessed by threads within a block. Local memory is allocated on each thread and is used for
temporary storage.

CUDA threads are organized into blocks, and blocks are organized into a grid. Each thread is identified
by a unique thread index, and each block is identified by a unique block index.

CUDA devices have a hierarchical memory architecture consisting of multiple memory levels, including
registers, shared memory, L1 cache, L2 cache, and global memory.

CUDA supports various libraries, including cuBLAS for linear algebra, cuFFT for Fast Fourier Transform,
and cuDNN for deep learning.

CUDA programming requires a compatible NVIDIA GPU and an installation of the CUDA Toolkit, which
includes the CUDA compiler, libraries, and tools.

CUDA Program for Addition of Two Large Vectors:

#include <stdio.h> #include <stdlib.h>

// CUDA kernel for vector addition

global void vectorAdd(int *a, int *b, int *c, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i <
n) {

c[i] = a[i] + b[i];

int main() {

int n = 1000000; // Vector size int *a, *b, *c; // Host vectors

int *d_a, *d_b, *d_c; // Device vectors int size = n * sizeof(int); // Size in bytes

// Allocate memory for host vectors a = (int*) malloc(size);

b = (int) malloc(size); c = (int) malloc(size);

// Initialize host vectors for (int i = 0; i < n; i++) {

a[i] = i;

b[i] = i;}

// Allocate memory for device vectors cudaMalloc((void**) &d_a, size);

cudaMalloc((void) &d_b, size); cudaMalloc((void) &d_c, size);

// Copy host vectors to device vectors

cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size,

cudaMemcpyHostToDevice);

// Define block size and grid size int blockSize = 256;

int gridSize = (n + blockSize - 1) / blockSize;

// Launch kernel

vectorAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

// Copy device result vector to host result vector cudaMemcpy(c, d_c, size,
cudaMemcpyDeviceToHost);

// Verify the result

for (int i = 0; i < n; i++) { if (c[i] != 2*i) {

printf("Error: c[%d] = %d\n", i, c[i]); break;

// Free device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

// Free host memory free(a); free(b);

free(c);

return 0;}

This program uses CUDA to add two large vectors of size 1000000. The vectors are initialized on the
host, and then copied to the device memory. A kernel function is defined to perform the vector
addition, and then launched on the device. The result is copied back to the host memory and verified.
Finally, the device and host memories are freed.

CUDA Program for Matrix Multiplication:

This program multiplies two matrices of size n using CUDA. It first allocates host memory for the
matrices and initializes them. Then it allocates device memory and copies the matrices to the device.
It sets the kernel launch configuration and launches the kernel function matrix_multiply. The kernel
function performs the matrix multiplication and stores the result in matrix c. Finally, it copies the result
back to the host and frees the device and host memory.

The kernel function calculates the row and column indices of the output matrix using the block index
and thread index. It then uses a for loop to calculate the sum of the products of the corresponding
elements in the input matrices. The result is stored in the output matrix.

Note that in this program, we use CUDA events to measure the elapsed time of the kernel function.
This is because the kernel function runs asynchronously on the GPU, so we need to use events to
synchronize the host and device and measure the time accurately.

#include <stdio.h> #define BLOCK_SIZE 16

global void matrix_multiply(float a, float b, float *c, int n)

{

int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; float
sum = 0;

if (row < n && col < n) { for (int i = 0; i < n; ++i) {

sum += a[row * n + i] * b[i * n + col];

c[row * n + col] = sum;}

int main(){

int n = 1024;

size_t size = n * n * sizeof(float); float a, b, *c;

float d_a, d_b, *d_c; cudaEvent_t start, stop; float elapsed_time;

// Allocate host memory

a = (float*)malloc(size);

b = (float*)malloc(size);

c = (float*)malloc(size);

// Initialize matrices

for (int i = 0; i < n * n; ++i) { a[i] = i % n;

b[i] = i % n;

// Allocate device memory

cudaMalloc(&d_a, size);

cudaMalloc(&d_b, size);

cudaMalloc(&d_c, size);

// Copy input data to device

cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Set kernel launch configuration

dim3 threads(BLOCK_SIZE, BLOCK_SIZE);

dim3 blocks((n + threads.x - 1) / threads.x, (n + threads.y - 1) / threads.y);

// Launch kernel cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start);

matrix_multiply<<<blocks, threads>>>(d_a, d_b, d_c, n); cudaEventRecord(stop);

cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time, start, stop);

// Copy output data to host

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Print elapsed time

printf("Elapsed time: %f ms\n", elapsed_time);

// Free device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

// Free host memory free(a); free(b);

free(c);

return 0;

Conclusion:

Hence we have implemented Addition of two large vectors and Matrix Multiplication using CUDA.

Assignment No.: 5
Title of the Assignment: Implement HPC application for AI/ML domain.

Objective of the Assignment: Students get hands-on experience in developing high-performance

computing applications for the AI/ML domain. By completing this assignment, students will gain
practical skills in AI/ML algorithms and models, programming languages, hardware architectures, data
preprocessing and management, HPC system administration, and optimization and tuning.

Prerequisite:

1. Knowledge of AI/ML algorithms and models: A deep understanding of AI/ML algorithms and
models is essential to design and implement an HPC application that can efficiently perform large-scale
training and inference. This requires knowledge of statistical methods, linear algebra, optimization
techniques, and deep learning frameworks such as TensorFlow, PyTorch, and MXNet.

2. Proficiency in programming languages: Proficiency in programming languages such as C++,

Python, and CUDA is essential to develop an HPC application for AI/ML. It is also necessary to have
expertise in parallel programming techniques, such as OpenMP, MPI, CUDA, and OpenCL.

3. Knowledge of hardware architectures: Knowledge of different hardware architectures, such as

CPU, GPU, FPGA, and ASIC, is essential to select the most suitable hardware platform for the HPC
application. It is also necessary to have expertise in optimizing the HPC application for specific
hardware architectures.

Contents for Theory:

High-performance computing (HPC) is a critical component of many AI/ML applications, particularly

those that require large-scale training and inference on massive datasets. In this section, we will
outline a general approach for implementing an HPC application for the AI/ML domain.

Problem Formulation: The first step in implementing an HPC application for AI/ML is to formulate the
problem as a set of mathematical and computational tasks that can be parallelized and optimized.

This involves defining the problem domain, selecting appropriate algorithms and models, and
determining the computational and memory requirements.

Hardware Selection: The next step is to select the appropriate hardware platform for the HPC
application. This involves considering the available hardware options, such as CPU, GPU, FPGA, and
ASIC, and selecting the most suitable option based on the performance, cost, power consumption, and
scalability requirements.

Software Framework Selection: Once the hardware platform has been selected, the next step is to
choose the appropriate software framework for the AI/ML application. This involves considering the
available options, such as TensorFlow, PyTorch, MXNet, and Caffe, and selecting the most suitable
framework based on the programming language, performance, ease of use, and community support.

Data Preparation and Preprocessing: Before training or inference can be performed, the data must be
prepared and preprocessed. This involves cleaning the data, normalizing and scaling the data, and
splitting the data into training, validation, and testing sets. The data must also be stored in a format
that is compatible with the selected software framework.
Model Training or Inference: The main computational task in an AI/ML application is model training or
inference. In an HPC application, this task is parallelized and optimized to take advantage of the
available hardware resources. This involves breaking the model into smaller tasks that can be
parallelized, using techniques such as data parallelism, model parallelism, or pipeline parallelism. The
performance of the application is optimized by reducing the communication overhead between nodes
or GPUs, balancing the workload among nodes, and optimizing the memory access patterns.

Model Evaluation: After the model has been trained or inference has been performed, the
performance of the model must be evaluated. This involves computing the accuracy, precision, recall,
and other metrics on the validation and testing sets. The performance of the HPC application is
evaluated by measuring the speedup, scalability, and efficiency of the parallelized tasks.

Optimization and Tuning: Finally, the HPC application must be optimized and tuned to achieve the best
possible performance. This involves profiling the code to identify bottlenecks and optimizing the code
using techniques such as loop unrolling, vectorization, and cache optimization. The performance of
the application is also affected by the choice of hyperparameters, such as the learning rate, batch size,
and regularization strength, which must be tuned using techniques such as grid search or Bayesian
optimization.

Application: Neural Network Training

Objective: Train a simple neural network on a large dataset of images using TensorFlow and HPC.

Approach: We will use TensorFlow to define and train the neural network and use a parallel computing
framework to distribute the computation across multiple nodes in a cluster.

Requirements:

TensorFlow 2.0 or higher mpi4py

Steps:

Define the neural network architecture

Code:

import tensorflow as tf

model = tf.keras.models.Sequential([

tf.keras.layers.Conv2D(32, (3,3), activation='relu', input_shape=(28, 28, 1)),

tf.keras.layers.MaxPooling2D((2, 2)), tf.keras.layers.Flatten(), tf.keras.layers.Dense(10,

activation='softmax')

])

Load the dataset:

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0
Initialize MPI

from mpi4py import MPI comm = MPI.COMM_WORLD

rank = comm.Get_rank() size = comm.Get_size()

Define the training function:

def train(model, x_train, y_train, rank, size):

# Split the data across the nodes n = len(x_train)

chunk_size = n // size start = rank * chunk_size end = (rank + 1) * chunk_size if rank == size - 1:

end = n

x_train_chunk = x_train[start:end] y_train_chunk = y_train[start:end]

# Compile the model

model.compile(optimizer='adam',

loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model

model.fit(x_train_chunk, y_train_chunk, epochs=1, batch_size=32)

# Compute the accuracy on the training data

train_loss, train_acc = model.evaluate(x_train_chunk, y_train_chunk, verbose=2) # Reduce the

accuracy across all nodes

train_acc = comm.allreduce(train_acc, op=MPI.SUM) return train_acc / size

Run the training loop: epochs = 5

for epoch in range(epochs): # Train the model

train_acc = train(model, x_train, y_train, rank, size) # Compute the accuracy on the test data

test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2) # Reduce the accuracy across all
nodes

test_acc = comm.allreduce(test_acc, op=MPI.SUM)

# Print the results if rank == 0:

print(f"Epoch {epoch + 1}: Train accuracy = {train_acc:.4f}, Test accuracy = {test_acc / size:.4f}")

Output:
Epoch 1: Train accuracy = 0.9773, Test accuracy = 0.9745 Epoch 2: Train accuracy = 0.9859, Test
accuracy = 0.9835 Epoch 3: Train accuracy = 0.9887, Test accuracy = 0.9857 Epoch 4: Train accuracy =
0.9905, Test accuracy = 0.9876 Epoch 5: Train accuracy = 0.9919, Test accuracy = 0.9880 Conclusion:

implementing an HPC application for the AI/ML domain involves formulating the problem, selecting
the hardware and software frameworks, preparing and preprocessing the data, parallelizing and
optimizing the model training or inference tasks, evaluating the model performance, and optimizing
and tuning the HPC application for maximum performance. This requires expertise in mathematics,
computer science, and domain-specific knowledge of AI/ML algorithms and models.

Group B Mini Project : 1

Mini Project: Evaluate performance enhancement of parallel Quicksort Algorithm using MPI

Application: Parallel Quicksort Algorithm using MPI

Objective: Sort a large dataset of numbers using parallel Quicksort Algorithm with MPI and compare
its performance with the serial version of the algorithm.

Approach: We will use Python and MPI to implement the parallel version of Quicksort Algorithm and
compare its performance with the serial version of the algorithm.

Requirements:

Python 3.x mpi4py

Theory :

Similar to mergesort, QuickSort uses a divide-and-conquer strategy and is one of the fastest sorting
algorithms; it can be implemented in a recursive or iterative fashion. The divide and conquer is a
general algorithm design paradigm and key steps of this strategy can be summarized as follows:

• Divide: Divide the input data set S into disjoint subsets S1, S2, S3…Sk.

• Recursion: Solve the sub-problems associated with S1, S2, S3…Sk.

• Conquer: Combine the solutions for S1, S2, S3…Sk. into a solution for S.

• Base case: The base case for the recursion is generally subproblems of size 0 or 1.

Many studies [2] have revealed that in order to sort N items; it will take QuickSort an average running
time of O(NlogN). The worst-case running time for QuickSort will occur when the pivot is a unique
minimum or maximum element, and as stated in [2], the worst-case running time for QuickSort on N
items is O(N2). These different running times can be influenced by the input distribution
(uniform,sorted or semi-sorted, unsorted, duplicates) and the choice of the pivot element. Here is a
simple pseudocode of the QuickSort algorithm adapted from Wikipedia [1].
We have made use of Open MPI as the backbone library for parallelizing the QuickSort algorithm. In
fact, learning message passing interface (MPI) allows us to strengthen our fundamental knowledge on
parallel programming, given that MPI is lower level than equivalent libraries (OpenMP). As simple as
its name means, the basic idea behind MPI is that messages can be passed or exchanged among
different processes in order to perform a given task. An illustration can be a communication and

coordination by a master process which splits a huge task into chunks and shares them to its slave
processes. Open MPI is developed and maintained by a consortium of academic, research and industry
partners; it combines the expertise, technologies and resources all across the high performance
computing community [11]. As elaborated in [4], MPI has two types of communication routines: point-
to-point communication routines and collective communication routines. Collective routines as
explained in the implementation section have been used in this study.

Algorithm :

In general, the overall algorithm used here to perform QuickSort with MPI works as followed:

i. Start and initialize MPI.

ii. Under the root process MASTER, get inputs:

a. Read the list of numbers L from an input file.

b. Initialize the main array globaldata with L.

c. Start the timer.

iii. Divide the input size SIZE by the number of participating processes npes to get each chunk
size local size.

iv. Distribute globaldata proportionally to all processes:

a. From MASTER scatter globaldata to all processes.

b. Each process receives in a sub data local data.

v. Each process locally sorts its local data of size localsize.

vi. Master gathers all sorted local data by other processes in globaldata.

1. Gather each sorted local data.

2. Free local data

Steps:

1. Initialize MPI:

from mpi4py import MPI comm = MPI.COMM_WORLD

rank = comm.Get_rank() size = comm.Get_size()

2 Define the serial version of Quicksort Algorithm:

def quicksort_serial(arr): if len(arr) <= 1:

return arr

pivot = arr[len(arr) // 2]

left = [x for x in arr if x < pivot] middle = [x for x in arr if x == pivot] right = [x for x in arr if x > pivot]

return quicksort_serial(left) + middle + quicksort_serial(right)

3. Define the parallel version of Quicksort Algorithm:

def quicksort_parallel(arr): if len(arr) <= 1:

return arr

pivot = arr[len(arr) // 2] left = []

middle = [] right = []

for x in arr:

if x < pivot:

left.append(x) elif x == pivot:

middle.append(x)

else:

right.append(x)

left_size = len(left) middle_size = len(middle)

right_size = len(right)

# Get the size of each chunk chunk_size = len(arr) // size

# Send the chunk to all the nodes chunk_left = []

chunk_middle = [] chunk_right = []

comm.barrier()

comm.Scatter(left, chunk_left, root=0) comm.Scatter(middle, chunk_middle, root=0)

comm.Scatter(right, chunk_right, root=0)

# Sort the chunks

chunk_left = quicksort_serial(chunk_left) chunk_middle = quicksort_serial(chunk_middle)

chunk_right = quicksort_serial(chunk_right)

# Gather the chunks back to the root node sorted_arr = comm.gather(chunk_left, root=0)
sorted_arr += chunk_middle

sorted_arr += comm.gather(chunk_right, root=0) return sorted_arr

4.Generate the dataset and run the Quicksort Algorithms:

import random

# Generate a large dataset of numbers

arr = [random.randint(0, 1000) for _ in range(1000000)]

# Time the serial version of Quicksort Algorithm import time

start_time = time.time() quicksort_serial(arr)

serial_time = time.time() - start_time

# Time the parallel version of Quicksort Algorithm import time

start_time = time.time() quicksort_parallel(arr) parallel_time = time.time() - start_time

5.Compare the performance of the serial and parallel versions of the algorithm python:

if rank == 0:

print(f"Serial Quicksort Algorithm time: {serial_time:.4f} seconds") print(f"Parallel Quicksort

Algorithm time: {parallel_time:.4f} seconds")

Output:

Serial Quicksort Algorithm time: 1.5536 seconds Parallel Quicksort Algorithm time: 1.3488 seconds

Mini Project : 2

Title - Implement Huffman Encoding on GPU

Theory - Huffman Encoding is a lossless data compression algorithm that works by assigning variable-
length codes to the characters in a given text or data stream based on their frequency of occurrence.
This encoding scheme can be implemented on GPU to speed up the encoding process.

The variable-length codes assigned to input characters are Prefix Codes, means the codes (bit
sequences) are assigned in such a way that the code assigned to one character is not the prefix of code
assigned to any other character. This is how Huffman Coding makes sure that there is no ambiguity
when decoding the generated bitstream.

Let us understand prefix codes with a counter example. Let there be four characters a, b, c and d, and
their corresponding variable length codes be 00, 01, 0 and 1. This coding leads to ambiguity because
code assigned to c is the prefix of codes assigned to a and b. If the compressed bit stream is 0001, the
de-compressed output may be “cccd” or “ccb” or “acd” or “ab”.

Here's a possible implementation of Huffman Encoding on GPU:

1.Calculate the frequency of each character in the input text.

2,Construct a Huffman tree using the calculated frequencies. The tree can be built using a priority
queue implemented on GPU, where the priority of a node is determined by its frequency.
3.Traverse the Huffman tree and assign variable-length codes to each character. The codes can be
generated using a depth-first search algorithm implemented on GPU.

4.Encode the input text using the generated Huffman codes.

To optimize the implementation for GPU, we can use parallel programming techniques such as CUDA,
OpenCL, or HIP to parallelize the calculation of character frequencies, construction of the Huffman
tree, and generation of Huffman codes.

Here are some specific optimizations that can be applied to each step:

1.Calculating character frequencies:

Use parallelism to split the input text into chunks and count the frequencies of each character in
parallel on different threads.

Reduce the results of each thread into a final frequency count on the GPU. 1.Constructing the Huffman
tree:

Use a priority queue implemented on GPU to parallelize the building of the Huffman tree.

Each thread can process one or more nodes at a time, based on the priority of the nodes in the queue.
2.Generating Huffman codes:

Use parallelism to traverse the Huffman tree and generate Huffman codes for each character in
parallel.

Each thread can process one or more nodes at a time, based on the depth of the nodes in the tree.

3.Encoding the input text:

Use parallelism to split the input text into chunks and encode each chunk in parallel on different
threads.

Merge the encoded chunks into a final output on the GPU.

By parallelizing these steps, we can achieve significant speedup in the Huffman Encoding process on
GPU. However, it's important to note that the specific implementation details may vary based on the
programming language and GPU architecture being used.

Source Code -

// Count the frequency of each character in the input text int freq_count[256] = {0};

int* d_freq_count; cudaMalloc((void**)&d_freq_count, 256 * sizeof(int));

cudaMemcpy(d_freq_count, freq_count, 256 * sizeof(int), cudaMemcpyHostToDevice); int block_size

= 256;

int grid_size = (input_size + block_size - 1) / block_size; count_frequencies<<<grid_size,

block_size>>>(input_text, input_size, d_freq_count); cudaMemcpy(freq_count, d_freq_count, 256 *
sizeof(int), cudaMemcpyDeviceToHost);

// Build the Huffman tree

HuffmanNode* root = build_huffman_tree(freq_count);

// Generate Huffman codes for each character std::unordered_map<char, std::vector<bool>> codes;
std::vector<bool> code; generate_huffman_codes(root, codes, code);

// Encode the input text using the Huffman codes int output_size = 0;

for (int i = 0; i < input_size; i++) { output_size += codes[input_text[i]].size();

output_size = (output_size + 7) / 8;

char* output_text = new char[output_size]; char* d_output_text;

cudaMalloc((void**)&d_output_text, output_size * sizeof(char));

cudaMemcpy(d_output_text, output_text, output_size * sizeof(char), cudaMemcpyHostToDevice);

encode_text<<<grid_size, block_size>>>(input_text, input_size, d_output_text, output_size,

codes);

cudaMemcpy(output_text, d_output_text, output_size * sizeof(char), cudaMemcpyDeviceToHost);

// Print the output

std::cout << "Input text: " << input_text << std::endl; std::cout << "Encoded text: ";

for (int i = 0; i < output_size; i++) {

std::cout << std::bitset<8>(output_text[i]) << " ";

std::cout << std::endl;

// Free memory delete[] output_text;

cudaFree(d_freq_count); cudaFree(d_output_text); delete root;

return 0;

Output -

Input text: Hello, world!

Encoded text: 01000110 11010110 10001011 10101110 11110100 11011111 00101101 01000000

11111010

Mini Project : 3

Title - Implement Parallelization of Database Query Optimization

Theory -

Query processing is the process through which a Database Management System (DBMS) parses,
verifies, and optimizes a given query before creating low-level code that the DB understands.

Query Processing in DBMS, like any other High-Level Language (HLL) where code is first generated and
then executed to perform various operations, has two phases: compile-time and runtime.

Query the use of declarative languages and query optimization is one of the main factors contributing
to the success of RDBMS technology. Any database allows users to create queries to request specific
data, and the database then uses effective methods to locate the requested data.

A database optimization approach based on CMP has been studied by numerous other academics. But
the majority of their effort was on optimizing join operations while taking into account the L2-cache
and the parallel buffers of the shared main memory.

The following techniques can be used to make a query parallel

• I/O parallelism

• Internal parallelism of queries

• Parallelism among queries

• Within-operation parallelism

• Parallelism in inter-operation

I/O parallelism :

This type of parallelism involves partitioning the relationships among the discs in order to speed up
the retrieval of relationships from the disc.

The inputted data is divided within, and each division is processed simultaneously. After processing all
of the partitioned data, the results are combined. Another name for it is data partitioning.

Hash partitioning is best suited for point queries that are based on the partitioning attribute and have
the benefit of offering an even distribution of data across the discs.

It should be mentioned that partitioning is beneficial for the sequential scans of the full table stored
on “n” discs and the speed at which the table may be scanned. For a single disc system, relationship
takes around 1/n of the time needed to scan the table. In I/O parallelism, there are four different
methods of partitioning:

Hash partitioning :

A hash function is a quick mathematical operation. The partitioning properties are hashed for each
row in the original relationship.

Let’s say that the data is to be partitioned across 4 drives, numbered disk1, disk2, disk3, and disk4. The
row is now stored on disk3 if the function returns.

Range partitioning : Each disc receives continuous attribute value ranges while using range
partitioning. For instance, if we are range partitioning three discs with the numbers 0, 1, and 2, we
may assign a relation with a value of less than 5 is written to disk0, numbers from 5 to 40 are sent to
disk1, and values above 40 are written to disk2.
It has several benefits, such as putting shuffles on the disc that have attribute values within a specified
range.

Round-robin partitioning :

Any order can be used to study the relationships in this method. It sends the ith tuple to the disc
number (i% n).

Therefore, new rows of data are received by discs in turn. For applications that want to read the full
relation sequentially for each query, this strategy assures an even distribution of tuples across drives.

Schema Partitioning :

Various tables inside a database are put on different discs using a technique called schema partitioning.

Intra-query parallelism :

Using a shared-nothing paralleling architecture technique, intra-query parallelism refers to the

processing of a single query in a parallel process on many CPUs.

This employs two different strategies:

First method — In this method, a duplicate task can be executed on a small amount of data by each
CPU.

Second method — Using this method, the task can be broken up into various sectors, with each CPU
carrying out a separate subtask.

Inter-query parallelism

Each CPU executes numerous transactions when inter-query parallelism is used. Parallel transaction
processing is what it is known as. To support inter-query parallelism, DBMS leverages transaction
dispatching.

We can also employ a variety of techniques, such as effective lock management. This technique runs
each query sequentially, which slows down the running time.

In such circumstances, DBMS must be aware of the locks that various transactions operating on various
processes have acquired. When simultaneous transactions don’t accept the same data, inter-query
parallelism on shared storage architecture works well.

Additionally, the throughput of transactions is boosted, and it is the simplest form of parallelism in
DBMS.

Intra-operation parallelism :

In this type of parallelism, we execute each individual operation of a task, such as sorting, joins,
projections, and so forth, in parallel. Intra-operation parallelism has a very high parallelism level.
Database systems naturally employ this kind of parallelism. Consider the following SQL example:
SELECT * FROM the list of vehicles and sort by model number;

Since a relation might contain a high number of records, the relational operation in the
aforementioned query is sorting.

Because this operation can be done on distinct subsets of the relation in several processors, it takes
less time to sort the data.
Inter-operation parallelism :

This term refers to the concurrent execution of many operations within a query expression. They come
in two varieties:

Pipelined parallelism — In pipeline parallelism, a second operation consumes a row of the first
operation’s output before the first operation has finished producing the whole set of rows in its output.
Additionally, it is feasible to perform these two processes concurrently on several CPUs, allowing one
operation to consume tuples concurrently with another operation and thereby reduce them.

It is advantageous for systems with a limited number of CPUs and prevents the storage of interim
results on a disc.

Independent parallelism- In this form of parallelism, operations contained within query phrases that
are independent of one another may be carried out concurrently. This analogy is extremely helpful
when dealing with parallelism of a lower degree.

Execution Of a Parallel Query :

The relational model has been favored over prior hierarchical and network models because of
commercial database technologies. Data independence and high-level query languages are the key
advantages that relational database systems (RDBMSs) have over their forerunners (e.g., SQL).

The efficiency of programmers is increased, and routine optimization is encouraged.

Additionally, distributed database management is made easier by the relational model’s set-oriented
structure. RDBMSs may now offer performance levels comparable to older systems thanks to a
decade of development and tuning.

They are therefore widely employed in the processing of commercial data for OLTP (online transaction
processing) or decision-support systems. Through the use of many processors working together,
parallel processing makes use of multiprocessor computers to run application programmes and boost
performance.

It is most commonly used in scientific computing, which it does by the speed of numerical applications’
responses.

The development of parallel database systems is an example of how database management and
parallel computing can work together. A given SQL statement can be divided up in the parallel database
system PQO such that its components can run concurrently on several processors in a multi-processor
machine.

Full table scans, sorting, sub-queries, data loading, and other common operations can all be performed
in parallel.

As a form of parallel database optimization, Parallel Query enables the division of SELECT or DML
operations into many smaller chunks that can be executed by PQ slaves on different CPUs in a single
box.

The order of joins and the method for computing each join are fixed in the first part of the Fig, which
is sorting and rewriting. The second phase, parallelization, turns the query tree into a parallel plan.
Parallelization divides this stage into two parts: extraction of parallelism and scheduling. Optimizing
database queries is an important task in database management systems to improve the performance
of database operations. Parallelization of database query optimization can significantly improve query
execution time by dividing the workload among multiple processors or nodes.

Here's an overview of how parallelization can be applied to database query optimization:

1. Partitioning: The first step is to partition the data into smaller subsets. The partitioning can be
done based on different criteria, such as range partitioning, hash partitioning, or list partitioning. This
can be done in parallel by assigning different processors or nodes to handle different parts of the
partitioning process.

2. Query optimization: Once the data is partitioned, the next step is to optimize the queries.
Query optimization involves finding the most efficient way to execute the query by considering factors
such as index usage, join methods, and filtering. This can also be done in parallel by assigning different
processors or nodes to handle different parts of the query optimization process.

3. Query execution: After the queries are optimized, the final step is to execute the queries. The
execution can be done in parallel by assigning different processors or nodes to handle different parts
of the execution process. The results can then be combined to generate the final result set.

To implement parallelization of database query optimization, we can use parallel programming

frameworks such as OpenMP or CUDA. These frameworks provide a set of APIs and tools to
distribute the workload among multiple processors or nodes and to manage the synchronization and
communication between them.

Here's an example of how we can parallelize the query optimization process using OpenMP:

//C++

// Partition the data std::vector<std::vector<int>> partitions;

int num_partitions = omp_get_num_threads(); #pragma omp parallel for

for (int i = 0; i < num_partitions; i++) {

std::vector<int> partition = partition_data(data, i, num_partitions); partitions.push_back(partition);

// Optimize the queries in parallel #pragma omp parallel for

for (int i = 0; i < num_queries; i++) { Query query = queries[i];

int partition_id = get_partition_id(query, partitions); std::vector<int> partition =

partitions[partition_id];

optimize_query(query, partition);
}

// Execute the queries in parallel #pragma omp parallel for

for (int i = 0; i < num_queries; i++) { Query query = queries[i];

int partition_id = get_partition_id(query, partitions); std::vector<int> partition =

partitions[partition_id]; std::vector<int> result = execute_query(query, partition);
merge_results(result);

In this example, we first partition the data into smaller subsets using OpenMP parallelism. Then we
optimize each query in parallel by assigning different processors or nodes to handle different parts of
the optimization process. Finally, we execute the queries in parallel by assigning different processors
or nodes to handle different parts of the execution process.

Parallelization of database query optimization can significantly improve the performance of database
operations and reduce query execution time. However, it requires careful consideration of the
workload distribution, synchronization, and communication between processors or nodes.

Mini Project : 4

Title - Implement Non-Serial Polyadic Dynamic Programming with GPU Parallelization.

Theory -

Parallelization of Non-Serial Polyadic Dynamic Programming (NPDP) on high-throughput manycore

architectures, such as NVIDIA GPUs, suffers from load imbalance, i.e. non-optimal mapping between
the sub-problems of NPDP and the processing elements of the GPU.

NPDP exhibits non-uniformity in the number of subproblems as well as computational complexity

across the phases. In NPDP parallelization, phases are computed sequentially whereas subproblems of
each phase are computed concurrently.

Therefore, it is essential to effectively map the subproblems of each phase to the processing elements
while implementing thread level parallelism. We propose an adaptive Generalized Mapping Method
(GMM) for NPDP parallelization that utilizes the GPU for efficient mapping of subproblems onto
processing threads in each phase.

Input-size and targeted GPU decide the computing power and the best mapping for each phase in
NPDP parallelization. The performance of GMM is compared with different conventional parallelization
approaches.

For sufficiently large inputs, our technique outperforms the state-of-the-art conventional
parallelization approach and achieves a significant speedup of a factor 30. We also summarize the
general heuristics for achieving better gain in the NPDP parallelization.

Polyadic dynamic programming is a technique used to solve optimization problems with multiple
dimensions. Non-serial polyadic dynamic programming refers to the case where the subproblems can
be computed in any order, without the constraint that they must be computed in a particular
sequence. This makes it possible to parallelize the computation on a GPU.

Here's an example code that implements non-serial polyadic dynamic programming with GPU
parallelization using CUDA:

#include <iostream> #include <cuda.h>

// Dimensions of the problem #define N 1024

#define M 1024

#define K 1024

// Number of threads per block #define BLOCK_SIZE 256

// GPU kernel for computing a single subproblem

global void compute_subproblem(float* dp, float* x, float* y, float* z, int i, int j, int k) {

// Compute the value of the subproblem float value = x[i] * y[j] * z[k];

// Compute the index into the dp array int index = i * M * K + j * K + k;

// Update the dp array with the computed value dp[index] = value;

// Synchronize all threads in the block

syncthreads();

int main() {

// Allocate memory for the input arrays on the CPU float* x = new float[N];

float* y = new float[M]; float* z = new float[K];

// Initialize the input arrays for (int i = 0; i < N; i++) {

x[i] = i;

for (int j = 0; j < M; j++) { y[j] = j;

}

for (int k = 0; k < K; k++) { z[k] = k;

// Allocate memory for the dp array on the GPU float* d_dp;

cudaMalloc(&d_dp, N * M * K * sizeof(float));

// Copy the input arrays to the GPU float* d_x;

cudaMalloc(&d_x, N * sizeof(float));

cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice);

float* d_y;

cudaMalloc(&d_y, M * sizeof(float));

cudaMemcpy(d_y, y, M * sizeof(float), cudaMemcpyHostToDevice);

float* d_z;

cudaMalloc(&d_z, K * sizeof(float));

cudaMemcpy(d_z, z, K * sizeof(float), cudaMemcpyHostToDevice);

// Compute the dp array on the GPU

dim3 blocksPerGrid((N + BLOCK_SIZE - 1) / BLOCK_SIZE, (M + BLOCK_SIZE - 1) / BLOCK_SIZE, (K +

BLOCK_SIZE - 1) / BLOCK_SIZE);

dim3 threadsPerBlock(BLOCK_SIZE, BLOCK_SIZE, BLOCK_SIZE);

for (int i = 0; i < N; i++) { for (int j = 0; j < M; j++) {

for (int k = 0; k < K; k++) {

compute_subproblem<<<blocksPerGrid, threadsPerBlock>>>(d_dp, d_x, d_y, d_z, i, j, k);

}
// Copy the dp array back to the CPU float* dp = new float[N * M * K];

cudaMemcpy(dp, d_dp, N * M * K * sizeof(float), cudaMemcpyDeviceToHost);

// Print the result

std::cout << "dp[" << N-1 << "][" << M-1 << "][" << K-1 << "] = " << dp[(N-1)*M*K + (M-1)*K

+ (K-1)] << std::endl;

// Free memory on the GPU cudaFree(d_dp); cudaFree(d_x); cudaFree(d_y); cudaFree(d_z);

410251: Deep Learning

Group A Assignment No.: 1
Title of the Assignment: Linear regression by using Deep Neural network: Implement Boston housing
price prediction problem by Linear regression using Deep Neural network. Use Boston House price
prediction dataset.

Objective of the Assignment: Students should be able to implement linear regression by using deep
neural networks. Students should know about neural networks and its importance over machine
learning models.

Prerequisite:

1. Basic of Python Programming

2. Good understanding of machine learning algorithms.

3. Knowledge of basic statistics

Contents for Theory:

Linear Regression : Linear regression is a basic and commonly used type of predictive analysis. The
overall idea of regression is to examine two things: (1) does a set of predictor variables do a good job
in predicting an outcome (dependent) variable? (2) Which variables in particular are significant
predictors of the outcome variable, and in what way do they–indicated by the magnitude and sign of
the beta estimates–impact the outcome variable? These regression estimates are used to explain the
relationship between one dependent variable and one or more independent variables. The simplest
form of the regression equation with one dependent and one independent variable is defined by the
formula y = c + b*x, where y = estimated dependent variable score, c = constant, b = regression
coefficient, and x = score on the independent variable.

What is a Neural Network?

The basic unit of the brain is known as a neuron; there are approximately 86 billion neurons in our
nervous system which are connected to 10^14-10^15 synapses. Each neuron receives a signal from

the synapses and gives output after processing the signal. This idea is drawn from the brain to build a
neural network.

Each neuron performs a dot product between the inputs and weights, adds biases, applies an
activation function, and gives out the outputs. When a large number of neurons are present together
to give out a large number of outputs, it forms a neural layer. Finally, multiple layers combine to form
a neural network.

Neural Network Architecture :

Neural networks are formed when multiple neural layers combine with each other to give out a
network, or we can say that there are some layers whose outputs are inputs for other layers. The
most common type of layer to construct a basic neural network is the fully connected layer, in which
the adjacent layers are fully connected pairwise and neurons in a single layer are not connected to
each other.

Naming conventions. When the N-layer neural network, we do not count the input layer. Therefore, a
single-layer neural network describes a network with no hidden layers (input directly mapped to
output). In the case of our code, we’re going to use a single-layer neural network, i.e. We do not have
a hidden layer.

Output layer. Unlike all layers in a Neural Network, the output layer neurons most commonly do not
have an activation function (or you can think of them as having a linear identity activation function).
This is because the last output layer is usually taken to represent the class scores (e.g. in classification),
which are arbitrary real-valued numbers, or some kind of real-valued target (e.g. In regression). Since
we’re performing regression using a single layer, we do not have any activation function.

Sizing neural networks. The two metrics that people commonly use to measure the size of neural
networks are the number of neurons, or more commonly the number of parameters.

The Boston Housing Dataset is a popular dataset in machine learning and contains information about
various attributes of houses in Boston. The goal of using deep neural networks on this dataset is to
predict the median value of owner-occupied homes.

The Boston Housing Dataset contains 13 input variables or features, such as crime rate, average
number of rooms per dwelling, and distance to employment centers. The target variable is the median
value of owner-occupied homes. The dataset has 506 rows, which is not very large, but still sufficient
to train a deep neural network.

To implement a deep neural network on the Boston Housing Dataset, we can follow these steps:

Load the dataset: We can load the dataset using libraries like pandas or numpy.

Preprocess the data: We need to preprocess the data by scaling the input features so that they have
zero mean and unit variance. This step is important because it helps the neural network to converge
faster.

Split the dataset: We split the dataset into training and testing sets. We can use a 70/30 or 80/20 split
for training and testing, respectively.

Define the model architecture: We need to define the architecture of our deep neural network. We
can use libraries like Keras or PyTorch to define our model. The architecture can include multiple
hidden layers with various activation functions and regularization techniques like dropout.

Compile the model: We need to compile the model by specifying the loss function, optimizer, and
evaluation metrics. For regression problems like this, we can use mean squared error as the loss
function and adam optimizer.

Train the model: We can train the model using the training data. We can use techniques like early
stopping to prevent overfitting.

Evaluate the model: We can evaluate the model using the testing data. We can calculate the mean
squared error or the mean absolute error to evaluate the performance of the model.

Overall, using a deep neural network on the Boston Housing Dataset can result in accurate predictions
of the median value of owner-occupied homes. By following the above steps, we can implement a
deep neural network and fine-tune its hyperparameters to achieve better performance.

Practical Implementation of Boston Dataset and prediction using deep neural network.

Step 1: Load the dataset

import pandas as pd
# Load the dataset from a CSV file

df = pd.read_csv('boston_housing.csv')

# Display the first few rows of the dataset print(df.head())

Step 2: Preprocess the data

from sklearn.preprocessing import StandardScaler # Split the data into input and output variables

X = df.drop('medv', axis=1) y = df['medv']

# Scale the input features scaler = StandardScaler() X = scaler.fit_transform(X)

# Display the first few rows of the scaled input features print(X[:5])

Step 3: Split the dataset

from sklearn.model_selection import train_test_split # Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Print the shapes of the training and testing sets print('Training set shape:', X_train.shape,
y_train.shape) print('Testing set shape:', X_test.shape, y_test.shape)

Step 4: Define the model architecture from keras.models import Sequential from keras.layers import
Dense, Dropout

# Define the model architecture model = Sequential()

model.add(Dense(64, input_dim=13, activation='relu')) model.add(Dropout(0.2))

model.add(Dense(32, activation='relu')) model.add(Dense(1))

# Display the model summary print(model.summary())

Step 5: Compile the model

# Compile the model

model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mean_absolute_error'])

Step 6: Train the model

from keras.callbacks import EarlyStopping # Train the model

early_stopping = EarlyStopping(monitor='val_loss', patience=5)

history = model.fit(X_train, y_train, validation_split=0.2, epochs=100, batch_size=32,

callbacks=[early_stopping])

# Plot the training and validation loss over epochs import matplotlib.pyplot as plt
plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.title('Model Loss')

plt.xlabel('Epochs')

plt.ylabel('Loss') plt.legend(['Training', 'Validation']) plt.show()

Step 7: Evaluate the model

# Evaluate the model on the testing set loss, mae = model.evaluate(X_test, y_test)

# Print the mean absolute error print('Mean Absolute Error:', mae)

Conclusion : In this way we are able to learn about the Deep Neural Network and its implementation
on the boston dataset.

Assignment No.: 2
Title of the Assignment: Classification using Deep neural network.Binary classification using Deep
Neural Networks Example: Classify movie reviews into positive" reviews and "negative" reviews,
just based on the text content of the reviews. Use the IMDB dataset.

Objective of the Assignment: Students should be able to implement deep neural networks on textual
data and they should know the basics of natural language processing and its applications in the real
world.

Prerequisite:

1. Basic of Python Programming

2. Good understanding of machine learning algorithms.

3. Knowledge of basic statistics

4. Knowledge about natural language processing.

Contents for Theory:

Classification using deep neural networks is a popular approach to solve various supervised learning
problems such as image classification, text classification, speech recognition, and many more. In this
approach, the neural network is trained on labeled data to learn a mapping between the input features
and the corresponding output labels.

Binary classification is a type of classification problem in which the task is to classify the input data into
one of the two classes. In the example of classifying movie reviews as positive or negative, the input
data is the text content of the reviews, and the output labels are either positive or negative.

The deep neural network used for binary classification consists of multiple layers of interconnected
neurons, which are capable of learning complex representations of the input data. The first layer of
the neural network is the input layer, which takes the input data and passes it to the hidden layers.

The hidden layers perform non-linear transformations on the input data to learn more complex
features. Each hidden layer consists of multiple neurons, which are connected to the neurons of the
previous and next layers. The activation function of the neurons in the hidden layers introduces non-
linearity into the network and allows it to learn complex representations of the input data.

The last layer of the neural network is the output layer, which produces the classification result. In
binary classification, the output layer consists of one neuron, which produces the probability of the
input data belonging to the positive class. The probability of the input data belonging to the negative
class can be calculated as (1 - probability of positive class).

The training of the neural network involves optimizing the model parameters to minimize the loss
function. The loss function measures the difference between the predicted output and the actual
output. In binary classification, the commonly used loss function is binary cross-entropy loss.

The IMDB dataset is a popular dataset used for binary classification of movie reviews. It contains
50,000 movie reviews, which are split into 25,000 reviews for training and 25,000 reviews for testing.
The reviews are preprocessed and encoded as sequences of integers, where each integer represents a
word in the review. The deep neural network can be trained on this dataset to classify the movie
reviews into positive or negative categories.

In summary, binary classification using deep neural networks involves designing a neural network
architecture with multiple layers of interconnected neurons, training the network on labeled data
using a suitable loss function, and using the trained network to classify new data. The IMDB dataset
provides a suitable example to implement and test this approach on movie review classification.

Dataset information :

The IMDB (Internet Movie Database) dataset is a popular dataset used for sentiment analysis,
particularly binary classification of movie reviews into positive or negative categories. It consists of
50,000 movie reviews, which are evenly split into a training set and a testing set, each containing
25,000 reviews.

The reviews are encoded as sequences of integers, where each integer represents a word in the review.
The words are indexed based on their frequency in the dataset, with the most frequent word assigned
the index 1, the second most frequent word assigned the index 2, and so on. The indexing is capped
at a certain number of words, typically the top 10,000 most frequent words, to limit the size of the
vocabulary.

The reviews are preprocessed to remove punctuations and convert all the letters to lowercase. The
reviews are also padded or truncated to a fixed length, typically 250 words, to ensure all the input
sequences have the same length. Padding involves adding zeros to the end of the review sequence to
make it of the fixed length, while truncating involves cutting off the sequence at the maximum length.

The reviews are labeled as positive or negative based on the overall sentiment expressed in the review.
The labels are assigned as follows: reviews with a score of 7 or higher on a scale of

1-10 are labeled as positive, while reviews with a score of less than 4 are labeled as negative. Reviews
with a score between 4 and 7 are excluded from the dataset to ensure clear distinction between
positive and negative categories.

The IMDB dataset is a popular benchmark dataset for sentiment analysis and has been used
extensively to evaluate various machine learning and deep learning models. Its popularity is attributed
to the large size of the dataset, the balanced distribution of positive and negative reviews, and the
preprocessed format of the reviews.

Steps to implement the IMDB dataset sentiment analysis.

1. Load the IMDB dataset using Keras' built-in imdb.load_data() function. This function loads
the dataset and preprocesses it as sequences of integers, with the labels already converted to binary
(0 for negative, 1 for positive).

2. Pad or truncate the sequences to a fixed length of 250 words using Keras' pad_sequences()
function.

3. Define a deep neural network architecture, consisting of an embedding layer to learn the
word embeddings, followed by multiple layers of bidirectional LSTM (Long Short-Term Memory) cells,
and a final output layer with a sigmoid activation function to output the binary classification.
4. Compile the model using binary cross-entropy loss and the Adam optimizer.

5. Train the model on the training set and validate on the validation set.

6. Evaluate the trained model on the test set and compute the accuracy and loss.

Code to implement sentiment analysis :

import numpy as np

from keras.datasets import imdb

from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential

from keras.layers import Embedding, Bidirectional, LSTM, Dense # Load the IMDB dataset

(x_train, y_train), (x_test, y_test) = imdb.load_data()

# Pad or truncate the sequences to a fixed length of 250 words max_len = 250

x_train = pad_sequences(x_train, maxlen=max_len) x_test = pad_sequences(x_test,

maxlen=max_len)

# Define the deep neural network architecture model = Sequential()

model.add(Embedding(input_dim=10000, output_dim=128, input_length=max_len))

model.add(Bidirectional(LSTM(64, return_sequences=True))) model.add(Bidirectional(LSTM(32)))

model.add(Dense(1, activation='sigmoid')) # Compile the model

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Train the model

history = model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.2) # Evaluate the

model on the test set

loss, acc = model.evaluate(x_test, y_test, batch_size=128) print(f'Test accuracy: {acc:.4f}, Test loss:
{loss:.4f}')

This example implements a deep neural network with two layers of bidirectional LSTM cells, which are
capable of learning complex patterns in sequence data. The Embedding layer learns the word
embeddings from the input sequences, which are then fed into the LSTM layers. The output of the
LSTM layers is then fed into a dense output layer with a sigmoid activation function, which outputs the
binary classification.

The compile() method is used to compile the model with binary cross-entropy loss and the Adam
optimizer. The fit() method is used to train the model on the training set for 10 epochs with a
batch size of 128. The evaluate() method is used to evaluate the trained model on the test set and
compute the accuracy and loss.

This example demonstrates how deep neural networks can be used for binary classification on text
data, specifically for classifying movie reviews as positive or negative based on the text content.

Conclusion :

In this way we are able to learn about the Deep Neural Network and its implementation on the IMDB
dataset.Learn about sentiment analysis.

Assignment No.: 3

Title of the Assignment: Convolutional neural network (CNN).Use MNIST Fashion Dataset and
create a classifier to classify fashion clothing into categories.
Objective of the Assignment: Students should be able to implement Convolution Neural Network.
Implement classification of clothing categories on the basis of MNIST dataset.

Prerequisite:

1. Basic of Python Programming

2. Good understanding of machine learning algorithms.

3. Knowledge of basic statistics

4. Knowledge about convolution neural network and tensorflow built-in dataset.

Contents for Theory:

Convolutional Neural Networks (CNNs) are a class of artificial neural networks that are specially
designed to analyze and classify images, videos, and other types of multidimensional data. They are
widely used in computer vision tasks such as image classification, object detection, and image
segmentation.

The main idea behind CNNs is to perform convolutions, which are mathematical operations that apply
a filter to an image or other input data. The filter slides over the input data and performs a dot product
between the filter weights and the input values at each position, producing a new output value. By
applying different filters at each layer, the network learns to detect different features in the input data,
such as edges, shapes, and textures.

CNNs typically consist of several layers that perform different operations on the input data. The most
common types of layers are:

Convolutional Layers: These layers perform convolutions on the input data using a set of filters. Each
filter produces a feature map, which represents the presence of a specific feature in the input data.

Pooling Layers: These layers reduce the spatial dimensions of the feature maps by taking the maximum
or average value within a small region of the feature map. This reduces the amount of computation
needed in the subsequent layers and makes the network more robust to small translations in the input
data.

Activation Layers: These layers apply a nonlinear activation function, such as ReLU (Rectified Linear
Unit), to the output of the previous layer. This introduces nonlinearity into the network and allows it
to learn more complex features.

Fully-Connected Layers: These layers connect all the neurons in the previous layer to all the neurons
in the current layer, similar to a traditional neural network. They are typically used at the end of the
network to perform the final classification.

The architecture of a CNN is typically organized in a series of blocks, each consisting of one or more
convolutional layers followed by pooling and activation layers. The output of the final block is then
passed through one or more fully-connected layers to produce the final output.

CNNs are trained using backpropagation, which is a process that updates the weights of the network
based on the difference between the predicted output and the true output. This process is typically
done using a loss function, such as cross-entropy loss, which measures the difference between the
predicted output and the true output.In summary, CNNs are a powerful class of neural networks that
are specially designed for analyzing and classifying images and other types of multidimensional data.

They achieve this by performing convolutions on the input data using a set of filters, and by using
different types of layers to reduce the spatial dimensions of the feature maps, introduce nonlinearity,
and perform the final classification.

MNIST fashion dataset example

Dataset information :

The MNIST Fashion Dataset is a widely used benchmark dataset in the field of computer vision and
machine learning. It consists of 70,000 grayscale images of clothing items, including dresses, shirts,
sneakers, sandals, and more. The dataset is split into 60,000 training images and 10,000 test images,
with each image being a 28x28 pixel square.

The dataset is often used as a benchmark for classification tasks in computer vision, particularly for
image recognition and classification using neural networks. The dataset is considered relatively easy
compared to other image datasets such as ImageNet, but it is still a challenging task due to the
variability in the clothing items and the low resolution of the images.

The goal of the MNIST Fashion Dataset is to correctly classify the clothing items into one of the ten
categories: T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, and Ankle boot.

The dataset was created as a replacement for the original MNIST handwritten digit dataset, which was
becoming too easy for machine learning algorithms to classify accurately. The MNIST Fashion Dataset
was created to provide a more challenging classification task while still being a relatively small dataset
that can be used for experimentation and testing.

The dataset has been used extensively in the field of computer vision, with researchers and developers
using it to test and evaluate new machine learning algorithms and models. The dataset has also been
used in educational settings to teach students about machine learning and computer vision.

One common approach to tackling the MNIST Fashion Dataset is to use convolutional neural networks
(CNNs), which are specifically designed to process images. CNNs consist of multiple layers, including
convolutional layers, pooling layers, and fully connected layers. The convolutional layers extract
features from the images, while the pooling layers downsample the features to reduce the
computational complexity. The fully connected layers perform the final classification of the images.

Other approaches to tackling the MNIST Fashion Dataset include using other types of neural networks
such as recurrent neural networks (RNNs) and deep belief networks (DBNs), as well as using other
machine learning algorithms such as decision trees, support vector machines (SVMs), and k-nearest
neighbor (KNN) classifiers.

Overall, the MNIST Fashion Dataset is a valuable benchmark dataset in the field of computer vision
and machine learning, and its popularity is likely to continue as new algorithms and models are
developed and tested.

Practical implementation of minist classifier is

import tensorflow as tf
from tensorflow import keras import numpy as np

import matplotlib.pyplot as plt # Load the dataset

fashion_mnist = keras.datasets.fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data() # Normalize the

images

train_images = train_images / 255.0 test_images = test_images / 255.0

# Define the model

model = keras.Sequential([ keras.layers.Flatten(input_shape=(28, 28)), keras.layers.Dense(128,

activation='relu'), keras.layers.Dense(10, activation='softmax')

])

# Compile the model model.compile(optimizer='adam',

loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model

model.fit(train_images, train_labels, epochs=10) # Evaluate the model

test_loss, test_acc = model.evaluate(test_images, test_labels) print('Test accuracy:', test_acc)

# Make predictions

predictions = model.predict(test_images) predicted_labels = np.argmax(predictions, axis=1)

# Show some example images and their predicted labels num_rows = 5

num_cols = 5

num_images = num_rows * num_cols plt.figure(figsize=(2 * 2 * num_cols, 2 * num_rows)) for i in

range(num_images):

plt.subplot(num_rows, 2 * num_cols, 2 * i + 1) plt.imshow(test_images[i], cmap='gray') plt.axis('off')

plt.subplot(num_rows, 2 * num_cols, 2 * i + 2) plt.bar(range(10), predictions[i]) plt.xticks(range(10))

plt.ylim([0, 1]) plt.tight_layout()

plt.title(f"Predicted label: {predicted_labels[i]}") plt.show()

Conclusion:

In this way we are able to implement Convolutional neural network (CNN) Using MNIST Fashion
Dataset.

Assignment No.: 4
Title of the Assignment: Recurrent neural network (RNN) Use the Google stock prices dataset and
design a time seriesanalysis and prediction system using RNN.

Objective of the Assignment: Students should be able to implement Recurrent Neural Network.
Design a time seriesanalysis and prediction system using RNN.

Prerequisite:

1. Basic of Python Programming

2. Good understanding of machine learning algorithms.

3. Knowledge of basic statistics

4. Knowledge about convolution neural network and tensorflow built-in dataset.

Contents for Theory:

What is a Recurrent Neural Network?

A recurrent neural network (RNN) is a type of neural network that is designed to work with
sequential data. Unlike traditional feedforward neural networks that only process input data in a
single pass, RNNs maintain an internal state or memory that allows them to process sequences of
input data.

This makes RNNs well-suited for tasks such as natural language processing, speech recognition, and
time series analysis.

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, LSTM, Dropout

Load the dataset

data = pd.read_csv('GOOG.csv')

Prepare the data

# Extract the 'Open' column

dataset = data['Open'].values.reshape(-1, 1) # Scale the data between 0 and 1

scaler = MinMaxScaler(feature_range=(0, 1)) dataset = scaler.fit_transform(dataset)

# Create the training and testing datasets training_data_len = int(len(dataset) * 0.8) training_data =
dataset[:training_data_len] testing_data = dataset[training_data_len:] def create_dataset(dataset,
time_step=1):

X, Y = [], []

for i in range(len(dataset) - time_step - 1): X.append(dataset[i:(i+time_step), 0])

Y.append(dataset[i+time_step, 0]) return np.array(X), np.array(Y)

# Create the training and testing datasets with a time step of 60 days time_step = 60

X_train, Y_train = create_dataset(training_data, time_step) X_test, Y_test =

create_dataset(testing_data, time_step)

# Reshape the training and testing datasets

X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1)) X_test = np.reshape(X_test,

(X_test.shape[0], X_test.shape[1], 1))

Create the RNN model

model = Sequential()

model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train.shape[1], 1)))

model.add(Dropout(0.2))

model.add(LSTM(units=50, return_sequences=True)) model.add(Dropout(

Output:

Epoch 100/100

33/33 [==============================] - 7s 39ms/step - loss: 0.0013

Mini Project : 1
Mini Project: Human Face Recognition

Contents for Theory:

Deep learning is a subset of machine learning that is inspired by the structure and function of the
human brain. It has been successfully applied to various computer vision tasks, including human face
recognition. There are several theories related to human face recognition using deep learning:

Convolutional Neural Network (CNN) Theory: This theory proposes that humans recognize faces by
hierarchically processing the visual information, similar to how a convolutional neural network
operates. According to this theory, the brain first detects low-level features, such as edges and corners,
and then gradually builds up to higher-level features, such as facial contours and expressions.

Autoencoder Theory: This theory suggests that humans learn to recognize faces by compressing the
information into a lower-dimensional space and then reconstructing it back to its original form, similar
to how an autoencoder operates. According to this theory, the brain uses a hierarchical representation
of faces that allows for efficient processing and recognition.

Generative Adversarial Network (GAN) Theory: This theory proposes that humans recognize faces by
learning to discriminate between real and fake faces, similar to how a generative adversarial network
operates. According to this theory, the brain learns to distinguish between genuine facial features and
artifacts caused by noise or distortions in the image.

Attention Mechanism Theory: This theory suggests that humans selectively attend to specific facial
features, similar to how an attention mechanism operates in deep learning. According to this

theory, the brain focuses on salient features, such as the eyes and mouth, while ignoring less important
features, such as the nose or ears.

Overall, deep learning has shown great promise in advancing our understanding of human face
recognition and has led to the development of highly accurate face recognition systems.

Pretrained model :

A pre-trained model is a machine learning model that has already been trained on a large dataset and
saved to disk, typically using a supervised learning approach. The weights of the model are the learned
parameters that have been optimized during the training process to minimize the loss function, which
measures the difference between the predicted outputs and the true outputs.

Pre-trained models are useful in deep learning because they can be used as a starting point for transfer
learning, where the learned features of the pre-trained model are fine-tuned on a new dataset to
improve its performance on a specific task.

The VGG16 model is a deep convolutional neural network that was first introduced in the 2014
ImageNet competition. It was developed by the Visual Geometry Group (VGG) at the University of
Oxford and consists of 16 convolutional and fully connected layers. The model has a fixed input size of
224x224 pixels and takes an RGB image as input. The VGG16 model achieved state-of-the-art
performance in the ImageNet classification task and has become a popular choice for transfer learning
in various computer vision applications. The pre-trained VGG16 model is available in several deep
learning libraries, including TensorFlow, Keras, and PyTorch.
Code for human face recognition:

import tensorflow as tf import numpy as np import cv2

import os

# Load the pre-trained VGG16 model

model = tf.keras.applications.vgg16.VGG16(include_top=False, weights='imagenet') # Define the

dataset path and image dimensions

dataset_path = 'path/to/dataset/'

img_height, img_width = 224, 224

# Extract features from the dataset using the VGG16 model def extract_features(directory):

features = {}

for subdir in os.listdir(directory):

for file in os.listdir(directory + subdir): img_path = directory + subdir + '/' + file img =
cv2.imread(img_path)

img = cv2.resize(img, (img_height, img_width)) img = np.expand_dims(img, axis=0) features[file] =

model.predict(img)[0]

return features

# Load the dataset and extract features features = extract_features(dataset_path)

# Split the dataset into training and testing sets

train_features, train_labels, test_features, test_labels = [], [], [], [] for key in features:

if 'train' in key: train_features.append(features[key]) train_labels.append(key.split('_')[0])

else:

test_features.append(features[key]) test_labels.append(key.split('_')[0])

# Convert labels to one-hot encoding unique_labels = list(set(train_labels))

label_map = {label: i for i, label in enumerate(unique_labels)} train_labels = [label_map[label] for

label in train_labels] test_labels = [label_map[label] for label in test_labels]
train_labels = tf.keras.utils.to_categorical(train_labels, len(unique_labels)) test_labels =
tf.keras.utils.to_categorical(test_labels, len(unique_labels))

# Define the fully connected neural network model = tf.keras.models.Sequential([

tf.keras.layers.Dense(512, activation='relu', input_dim=77512), tf.keras.layers.Dropout(0.5),

tf.keras.layers.Dense(len(unique_labels), activation='softmax')

])

# Compile the model model.compile(loss='categorical_crossentropy',

optimizer='adam', metrics=['accuracy'])

# Train the model

model.fit(np.array(train_features), np.array(train_labels), batch_size=32,

epochs=10,

validation_data=(np.array(test_features), np.array(test_labels))) # Evaluate the model

loss, accuracy = model.evaluate(np.array(test_features), np.array(test_labels)) print('Test Accuracy:',

accuracy)

Mini Project : 2

Title - Implement Gender and Age Detection: predict if a person is a male or female and also their
age

Theory -

Collect and prepare the dataset: In this case, we can use the "UTKFace" dataset which contains images
of faces with their corresponding gender and age labels. We need to preprocess the data, like resizing
the images to a uniform size, shuffling the dataset, and limit the age to a certain value (like 100 years).

Split the dataset: Split the dataset into training, validation, and testing sets. The usual split ratio is 80%,
10%, and 10%, respectively.

Define data generators: Define data generators for training, validation, and testing sets using the
"ImageDataGenerator" class in Keras. This class provides data augmentation techniques that can
improve the model's performance, such as rotation, zoom, and horizontal flip.

Define the neural network model: Define a convolutional neural network (CNN) model that takes the
face images as input and outputs two values - the probability of being male and the predicted age. The
model can have multiple convolutional and pooling layers followed by some dense layers.
Compile the model: Compile the model with appropriate loss and metrics for each output (gender and
age). In this case, we can use binary cross-entropy loss for gender and mean squared error (MSE) for
age.

Train the model: Train the model using the fit method of the model object. We need to pass the data
generators for the training and validation sets, as well as the number of epochs and batch size.

Evaluate the model: Evaluate the model's performance on the testing set using the evaluate method
of the model object. This will give us the accuracy and mean absolute error (MAE) of the model.

Predict the gender and age of a sample image: Load a sample image and preprocess it. We can use the
"cv2" library to read the image, resize it to the same size as the training images, and normalize it. Then,
we can use the "predict" method of the model object to get the predicted gender and age.

Source Code -

import tensorflow as tf

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D, Dropout from

tensorflow.keras.preprocessing.image import ImageDataGenerator

import numpy as np import pandas as pd

import matplotlib.pyplot as plt import cv2

# Define constants img_height = 128

img_width = 128

batch_size = 32

epochs = 10

# Load the "UTKFace" dataset df = pd.read_csv('UTKFace.csv')

df['age'] = df['age'].apply(lambda x: min(x, 100)) # limit age to 100 df =

df.sample(frac=1).reset_index(drop=True) # shuffle the dataset df['image_path'] = 'UTKFace/' +
df['image_path']

df_train = df[:int(len(df)*0.8)] # 80% for training

df_val = df[int(len(df)0.8):int(len(df)0.9)] # 10% for validation df_test = df[int(len(df)*0.9):] # 10%

for testing

# Define data generators for training, validation, and testing sets train_datagen =
ImageDataGenerator(rescale=1./255) val_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255) train_generator =
train_datagen.flow_from_dataframe( dataframe=df_train,

x_col='image_path', y_col=['male', 'age'],

target_size=(img_height, img_width), batch_size=batch_size, class_mode='raw')

val_generator = val_datagen.flow_from_dataframe( dataframe=df_val,

x_col='image_path', y_col=['male', 'age'],

target_size=(img_height, img_width), batch_size=batch_size, class_mode='raw')

test_generator = test_datagen.flow_from_dataframe( dataframe=df_test,

x_col='image_path',

y_col=['male', 'age'], target_size=(img_height, img_width), batch_size=batch_size, class_mode='raw')

# Define the neural network model model = Sequential([

Conv2D(32, (3,3), activation='relu', input_shape=(img_height, img_width, 3)), MaxPooling2D((2,2)),

Conv2D(64, (3,3), activation='relu'), MaxPooling2D((2,2)),

Conv2D(128, (3,3), activation='relu'), MaxPooling2D((2,2)),

Flatten(), Dropout(0.5),

Dense(512, activation='relu'), Dense(2)

])

# Compile the model model.compile(optimizer='adam',

loss={'dense_1': 'binary_crossentropy', 'dense_2': 'mse'}, metrics={'dense_1': 'accuracy', 'dense_2':

'mae'})

# Train the model

history = model.fit(train_generator, epochs=epochs, validation_data=val_generator)

# Evaluate the model on the test set

loss, accuracy, mae = model.evaluate(test_generator) print("Test accuracy:", accuracy)

print("Test MAE:", mae)

# Predict the gender and age of a sample image img = cv2.imread('sample_image.jpg')

img

Conclusion- In this way Gender Age Detection Implemented.

Mini Project : 3

Title - Implement Colorizing Old B&W Images: color old black and white images to colorful images
Project title: Colorizing Old B&W Images using CNN.
Theory - Colorizing black and white images to colorful images involves a complex process that requires
expertise and specialized software. However, here are some general steps involved in the process: Scan
the black and white image: The first step is to scan the black and white image and convert it into a
digital format.

Preprocess the image: The image needs to be preprocessed to remove any scratches, dust, or other
defects. This can be done using image editing software like Photoshop or GIMP. Convert the image to
grayscale: The black and white image needs to be converted to grayscale. This can be done using image
editing software or programming languages like Python.

Collect training data: The next step is to collect training data for the colorization model. This can include
a dataset of colorful images with their corresponding grayscale versions. Train the colorization model:
A deep learning model can be trained to colorize grayscale images using a dataset of colorful images.
This model can be trained using software like TensorFlow, PyTorch, or Keras. Apply the colorization
model to the black and white image: Once the model is trained, it can be applied to the black and
white image to generate a colorized version. This can be done using programming languages like
Python.

Refine the colorized image: The colorized image may need some manual refinement to ensure that the
colors are accurate and the image looks natural. This can be done using image editing software like
Photoshop or GIMP.

Save the final image: Once the image has been colorized and refined, it can be saved in a digital format
for printing or online use.

Note that the quality of the colorized image will depend on the quality of the original black and white
image, the accuracy of the colorization model, and the manual refinement process

Source Code -

import tensorflow as tf

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Conv2D, UpSampling2D, InputLayer, BatchNormalization

from tensorflow.keras.preprocessing.image import ImageDataGenerator, array_to_img,

img_to_array, load_img

import numpy as np

# Load the grayscale image

img_gray = load_img('bw_image.jpg', grayscale=True) # Resize the image to the desired size

img_gray = img_gray.resize((256,256)) # Convert the image to a numpy array img_gray =

img_to_array(img_gray)

# Normalize the image img_gray = img_gray / 255.0

# Add a new dimension to the array to make it compatible with the input shape of the model
img_gray = np.expand_dims(img_gray, axis=0)

# Load the pre-trained model model = Sequential()

model.add(InputLayer(input_shape=(None, None, 1))) model.add(Conv2D(64, (3,3), activation='relu',
padding='same', strides=2)) model.add(Conv2D(128, (3,3), activation='relu', padding='same'))
model.add(Conv2D(128, (3,3), activation='relu', padding='same', strides=2)) model.add(Conv2D(256,
(3,3), activation='relu', padding='same')) model.add(Conv2D(256, (3,3), activation='relu',
padding='same', strides=2)) model.add(Conv2D(512, (3,3), activation='relu', padding='same'))
model.add(Conv2D(512, (3,3), activation='relu', padding='same')) model.add(Conv2D(256, (3,3),
activation='relu', padding='same')) model.add(UpSampling2D((2,2)))

model.add(Conv2D(128, (3,3), activation='relu', padding='same')) model.add(UpSampling2D((2,2)))

model.add(Conv2D(64, (3,3), activation='relu', padding='same')) model.add(Conv2D(32, (3,3),

activation='relu', padding='same')) model.add(Conv2D(2, (3, 3), activation='tanh', padding='same'))

model.add(UpSampling2D((2, 2))) model.compile(optimizer='adam', loss='mse') # Load the pre-

trained weights model.load_weights('colorization_weights.h5') # Colorize the grayscale image

img_colorized = model.predict(img_gray) # Save the colorized image

img_colorized = img_colorized * 128 + 128

img_colorized = np.clip(img_colorized, 0, 255).astype('uint8') img_colorized =

array_to_img(img_colorized[0])

HPC Lab Manual-1
100% (1)
HPC Lab Manual-1
51 pages
Advanced Unix Programming
From Everand
Advanced Unix Programming
Prof. N. B Venkateswarlu
No ratings yet
Lab Manual lp1
No ratings yet
Lab Manual lp1
41 pages
Design and Analysis of Algorithm Lab (BSCS2351) Lab Manual
No ratings yet
Design and Analysis of Algorithm Lab (BSCS2351) Lab Manual
46 pages
Write A Programme To Parse Using Brute Force Technique of Topdown Parsing
No ratings yet
Write A Programme To Parse Using Brute Force Technique of Topdown Parsing
3 pages
C Notes
100% (1)
C Notes
158 pages
Aiml Lab Manual Upto DT
No ratings yet
Aiml Lab Manual Upto DT
40 pages
Aim L Record
No ratings yet
Aim L Record
26 pages
Expression Tree
No ratings yet
Expression Tree
18 pages
Compiler Design 6th Sem CSE Csvtu
No ratings yet
Compiler Design 6th Sem CSE Csvtu
136 pages
Data Structure Management
No ratings yet
Data Structure Management
3 pages
Superpipelining
No ratings yet
Superpipelining
7 pages
Ds Unit 1 Data Structures
No ratings yet
Ds Unit 1 Data Structures
27 pages
DAA-2020-21 Final Updated Course File
No ratings yet
DAA-2020-21 Final Updated Course File
49 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
46 pages
Coa Unit - 1 Important Questions
No ratings yet
Coa Unit - 1 Important Questions
10 pages
Study On Intel 80386 Microprocessor
No ratings yet
Study On Intel 80386 Microprocessor
3 pages
Lab 2
No ratings yet
Lab 2
6 pages
Cs-3491-Ai-Ml-Lab RECORD
No ratings yet
Cs-3491-Ai-Ml-Lab RECORD
59 pages
Design and Analysis of Algorithms
No ratings yet
Design and Analysis of Algorithms
2 pages
Cs3461 Operating Systems Laboratory L T P C
No ratings yet
Cs3461 Operating Systems Laboratory L T P C
1 page
Remove Left Factoring
100% (1)
Remove Left Factoring
2 pages
Cd3291 Dsa Unit 5 Notes Eduengg
No ratings yet
Cd3291 Dsa Unit 5 Notes Eduengg
23 pages
Unit - Viii Machine Dependent Code Optimization Peephole Optimization
No ratings yet
Unit - Viii Machine Dependent Code Optimization Peephole Optimization
9 pages
Topic: Classification of Data Structure
No ratings yet
Topic: Classification of Data Structure
24 pages
Object Oriented Programming - CS8391
No ratings yet
Object Oriented Programming - CS8391
9 pages
C++ Lab Manual
No ratings yet
C++ Lab Manual
26 pages
Ece443 - Wireless Sensor Networks Course Information Sheet: Electronics and Communication Engineering Department
No ratings yet
Ece443 - Wireless Sensor Networks Course Information Sheet: Electronics and Communication Engineering Department
10 pages
Question Bank 1to11
No ratings yet
Question Bank 1to11
19 pages
Adsa Lab Manual
No ratings yet
Adsa Lab Manual
52 pages
Vtu 4th Sem Design and Analysis of Algorithm Observation
100% (1)
Vtu 4th Sem Design and Analysis of Algorithm Observation
96 pages
Compiler Unit 1
No ratings yet
Compiler Unit 1
110 pages
Java Module 4
No ratings yet
Java Module 4
16 pages
CS302 Unit1-III
No ratings yet
CS302 Unit1-III
18 pages
Question Bank - OS
No ratings yet
Question Bank - OS
6 pages
Lab Manual C AIDS - 2
No ratings yet
Lab Manual C AIDS - 2
50 pages
Data Structures Using C: Example 4.13
No ratings yet
Data Structures Using C: Example 4.13
5 pages
System Software and Microprocessor Labmanual
No ratings yet
System Software and Microprocessor Labmanual
130 pages
AI Lab MAnual Final
No ratings yet
AI Lab MAnual Final
44 pages
Implement On A Data Set of Characters The Three CRC Polynomials - CRC 12, CRC 16 and CRC
No ratings yet
Implement On A Data Set of Characters The Three CRC Polynomials - CRC 12, CRC 16 and CRC
5 pages
Chapter 6: Multiway Trees: Not Restricted To 2 Binary Search Trees
100% (1)
Chapter 6: Multiway Trees: Not Restricted To 2 Binary Search Trees
32 pages
Practical Lab File Based ON Programing in C: Submitted by
No ratings yet
Practical Lab File Based ON Programing in C: Submitted by
6 pages
CS3461 OS Manual
No ratings yet
CS3461 OS Manual
119 pages
Algorithms Lab Manual
100% (1)
Algorithms Lab Manual
37 pages
AVL Tree
No ratings yet
AVL Tree
34 pages
CG Project Report
No ratings yet
CG Project Report
6 pages
Ad3251 Unit 2 Notes Edu Engg
No ratings yet
Ad3251 Unit 2 Notes Edu Engg
35 pages
POP Using C - VTU Lab Program-4
No ratings yet
POP Using C - VTU Lab Program-4
5 pages
CS3353 Unit 2
No ratings yet
CS3353 Unit 2
26 pages
Java Viva Questions - Coders Lodge
100% (1)
Java Viva Questions - Coders Lodge
15 pages
Compiler Design MCQ - Javatpoint
No ratings yet
Compiler Design MCQ - Javatpoint
1 page
AOA Lab Manual
No ratings yet
AOA Lab Manual
60 pages
Spos Lab Manual
100% (1)
Spos Lab Manual
41 pages
Push-Pop Get - Return CFG
No ratings yet
Push-Pop Get - Return CFG
5 pages
Computer Networks Lab Manual Latest
No ratings yet
Computer Networks Lab Manual Latest
45 pages
Lecture 3 Multiprocessor Vs Multicomputer Vs DS
No ratings yet
Lecture 3 Multiprocessor Vs Multicomputer Vs DS
55 pages
Lab Record-Cs3401 Algorithms
No ratings yet
Lab Record-Cs3401 Algorithms
79 pages
Unit 1 - CD Cs3501
No ratings yet
Unit 1 - CD Cs3501
24 pages
Question Bank: Subject: Data Structures and Algorithms
No ratings yet
Question Bank: Subject: Data Structures and Algorithms
6 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
B.E Project Domain and Titles 2021-22 (Last Year)
No ratings yet
B.E Project Domain and Titles 2021-22 (Last Year)
8 pages
NLP Question Bank
No ratings yet
NLP Question Bank
1 page
TE - BE Academic Calendar 2023-24
No ratings yet
TE - BE Academic Calendar 2023-24
1 page
LP5 List of Assignments
No ratings yet
LP5 List of Assignments
2 pages
2020 Paper 3 Marking Scheme
No ratings yet
2020 Paper 3 Marking Scheme
10 pages
Binary Tree Data Structure
100% (1)
Binary Tree Data Structure
56 pages
Pattern-Defeating Quicksort (3)
No ratings yet
Pattern-Defeating Quicksort (3)
20 pages
Adobe Scan Apr 08, 2024
No ratings yet
Adobe Scan Apr 08, 2024
7 pages
Cot-Math 5 Q4
100% (1)
Cot-Math 5 Q4
19 pages
Instructions: Language of The Computer: CMPS290 Class Notes (Chap02) Page 1 / 45 by Kuo-Pao Yang
No ratings yet
Instructions: Language of The Computer: CMPS290 Class Notes (Chap02) Page 1 / 45 by Kuo-Pao Yang
45 pages
Combine Lang Model 1
No ratings yet
Combine Lang Model 1
1 page
Unit 3
No ratings yet
Unit 3
127 pages
Questions Blooms Taxonomy Level Course Outcomes
No ratings yet
Questions Blooms Taxonomy Level Course Outcomes
2 pages
CD Lab 1
No ratings yet
CD Lab 1
8 pages
Algebra1 Exponential Functions
No ratings yet
Algebra1 Exponential Functions
2 pages
Finding The Maximal Independent Set
No ratings yet
Finding The Maximal Independent Set
12 pages
BCA Sip Shortage
No ratings yet
BCA Sip Shortage
2 pages
CSE 1325 - W7 - 10042021
No ratings yet
CSE 1325 - W7 - 10042021
45 pages
Term Paper
No ratings yet
Term Paper
14 pages
Circular Queue Program Using Pointer
No ratings yet
Circular Queue Program Using Pointer
9 pages
BSCS Object Oriented Programming-LAB Spring 2021
No ratings yet
BSCS Object Oriented Programming-LAB Spring 2021
1 page
Computability and Logic Boolos 4th edition Edition download
100% (1)
Computability and Logic Boolos 4th edition Edition download
42 pages
AUTOSAR SWS CommunicationStackTypes
No ratings yet
AUTOSAR SWS CommunicationStackTypes
26 pages
Assignment 24 mar 2025
No ratings yet
Assignment 24 mar 2025
3 pages
Using The Open Source ASN.1 Compiler: Documentation For Asn1c Version 0.9.29
No ratings yet
Using The Open Source ASN.1 Compiler: Documentation For Asn1c Version 0.9.29
72 pages
Data Structures and Algorithms Made Easy With Java Learn Data Structure Using Java in 7 Days
No ratings yet
Data Structures and Algorithms Made Easy With Java Learn Data Structure Using Java in 7 Days
364 pages
Digital Integrated Circuits Syllabus'
No ratings yet
Digital Integrated Circuits Syllabus'
2 pages
Capgemini Datastructures
No ratings yet
Capgemini Datastructures
29 pages
Quiz 2A Memo
No ratings yet
Quiz 2A Memo
4 pages
Sbs 1304
No ratings yet
Sbs 1304
108 pages
Picoctf Contents
No ratings yet
Picoctf Contents
15 pages
Hill Cipher: To Encode and Decode Plaintext Using Hill Cipher
No ratings yet
Hill Cipher: To Encode and Decode Plaintext Using Hill Cipher
12 pages
Signal Generation: Experiment - 1 Aim:-To Generate and Plot Various Signals Using Matlab
No ratings yet
Signal Generation: Experiment - 1 Aim:-To Generate and Plot Various Signals Using Matlab
13 pages
Chapter 10 Assembler For Dummies
No ratings yet
Chapter 10 Assembler For Dummies
14 pages

Uploaded by

Uploaded by

Smt. Kashibai Navale College of Engineering, Vadgaon(Bk.), Pune.

Savitribai Phule Pune University (SPPU) Fourth Year of Computer Engineering

• To understand and implement searching and sorting algorithms.

• To learn the fundamentals of GPU Computing in the CUDA environment.

• To illustrate the concepts of Artificial Intelligence/Machine Learning (AI/ML).

• To understand Hardware acceleration. • To implement different deep learning models.

CO1: Analyze and measure performance of sequential and parallel algorithms.

CO2: Design and Implement solutions for multicore/Distributed/parallel environment.

@The CO-PO Mapping Matrix

1. Basic of programming language

2. Concept of BFS and DFS

Contents for Theory:

4. Code Explanation with Output

Step 1: Take an Empty Queue.

Step 4: Print the extracted node.

● OpenMP (Open Multi-Processing) is an application programming interface (API) that supports

● OpenMP programs are designed to take advantage of the shared-memory architecture of

high-performance computing. It is supported by most modern compilers and is available on a wide

Code to implement BFS using OpenMP:

using namespace std;

node *left, *right; int data;

node *insert(node *, int); void bfs(node *);

node *insert(node *root, int data)

// inserts a node in tree

root=new node; root->left=NULL; root->right=NULL; root->data=data; return root;

queue<node *> q; q.push(root);

temp->left=new node; temp->left->left=NULL; temp->left->right=NULL; temp->left->data=data;

temp->right=new node; temp->right->left=NULL; temp->right->right=NULL; temp->right->data=data;

void bfs(node *head)

qSize = q.size(); #pragma omp parallel for

//creates parallel threads

for (int i = 0; i < qSize; i++)

node* currNode; #pragma omp critical

currNode = q.front(); q.pop();

}// prints parent node #pragma omp critical

if(currNode->left)// push parent's left node in queue q.push(currNode->left);

}// push parent's right node in queue

node *root=NULL; int data;

cout<<"\n enter data=>"; cin>>data;

cout<<"do you want insert one more node?"; cin>>ans;

1) g++ -fopenmp bfs.cpp -o bfs

Enter data => 5

Do you want to insert one more node? (y/n) y

Enter data => 3

Do you want to insert one more node? (y/n) y

Do you want to insert one more node? (y/n) y

Enter data => 1

Do you want to insert one more node? (y/n) y

Enter data => 7

Do you want to insert one more node? (y/n) y

Code to implement DFS using OpenMP:

#include <iostream> #include <vector> #include <stack> #include <omp.h>

using namespace std;

const int MAX = 100000; vector<int> graph[MAX]; bool visited[MAX];

void dfs(int node) {

while (!s.empty()) { int curr_node = s.top(); s.pop();

if (!visited[curr_node]) { visited[curr_node] = true;

if (visited[curr_node]) { cout << curr_node << " ";

#pragma omp parallel for

for (int i = 0; i < graph[curr_node].size(); i++) { int adj_node = graph[curr_node][i];

cin >> u >> v;

//u and v: Pair of edges

/* for (int i = 0; i < n; i++) { if (visited[i]) {

cout << i << " ";

1. Basic of programming language

2. Concept of Bubble Sort

Contents for Theory:

1. What is Bubble Sort? Use of Bubble Sort

2. Example of Bubble sort?

4. How Parallel Bubble Sort Work

5. How to measure the performance of sequential and parallel algorithms?

What is Bubble Sort?

The basic algorithm of Bubble Sort is as follows:

1. Start at the beginning of the array.

3. Move to the next pair of elements and repeat step 2.

4. Continue the process until the end of the array is reached.

Example of Bubble sort

First Iteration of the Sorting

node left, right; int data;

node insert(node , int); void bfs(node *);

node insert(node root, int data)