BE LP5 Manual 23-24
BE LP5 Manual 23-24
Course Objectives:
Course Outcomes:
CO3: Identify and apply the suitable algorithms to solve AI/ML problems.
CO4: Apply the technique of Deep Neural network for implementing Linear regression and
classification.
CO5: Apply the technique of Convolution (CNN) for implementing Deep Learning models CO6: Design
and develop Recurrent Neural Network (RNN) for prediction.
CO/PO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1 1 - 1 1 - 2 1 - - - - -
CO2 1 2 1 - - 1 - - - - - 1
CO3 - 1 1 1 1 1 - - - - - -
CO4 3 3 3 - 3 - - - - - - -
CO5 3 3 3 3 3 - - - - - - -
CO6 3 3 3 3 3 - - - - - - -
CO7 3 3 3 3 3 - - - - - -
410250: High-Performance Computing Group A
Assignment No.: 1
Title of the Assignment: Design and implement Parallel Breadth First Search and Depth First Search
based on existing algorithms using OpenMP. Use a Tree or an undirected graph for BFS and DFS
Objective of the Assignment: Students should be able to Write a program to implement Parallel
Breadth First Search and Depth First Search based on existing algorithms using OpenMP
Prerequisite:
3. Concept of Parallelism
1. What is BFS?
2. What is DFS?
3. Concept of OpenMP
What is BFS?
BFS stands for Breadth-First Search. It is a graph traversal algorithm used to explore all the nodes of a
graph or tree systematically, starting from the root node or a specified starting point, and visiting all
the neighboring nodes at the current depth level before moving on to the next depth level. The
algorithm uses a queue data structure to keep track of the nodes that need to be visited, and marks
each visited node to avoid processing it again. The basic idea of the BFS algorithm is to visit all the
nodes at a given level before moving on to the next level, which ensures that all the nodes are visited
in breadth-first order. BFS is commonly used in many applications, such as finding the shortest path
between two nodes, solving puzzles, and searching through a tree or graph.
Now let’s take a look at the steps involved in traversing a graph by using Breadth-First Search:
Step 2: Select a starting node (visiting a node) and insert it into the Queue.
Step 3: Provided that the Queue is not empty, extract the node from the Queue and insert its child
nodes (exploring a node) into the Queue.
What is DFS?
DFS stands for Depth-First Search. It is a popular graph traversal algorithm that explores as far as
possible along each branch before backtracking. This algorithm can be used to find the shortest path
between two vertices or to traverse a graph systematically. The algorithm starts at the root node and
explores as far as possible along each branch before backtracking. The backtracking is done to explore
the next branch that has not been explored yet.
DFS can be implemented using either a recursive or an iterative approach. The recursive approach is
simpler to implement but can lead to a stack overflow error for very large graphs. The iterative
approach uses a stack to keep track of nodes to be explored and is preferred for larger graphs.
DFS can also be used to detect cycles in a graph. If a cycle exists in a graph, the DFS algorithm will
eventually reach a node that has already been visited, indicating that a cycle exists. A standard DFS
implementation puts each vertex of the graph into one of two categories: 1. Visited 2. Not Visited The
purpose of the algorithm is to mark each vertex as visited while avoiding cycles.
Concept of OpenMP
● OpenMP provides a set of directives and functions that can be inserted into the source code
of a program to parallelize its execution. These directives are simple and easy to use, and they can be
applied to loops, sections, functions, and other program constructs. The compiler then generates
parallel code that can run on multiple processors concurrently.
● OpenMP is widely used in scientific computing, engineering, and other fields that require
● Parallel BFS (Breadth-First Search) is an algorithm used to explore all the nodes of a graph or
tree SNJB’s Late Sau.K.B.Jain College of Engineering Chandwad 3 Department of Computer Engineering
Course : Laboratory Practice V systematically in parallel. It is a popular parallel algorithm used for graph
traversal in distributed computing, shared-memory systems, and parallel clusters.
● The parallel BFS algorithm starts by selecting a root node or a specified starting point, and then
assigning it to a thread or processor in the system. Each thread maintains a local queue of nodes to be
visited and marks each visited node to avoid processing it again.
● The algorithm then proceeds in levels, where each level represents a set of nodes that are at
a certain distance from the root node. Each thread processes the nodes in its local queue at the current
level, and then exchanges the nodes that are adjacent to the current level with other threads or
processors. This is done to ensure that the nodes at the next level are visited by the next iteration of
the algorithm.
● The parallel BFS algorithm uses two phases: the computation phase and the communication
phase. In the computation phase, each thread processes the nodes in its local queue, while in the
communication phase, the threads exchange the nodes that are adjacent to the current level with
other threads or processors.
● The parallel BFS algorithm terminates when all nodes have been visited or when a specified
node has been found. The result of the algorithm is the set of visited nodes or the shortest path from
the root node to the target node.
● Parallel BFS can be implemented using different parallel programming models, such as
OpenMP, MPI, CUDA, and others. The performance of the algorithm depends on the number of
threads or processors used, the size of the graph, and the communication overhead between the
threads or processors.
#include<iostream>
#include<stdlib.h>
#include<queue>
class node
public:
};
class Breadthfs
public:
};
if(!root)
while(!q.empty())
{
node *temp=q.front(); q.pop();
if(temp->left==NULL)
else
q.push(temp->left);
if(temp->right==NULL)
else
q.push(temp->right);
queue<node*> q; q.push(head);
int qSize;
while (!q.empty())
cout<<"\t"<<currNode->data;
if(currNode->right)
q.push(currNode->right);
int main(){
char ans;
do
root=insert(root,data);
}while(ans=='y'||ans=='Y'); bfs(root);
return 0;
Run Commands:
2) ./bfs
Output:
Enter data => 8 Do you want to insert one more node? (y/n) n
5 3 7 2 1 8
stack<int> s; s.push(node);
if (!visited[adj_node]) {
s.push(adj_node);
}
}
int main() {
int n, m, start_node;
cout << "Enter No of Node,Edges,and start node:" ; cin >> n >> m >> start_node;
//n: node,m:edges
cout << "Enter Pair of edges:" ; for (int i = 0; i < m; i++) { int u, v;
graph[u].push_back(v); graph[v].push_back(u);
#pragma omp parallel for for (int i = 0; i < n; i++) { visited[i] = false;
dfs(start_node);
}*/
return 0;
Conclusion:
In this way we can achieve parallelism while implementing Breadth First Search and Depth First
Search
Assignment No.: 2
Title of the Assignment: Write a program to implement Parallel Bubble Sort. Use existing algorithms
and measure the performance of sequential and parallel algorithms.
Objective of the Assignment: Students should be able to Write a program to implement Parallel Bubble
Sort and can measure the performance of sequential and parallel algorithms.
Prerequisite:
3. Concept of Parallelism
3. Concept of OpenMP
Bubble Sort is a simple sorting algorithm that works by repeatedly swapping adjacent elements if they
are in the wrong order. It is called "bubble" sort because the algorithm moves the larger elements
towards the end of the array in a manner that resembles the rising of bubbles in a liquid.
2. Compare the first two elements. If the first element is greater than the second element, swap
them.
5. If any swaps were made in step 2-4, repeat the process from step 1.
The time complexity of Bubble Sort is O(n^2), which makes it inefficient for large lists. However, it has
the advantage of being easy to understand and implement, and it is useful for educational purposes
and for sorting small datasets. Bubble Sort has limited practical use in modern software development
due to its inefficient time complexity of O(n^2) which makes it unsuitable for sorting large datasets.
However, Bubble Sort has some advantages and use cases that make it a valuable algorithm to
understand, such as:
1. Simplicity: Bubble Sort is one of the simplest sorting algorithms, and it is easy to understand
and implement. It can be used to introduce the concept of sorting to beginners and as a basis for more
complex sorting algorithms.
2. Educational purposes: Bubble Sort is often used in academic settings to teach the principles of
sorting algorithms and to help students understand how algorithms work.
3. Small datasets: For very small datasets, Bubble Sort can be an efficient sorting algorithm, as
its overhead is relatively low.
4. Partially sorted datasets: If a dataset is already partially sorted, Bubble Sort can be very
efficient. Since Bubble Sort only swaps adjacent elements that are in the wrong order, it has a low
number of operations for a partially sorted dataset.
5. Performance optimization: Although Bubble Sort itself is not suitable for sorting large datasets,
some of its techniques can be used in combination with other sorting algorithms to optimize their
performance. For example, Bubble Sort can be used to optimize the performance of Insertion Sort by
reducing the number of comparisons needed.
Let's say we want to sort a series of numbers 5, 3, 4, 1, and 2 so that they are arranged in ascending
order…
The sorting begins the first iteration by comparing the first two values. If the first value is greater than
the second, the algorithm pushes the first value to the index of the second value.
Step 1: In the case of 5, 3, 4, 1, and 2, 5 is greater than 3. So 5 takes the position of 3 and the
Step 2: The algorithm now has 3, 5, 4, 1, and 2 to compare, this time around, it compares the next two
values, which are 5 and 4. 5 is greater than 4, so 5 takes the index of 4 and the values now become 3,
4, 5, 1, and 2.
Step 3: The algorithm now has 3, 4, 5, 1, and 2 to compare. It compares the next two values, which are
5 and 1. 5 is greater than 1, so 5 takes the index of 1 and the numbers become 3, 4, 1, 5, and 2.
Step 4: The algorithm now has 3, 4, 1, 5, and 2 to compare. It compares the next two values, which are
5 and 2. 5 is greater than 2, so 5 takes the index of 2 and the numbers become 3, 4, 1, 2, and 5.
That’s the first iteration. And the numbers are now arranged as 3, 4, 1, 2, and 5 – from the initial 5, 3,
4, 1, and 2. As you might realize, 5 should be the last number if the numbers are sorted in ascending
order. This means the first iteration is really completed.
The algorithm proceeds to compare 4 and 1. 4 is greater than 1, so 4 is swapped for 1 and the numbers
become 3, 1, 4, 2, and 5.
The algorithm now proceeds to compare 4 and 2. 4 is greater than 2, so 4 is swapped for 2 and the
numbers become 3, 1, 2, 4, and 5.
4 is now in the right place, so no swapping occurs between 4 and 5 because 4 is smaller than
5.That’s how the algorithm continues to compare the numbers until they are arranged in ascending
order of 1, 2, 3, 4, and 5.
Concept of OpenMP
● OpenMP provides a set of directives and functions that can be inserted into the source code
of a program to parallelize its execution. These directives are simple and easy to use, and they can be
applied to loops, sections, functions, and other program constructs.
● The compiler then generates parallel code that can run on multiple processors concurrently.
● Parallel Bubble Sort is a modification of the classic Bubble Sort algorithm that takes advantage
of parallel processing to speed up the sorting process.
● In parallel Bubble Sort, the list of elements is divided into multiple sublists that are sorted
concurrently by multiple threads. Each thread sorts its sublist using the regular Bubble Sort algorithm.
When all sublists have been sorted, they are merged together to form the final sorted list.
● The parallelization of the algorithm is achieved using OpenMP, a programming API that
supports parallel processing in C++, Fortran, and other programming languages. OpenMP provides a
set of compiler directives that allow developers to specify which parts of the code can be executed in
parallel.
● In the parallel Bubble Sort algorithm, the main loop that iterates over the list of elements is
divided into multiple iterations that are executed concurrently by multiple threads. Each thread sorts
a subset of the list, and the threads synchronize their work at the end of each iteration to ensure that
the elements are properly ordered.
● Parallel Bubble Sort can provide a significant speedup over the regular Bubble Sort algorithm,
especially when sorting large datasets on multi-core processors. However, the speedup is
limited by the overhead of thread creation and synchronization, and it may not be worth the effort for
small datasets or when using a single-core processor.
To measure the performance of sequential Bubble sort and parallel Bubble sort algorithms, you can
follow these steps:
2. Choose a range of test cases, such as arrays of different sizes and different degrees of
sortedness, to test the performance of both algorithms.
3. Use a reliable timer to measure the execution time of each algorithm on each test case.
When measuring the performance of the parallel Bubble sort algorithm, you will need to specify the
number of threads to use. You can experiment with different numbers of threads to find the optimal
value for your system.
In Ubuntu, you can use a variety of tools to check CPU utilization and memory consumption. Here are
some common tools:
1. top: The top command provides a real-time view of system resource usage, including CPU
utilization and memory consumption. To use it, open a terminal window and type top. The output will
display a list of processes sorted by resource usage, with the most resource-intensive processes at the
top.
2. htop: htop is a more advanced version of top that provides additional features, such as
interactive process filtering and a color-coded display. To use it, open a terminal window and type htop.
3. ps: The ps command provides a snapshot of system resource usage at a particular moment in
time. To use it, open a terminal window and type ps aux. This will display a list of all running processes
and their resource usage.
4. free: The free command provides information about system memory usage, including total,
used, and free memory. To use it, open a terminal window and type free -h.
5. vmstat: The vmstat command provides a variety of system statistics, including CPU utilization,
memory usage, and disk activity. To use it, open a terminal window and type vmstat.
# Use the parallel construct to distribute the loop iterations among the threads # Each thread sorts a
portion of the array
# The ordered argument ensures that the threads wait for each other before moving on to the next
iteration
# This guarantees that the array is fully sorted before the loop ends
if _name_ == '_main_':
Output:
# Split the array into two halves mid = n // 2 left = arr[:mid] right = arr[mid:]
# Use the parallel construct to distribute the work among the threads # Each thread sorts a portion
of the array
with omp.parallel(num_threads=omp.get_max_threads(), default_shared=False): left_sorted =
parallel_merge_sort(left)
# Use the parallel construct to distribute the loop iterations among the threads # Each thread merges
a portion of the array
elif j == n2:
i += 1
else:
merged_arr[k] = right_sorted[j] j += 1
return merged_arr
if _name_ == '_main_':
start_time = time.time()
Output:
Assignment No.: 3
Title of the Assignment:Implement Min, Max, Sum and Average operations using Parallel Reduction.
Objective of the Assignment: Students should be able to learn about how to perform min, max, sum,
and average operations on a large set of data using parallel reduction technique in CUDA. The program
defines four kernel functions, reduce_min, reduce_max, reduce_sum, and reduce_avg.
Prerequisite:
2. Familiarity with a parallel programming library or framework, such as OpenMP, MPI, or CUDA.
Contents of Theory :
Parallel reduction is a common technique used in parallel computing to perform a reduction operation
on a large dataset. A reduction operation combines a set of values into a single value, such as
computing the sum, maximum, minimum, or average of the values. Parallel reduction exploits the
parallelism available in modern multicore processors, clusters of computers, or GPUs to speed up the
computation.
The parallel reduction algorithm works by dividing the input data into smaller chunks that can be
processed independently in parallel. Each thread or process computes the reduction operation on its
local chunk of data, producing a partial result. The partial results are then combined in a hierarchical
manner until a single result is obtained.
The most common parallel reduction algorithm is the binary tree reduction algorithm, which has a
logarithmic time complexity and can achieve optimal parallel efficiency. In this algorithm, the input
data is initially divided into chunks of size n, where n is the number of parallel threads or processes.
Each thread or process computes the reduction operation on its chunk of data, producing n partial
results.
The partial results are then recursively combined in a binary tree structure, where each internal node
represents the reduction operation of its two child nodes. The tree structure is built in a bottom-up
manner, starting from the leaf nodes and ending at the root node. Each level of the tree reduces the
number of partial results by a factor of two, until a single result is obtained at the root node.
The binary tree reduction algorithm can be implemented using various parallel programming models,
such as OpenMP, MPI, or CUDA. In OpenMP, the algorithm can be implemented using the parallel and
for directives for parallelizing the computation, and the reduction clause for combining the partial
results. In MPI, the algorithm can be implemented using the MPI_Reduce function for performing the
reduction operation, and the MPI_Allreduce function for distributing the result to all processes. In
CUDA, the algorithm can be implemented using the parallel reduction kernel, which uses shared
memory to store the partial results and reduce the memory access latency.
Parallel reduction has many applications in scientific computing, machine learning, data analytics, and
computer graphics. It can be used to compute the sum, maximum, minimum, or average of large
datasets, to perform data filtering, feature extraction, or image processing, to solve optimization
problems, or to accelerate numerical simulations. Parallel reduction can also be combined with other
parallel algorithms, such as parallel sorting, searching, or matrix operations, to achieve higher
performance and scalability.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int size;
};
// compute the minimum, sum, and size of a chunk struct ChunkStats stats;
= 0; stats.size = chunk_size;
return stats;
#pragma omp parallel shared(data, chunk_size, num_chunks, chunk_stats) private(i, j) { int thread_id
= omp_get_thread_num();
end_index = data_size - 1;
}
// perform a binary operation on adjacent pairs of minimum and sum values int min_val =
chunk_stats[0].min_val;
// the final minimum value is the minimum value of the entire dataset
*min_val_ptr = min_val;
// the final average value is the sum of the entire dataset divided by its size
int main() {
}
int min_val; double avg_val;
free(data);
return 0;
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#pragma omp parallel for reduction(max: *max_val_ptr) reduction(+: *sum_val_ptr) for (int i = 0; i <
size; i++) {
*max_val_ptr = data[i];
*sum_val_ptr += data[i];
// Combine maximum and sum values from each chunk #pragma omp parallel sections
thread_max_val = *max_val_ptr;
*max_val_ptr = thread_max_val;
thread_sum_val = *sum_val_ptr;
*sum_val_ptr += thread_sum_val;
int main() {
}0
free(data); return 0;
#pragma omp parallel for reduction(max: *max_val_ptr) reduction(+: *sum_val_ptr) for (int i = 0; i <
size; i++) {
*max_val_ptr = data[i];
*sum_val_ptr += data[i];
// Combine maximum and sum values from each chunk #pragma omp parallel sections
thread_max_val = *max_val_ptr;
*max_val_ptr = thread_max_val;
}}
thread_sum_val = *sum_val_ptr;
*sum_val_ptr += thread_sum_val;
int main() {
free(data); return 0;
In this code, we use the #pragma omp parallel for directive to execute the loop that computes the
maximum and sum of each chunk in parallel. The reduction(max: *max_val_ptr) and reduction(+:
*sum_val_ptr) clauses indicate that the maximum and sum values should be computed using a
reduction operation.
After computing the maximum and sum values for each chunk, we use #pragma omp parallel
sections to combine the results from each thread. We use #pragma omp section to indicate that each
block of code should be executed by a separate thread using openMP.
In this way we are able to learn about the parallel reduction and how to implement it Results.
Conclusion :
In each section, we use a loop and a critical section to combine the maximum or sum values from
each thread. The #pragma omp flush directive ensures that the values are properly synchronized
between threads.
Assignment No.: 4
Title of the Assignment:Write a CUDA Program for :
Objective of the Assignment: Students should be able to learn about parallel computing and students
should learn about CUDA( Compute Unified Device Architecture) and how it helps to boost high
performance computations.
Prerequisite:
Contents of Theory :
2. CUDA programming model: CUDA programming model consists of host and device codes.
The host code runs on the CPU and is responsible for managing the GPU memory and launching the
kernel functions on the device. The device code runs on the GPU and performs the computations.
3. CUDA kernel function: A CUDA kernel function is a function that is executed on the GPU. It is
defined with the global keyword and is called from the host code using a launch configuration. Each
kernel function runs in parallel on multiple threads, where each thread performs the same operation
on different data.
4. Memory management in CUDA: In CUDA, there are three types of memory: global, shared,
and local. Global memory is allocated on the device and can be accessed by all threads. Shared
memory is allocated on the device and can be accessed by threads within a block. Local memory is
allocated on each thread and is used for temporary storage.
5. CUDA thread organization: In CUDA, threads are organized into blocks, and blocks are
organized into a grid. Each thread is identified by a unique thread index, and each block is identified
by a unique block index.
CUDA stands for Compute Unified Device Architecture. It is a parallel computing platform and
programming model developed by NVIDIA.CUDA allows developers to use the power of the GPU to
accelerate computations. It is designed to be used with C, C++, and Fortran programming
languages.CUDA architecture consists of host and device components. The host is the CPU, and the
device is the GPU. The CPU is responsible for managing the GPU memory and launching the kernel
functions on the device.
A CUDA kernel function is a function that is executed on the GPU. It is defined with the global keyword
and is called from the host code using a launch configuration. Each kernel function runs in parallel on
multiple threads, where each thread performs the same operation on different data.
CUDA provides three types of memory: global, shared, and local. Global memory is allocated on the
device and can be accessed by all threads. Shared memory is allocated on the device and can be
accessed by threads within a block. Local memory is allocated on each thread and is used for
temporary storage.
CUDA threads are organized into blocks, and blocks are organized into a grid. Each thread is identified
by a unique thread index, and each block is identified by a unique block index.
CUDA devices have a hierarchical memory architecture consisting of multiple memory levels, including
registers, shared memory, L1 cache, L2 cache, and global memory.
CUDA supports various libraries, including cuBLAS for linear algebra, cuFFT for Fast Fourier Transform,
and cuDNN for deep learning.
CUDA programming requires a compatible NVIDIA GPU and an installation of the CUDA Toolkit, which
includes the CUDA compiler, libraries, and tools.
global void vectorAdd(int *a, int *b, int *c, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i <
n) {
int main() {
int n = 1000000; // Vector size int *a, *b, *c; // Host vectors
int *d_a, *d_b, *d_c; // Device vectors int size = n * sizeof(int); // Size in bytes
a[i] = i;
b[i] = i;}
// Launch kernel
// Copy device result vector to host result vector cudaMemcpy(c, d_c, size,
cudaMemcpyDeviceToHost);
free(c);
return 0;}
This program uses CUDA to add two large vectors of size 1000000. The vectors are initialized on the
host, and then copied to the device memory. A kernel function is defined to perform the vector
addition, and then launched on the device. The result is copied back to the host memory and verified.
Finally, the device and host memories are freed.
This program multiplies two matrices of size n using CUDA. It first allocates host memory for the
matrices and initializes them. Then it allocates device memory and copies the matrices to the device.
It sets the kernel launch configuration and launches the kernel function matrix_multiply. The kernel
function performs the matrix multiplication and stores the result in matrix c. Finally, it copies the result
back to the host and frees the device and host memory.
The kernel function calculates the row and column indices of the output matrix using the block index
and thread index. It then uses a for loop to calculate the sum of the products of the corresponding
elements in the input matrices. The result is stored in the output matrix.
Note that in this program, we use CUDA events to measure the elapsed time of the kernel function.
This is because the kernel function runs asynchronously on the GPU, so we need to use events to
synchronize the host and device and measure the time accurately.
int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; float
sum = 0;
int main(){
int n = 1024;
a = (float*)malloc(size);
b = (float*)malloc(size);
c = (float*)malloc(size);
// Initialize matrices
b[i] = i % n;
cudaMalloc(&d_a, size);
cudaMalloc(&d_b, size);
cudaMalloc(&d_c, size);
free(c);
return 0;
Conclusion:
Hence we have implemented Addition of two large vectors and Matrix Multiplication using CUDA.
Assignment No.: 5
Title of the Assignment: Implement HPC application for AI/ML domain.
Prerequisite:
1. Knowledge of AI/ML algorithms and models: A deep understanding of AI/ML algorithms and
models is essential to design and implement an HPC application that can efficiently perform large-scale
training and inference. This requires knowledge of statistical methods, linear algebra, optimization
techniques, and deep learning frameworks such as TensorFlow, PyTorch, and MXNet.
Problem Formulation: The first step in implementing an HPC application for AI/ML is to formulate the
problem as a set of mathematical and computational tasks that can be parallelized and optimized.
This involves defining the problem domain, selecting appropriate algorithms and models, and
determining the computational and memory requirements.
Hardware Selection: The next step is to select the appropriate hardware platform for the HPC
application. This involves considering the available hardware options, such as CPU, GPU, FPGA, and
ASIC, and selecting the most suitable option based on the performance, cost, power consumption, and
scalability requirements.
Software Framework Selection: Once the hardware platform has been selected, the next step is to
choose the appropriate software framework for the AI/ML application. This involves considering the
available options, such as TensorFlow, PyTorch, MXNet, and Caffe, and selecting the most suitable
framework based on the programming language, performance, ease of use, and community support.
Data Preparation and Preprocessing: Before training or inference can be performed, the data must be
prepared and preprocessed. This involves cleaning the data, normalizing and scaling the data, and
splitting the data into training, validation, and testing sets. The data must also be stored in a format
that is compatible with the selected software framework.
Model Training or Inference: The main computational task in an AI/ML application is model training or
inference. In an HPC application, this task is parallelized and optimized to take advantage of the
available hardware resources. This involves breaking the model into smaller tasks that can be
parallelized, using techniques such as data parallelism, model parallelism, or pipeline parallelism. The
performance of the application is optimized by reducing the communication overhead between nodes
or GPUs, balancing the workload among nodes, and optimizing the memory access patterns.
Model Evaluation: After the model has been trained or inference has been performed, the
performance of the model must be evaluated. This involves computing the accuracy, precision, recall,
and other metrics on the validation and testing sets. The performance of the HPC application is
evaluated by measuring the speedup, scalability, and efficiency of the parallelized tasks.
Optimization and Tuning: Finally, the HPC application must be optimized and tuned to achieve the best
possible performance. This involves profiling the code to identify bottlenecks and optimizing the code
using techniques such as loop unrolling, vectorization, and cache optimization. The performance of
the application is also affected by the choice of hyperparameters, such as the learning rate, batch size,
and regularization strength, which must be tuned using techniques such as grid search or Bayesian
optimization.
Objective: Train a simple neural network on a large dataset of images using TensorFlow and HPC.
Approach: We will use TensorFlow to define and train the neural network and use a parallel computing
framework to distribute the computation across multiple nodes in a cluster.
Requirements:
Steps:
Code:
import tensorflow as tf
model = tf.keras.models.Sequential([
])
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0
Initialize MPI
chunk_size = n // size start = rank * chunk_size end = (rank + 1) * chunk_size if rank == size - 1:
end = n
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy', metrics=['accuracy'])
train_acc = train(model, x_train, y_train, rank, size) # Compute the accuracy on the test data
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2) # Reduce the accuracy across all
nodes
print(f"Epoch {epoch + 1}: Train accuracy = {train_acc:.4f}, Test accuracy = {test_acc / size:.4f}")
Output:
Epoch 1: Train accuracy = 0.9773, Test accuracy = 0.9745 Epoch 2: Train accuracy = 0.9859, Test
accuracy = 0.9835 Epoch 3: Train accuracy = 0.9887, Test accuracy = 0.9857 Epoch 4: Train accuracy =
0.9905, Test accuracy = 0.9876 Epoch 5: Train accuracy = 0.9919, Test accuracy = 0.9880 Conclusion:
implementing an HPC application for the AI/ML domain involves formulating the problem, selecting
the hardware and software frameworks, preparing and preprocessing the data, parallelizing and
optimizing the model training or inference tasks, evaluating the model performance, and optimizing
and tuning the HPC application for maximum performance. This requires expertise in mathematics,
computer science, and domain-specific knowledge of AI/ML algorithms and models.
Mini Project: Evaluate performance enhancement of parallel Quicksort Algorithm using MPI
Objective: Sort a large dataset of numbers using parallel Quicksort Algorithm with MPI and compare
its performance with the serial version of the algorithm.
Approach: We will use Python and MPI to implement the parallel version of Quicksort Algorithm and
compare its performance with the serial version of the algorithm.
Requirements:
Theory :
Similar to mergesort, QuickSort uses a divide-and-conquer strategy and is one of the fastest sorting
algorithms; it can be implemented in a recursive or iterative fashion. The divide and conquer is a
general algorithm design paradigm and key steps of this strategy can be summarized as follows:
• Divide: Divide the input data set S into disjoint subsets S1, S2, S3…Sk.
• Conquer: Combine the solutions for S1, S2, S3…Sk. into a solution for S.
• Base case: The base case for the recursion is generally subproblems of size 0 or 1.
Many studies [2] have revealed that in order to sort N items; it will take QuickSort an average running
time of O(NlogN). The worst-case running time for QuickSort will occur when the pivot is a unique
minimum or maximum element, and as stated in [2], the worst-case running time for QuickSort on N
items is O(N2). These different running times can be influenced by the input distribution
(uniform,sorted or semi-sorted, unsorted, duplicates) and the choice of the pivot element. Here is a
simple pseudocode of the QuickSort algorithm adapted from Wikipedia [1].
We have made use of Open MPI as the backbone library for parallelizing the QuickSort algorithm. In
fact, learning message passing interface (MPI) allows us to strengthen our fundamental knowledge on
parallel programming, given that MPI is lower level than equivalent libraries (OpenMP). As simple as
its name means, the basic idea behind MPI is that messages can be passed or exchanged among
different processes in order to perform a given task. An illustration can be a communication and
coordination by a master process which splits a huge task into chunks and shares them to its slave
processes. Open MPI is developed and maintained by a consortium of academic, research and industry
partners; it combines the expertise, technologies and resources all across the high performance
computing community [11]. As elaborated in [4], MPI has two types of communication routines: point-
to-point communication routines and collective communication routines. Collective routines as
explained in the implementation section have been used in this study.
Algorithm :
In general, the overall algorithm used here to perform QuickSort with MPI works as followed:
iii. Divide the input size SIZE by the number of participating processes npes to get each chunk
size local size.
vi. Master gathers all sorted local data by other processes in globaldata.
Steps:
1. Initialize MPI:
return arr
pivot = arr[len(arr) // 2]
left = [x for x in arr if x < pivot] middle = [x for x in arr if x == pivot] right = [x for x in arr if x > pivot]
return arr
middle = [] right = []
for x in arr:
if x < pivot:
middle.append(x)
else:
right.append(x)
right_size = len(right)
chunk_middle = [] chunk_right = []
comm.barrier()
# Gather the chunks back to the root node sorted_arr = comm.gather(chunk_left, root=0)
sorted_arr += chunk_middle
import random
5.Compare the performance of the serial and parallel versions of the algorithm python:
if rank == 0:
Output:
Serial Quicksort Algorithm time: 1.5536 seconds Parallel Quicksort Algorithm time: 1.3488 seconds
Mini Project : 2
Theory - Huffman Encoding is a lossless data compression algorithm that works by assigning variable-
length codes to the characters in a given text or data stream based on their frequency of occurrence.
This encoding scheme can be implemented on GPU to speed up the encoding process.
The variable-length codes assigned to input characters are Prefix Codes, means the codes (bit
sequences) are assigned in such a way that the code assigned to one character is not the prefix of code
assigned to any other character. This is how Huffman Coding makes sure that there is no ambiguity
when decoding the generated bitstream.
Let us understand prefix codes with a counter example. Let there be four characters a, b, c and d, and
their corresponding variable length codes be 00, 01, 0 and 1. This coding leads to ambiguity because
code assigned to c is the prefix of codes assigned to a and b. If the compressed bit stream is 0001, the
de-compressed output may be “cccd” or “ccb” or “acd” or “ab”.
2,Construct a Huffman tree using the calculated frequencies. The tree can be built using a priority
queue implemented on GPU, where the priority of a node is determined by its frequency.
3.Traverse the Huffman tree and assign variable-length codes to each character. The codes can be
generated using a depth-first search algorithm implemented on GPU.
To optimize the implementation for GPU, we can use parallel programming techniques such as CUDA,
OpenCL, or HIP to parallelize the calculation of character frequencies, construction of the Huffman
tree, and generation of Huffman codes.
Here are some specific optimizations that can be applied to each step:
Use parallelism to split the input text into chunks and count the frequencies of each character in
parallel on different threads.
Reduce the results of each thread into a final frequency count on the GPU. 1.Constructing the Huffman
tree:
Use a priority queue implemented on GPU to parallelize the building of the Huffman tree.
Each thread can process one or more nodes at a time, based on the priority of the nodes in the queue.
2.Generating Huffman codes:
Use parallelism to traverse the Huffman tree and generate Huffman codes for each character in
parallel.
Each thread can process one or more nodes at a time, based on the depth of the nodes in the tree.
Use parallelism to split the input text into chunks and encode each chunk in parallel on different
threads.
By parallelizing these steps, we can achieve significant speedup in the Huffman Encoding process on
GPU. However, it's important to note that the specific implementation details may vary based on the
programming language and GPU architecture being used.
Source Code -
// Count the frequency of each character in the input text int freq_count[256] = {0};
// Encode the input text using the Huffman codes int output_size = 0;
output_size = (output_size + 7) / 8;
codes);
std::cout << "Input text: " << input_text << std::endl; std::cout << "Encoded text: ";
return 0;
Output -
Encoded text: 01000110 11010110 10001011 10101110 11110100 11011111 00101101 01000000
11111010
Mini Project : 3
Query processing is the process through which a Database Management System (DBMS) parses,
verifies, and optimizes a given query before creating low-level code that the DB understands.
Query Processing in DBMS, like any other High-Level Language (HLL) where code is first generated and
then executed to perform various operations, has two phases: compile-time and runtime.
Query the use of declarative languages and query optimization is one of the main factors contributing
to the success of RDBMS technology. Any database allows users to create queries to request specific
data, and the database then uses effective methods to locate the requested data.
A database optimization approach based on CMP has been studied by numerous other academics. But
the majority of their effort was on optimizing join operations while taking into account the L2-cache
and the parallel buffers of the shared main memory.
• I/O parallelism
• Within-operation parallelism
• Parallelism in inter-operation
I/O parallelism :
This type of parallelism involves partitioning the relationships among the discs in order to speed up
the retrieval of relationships from the disc.
The inputted data is divided within, and each division is processed simultaneously. After processing all
of the partitioned data, the results are combined. Another name for it is data partitioning.
Hash partitioning is best suited for point queries that are based on the partitioning attribute and have
the benefit of offering an even distribution of data across the discs.
It should be mentioned that partitioning is beneficial for the sequential scans of the full table stored
on “n” discs and the speed at which the table may be scanned. For a single disc system, relationship
takes around 1/n of the time needed to scan the table. In I/O parallelism, there are four different
methods of partitioning:
Hash partitioning :
A hash function is a quick mathematical operation. The partitioning properties are hashed for each
row in the original relationship.
Let’s say that the data is to be partitioned across 4 drives, numbered disk1, disk2, disk3, and disk4. The
row is now stored on disk3 if the function returns.
Range partitioning : Each disc receives continuous attribute value ranges while using range
partitioning. For instance, if we are range partitioning three discs with the numbers 0, 1, and 2, we
may assign a relation with a value of less than 5 is written to disk0, numbers from 5 to 40 are sent to
disk1, and values above 40 are written to disk2.
It has several benefits, such as putting shuffles on the disc that have attribute values within a specified
range.
Round-robin partitioning :
Any order can be used to study the relationships in this method. It sends the ith tuple to the disc
number (i% n).
Therefore, new rows of data are received by discs in turn. For applications that want to read the full
relation sequentially for each query, this strategy assures an even distribution of tuples across drives.
Schema Partitioning :
Various tables inside a database are put on different discs using a technique called schema partitioning.
Intra-query parallelism :
First method — In this method, a duplicate task can be executed on a small amount of data by each
CPU.
Second method — Using this method, the task can be broken up into various sectors, with each CPU
carrying out a separate subtask.
Inter-query parallelism
Each CPU executes numerous transactions when inter-query parallelism is used. Parallel transaction
processing is what it is known as. To support inter-query parallelism, DBMS leverages transaction
dispatching.
We can also employ a variety of techniques, such as effective lock management. This technique runs
each query sequentially, which slows down the running time.
In such circumstances, DBMS must be aware of the locks that various transactions operating on various
processes have acquired. When simultaneous transactions don’t accept the same data, inter-query
parallelism on shared storage architecture works well.
Additionally, the throughput of transactions is boosted, and it is the simplest form of parallelism in
DBMS.
Intra-operation parallelism :
In this type of parallelism, we execute each individual operation of a task, such as sorting, joins,
projections, and so forth, in parallel. Intra-operation parallelism has a very high parallelism level.
Database systems naturally employ this kind of parallelism. Consider the following SQL example:
SELECT * FROM the list of vehicles and sort by model number;
Since a relation might contain a high number of records, the relational operation in the
aforementioned query is sorting.
Because this operation can be done on distinct subsets of the relation in several processors, it takes
less time to sort the data.
Inter-operation parallelism :
This term refers to the concurrent execution of many operations within a query expression. They come
in two varieties:
Pipelined parallelism — In pipeline parallelism, a second operation consumes a row of the first
operation’s output before the first operation has finished producing the whole set of rows in its output.
Additionally, it is feasible to perform these two processes concurrently on several CPUs, allowing one
operation to consume tuples concurrently with another operation and thereby reduce them.
It is advantageous for systems with a limited number of CPUs and prevents the storage of interim
results on a disc.
Independent parallelism- In this form of parallelism, operations contained within query phrases that
are independent of one another may be carried out concurrently. This analogy is extremely helpful
when dealing with parallelism of a lower degree.
The relational model has been favored over prior hierarchical and network models because of
commercial database technologies. Data independence and high-level query languages are the key
advantages that relational database systems (RDBMSs) have over their forerunners (e.g., SQL).
Additionally, distributed database management is made easier by the relational model’s set-oriented
structure. RDBMSs may now offer performance levels comparable to older systems thanks to a
decade of development and tuning.
They are therefore widely employed in the processing of commercial data for OLTP (online transaction
processing) or decision-support systems. Through the use of many processors working together,
parallel processing makes use of multiprocessor computers to run application programmes and boost
performance.
It is most commonly used in scientific computing, which it does by the speed of numerical applications’
responses.
The development of parallel database systems is an example of how database management and
parallel computing can work together. A given SQL statement can be divided up in the parallel database
system PQO such that its components can run concurrently on several processors in a multi-processor
machine.
Full table scans, sorting, sub-queries, data loading, and other common operations can all be performed
in parallel.
As a form of parallel database optimization, Parallel Query enables the division of SELECT or DML
operations into many smaller chunks that can be executed by PQ slaves on different CPUs in a single
box.
The order of joins and the method for computing each join are fixed in the first part of the Fig, which
is sorting and rewriting. The second phase, parallelization, turns the query tree into a parallel plan.
Parallelization divides this stage into two parts: extraction of parallelism and scheduling. Optimizing
database queries is an important task in database management systems to improve the performance
of database operations. Parallelization of database query optimization can significantly improve query
execution time by dividing the workload among multiple processors or nodes.
1. Partitioning: The first step is to partition the data into smaller subsets. The partitioning can be
done based on different criteria, such as range partitioning, hash partitioning, or list partitioning. This
can be done in parallel by assigning different processors or nodes to handle different parts of the
partitioning process.
2. Query optimization: Once the data is partitioned, the next step is to optimize the queries.
Query optimization involves finding the most efficient way to execute the query by considering factors
such as index usage, join methods, and filtering. This can also be done in parallel by assigning different
processors or nodes to handle different parts of the query optimization process.
3. Query execution: After the queries are optimized, the final step is to execute the queries. The
execution can be done in parallel by assigning different processors or nodes to handle different parts
of the execution process. The results can then be combined to generate the final result set.
Here's an example of how we can parallelize the query optimization process using OpenMP:
//C++
optimize_query(query, partition);
}
In this example, we first partition the data into smaller subsets using OpenMP parallelism. Then we
optimize each query in parallel by assigning different processors or nodes to handle different parts of
the optimization process. Finally, we execute the queries in parallel by assigning different processors
or nodes to handle different parts of the execution process.
Parallelization of database query optimization can significantly improve the performance of database
operations and reduce query execution time. However, it requires careful consideration of the
workload distribution, synchronization, and communication between processors or nodes.
Mini Project : 4
Theory -
Therefore, it is essential to effectively map the subproblems of each phase to the processing elements
while implementing thread level parallelism. We propose an adaptive Generalized Mapping Method
(GMM) for NPDP parallelization that utilizes the GPU for efficient mapping of subproblems onto
processing threads in each phase.
Input-size and targeted GPU decide the computing power and the best mapping for each phase in
NPDP parallelization. The performance of GMM is compared with different conventional parallelization
approaches.
For sufficiently large inputs, our technique outperforms the state-of-the-art conventional
parallelization approach and achieves a significant speedup of a factor 30. We also summarize the
general heuristics for achieving better gain in the NPDP parallelization.
Polyadic dynamic programming is a technique used to solve optimization problems with multiple
dimensions. Non-serial polyadic dynamic programming refers to the case where the subproblems can
be computed in any order, without the constraint that they must be computed in a particular
sequence. This makes it possible to parallelize the computation on a GPU.
Here's an example code that implements non-serial polyadic dynamic programming with GPU
parallelization using CUDA:
#define M 1024
#define K 1024
global void compute_subproblem(float* dp, float* x, float* y, float* z, int i, int j, int k) {
// Compute the value of the subproblem float value = x[i] * y[j] * z[k];
syncthreads();
int main() {
// Allocate memory for the input arrays on the CPU float* x = new float[N];
x[i] = i;
cudaMalloc(&d_dp, N * M * K * sizeof(float));
cudaMalloc(&d_x, N * sizeof(float));
float* d_y;
cudaMalloc(&d_y, M * sizeof(float));
float* d_z;
cudaMalloc(&d_z, K * sizeof(float));
}
// Copy the dp array back to the CPU float* dp = new float[N * M * K];
std::cout << "dp[" << N-1 << "][" << M-1 << "][" << K-1 << "] = " << dp[(N-1)*M*K + (M-1)*K
Objective of the Assignment: Students should be able to implement linear regression by using deep
neural networks. Students should know about neural networks and its importance over machine
learning models.
Prerequisite:
Linear Regression : Linear regression is a basic and commonly used type of predictive analysis. The
overall idea of regression is to examine two things: (1) does a set of predictor variables do a good job
in predicting an outcome (dependent) variable? (2) Which variables in particular are significant
predictors of the outcome variable, and in what way do they–indicated by the magnitude and sign of
the beta estimates–impact the outcome variable? These regression estimates are used to explain the
relationship between one dependent variable and one or more independent variables. The simplest
form of the regression equation with one dependent and one independent variable is defined by the
formula y = c + b*x, where y = estimated dependent variable score, c = constant, b = regression
coefficient, and x = score on the independent variable.
The basic unit of the brain is known as a neuron; there are approximately 86 billion neurons in our
nervous system which are connected to 10^14-10^15 synapses. Each neuron receives a signal from
the synapses and gives output after processing the signal. This idea is drawn from the brain to build a
neural network.
Each neuron performs a dot product between the inputs and weights, adds biases, applies an
activation function, and gives out the outputs. When a large number of neurons are present together
to give out a large number of outputs, it forms a neural layer. Finally, multiple layers combine to form
a neural network.
Neural networks are formed when multiple neural layers combine with each other to give out a
network, or we can say that there are some layers whose outputs are inputs for other layers. The
most common type of layer to construct a basic neural network is the fully connected layer, in which
the adjacent layers are fully connected pairwise and neurons in a single layer are not connected to
each other.
Naming conventions. When the N-layer neural network, we do not count the input layer. Therefore, a
single-layer neural network describes a network with no hidden layers (input directly mapped to
output). In the case of our code, we’re going to use a single-layer neural network, i.e. We do not have
a hidden layer.
Output layer. Unlike all layers in a Neural Network, the output layer neurons most commonly do not
have an activation function (or you can think of them as having a linear identity activation function).
This is because the last output layer is usually taken to represent the class scores (e.g. in classification),
which are arbitrary real-valued numbers, or some kind of real-valued target (e.g. In regression). Since
we’re performing regression using a single layer, we do not have any activation function.
Sizing neural networks. The two metrics that people commonly use to measure the size of neural
networks are the number of neurons, or more commonly the number of parameters.
The Boston Housing Dataset is a popular dataset in machine learning and contains information about
various attributes of houses in Boston. The goal of using deep neural networks on this dataset is to
predict the median value of owner-occupied homes.
The Boston Housing Dataset contains 13 input variables or features, such as crime rate, average
number of rooms per dwelling, and distance to employment centers. The target variable is the median
value of owner-occupied homes. The dataset has 506 rows, which is not very large, but still sufficient
to train a deep neural network.
To implement a deep neural network on the Boston Housing Dataset, we can follow these steps:
Load the dataset: We can load the dataset using libraries like pandas or numpy.
Preprocess the data: We need to preprocess the data by scaling the input features so that they have
zero mean and unit variance. This step is important because it helps the neural network to converge
faster.
Split the dataset: We split the dataset into training and testing sets. We can use a 70/30 or 80/20 split
for training and testing, respectively.
Define the model architecture: We need to define the architecture of our deep neural network. We
can use libraries like Keras or PyTorch to define our model. The architecture can include multiple
hidden layers with various activation functions and regularization techniques like dropout.
Compile the model: We need to compile the model by specifying the loss function, optimizer, and
evaluation metrics. For regression problems like this, we can use mean squared error as the loss
function and adam optimizer.
Train the model: We can train the model using the training data. We can use techniques like early
stopping to prevent overfitting.
Evaluate the model: We can evaluate the model using the testing data. We can calculate the mean
squared error or the mean absolute error to evaluate the performance of the model.
Overall, using a deep neural network on the Boston Housing Dataset can result in accurate predictions
of the median value of owner-occupied homes. By following the above steps, we can implement a
deep neural network and fine-tune its hyperparameters to achieve better performance.
Practical Implementation of Boston Dataset and prediction using deep neural network.
import pandas as pd
# Load the dataset from a CSV file
df = pd.read_csv('boston_housing.csv')
from sklearn.preprocessing import StandardScaler # Split the data into input and output variables
# Display the first few rows of the scaled input features print(X[:5])
from sklearn.model_selection import train_test_split # Split the data into training and testing sets
# Print the shapes of the training and testing sets print('Training set shape:', X_train.shape,
y_train.shape) print('Testing set shape:', X_test.shape, y_test.shape)
Step 4: Define the model architecture from keras.models import Sequential from keras.layers import
Dense, Dropout
# Plot the training and validation loss over epochs import matplotlib.pyplot as plt
plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.title('Model Loss')
plt.xlabel('Epochs')
# Evaluate the model on the testing set loss, mae = model.evaluate(X_test, y_test)
Conclusion : In this way we are able to learn about the Deep Neural Network and its implementation
on the boston dataset.
Assignment No.: 2
Title of the Assignment: Classification using Deep neural network.Binary classification using Deep
Neural Networks Example: Classify movie reviews into positive" reviews and "negative" reviews,
just based on the text content of the reviews. Use the IMDB dataset.
Objective of the Assignment: Students should be able to implement deep neural networks on textual
data and they should know the basics of natural language processing and its applications in the real
world.
Prerequisite:
Classification using deep neural networks is a popular approach to solve various supervised learning
problems such as image classification, text classification, speech recognition, and many more. In this
approach, the neural network is trained on labeled data to learn a mapping between the input features
and the corresponding output labels.
Binary classification is a type of classification problem in which the task is to classify the input data into
one of the two classes. In the example of classifying movie reviews as positive or negative, the input
data is the text content of the reviews, and the output labels are either positive or negative.
The deep neural network used for binary classification consists of multiple layers of interconnected
neurons, which are capable of learning complex representations of the input data. The first layer of
the neural network is the input layer, which takes the input data and passes it to the hidden layers.
The hidden layers perform non-linear transformations on the input data to learn more complex
features. Each hidden layer consists of multiple neurons, which are connected to the neurons of the
previous and next layers. The activation function of the neurons in the hidden layers introduces non-
linearity into the network and allows it to learn complex representations of the input data.
The last layer of the neural network is the output layer, which produces the classification result. In
binary classification, the output layer consists of one neuron, which produces the probability of the
input data belonging to the positive class. The probability of the input data belonging to the negative
class can be calculated as (1 - probability of positive class).
The training of the neural network involves optimizing the model parameters to minimize the loss
function. The loss function measures the difference between the predicted output and the actual
output. In binary classification, the commonly used loss function is binary cross-entropy loss.
The IMDB dataset is a popular dataset used for binary classification of movie reviews. It contains
50,000 movie reviews, which are split into 25,000 reviews for training and 25,000 reviews for testing.
The reviews are preprocessed and encoded as sequences of integers, where each integer represents a
word in the review. The deep neural network can be trained on this dataset to classify the movie
reviews into positive or negative categories.
In summary, binary classification using deep neural networks involves designing a neural network
architecture with multiple layers of interconnected neurons, training the network on labeled data
using a suitable loss function, and using the trained network to classify new data. The IMDB dataset
provides a suitable example to implement and test this approach on movie review classification.
Dataset information :
The IMDB (Internet Movie Database) dataset is a popular dataset used for sentiment analysis,
particularly binary classification of movie reviews into positive or negative categories. It consists of
50,000 movie reviews, which are evenly split into a training set and a testing set, each containing
25,000 reviews.
The reviews are encoded as sequences of integers, where each integer represents a word in the review.
The words are indexed based on their frequency in the dataset, with the most frequent word assigned
the index 1, the second most frequent word assigned the index 2, and so on. The indexing is capped
at a certain number of words, typically the top 10,000 most frequent words, to limit the size of the
vocabulary.
The reviews are preprocessed to remove punctuations and convert all the letters to lowercase. The
reviews are also padded or truncated to a fixed length, typically 250 words, to ensure all the input
sequences have the same length. Padding involves adding zeros to the end of the review sequence to
make it of the fixed length, while truncating involves cutting off the sequence at the maximum length.
The reviews are labeled as positive or negative based on the overall sentiment expressed in the review.
The labels are assigned as follows: reviews with a score of 7 or higher on a scale of
1-10 are labeled as positive, while reviews with a score of less than 4 are labeled as negative. Reviews
with a score between 4 and 7 are excluded from the dataset to ensure clear distinction between
positive and negative categories.
The IMDB dataset is a popular benchmark dataset for sentiment analysis and has been used
extensively to evaluate various machine learning and deep learning models. Its popularity is attributed
to the large size of the dataset, the balanced distribution of positive and negative reviews, and the
preprocessed format of the reviews.
1. Load the IMDB dataset using Keras' built-in imdb.load_data() function. This function loads
the dataset and preprocesses it as sequences of integers, with the labels already converted to binary
(0 for negative, 1 for positive).
2. Pad or truncate the sequences to a fixed length of 250 words using Keras' pad_sequences()
function.
3. Define a deep neural network architecture, consisting of an embedding layer to learn the
word embeddings, followed by multiple layers of bidirectional LSTM (Long Short-Term Memory) cells,
and a final output layer with a sigmoid activation function to output the binary classification.
4. Compile the model using binary cross-entropy loss and the Adam optimizer.
5. Train the model on the training set and validate on the validation set.
6. Evaluate the trained model on the test set and compute the accuracy and loss.
import numpy as np
from keras.layers import Embedding, Bidirectional, LSTM, Dense # Load the IMDB dataset
# Pad or truncate the sequences to a fixed length of 250 words max_len = 250
loss, acc = model.evaluate(x_test, y_test, batch_size=128) print(f'Test accuracy: {acc:.4f}, Test loss:
{loss:.4f}')
This example implements a deep neural network with two layers of bidirectional LSTM cells, which are
capable of learning complex patterns in sequence data. The Embedding layer learns the word
embeddings from the input sequences, which are then fed into the LSTM layers. The output of the
LSTM layers is then fed into a dense output layer with a sigmoid activation function, which outputs the
binary classification.
The compile() method is used to compile the model with binary cross-entropy loss and the Adam
optimizer. The fit() method is used to train the model on the training set for 10 epochs with a
batch size of 128. The evaluate() method is used to evaluate the trained model on the test set and
compute the accuracy and loss.
This example demonstrates how deep neural networks can be used for binary classification on text
data, specifically for classifying movie reviews as positive or negative based on the text content.
Conclusion :
In this way we are able to learn about the Deep Neural Network and its implementation on the IMDB
dataset.Learn about sentiment analysis.
Assignment No.: 3
Title of the Assignment: Convolutional neural network (CNN).Use MNIST Fashion Dataset and
create a classifier to classify fashion clothing into categories.
Objective of the Assignment: Students should be able to implement Convolution Neural Network.
Implement classification of clothing categories on the basis of MNIST dataset.
Prerequisite:
Convolutional Neural Networks (CNNs) are a class of artificial neural networks that are specially
designed to analyze and classify images, videos, and other types of multidimensional data. They are
widely used in computer vision tasks such as image classification, object detection, and image
segmentation.
The main idea behind CNNs is to perform convolutions, which are mathematical operations that apply
a filter to an image or other input data. The filter slides over the input data and performs a dot product
between the filter weights and the input values at each position, producing a new output value. By
applying different filters at each layer, the network learns to detect different features in the input data,
such as edges, shapes, and textures.
CNNs typically consist of several layers that perform different operations on the input data. The most
common types of layers are:
Convolutional Layers: These layers perform convolutions on the input data using a set of filters. Each
filter produces a feature map, which represents the presence of a specific feature in the input data.
Pooling Layers: These layers reduce the spatial dimensions of the feature maps by taking the maximum
or average value within a small region of the feature map. This reduces the amount of computation
needed in the subsequent layers and makes the network more robust to small translations in the input
data.
Activation Layers: These layers apply a nonlinear activation function, such as ReLU (Rectified Linear
Unit), to the output of the previous layer. This introduces nonlinearity into the network and allows it
to learn more complex features.
Fully-Connected Layers: These layers connect all the neurons in the previous layer to all the neurons
in the current layer, similar to a traditional neural network. They are typically used at the end of the
network to perform the final classification.
The architecture of a CNN is typically organized in a series of blocks, each consisting of one or more
convolutional layers followed by pooling and activation layers. The output of the final block is then
passed through one or more fully-connected layers to produce the final output.
CNNs are trained using backpropagation, which is a process that updates the weights of the network
based on the difference between the predicted output and the true output. This process is typically
done using a loss function, such as cross-entropy loss, which measures the difference between the
predicted output and the true output.In summary, CNNs are a powerful class of neural networks that
are specially designed for analyzing and classifying images and other types of multidimensional data.
They achieve this by performing convolutions on the input data using a set of filters, and by using
different types of layers to reduce the spatial dimensions of the feature maps, introduce nonlinearity,
and perform the final classification.
Dataset information :
The MNIST Fashion Dataset is a widely used benchmark dataset in the field of computer vision and
machine learning. It consists of 70,000 grayscale images of clothing items, including dresses, shirts,
sneakers, sandals, and more. The dataset is split into 60,000 training images and 10,000 test images,
with each image being a 28x28 pixel square.
The dataset is often used as a benchmark for classification tasks in computer vision, particularly for
image recognition and classification using neural networks. The dataset is considered relatively easy
compared to other image datasets such as ImageNet, but it is still a challenging task due to the
variability in the clothing items and the low resolution of the images.
The goal of the MNIST Fashion Dataset is to correctly classify the clothing items into one of the ten
categories: T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, and Ankle boot.
The dataset was created as a replacement for the original MNIST handwritten digit dataset, which was
becoming too easy for machine learning algorithms to classify accurately. The MNIST Fashion Dataset
was created to provide a more challenging classification task while still being a relatively small dataset
that can be used for experimentation and testing.
The dataset has been used extensively in the field of computer vision, with researchers and developers
using it to test and evaluate new machine learning algorithms and models. The dataset has also been
used in educational settings to teach students about machine learning and computer vision.
One common approach to tackling the MNIST Fashion Dataset is to use convolutional neural networks
(CNNs), which are specifically designed to process images. CNNs consist of multiple layers, including
convolutional layers, pooling layers, and fully connected layers. The convolutional layers extract
features from the images, while the pooling layers downsample the features to reduce the
computational complexity. The fully connected layers perform the final classification of the images.
Other approaches to tackling the MNIST Fashion Dataset include using other types of neural networks
such as recurrent neural networks (RNNs) and deep belief networks (DBNs), as well as using other
machine learning algorithms such as decision trees, support vector machines (SVMs), and k-nearest
neighbor (KNN) classifiers.
Overall, the MNIST Fashion Dataset is a valuable benchmark dataset in the field of computer vision
and machine learning, and its popularity is likely to continue as new algorithms and models are
developed and tested.
import tensorflow as tf
from tensorflow import keras import numpy as np
fashion_mnist = keras.datasets.fashion_mnist
])
loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Make predictions
num_cols = 5
Conclusion:
In this way we are able to implement Convolutional neural network (CNN) Using MNIST Fashion
Dataset.
Assignment No.: 4
Title of the Assignment: Recurrent neural network (RNN) Use the Google stock prices dataset and
design a time seriesanalysis and prediction system using RNN.
Objective of the Assignment: Students should be able to implement Recurrent Neural Network.
Design a time seriesanalysis and prediction system using RNN.
Prerequisite:
A recurrent neural network (RNN) is a type of neural network that is designed to work with
sequential data. Unlike traditional feedforward neural networks that only process input data in a
single pass, RNNs maintain an internal state or memory that allows them to process sequences of
input data.
This makes RNNs well-suited for tasks such as natural language processing, speech recognition, and
time series analysis.
data = pd.read_csv('GOOG.csv')
# Create the training and testing datasets training_data_len = int(len(dataset) * 0.8) training_data =
dataset[:training_data_len] testing_data = dataset[training_data_len:] def create_dataset(dataset,
time_step=1):
X, Y = [], []
model = Sequential()
Output:
Epoch 100/100
Mini Project : 1
Mini Project: Human Face Recognition
Deep learning is a subset of machine learning that is inspired by the structure and function of the
human brain. It has been successfully applied to various computer vision tasks, including human face
recognition. There are several theories related to human face recognition using deep learning:
Convolutional Neural Network (CNN) Theory: This theory proposes that humans recognize faces by
hierarchically processing the visual information, similar to how a convolutional neural network
operates. According to this theory, the brain first detects low-level features, such as edges and corners,
and then gradually builds up to higher-level features, such as facial contours and expressions.
Autoencoder Theory: This theory suggests that humans learn to recognize faces by compressing the
information into a lower-dimensional space and then reconstructing it back to its original form, similar
to how an autoencoder operates. According to this theory, the brain uses a hierarchical representation
of faces that allows for efficient processing and recognition.
Generative Adversarial Network (GAN) Theory: This theory proposes that humans recognize faces by
learning to discriminate between real and fake faces, similar to how a generative adversarial network
operates. According to this theory, the brain learns to distinguish between genuine facial features and
artifacts caused by noise or distortions in the image.
Attention Mechanism Theory: This theory suggests that humans selectively attend to specific facial
features, similar to how an attention mechanism operates in deep learning. According to this
theory, the brain focuses on salient features, such as the eyes and mouth, while ignoring less important
features, such as the nose or ears.
Overall, deep learning has shown great promise in advancing our understanding of human face
recognition and has led to the development of highly accurate face recognition systems.
Pretrained model :
A pre-trained model is a machine learning model that has already been trained on a large dataset and
saved to disk, typically using a supervised learning approach. The weights of the model are the learned
parameters that have been optimized during the training process to minimize the loss function, which
measures the difference between the predicted outputs and the true outputs.
Pre-trained models are useful in deep learning because they can be used as a starting point for transfer
learning, where the learned features of the pre-trained model are fine-tuned on a new dataset to
improve its performance on a specific task.
The VGG16 model is a deep convolutional neural network that was first introduced in the 2014
ImageNet competition. It was developed by the Visual Geometry Group (VGG) at the University of
Oxford and consists of 16 convolutional and fully connected layers. The model has a fixed input size of
224x224 pixels and takes an RGB image as input. The VGG16 model achieved state-of-the-art
performance in the ImageNet classification task and has become a popular choice for transfer learning
in various computer vision applications. The pre-trained VGG16 model is available in several deep
learning libraries, including TensorFlow, Keras, and PyTorch.
Code for human face recognition:
import os
dataset_path = 'path/to/dataset/'
# Extract features from the dataset using the VGG16 model def extract_features(directory):
features = {}
for file in os.listdir(directory + subdir): img_path = directory + subdir + '/' + file img =
cv2.imread(img_path)
return features
train_features, train_labels, test_features, test_labels = [], [], [], [] for key in features:
else:
test_features.append(features[key]) test_labels.append(key.split('_')[0])
])
optimizer='adam', metrics=['accuracy'])
epochs=10,
Mini Project : 2
Title - Implement Gender and Age Detection: predict if a person is a male or female and also their
age
Theory -
Collect and prepare the dataset: In this case, we can use the "UTKFace" dataset which contains images
of faces with their corresponding gender and age labels. We need to preprocess the data, like resizing
the images to a uniform size, shuffling the dataset, and limit the age to a certain value (like 100 years).
Split the dataset: Split the dataset into training, validation, and testing sets. The usual split ratio is 80%,
10%, and 10%, respectively.
Define data generators: Define data generators for training, validation, and testing sets using the
"ImageDataGenerator" class in Keras. This class provides data augmentation techniques that can
improve the model's performance, such as rotation, zoom, and horizontal flip.
Define the neural network model: Define a convolutional neural network (CNN) model that takes the
face images as input and outputs two values - the probability of being male and the predicted age. The
model can have multiple convolutional and pooling layers followed by some dense layers.
Compile the model: Compile the model with appropriate loss and metrics for each output (gender and
age). In this case, we can use binary cross-entropy loss for gender and mean squared error (MSE) for
age.
Train the model: Train the model using the fit method of the model object. We need to pass the data
generators for the training and validation sets, as well as the number of epochs and batch size.
Evaluate the model: Evaluate the model's performance on the testing set using the evaluate method
of the model object. This will give us the accuracy and mean absolute error (MAE) of the model.
Predict the gender and age of a sample image: Load a sample image and preprocess it. We can use the
"cv2" library to read the image, resize it to the same size as the training images, and normalize it. Then,
we can use the "predict" method of the model object to get the predicted gender and age.
Source Code -
import tensorflow as tf
img_width = 128
batch_size = 32
epochs = 10
# Define data generators for training, validation, and testing sets train_datagen =
ImageDataGenerator(rescale=1./255) val_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255) train_generator =
train_datagen.flow_from_dataframe( dataframe=df_train,
x_col='image_path',
Flatten(), Dropout(0.5),
])
img
Mini Project : 3
Title - Implement Colorizing Old B&W Images: color old black and white images to colorful images
Project title: Colorizing Old B&W Images using CNN.
Theory - Colorizing black and white images to colorful images involves a complex process that requires
expertise and specialized software. However, here are some general steps involved in the process: Scan
the black and white image: The first step is to scan the black and white image and convert it into a
digital format.
Preprocess the image: The image needs to be preprocessed to remove any scratches, dust, or other
defects. This can be done using image editing software like Photoshop or GIMP. Convert the image to
grayscale: The black and white image needs to be converted to grayscale. This can be done using image
editing software or programming languages like Python.
Collect training data: The next step is to collect training data for the colorization model. This can include
a dataset of colorful images with their corresponding grayscale versions. Train the colorization model:
A deep learning model can be trained to colorize grayscale images using a dataset of colorful images.
This model can be trained using software like TensorFlow, PyTorch, or Keras. Apply the colorization
model to the black and white image: Once the model is trained, it can be applied to the black and
white image to generate a colorized version. This can be done using programming languages like
Python.
Refine the colorized image: The colorized image may need some manual refinement to ensure that the
colors are accurate and the image looks natural. This can be done using image editing software like
Photoshop or GIMP.
Save the final image: Once the image has been colorized and refined, it can be saved in a digital format
for printing or online use.
Note that the quality of the colorized image will depend on the quality of the original black and white
image, the accuracy of the colorization model, and the manual refinement process
Source Code -
import tensorflow as tf
import numpy as np
# Add a new dimension to the array to make it compatible with the input shape of the model
img_gray = np.expand_dims(img_gray, axis=0)