0% found this document useful (0 votes)

3 views29 pages

Big Data Analytics Lab Manual(BE AI&DS)

The document is a lab manual for the Big Data Analytics Laboratory course at Sahyadri Valley College of Engineering and Technology. It provides detailed instructions for installing and configuring Apache Hadoop, running a MapReduce word count program, and developing a custom MapReduce program to calculate word frequency. The manual outlines prerequisites, procedures, and coding examples necessary for students to complete their practical assignments in the course.

Uploaded by

Lucifer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views29 pages

Big Data Analytics Lab Manual(BE AI&DS)

Uploaded by

Lucifer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Maharia Charitable Trust's

SAHYADRI VALLEY COLLEGE OF ENGINEERING AND TECHNOLOGY

Accredited by NAAC with ‘B+’ Grade & NBA Approved by AICTE, New Delhi
Rajuri, Near Alephata, Kalyan - Nagar Highway, Taluka - Junnar, District – Pune 412411

ARTIFICIAL INTELLIGENCE & DATA SCIENCE

ENGINEERING DEPARTMENT

LAB MANUAL (2024-25)

SEM – VIII

Subject -Big Data Analytics Laboratory

417531(B)

Teaching Scheme: Examination Scheme:

Practical: 04Hrs/ Week Practical: 25 Marks
Credit: 02 Term Work: 50 Marks

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

Set up and Configuration Hadoop Using CloudEra/ Google Cloud

EXP NO: 1 BigQuery. Databricks Lakehouse Platform. Snowflake. Amazon
Redshift.

AIM: To Install Apache Hadoop.

Hadoop software can be installed in three modes of

Hadoop is a Java-based programming framework that supports the processing and storage of
extremely large datasets on a cluster of inexpensive machines. It was the first major open source
project in the big data playing field and is sponsored by the Apache Software Foundation.

Hadoop-2.7.3 is comprised of four main layers:

➢ Hadoop Common is the collection of utilities and libraries that support other Hadoop
modules.
➢ HDFS, which stands for Hadoop Distributed File System, is responsible for persisting
data to disk.
➢ YARN, short for Yet Another Resource Negotiator, is the "operating system" for HDFS.
➢ MapReduce is the original processing model for Hadoop clusters. It distributes work
within the cluster or map, then organizes and reduces the results from the nodes into a
response to a query. Many other processing models are available for the 2.x version of
Hadoop.
Hadoop clusters are relatively complex to set up, so the project includes a stand-alone mode
which is suitable for learning about Hadoop, performing simple operations, and debugging.

Procedure:

we'll install Hadoop in stand-alone mode and run one of the example example MapReduce
programs it includes to verify the installation.

Prerequisites:

Step1: Installing Java 8 version.

Openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
This output verifies that OpenJDK has been successfully installed.
Note: To set the path for environment variables. i.e. JAVA_HOME

Step2: Installing Hadoop

With Java in place, we'll visit the Apache Hadoop Releases page to find the most
recent stable release. Follow the binary for the current release:

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

Download Hadoop from www.hadoop.apache.org

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

Procedure to Run Hadoop

1. Install Apache Hadoop 2.2.0 in Microsoft Windows OS

If Apache Hadoop 2.2.0 is not already installed then follow the post Build, Install,
Configure and Run Apache Hadoop 2.2.0 in Microsoft Windows OS.

2. Start HDFS (Namenode and Datanode) and YARN (Resource Manager and Node
Manager)

Run following commands.

Command Prompt
C:\Users\abhijitg>cd c:\hadoop
c:\hadoop>sbin\start-dfs
c:\hadoop>sbin\start-yarn
starting yarn daemons

Namenode, Datanode, Resource Manager and Node Manager will be started in

few minutes and ready to execute Hadoop MapReduce job in the Single Node
(pseudo-distributed mode) cluster.

Resource Manager & Node Manager:

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

Run wordcount MapReduce job

Now we'll run wordcount MapReduce job available

in %HADOOP_HOME%\share\hadoop\mapreduce\hadoop-mapreduce-examples-
2.2.0.jar

Create a text file with some content. We'll pass this file as input to
the wordcount MapReduce job for counting words.
C:\file1.txt
Install Hadoop

Run Hadoop Wordcount Mapreduce Example

Create a directory (say 'input') in HDFS to keep all the text files (say 'file1.txt') to be used for
counting words.
C:\Users\abhijitg>cd c:\hadoop
C:\hadoop>bin\hdfs dfs -mkdir input

Copy the text file(say 'file1.txt') from local disk to the newly created 'input' directory in HDFS.

C:\hadoop>bin\hdfs dfs -copyFromLocal c:/file1.txt input

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

Check content of the copied file.

C:\hadoop>hdfs dfs -ls input

Found 1 items
-rw-r--r-- 1 ABHIJITG supergroup 55 2014-02-03 13:19 input/file1.txt

C:\hadoop>bin\hdfs dfs -cat input/file1.txt

Install Hadoop
Run Hadoop Wordcount Mapreduce Example

Run the wordcount MapReduce job provided

in %HADOOP_HOME%\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.2.0.jar

C:\hadoop>bin\yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-

2.2.0.jar wordcount input output
14/02/03 13:22:02 INFO client.RMProxy: Connecting to ResourceManager at
/0.0.0.0:8032
14/02/03 13:22:03 INFO input.FileInputFormat: Total input paths to process : 1
14/02/03 13:22:03 INFO mapreduce.JobSubmitter: number of splits:1
:
:
14/02/03 13:22:04 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1391412385921_0002
14/02/03 13:22:04 INFO impl.YarnClientImpl: Submitted application
application_1391412385921_0002 to ResourceManager at /0.0.0.0:8032
14/02/03 13:22:04 INFO mapreduce.Job: The url to track the job:
http://ABHIJITG:8088/proxy/application_1391412385921_0002/
14/02/03 13:22:04 INFO mapreduce.Job: Running job: job_1391412385921_0002
14/02/03 13:22:14 INFO mapreduce.Job: Job job_1391412385921_0002 running in
uber mode : false
14/02/03 13:22:14 INFO mapreduce.Job: map 0% reduce 0%
14/02/03 13:22:22 INFO mapreduce.Job: map 100% reduce 0%
14/02/03 13:22:30 INFO mapreduce.Job: map 100% reduce 100%
14/02/03 13:22:30 INFO mapreduce.Job: Job job_1391412385921_0002 completed
successfully
14/02/03 13:22:31 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=89
FILE: Number of bytes written=160142
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

HDFS: Number of bytes read=171

HDFS: Number of bytes written=59
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5657
Total time spent by all reduces in occupied slots (ms)=6128
Map-Reduce Framework
Map input records=2
Map output records=7
Map output bytes=82
Map output materialized bytes=89
Input split bytes=116
Combine input records=7
Combine output records=6
Reduce input groups=6
Reduce shuffle bytes=89
Reduce input records=6
Reduce output records=6
Spilled Records=12
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=145
CPU time spent (ms)=1418
Physical memory (bytes) snapshot=368246784
Virtual memory (bytes) snapshot=513716224
Total committed heap usage (bytes)=307757056
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=55
File Output Format Counters

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

Bytes Written=59
http://abhijitg:8088/cluster

Result: We've installed Hadoop in stand-alone mode and verified it by running an

example program it provided.

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

EXP NO: 2 Develop a MapReduce program to calculate the frequency of a

given word in a given file.

AIM: To Develop a MapReduce program to calculate the frequency of a given word in a given file.

Map Function – It takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (Key-Value pair).
Example – (Map function in Word Count)
Input
Set of data
Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car, BUS, TRAIN
Output
Convert into another set of data
(Key,Value)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1)
Reduce Function – Takes the output from Map as an input and combines those data tuples
into a smaller set of tuples.
Example – (Reduce function in Word Count)
Input Set of Tuples
(output of Map function)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1),
(buS,1),(caR,1),(CAR,1), (car,1), (BUS,1), (TRAIN,1)
Output Converts into smaller set of tuples
(BUS,7), (CAR,7), (TRAIN,4)
Work Flow of Program

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

Workflow of MapReduce consists of 5 steps

1. Splitting – The splitting parameter can be anything, e.g. splitting by space,
comma, semicolon, or even by a new line (‘\n’).
2. Mapping – as explained above
3. Intermediate splitting – the entire process in parallel on different clusters. In order
to group them in “Reduce Phase” the similar KEY data should be on same cluster.
4. Reduce – it is nothing but mostly group by phase
5. Combining – The last phase where all the data (individual result set from each
cluster) is combine together to form a Result

Now Let’s See the Word Count Program in Java

Make sure that Hadoop is installed on your system with java idk

Steps to follow

Step 1. Open Eclipse> File > New > Java Project > (Name it – MRProgramsDemo) >
Finish
Step 2. Right Click > New > Package ( Name it - PackageDemo) > Finish
Step 3. Right Click on Package > New > Class (Name it - WordCount)
Step 4. Add Following Reference Libraries –

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

Right Click on Project > Build Path> Add External Archivals

• /usr/lib/hadoop-0.20/hadoop-core.jar
• Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar

Program: Step 5. Type following Program :

package PackageDemo;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static void main(String [] args) throws Exception
{
Configuration c=new Configuration();
String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
Path input=new Path(files[0]);
Path output=new Path(files[1]);
Job j=new Job(c,"wordcount");
j.setJarByClass(WordCount.class);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);
}
public static class MapForWordCount extends Mapper<LongWritable, Text, Text,
IntWritable>{
public void map(LongWritable key, Text value, Context con) throws IOException,
InterruptedException
{
String line = value.toString();

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

String[] words=line.split(",");
for(String word: words )
{
Text outputKey = new Text(word.toUpperCase().trim());
IntWritable outputValue = new IntWritable(1);
con.write(outputKey, outputValue);
}
}
}
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text,
IntWritable>
{
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws
IOException,
InterruptedException
{
int sum = 0;
for(IntWritable value : values)
{
sum += value.get();
}
con.write(word, new IntWritable(sum));
}
}
}

Make Jar File

Right Click on Project> Export> Select export destination as Jar File > next> Finish

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

To Move this into Hadoop directly, open the terminal and enter the following
commands:
[training@localhost ~]$ hadoop fs -put wordcountFile wordCountFile

Run Jar file

(Hadoop jar jarfilename.jar packageName.ClassName PathToInputTextFile
PathToOutputDirectry)

[training@localhost ~]$ Hadoop jar MRProgramsDemo.jar

PackageDemo.WordCount wordCountFile MRDir1

Result: Open Result

[training@localhost ~]$ hadoop fs -ls MRDir1

Found 3 items
-rw-r--r-- 1 training supergroup
0 2016-02-23 03:36 /user/training/MRDir1/_SUCCESS
drwxr-xr-x - training supergroup
0 2016-02-23 03:36 /user/training/MRDir1/_logs
-rw-r--r-- 1 training supergroup
20 2016-02-23 03:36 /user/training/MRDir1/part-r-00000
[training@localhost ~]$ hadoop fs -cat MRDir1/part-r-00000
BUS 7
CAR 4
TRAIN 6

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

EXP NO: 3 Develop a MapReduce program to find the grades of

students.

AIM: To Develop a MapReduce program to find the grades of student’s.

import java.util.Scanner;
public class JavaExample
{
public static void main(String args[])
{
/* This program assumes that the student has 6 subjects,
* thats why I have created the array of size 6. You can
* change this as per the requirement.
*/
int marks[] = new int[6];
int i;
float total=0, avg;
Scanner scanner = new Scanner(System.in);
for(i=0; i<6; i++) {
System.out.print("Enter Marks of Subject"+(i+1)+":");
marks[i] = scanner.nextInt();
total = total + marks[i];
}
scanner.close();
//Calculating average
here avg = total/6;
System.out.print("The student Grade is: ");
if(avg>=80)
{
System.out.print("A");
}
else if(avg>=60 && avg<80)
{
System.out.print("B");
}
else if(avg>=40 && avg<60)
{

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

System.out.print("C");
}
else
{

System.out.print("D");
}
}
}

Output:

Enter Marks of Subject1:40

Enter Marks of Subject2:80
Enter Marks of Subject3:80
Enter Marks of Subject4:40
Enter Marks of Subject5:60
Enter Marks of Subject6:60
The student Grade is: B

Result:
In this assignment, we successfully developed a MapReduce program to calculate the grades of
students based on their subject-wise marks. By leveraging the power of the MapReduce paradigm, we
efficiently processed and analyzed large volumes of student data in a distributed environment.

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

EXP NO: 4 To Develop a MapReduce program to implement Matrix

Multiplication.

AIM: To Develop a MapReduce program to implement Matrix Multiplication.

In mathematics, matrix multiplication or the matrix product is a binary operation that produces
a matrix from two matrices. The definition is motivated by linear equations and linear
transformations on vectors, which have numerous applications in applied mathematics, physics,
and engineering. In more detail, if A is an n × m matrix and B is an m × p matrix, their matrix
product AB is an n × p matrix, in which the m entries across a row of A are multiplied with the
m entries down a column of B and summed to produce an entry of AB. When two linear
transformations are represented by matrices, then the matrix product represents the composition
of the two transformations.

Algorithm for Map Function.

a. for each element mij of M do

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

produce (key,value) pairs as ((i,k), (M,j,mij), for k=1,2,3,.. upto the number of
columns of N
b. for each element njk of N do
produce (key,value) pairs as ((i,k),(N,j,Njk), for i = 1,2,3,.. Upto the number of
rows of M.
c. return Set of (key,value) pairs that each key (i,k), has list with values
(M,j,mij) and (N, j,njk) for all possible values of j.
Algorithm for Reduce Function.
d. for each key (i,k) do
e. sort values begin with M by j in listM sort values begin with N by j in listN
multiply mij and njk for jth value of each list
f. sum up mij x njk return (i,k), Σj=1 mij x njk

Step 1. Download the hadoop jar files with these links.

Download Hadoop Common Jar files: https://goo.gl/G4MyHp
$ wget https://goo.gl/G4MyHp -O hadoop-common-2.2.0.jar
Download Hadoop Mapreduce Jar File: https://goo.gl/KT8yfB
$ wget https://goo.gl/KT8yfB -O hadoop-mapreduce-client-core-2.7.1.jar

Step 2. Creating Mapper file for Matrix Multiplication.

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.ArrayList;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.ReflectionUtils;

class Element implements Writable {

int tag;
int index;
double value;
Element() {
tag = 0;
index = 0;
value = 0.0;
}
Element(int tag, int index, double value) {
this.tag = tag;
this.index = index;
this.value = value;
}
@Override
public void readFields(DataInput input) throws IOException {
tag = input.readInt();
index = input.readInt();
value = input.readDouble();
}
@Override
public void write(DataOutput output) throws IOException {
output.writeInt(tag);
output.writeInt(index);
output.writeDouble(value);
}
}
class Pair implements WritableComparable<Pair> {
int i;
int j;

Pair() {
i = 0;

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

j = 0;
}
Pair(int i, int j) {
this.i = i;
this.j = j;
}
@Override
public void readFields(DataInput input) throws IOException {
i = input.readInt();
j = input.readInt();
}
@Override
public void write(DataOutput output) throws IOException {
output.writeInt(i);
output.writeInt(j);
}
@Override
public int compareTo(Pair compare) {
if (i > compare.i) {
return 1;
} else if ( i < compare.i) {
return -1;
} else {
if(j > compare.j) {
return 1;
} else if (j < compare.j) {
return -1;
}
}
return 0;
}
public String toString() {
return i + " " + j + " ";
}
}
public class Multiply {
public static class MatriceMapperM extends Mapper<Object,Text,IntWritable,Element>
{

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String readLine = value.toString();
String[] stringTokens = readLine.split(",");

int index = Integer.parseInt(stringTokens[0]);

double elementValue = Double.parseDouble(stringTokens[2]);
Element e = new Element(0, index, elementValue);
IntWritable keyValue = new
IntWritable(Integer.parseInt(stringTokens[1]));
context.write(keyValue, e);
}
}
public static class MatriceMapperN extends Mapper<Object,Text,IntWritable,Element> {
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String readLine = value.toString();
String[] stringTokens = readLine.split(",");
int index = Integer.parseInt(stringTokens[1]);
double elementValue = Double.parseDouble(stringTokens[2]);
Element e = new Element(1,index, elementValue);
IntWritable keyValue = new
IntWritable(Integer.parseInt(stringTokens[0]));
context.write(keyValue, e);
}
}
public static class ReducerMxN extends Reducer<IntWritable,Element, Pair,
DoubleWritable> {
@Override
public void reduce(IntWritable key, Iterable<Element> values, Context context) throws
IOException, InterruptedException {
ArrayList<Element> M = new ArrayList<Element>();
ArrayList<Element> N = new ArrayList<Element>();
Configuration conf = context.getConfiguration();
for(Element element : values) {
Element tempElement = ReflectionUtils.newInstance(Element.class,
conf);

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

ReflectionUtils.copy(conf, element, tempElement);

if (tempElement.tag == 0) {
M.add(tempElement);
} else if(tempElement.tag == 1) {
N.add(tempElement);
}
}
for(int i=0;i<M.size();i++) {
for(int j=0;j<N.size();j++) {

Pair p = new Pair(M.get(i).index,N.get(j).index);

double multiplyOutput = M.get(i).value * N.get(j).value;

context.write(p, new DoubleWritable(multiplyOutput));

}
}
}
}
public static class MapMxN extends Mapper<Object, Text, Pair, DoubleWritable> {
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String readLine = value.toString();
String[] pairValue = readLine.split(" ");
Pair p = new
Pair(Integer.parseInt(pairValue[0]),Integer.parseInt(pairValue[1]));
DoubleWritable val = new
DoubleWritable(Double.parseDouble(pairValue[2]));
context.write(p, val);
}
}
public static class ReduceMxN extends Reducer<Pair, DoubleWritable, Pair,
DoubleWritable> {
@Override
public void reduce(Pair key, Iterable<DoubleWritable> values, Context context)
throws IOException, InterruptedException {
double sum = 0.0;
for(DoubleWritable value : values) {

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

sum += value.get();
}
context.write(key, new DoubleWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Job job = Job.getInstance();
job.setJobName("MapIntermediate");
job.setJarByClass(Project1.class);
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class,
MatriceMapperM.class);
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class,
MatriceMapperN.class);
job.setReducerClass(ReducerMxN.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Element.class);
job.setOutputKeyClass(Pair.class);
job.setOutputValueClass(DoubleWritable.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
Job job2 = Job.getInstance();
job2.setJobName("MapFinalOutput");
job2.setJarByClass(Project1.class);

job2.setMapperClass(MapMxN.class);
job2.setReducerClass(ReduceMxN.class);

job2.setMapOutputKeyClass(Pair.class);
job2.setMapOutputValueClass(DoubleWritable.class);

job2.setOutputKeyClass(Pair.class);
job2.setOutputValueClass(DoubleWritable.class);

job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.setInputPaths(job2, new Path(args[2]));

FileOutputFormat.setOutputPath(job2, new Path(args[3]));

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

job2.waitForCompletion(true);
}
}

Step 5. Compiling the program in particular folder named as operation

#!/bin/bash

rm -rf multiply.jar classes

module load hadoop/2.6.0

mkdir -p classes
javac -d classes -cp classes:`$HADOOP_HOME/bin/hadoop classpath` Multiply.java
jar cf multiply.jar -C classes .

echo "end"

Step 6. Running the program in particular folder named as operation

export HADOOP_CONF_DIR=/home/$USER/cometcluster
module load hadoop/2.6.0
myhadoop-configure.sh
start-dfs.sh
start-yarn.sh

hdfs dfs -mkdir -p /user/$USER

hdfs dfs -put M-matrix-large.txt /user/$USER/M-matrix-large.txt
hdfs dfs -put N-matrix-large.txt /user/$USER/N-matrix-large.txt
hadoop jar multiply.jar edu.uta.cse6331.Multiply /user/$USER/M-matrix-large.txt
/user/$USER/N-matrix-large.txt /user/$USER/intermediate /user/$USER/output
rm -rf output-distr
mkdir output-distr
hdfs dfs -get /user/$USER/output/part* output-distr

stop-yarn.sh
stop-dfs.sh
myhadoop-cleanup.sh

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

Output:

module load hadoop/2.6.0

rm -rf output intermediate

hadoop --config $HOME jar multiply.jar edu.uta.cse6331.Multiply M-matrix-small.txt N-matrix-

small.txt intermediate output

Result:
In this assignment, we successfully developed a MapReduce program to perform matrix
multiplication using distributed processing. The implementation demonstrated how large-scale matrix
operations can be broken down into smaller, parallelizable tasks using the Map and Reduce functions.
By implementing matrix multiplication in a distributed environment, we learned how to:
• Structure complex computations in a key-value format.
• Handle data partitioning and combination in a parallelized manner.
• Optimize performance for large datasets using the MapReduce paradigm.

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

Develop a MapReduce program to analyze Titanic ship data and to find the
EXP
average age of the people (both male and female) who died in the tragedy. How
NO:5
many persons are survived in each class.

AIM: Develop a MapReduce program to analyze Titanic ship data and to find the average
age of the people (both male and female) who died in the tragedy. How many persons are
survived in each class.

The titanic data will be..

Column 1 :PassengerI d Column 2 : Survived (survived=0 &died=1)

Column 3 :Pclass Column 4 : Name
Column 5 : Sex Column 6 : Age
Column 7 :SibSp Column 8 :Parch
Column 9 : Ticket Column 10 : Fare
Column 11 :Cabin Column 12 : Embarked

Description:
There have been huge disasters in the history of Map reduce, but the magnitude of the Titanic’s
disaster ranks as high as the depth it sank too. So much so that subsequent disasters have always
been described as “titanic in proportion” – implying huge losses.
Anyone who has read about the Titanic, know that a perfect combination of natural events and
human errors led to the sinking of the Titanic on its fateful maiden journey from Southampton to
New York on April 14, 1912.
There have been several questions put forward to understand the cause/s of the tragedy – foremost
among them is: What made it sink and even more intriguing How can a 46,000 ton ship sink to
the depth of 13,000 feet in a matter of 3 hours? This is a mind boggling question indeed!
There have been as many inquisitions as there have been questions raised and equally that many
types of analysis methods applied to arrive at conclusions. But this blog is not about analyzing
why or what made the Titanic sink – it is about analyzing the data that is present about the
Titanic publicly. It actually uses Hadoop MapReduce to analyze and arrive at:
• The average age of the people (both male and female) who died in the tragedy
using Hadoop MapReduce.
• How many persons survived – traveling class wise.
This blog is about analyzing the data of Titanic. This total analysis is performed in Hadoop
MapReduce.

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

This Titanic data is publically available and the Titanic data set is described below under
the heading Data Set Description.

Using that dataset we will perform some Analysis and will draw out some insights
like finding the average age of male and females died in Titanic, Number of males and
females died in each compartment.

DATA SET DESCRIPTION

Column 1 : PassengerI
Column 2 : Survived (survived=0 & died=1) Column 3 : Pclass
Column 4 : Name
Column 5 : Sex
Column 6 : Age
Column 7 : SibSp
Column 8 : Parch
Column 9 : Ticket
Column 10 : Fare
Column 11 : Cabin
Column 12 : Embarked

Mapper code:
public class Average_age {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private Text gender = new Text();

private IntWritable age = new IntWritable();

public void map(LongWritable key, Text value, Context context ) throws

IOException, InterruptedException {

String line = value.toString();

String str[]=line.split(",");

if(str.length>6){

gender.set(str[4]);

if((str[1].equals("0")) ){

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

if(str[5].matches("\\d+")){

int i=Integer.parseInt(str[5]);

age.set(i);
}
}
}
context.write(gender, age)
}
}

Reducer Code:
public static class Reduce extends Reducer<Text,IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
int l=0;
for (IntWritable val : values) {
l+=1;
sum += val.get();
}
sum=sum/l;
context.write(key, new IntWritable(sum));
}
}

Configuration Code:

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(IntWritable.class);

https://github.com/kiran0541/Map-
Reduce/blob/master/Average%20age%20of%20male%20and%20female%20
people%20died%20in%20titanic

Subject Incharge: Prof. Gaikwad Shraddha B.

Big Data Analytics Laboratory 417531(B)

Way to to execute the Jar file to get the result of the first problem statement:

hadoop jar average.jar /TitanicData.txt /avg_out

Here ‘hadoop’ specifies we are running a Hadoop command and jar specifies
which type of application we are running and average.jar is the jar file which we
have created which consists the above source code and the path of the Input file
name in our case it is TitanicData.txt and the output file where to store the output
here we have given it as avg out.

Way to view the output:

hadoop dfs –cat /avg_out/part-r-00000

Here ‘hadoop’ specifies that we are running a Hadoop command
and ‘dfs‘ specifies that we are performing an operation related to Hadoop
Distributed File System and ‘- cat’ is used to view the contents of a file and
‘avg_out/part-r-00000’ is the file where the output is stored. Part file is
created by default by the TextInputFormat class of Hadoop.

Result:
In this assignment, we developed a MapReduce program to analyze the Titanic dataset and extract
valuable insights regarding the victims and survivors of the tragedy. Specifically, the program was designed
to:
• Calculate the average age of deceased passengers, categorized by gender (male and female).
• Count the number of survivors in each passenger class (1st, 2nd, and 3rd class).

Subject Incharge: Prof. Gaikwad Shraddha B.

CC CCS335 Unit 3 Key
No ratings yet
CC CCS335 Unit 3 Key
4 pages
BDA Manual
No ratings yet
BDA Manual
41 pages
bda lab s
No ratings yet
bda lab s
92 pages
BDA Journal
No ratings yet
BDA Journal
52 pages
BDA record
No ratings yet
BDA record
58 pages
Java report Group no.11.pdf
No ratings yet
Java report Group no.11.pdf
27 pages
ICT YEAR 5
No ratings yet
ICT YEAR 5
5 pages
CCS334 BDA Lab Manual
No ratings yet
CCS334 BDA Lab Manual
35 pages
Lab 7. EIGRP & OSPF Routing Protocol
No ratings yet
Lab 7. EIGRP & OSPF Routing Protocol
2 pages
Micro Controller Assignment No.1
No ratings yet
Micro Controller Assignment No.1
2 pages
Grade8 - Q1-4 - W2 PCO Part 1
No ratings yet
Grade8 - Q1-4 - W2 PCO Part 1
33 pages
SENFinal-19IF264
No ratings yet
SENFinal-19IF264
15 pages
SEN Final Reoprt
No ratings yet
SEN Final Reoprt
14 pages
Java Programming
No ratings yet
Java Programming
9 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
32 pages
Big Dataa-Lab-Manual
No ratings yet
Big Dataa-Lab-Manual
24 pages
Big Data Akshat
No ratings yet
Big Data Akshat
57 pages
Quickspecs: HP 240 G7 Notebook PC
No ratings yet
Quickspecs: HP 240 G7 Notebook PC
39 pages
Practical-1: Aim:-Make A Single Node Cluster in Hadoop. Solution
No ratings yet
Practical-1: Aim:-Make A Single Node Cluster in Hadoop. Solution
49 pages
Ajp Report Final
No ratings yet
Ajp Report Final
8 pages
2020300053_BDA_EXP1_CHINMAY
No ratings yet
2020300053_BDA_EXP1_CHINMAY
13 pages
ACN report final
No ratings yet
ACN report final
7 pages
CI ASssignments
No ratings yet
CI ASssignments
6 pages
bda 1
No ratings yet
bda 1
6 pages
PSI5 Specification v2.1 08-2012 PDF
No ratings yet
PSI5 Specification v2.1 08-2012 PDF
62 pages
Assignment of BDA ,BE,aids
No ratings yet
Assignment of BDA ,BE,aids
2 pages
Big Data Analytics Lab
No ratings yet
Big Data Analytics Lab
18 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
9.3.1.2 Lab - Configure ASA Basic Settings and Firewall Using CLI PDF
50% (2)
9.3.1.2 Lab - Configure ASA Basic Settings and Firewall Using CLI PDF
26 pages
ASPNET Web Forms Internals
No ratings yet
ASPNET Web Forms Internals
56 pages
OSterm PAPER
No ratings yet
OSterm PAPER
5 pages
bda megh
No ratings yet
bda megh
50 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
34 pages
Big Data & Analytics Lab Manual
No ratings yet
Big Data & Analytics Lab Manual
51 pages
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
No ratings yet
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
11 pages
Big Data Analysis 3170722 Lab Manual
No ratings yet
Big Data Analysis 3170722 Lab Manual
68 pages
VIZIMAX - VIZIMAX Unified Communication Services
No ratings yet
VIZIMAX - VIZIMAX Unified Communication Services
2 pages
BDA practical (1)
No ratings yet
BDA practical (1)
18 pages
Bda Lab Manual 2024
No ratings yet
Bda Lab Manual 2024
45 pages
bda lab
No ratings yet
bda lab
4 pages
Big data analytics
No ratings yet
Big data analytics
50 pages
Hands-On Exercises With Big Data: Lab Sheet 1: Getting Started With Mapreduce and Hadoop
No ratings yet
Hands-On Exercises With Big Data: Lab Sheet 1: Getting Started With Mapreduce and Hadoop
14 pages
PIC LAB Manual
No ratings yet
PIC LAB Manual
13 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
BDF Programs
No ratings yet
BDF Programs
32 pages
Error Detection and Correction.1
No ratings yet
Error Detection and Correction.1
31 pages
BDA Lab Manual_organized (2) (1) - Copy
No ratings yet
BDA Lab Manual_organized (2) (1) - Copy
69 pages
Installation Kodak I100
No ratings yet
Installation Kodak I100
13 pages
CS702_Big_Data_Programs
No ratings yet
CS702_Big_Data_Programs
58 pages
BDA Lab 8 Manual
No ratings yet
BDA Lab 8 Manual
7 pages
02-Wordcount Mapreduce
No ratings yet
02-Wordcount Mapreduce
5 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
59 pages
SCADAPack 535E Hardware Manual
No ratings yet
SCADAPack 535E Hardware Manual
146 pages
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
No ratings yet
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
22 pages
2302IT451 OS lab manual
No ratings yet
2302IT451 OS lab manual
68 pages
Firewall Exploration Lab: 2.1 Container Setup and Commands
No ratings yet
Firewall Exploration Lab: 2.1 Container Setup and Commands
14 pages
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
71 pages
Example - (Map Function in Word Count)
No ratings yet
Example - (Map Function in Word Count)
6 pages
Big Data Lab Manual Printout Copy
No ratings yet
Big Data Lab Manual Printout Copy
51 pages
CSE488 Lab01
No ratings yet
CSE488 Lab01
6 pages
BIG data file
No ratings yet
BIG data file
28 pages
Learning DTrace Part4
No ratings yet
Learning DTrace Part4
13 pages
CS-702 (D) BigData
No ratings yet
CS-702 (D) BigData
61 pages
Cp5261 Da Lab Me-Cse 2021 - Edit
No ratings yet
Cp5261 Da Lab Me-Cse 2021 - Edit
88 pages
Workstation Z1 G9
No ratings yet
Workstation Z1 G9
58 pages
big data
No ratings yet
big data
28 pages
MapReduce Merged
No ratings yet
MapReduce Merged
18 pages
BDT Lab Manual
No ratings yet
BDT Lab Manual
48 pages
Flatbed Printer13
No ratings yet
Flatbed Printer13
4 pages
@bigdatalabfile 09
No ratings yet
@bigdatalabfile 09
35 pages
Prerequisites: Single Node Setup Cluster Setup
No ratings yet
Prerequisites: Single Node Setup Cluster Setup
5 pages
Unit IV Programming Model
No ratings yet
Unit IV Programming Model
30 pages
Comparison of Windows and Linux
No ratings yet
Comparison of Windows and Linux
35 pages
MI Practical Assignment Index
No ratings yet
MI Practical Assignment Index
3 pages
Bda Lab
No ratings yet
Bda Lab
94 pages
Lab 4 - Installation of Hadoop and MapReduce WordCount Example
100% (1)
Lab 4 - Installation of Hadoop and MapReduce WordCount Example
16 pages
BDA Practicalfile
No ratings yet
BDA Practicalfile
19 pages
Big Data Manual
No ratings yet
Big Data Manual
82 pages
Ba Lab Record-It b2022-26
No ratings yet
Ba Lab Record-It b2022-26
43 pages
File Systems in Linux and Freebsd: A Comparative Study: Kuo-Pao Yang, Katie Wallace
No ratings yet
File Systems in Linux and Freebsd: A Comparative Study: Kuo-Pao Yang, Katie Wallace
4 pages
Hadoop Administrator Training - Lab Hand Book
No ratings yet
Hadoop Administrator Training - Lab Hand Book
12 pages
Bigdata Lab
No ratings yet
Bigdata Lab
55 pages
Data Science
No ratings yet
Data Science
82 pages
Notes
No ratings yet
Notes
53 pages
How To Design A System To Scale To Your First 100 Million Users - by Anh T. Dang - Level Up Coding
No ratings yet
How To Design A System To Scale To Your First 100 Million Users - by Anh T. Dang - Level Up Coding
34 pages
Big Data Analytics IT
No ratings yet
Big Data Analytics IT
55 pages
Calling RFC From BODS
No ratings yet
Calling RFC From BODS
10 pages
Word Count Program With MapReduce and Java
No ratings yet
Word Count Program With MapReduce and Java
6 pages
CSEC It Jan 2004 Answers
No ratings yet
CSEC It Jan 2004 Answers
3 pages
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
No ratings yet
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
28 pages
Lesson Plan New Parts of Computer Walkthrough
No ratings yet
Lesson Plan New Parts of Computer Walkthrough
4 pages
Word Count Program With MapReduce and Java
No ratings yet
Word Count Program With MapReduce and Java
6 pages
16.1 Purpose of An Operating System (MT-L) PDF
No ratings yet
16.1 Purpose of An Operating System (MT-L) PDF
10 pages
Alcatel-Lucent Services Lab Guide v3-1
No ratings yet
Alcatel-Lucent Services Lab Guide v3-1
128 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
20697-2B 00
No ratings yet
20697-2B 00
19 pages
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet