Big Data - ASSIGNMENT 2
Big Data - ASSIGNMENT 2
Step 1)
Create a new directory with name MapReduceTutorial
Give permissions
SalesMapper.java
package SalesCountry;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
public class SalesMapper extends MapReduceBase implements Mapper
<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
SalesCountryReducer.java
package SalesCountry;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
}
output.collect(key, new IntWritable(frequencyForCountry));
}
}
SalesCountryDriver.java
package SalesCountry;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
my_client.setConf(job_conf);
try {
// Run the job
JobClient.runJob(job_conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Step 2)
Export classpath
export CLASSPATH="$HADOOP_HOME/share/hadoop/mapreduce/hadoop-
mapreduce-client-core-2.2.0.jar:
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-common-
2.2.0.jar:$HADOOP_HOME/share/hadoop/common/hadoop-common-
2.2.0.jar:~/MapReduceTutorial/SalesCountry/*:$HADOOP_HOME/lib/*"
Step 3)
Compile Java files (these files are present in directory Final-MapReduceHandsOn).
Its class files will be put in the package directory
This compilation will create a directory in a current directory named with package
name specified in the java source file (i.e. SalesCountry in our case) and put all
compiled class files in it.
Step 4)
Create a new file Manifest.txt
Main-Class: SalesCountry.SalesCountryDriver
Step 5)
Create a Jar file
Step 6)
Start Hadoop
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
Step 7)
Copy the File SalesJan2009.csv into ~/inputMapReduce
Step 8)
Run MapReduce job
Step 9)
The result can be seen through command interface as,
Please note that our input data is in the below format (where Country is at 7th index,
with 0 as a starting index)-
Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_C
reated,Last_Login,Latitude,Longitude
Here, the first two data types, 'Text' and 'IntWritable' are data type of input key-
value to the reducer.
The last two data types, 'Text' and 'IntWritable' are data type of output generated
by reducer in the form of key-value pair.
<United Arab Emirates, 1>, <United Arab Emirates, 1>, <United Arab Emirates,
1>,<United Arab Emirates, 1>, <United Arab Emirates, 1>, <United Arab Emirates,
1>.
So, to accept arguments of this form, first two data types are used,
viz., Text and Iterator<IntWritable>. Text is a data type of key
and Iterator<IntWritable> is a data type for list of values for that key.
Then, using 'while' loop, we iterate through the list of values associated with the key
and calculate the final frequency by summing up all the values.
while (values.hasNext()) {
// replace type of value with the actual type of our value
IntWritable value = (IntWritable) values.next();
frequencyForCountry += value.get();
}
Now, we push the result to the output collector in the form of key and
obtained frequency count.
Here is a line specifying package name followed by code to import library packages.
2. Define a driver class which will create a new client job, configuration object and
advertise Mapper and Reducer classes.
The driver class is responsible for setting our MapReduce job to run in Hadoop. In
this class, we specify job name, data type of input/output and names of mapper
and reducer classes.
3. In below code snippet, we set input and output directories which are used to
consume input dataset and produce output, respectively.
try {
// Run the job
JobClient.runJob(job_conf);
} catch (Exception e) {
e.printStackTrace();
}