0% found this document useful (0 votes)

53 views20 pages

Hadoop Wordcount Program

This document provides an overview of the MapReduce paradigm and how it is implemented in Hadoop. It describes the basic data flow and components of a MapReduce job, including mappers that process input key-value pairs, reducers that aggregate output from mappers, and a driver program that coordinates execution. Mappers and reducers are specified as classes that implement map and reduce functions. The document uses word count as a running example to illustrate how text data can be processed to count word frequencies using MapReduce. It also briefly discusses chaining multiple MapReduce jobs together to solve more complex problems.

Uploaded by

Arjun S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views20 pages

Hadoop Wordcount Program

Uploaded by

Arjun S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Hadoop: MapReduce

Typical problem solved by MapReduce

Read a lot of data

• Map: extract something you care about from each record Shuffle
and
Sort
• Reduce: aggregate, summarize, filter, or transform Write the results

Outline stays the same,

Map and Reduce change to fit the problem

Data Flow

1. Mappers read from HDFS

2. Map output is partitioned by key and sent to
Reducers
3. Reducers sort input by key
4. Reduce output is written to HDFS
Detailed Hadoop MapReduce data flow
MapReduce Paradigm

Basic data type: the key-value pair (k,v).

For example, key = URL, value = HTML of the web
page.

Programmer specifies two primary

methods:
• Map: (k, v) ↦ <(k ,v ), (k ,v ), (k ,v ),…,(k ,v )>
1 1 2 2 3 3 n n

•
Reduce: (k', <v’1, v’2,…,v’n’>) ↦ <(k', v'’1), (k', v'’2),…,(k',
All v'’n’’with
v' )> same k' are reduced together. (Remember

the invisible “Shuffle and Sort” step.)

MapReduce Paradigm
● Shuffle
Reducer is input the grouped output of a Mapper. In the phase the framework, for
each Reducer, fetches the relevant partition of the output of all the Mappers, via
HTTP.
● Sort
The framework groups Reducer inputs by keys (since different Mappers may have
output the same key) in this stage.

The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched
they are merged

Hadoop framework handles the Shuffle and Sort step ..

MapReduce Paradigm: wordcount flow

Key/value Pairs:
(fileoffset,line) → Map → (word,1) → Reduce → (word,n)
file offset is position within the file.
WordCount in Web Pages

Input: files with one document per record

Specify a map function that takes a key/value pair

key = document URL, value = document contents

Output of map function is (potentially many) key/value pairs.

In our case, output (word, “1”) once per word in the document
WordCount in Web Pages
MapReduce library gathers together all pairs with the same key
(shuffle/sort)
Specify a reduce function that combines the values for a key
In our case, compute the sum

key = “be” values key = “not” key = “or” key = “to” values = “1”,
“1”
= “1”, “1” values = “1” values
1 = “1”
1
2 2

Output of reduce (usually 0 or 1 value) paired with key and

saved
MapReduce Programme

A MapReduce program consists of the following 3 parts :

● Driver → main (would trigger the map and reduce

methods)
● Mapper
● Reducer

it is better to include the map reduce and main methods in 3 different

classes
Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>

output, Reporter reporter) throws IOException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens())
{ word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
Mapper
//Map class header
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable>

● extends MapReduceBase: Base class for Mapper and Reducer

implementations. Provides default no-op implementations for a few
methods, most non-trivial applications need to override some of them.
Example

● implements Mapper : Interface Mapper<K1,V1,K2,V2> has the map

method
◄ <K1,V1,K2,V2>first pair is the input key/value pair , second is the output
key/value pair.
◄ <LongWritable, Text, Text, IntWritable> “The data types provided here are
Hadoop
specific data types designed for operational efficiency suited for massive parallel and lightning fast read
write operations. All these data types are based out of java data types itself, for example
LongWritable is the equivalent for long in java, IntWritable for int and Text for String”
Mapper
//hadoop supported data types for the key/value pairs
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
//Map method header
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {

● LongWritable key, Text value :Data type of the input Key

and Value to the mapper.

● OutputCollector<Text, IntWritable> output :

output collector <K,V>: “collect data output by either the Mapper or the
Reducer i.e. intermediate outputs or the output of the job”
**k and v are the output Key and Value from the mapper. EX: <”the”,1>

● Reporter reporter: the reporter is used to report the task status

internally in Hadoop environment to avoid time outs.
Mapper
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {

//Convert the input line in Text type to a String

String line = value.toString();
//Use a tokenizer to split the line into words
StringTokenizer tokenizer = new StringTokenizer(line);
//Iterate through each word and a form key value pairs
while (tokenizer.hasMoreTokens()) {
//Assign each work from the tokenizer(of String type)
//to a Text ‘word’
word.set(tokenizer.nextToken());
//Form key value pairs for each word as <word,one> and
push
//it to the output collector
output.collect(word, one);
}
}
Reducer
//Class Header similar to the one in map
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
//Reduce Header similar to the one in map with different
//key/value data type
//data from map will be <”word”,{1,1,..}> so we get it with
an
//Iterator so we can go through the sets of values
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
//Initaize a variable ‘sum’ as 0
int sum = 0;
//Iterate through all the
values with respect to a key
and
//sum up all of them
while (values.hasNext())
{ sum +=
values.next().get();
}
Main(The Driver)
● Given the Mapper and Reducer code, the short main() starts the Map-
Reduction running.

● The Hadoop system picks up a bunch of values from the command line on its
own.

● Then the main() also specifies a few key parameters of the problem in the
JobConf object.

● JobConf is the primary interface for a user to describe a map-reduce job to the
Hadoop framework for execution (such as what Map and Reduce classes to
use and the format of the input and output files).

● Other parameters, i.e. the number of machines to use, are optional and the
system will determine good values for them if not specified.

● Then the framework tries to faithfully execute the job as-is described by
JobConf.
Main
public static void main(String[] args) throws Exception {

//creating a JobConf object and assigning a job name for identification

purposes
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
//Setting configuration object with the Data Type of output Key and Value
for //map and reduce if you have diffrent type of outputs there is other set
method //for them
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
//Providing the mapper and reducer class
names
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class); //set theCombiner
class conf.setReducerClass(Reduce.class);
// The default input format, "TextInputFormat," will load data in as //
(LongWritable, Text) pairs. The long value is the byte offset of the line in
//the file.
conf.setInputFormat(TextInputFormat.class);
// The basic (default) instance is TextOutputFormat, which writes (key,
value)
//pairs on individual lines of a text file.
conf.setOutputFormat(TextOutputFormat.class);
//the hdfs input and output directory to be fetched from the command line
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
//submits the job to MapReduce. and returns only after the job has completed
Chaining Jobs
● Not every problem can be solved with a MapReduce program, but
fewer still are those which can be solved with a single MapReduce job.

● Many problems can be solved with MapReduce, by writing several

MapReduce steps which run in series to accomplish a goal:

Map1 -> Reduce1 -> Map2 -> Reduce2 -> Map3…

● You can easily chain jobs together in this fashion by writing multiple
driver methods, one for each job.

● Call the first driver method, which uses JobClient.runJob() to run the
job and wait for it to complete. When that job has completed, then
call the next driver method, which creates a new JobConf object
referring to different instances of Mapper and Reducer, etc.
How it Works?

• Purpose
– Simple serialization for keys, values, and other data

• Interface Writable
– Read and write binary format
– Convert to String for text formats
Exercises

• Get WordCount running!

• Extend WordCount to count Bigrams:

– Bigrams are simply sequences of two consecutive
words.
• For example, the previous sentence contains the
following bigrams: "Bigrams are", "are simply", "simply
sequences", "sequence of", etc.

Evidence and Procedures For Boundary Location 6th Edition PDF
0% (2)
Evidence and Procedures For Boundary Location 6th Edition PDF
2 pages
IEC 62271 202 Standards en
80% (5)
IEC 62271 202 Standards en
27 pages
Project Mobilization Plan-Narrative
100% (1)
Project Mobilization Plan-Narrative
25 pages
Electrical Workshop Practice 2
95% (20)
Electrical Workshop Practice 2
42 pages
6. Map Reduce Programming
No ratings yet
6. Map Reduce Programming
67 pages
Advanced Mapreduce
No ratings yet
Advanced Mapreduce
37 pages
Bda Unit III r20csm
No ratings yet
Bda Unit III r20csm
54 pages
Lecture 04
No ratings yet
Lecture 04
25 pages
Developing A Mapreduce Application: by Dr. K. Venkateswara Rao Professor Department of Cse
No ratings yet
Developing A Mapreduce Application: by Dr. K. Venkateswara Rao Professor Department of Cse
83 pages
Map Reduce
No ratings yet
Map Reduce
57 pages
Palak
No ratings yet
Palak
10 pages
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
No ratings yet
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
13 pages
MapReduce - Notes
No ratings yet
MapReduce - Notes
17 pages
Prerequisites: Single Node Setup Cluster Setup
No ratings yet
Prerequisites: Single Node Setup Cluster Setup
5 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
30 pages
Unit IV Programming Model
No ratings yet
Unit IV Programming Model
30 pages
Hadoop and Map Reduce
No ratings yet
Hadoop and Map Reduce
27 pages
Big Data Practical 2
No ratings yet
Big Data Practical 2
11 pages
3 MapReduce program ex code
No ratings yet
3 MapReduce program ex code
14 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Experiment-4 BDA LAB
No ratings yet
Experiment-4 BDA LAB
7 pages
M4_06_MapReduce
No ratings yet
M4_06_MapReduce
28 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Hadoop MapReduce Flow Chart
No ratings yet
Hadoop MapReduce Flow Chart
28 pages
Big Data 4 Vivek
No ratings yet
Big Data 4 Vivek
3 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Understanding MapReduce
No ratings yet
Understanding MapReduce
15 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Dllction To MAPREDUCE Afflrlling: L Tro
No ratings yet
Dllction To MAPREDUCE Afflrlling: L Tro
12 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
04_MapReduce
No ratings yet
04_MapReduce
45 pages
Bda Experiment No2
No ratings yet
Bda Experiment No2
12 pages
BDC Output 3
No ratings yet
BDC Output 3
4 pages
Mapreduce Programming Framework
No ratings yet
Mapreduce Programming Framework
23 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Data Science Presentation
No ratings yet
Data Science Presentation
20 pages
6. Map Reduce Programming
No ratings yet
6. Map Reduce Programming
64 pages
hadoop2
No ratings yet
hadoop2
31 pages
Import Import Import Import Import Import Import Import Public Class Extends Implements
No ratings yet
Import Import Import Import Import Import Import Import Public Class Extends Implements
7 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Hadoop Developingapps PDF
No ratings yet
Hadoop Developingapps PDF
17 pages
Hadoop Training in Hyderabad
No ratings yet
Hadoop Training in Hyderabad
49 pages
Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
Hadoop
No ratings yet
Hadoop
28 pages
Hadoop Mapred
100% (1)
Hadoop Mapred
11 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
Understanding MapReduce
No ratings yet
Understanding MapReduce
4 pages
Example - (Map Function in Word Count)
No ratings yet
Example - (Map Function in Word Count)
6 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
unit 2
No ratings yet
unit 2
12 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Big Data Infrastructure: Week 2: Mapreduce Algorithm Design (2/2)
No ratings yet
Big Data Infrastructure: Week 2: Mapreduce Algorithm Design (2/2)
55 pages
Practical 2-1
No ratings yet
Practical 2-1
4 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
BDA U2 - copy
No ratings yet
BDA U2 - copy
79 pages
CS702_Big_Data_Programs
No ratings yet
CS702_Big_Data_Programs
58 pages
Hadoop Unit III DR David
No ratings yet
Hadoop Unit III DR David
12 pages
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
C Programming
From Everand
C Programming
Netra
No ratings yet
Iman Mohammed Elshafei: Professional
No ratings yet
Iman Mohammed Elshafei: Professional
3 pages
NLC Templates For Schools and Districts
No ratings yet
NLC Templates For Schools and Districts
13 pages
(Tamura, 2008) - Towards Practical Use of LES in Wind Engineering
No ratings yet
(Tamura, 2008) - Towards Practical Use of LES in Wind Engineering
21 pages
Proposal Letter
No ratings yet
Proposal Letter
13 pages
HƯỚNG DẪN THI VÀO 10 FORM Hà Nội 2025 MOCK TEST 15
No ratings yet
HƯỚNG DẪN THI VÀO 10 FORM Hà Nội 2025 MOCK TEST 15
8 pages
JSI MI gi-fittings
No ratings yet
JSI MI gi-fittings
20 pages
Smsassassin: Crowdsourcing Driven Mobile-Based System For Sms Spam Filtering
No ratings yet
Smsassassin: Crowdsourcing Driven Mobile-Based System For Sms Spam Filtering
6 pages
25 More Nonograms Puzzles For Beginners89544
No ratings yet
25 More Nonograms Puzzles For Beginners89544
29 pages
PSP Module 5 Notes
No ratings yet
PSP Module 5 Notes
25 pages
Sherwin
No ratings yet
Sherwin
2 pages
FORM ATP TSEL Ok
No ratings yet
FORM ATP TSEL Ok
25 pages
Seminar Report
No ratings yet
Seminar Report
39 pages
Akaike Information Criterion
100% (1)
Akaike Information Criterion
6 pages
A Critique of Deep and Surface Approaches to Learning
No ratings yet
A Critique of Deep and Surface Approaches to Learning
14 pages
Assignment Activity Unit 4
100% (1)
Assignment Activity Unit 4
3 pages
Daphne Hi Temp Oil A-S
No ratings yet
Daphne Hi Temp Oil A-S
2 pages
Coordinate Converter
No ratings yet
Coordinate Converter
13 pages
Mdat PR1
No ratings yet
Mdat PR1
14 pages
Unit I Mathematical Tools 1.1 Basic Mathematics For Physics: I. Quadratic Equation and Its Solution
No ratings yet
Unit I Mathematical Tools 1.1 Basic Mathematics For Physics: I. Quadratic Equation and Its Solution
16 pages
Stability and Chemical Equilibrium of Amphibole in Calc-Alkaline Magmas: An Overview, New Thermobarometric Formulations and Application To Subduction-Related Volcanoes
No ratings yet
Stability and Chemical Equilibrium of Amphibole in Calc-Alkaline Magmas: An Overview, New Thermobarometric Formulations and Application To Subduction-Related Volcanoes
26 pages
Cse121 - Orientation To Computingii 1
No ratings yet
Cse121 - Orientation To Computingii 1
36 pages
Protons, Neutrons, and Electrons Practice Worksheet
100% (2)
Protons, Neutrons, and Electrons Practice Worksheet
3 pages
7 Feedforward
No ratings yet
7 Feedforward
12 pages
Cultivating Heart Character: Educating For Life S Most Essential Goals
No ratings yet
Cultivating Heart Character: Educating For Life S Most Essential Goals
22 pages
Mil DTL 24784 11D
No ratings yet
Mil DTL 24784 11D
16 pages
Manual Compressor TEKNA
No ratings yet
Manual Compressor TEKNA
20 pages

Uploaded by

Uploaded by

Hadoop: MapReduce

Typical problem solved by MapReduce

Read a lot of data

Outline stays the same,

Map and Reduce change to fit the problem

1. Mappers read from HDFS

Basic data type: the key-value pair (k,v).

Programmer specifies two primary

the invisible “Shuffle and Sort” step.)

Hadoop framework handles the Shuffle and Sort step ..

Input: files with one document per record

Specify a map function that takes a key/value pair

Output of map function is (potentially many) key/value pairs.

Output of reduce (usually 0 or 1 value) paired with key and

A MapReduce program consists of the following 3 parts :

● Driver → main (would trigger the map and reduce

it is better to include the map reduce and main methods in 3 different

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>

String line = value.toString();

● extends MapReduceBase: Base class for Mapper and Reducer

● implements Mapper : Interface Mapper<K1,V1,K2,V2> has the map

● LongWritable key, Text value :Data type of the input Key

● OutputCollector<Text, IntWritable> output :

● Reporter reporter: the reporter is used to report the task status

//Convert the input line in Text type to a String

//creating a JobConf object and assigning a job name for identification

● Many problems can be solved with MapReduce, by writing several

Map1 -> Reduce1 -> Map2 -> Reduce2 -> Map3…

• Get WordCount running!

• Extend WordCount to count Bigrams:

You might also like