Hadoop Wordcount Program
Hadoop Wordcount Program
• Map: extract something you care about from each record Shuffle
and
Sort
• Reduce: aggregate, summarize, filter, or transform Write the results
•
Reduce: (k', <v’1, v’2,…,v’n’>) ↦ <(k', v'’1), (k', v'’2),…,(k',
All v'’n’’with
v' )> same k' are reduced together. (Remember
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched
they are merged
Key/value Pairs:
(fileoffset,line) → Map → (word,1) → Reduce → (word,n)
file offset is position within the file.
WordCount in Web Pages
key = “be” values key = “not” key = “or” key = “to” values = “1”,
“1”
= “1”, “1” values = “1” values
1 = “1”
1
2 2
while (tokenizer.hasMoreTokens())
{ word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
Mapper
//Map class header
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable>
● The Hadoop system picks up a bunch of values from the command line on its
own.
● Then the main() also specifies a few key parameters of the problem in the
JobConf object.
● JobConf is the primary interface for a user to describe a map-reduce job to the
Hadoop framework for execution (such as what Map and Reduce classes to
use and the format of the input and output files).
● Other parameters, i.e. the number of machines to use, are optional and the
system will determine good values for them if not specified.
● Then the framework tries to faithfully execute the job as-is described by
JobConf.
Main
public static void main(String[] args) throws Exception {
● You can easily chain jobs together in this fashion by writing multiple
driver methods, one for each job.
● Call the first driver method, which uses JobClient.runJob() to run the
job and wait for it to complete. When that job has completed, then
call the next driver method, which creates a new JobConf object
referring to different instances of Mapper and Reducer, etc.
How it Works?
• Purpose
– Simple serialization for keys, values, and other data
• Interface Writable
– Read and write binary format
– Convert to String for text formats
Exercises