MapReduce Introduction
MapReduce Introduction
ACM Fellow
Google Fellow
Calculate 30*50
Easy?
Which are,
Parallelization
Fault-tolerance
Data distribution
Load Balancing
A programming model
An implementation
the mouse
Document
the
quick
brown
fox
the
fox
ate
the
mouse
1
1
1
1
1
1
1
1
1
Mapped
the
quick
brown
fox
ate
mouse
3
1
1
2
1
1
Reduced
Map
the, 3
the, 1
quick, 1
brown, 1
fox, 1
quick, 1
brown, 1
Reduce
the fox
ate
the
mouse
Input
Map
Map
the, 1
fox, 1
ate,1
the, 1
mouse, 1
fox, 2
ate, 1
mouse, 1
Reduce
Output
Document Contents
Source
Web
page 4
Target
(My web
page)
Source
Web
page 2
Source
Web
page 3
Map
Target
Source pointing
to the target
Reduce
User Program
(1) Fork
(1) Fork
Master
Split 0
Split 1
Split 2
(1) Fork
Worker
(3) Read
Worker
(6) Write
Worker
(5) Remote Read
Split 3
Split 4
Worker
Input Layer
Map Layer
O/P File 0
Intermediate
Files
Worker
Reduce Layer
O/P File 1
Output Layer
parallelization
using Map & Reduce
o Automatic
How to parallelize
the computation?
o Coordinate with other nodes
o Handling failures
o Preserve bandwidth
o Load balancing
Data
Worker1
Worker2
Worker3
User-defined
Map/Reduce
Instruction
o Handling failures
o Preserve bandwidth
o Load balancing
Information
Reduce
Worker
Partitioning
Combining
Counters
891 S
Normal Execution
1283 S
No backup tasks
44% increment in
time
Stragglers take
>300s to finish
891 S
933 S
5% increment in
time
Quick failure
recovery
Normal Execution
Google Maps
Locating addresses
Map tiles rendering
Google PageRank
Localized Search
Used in,
Yahoo! Search
Facebook
Amazon
Twitter
Google