0% found this document useful (0 votes)
111 views34 pages

MapReduce Introduction

The document describes MapReduce, a programming model and associated implementation for processing and generating large data sets on a distributed computing environment. It addresses common complexities in distributed computing like parallelization, fault tolerance, load balancing and bandwidth usage through the Map and Reduce functions. MapReduce has been widely adopted including by Google, Yahoo, Facebook and Amazon to solve problems involving large scale clustering, searching and analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views34 pages

MapReduce Introduction

The document describes MapReduce, a programming model and associated implementation for processing and generating large data sets on a distributed computing environment. It addresses common complexities in distributed computing like parallelization, fault tolerance, load balancing and bandwidth usage through the Map and Reduce functions. MapReduce has been widely adopted including by Google, Yahoo, Facebook and Amazon to solve problems involving large scale clustering, searching and analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

By: Jeffrey Dean & Sanjay Ghemawat

Presented by: Warunika Ranaweera


Supervised by: Dr. Nalin Ranasinghe

MapReduce: Simplified Data Processing


on Large Clusters
In Proceedings of the 6th Symposium on Operating Systems
Design and Implementation (OSDI' 04)
Also appears in the Communications of the ACM (2008)

Ph.D. in Computer Science University of Washington

Google Fellow in Systems and Infrastructure Group

ACM Fellow

Research Areas: Distributed Systems and Parallel Computing

Ph.D. in Computer Science Massachusetts Institute of


Technology

Google Fellow

Research Areas: Distributed Systems and Parallel Computing

Calculate 30*50
Easy?

30*50 + 31*51 + 32*52 + 33*52 + .... + 40*60


Little bit hard?

Simple computation, but huge data set

Real world example for large computations


20+ billion web pages * 20kB webpage
One computer reads 30/35 MB/sec from disc
Nearly four months to read the web

Parallelize tasks in a distributed computing


environment
Web page problem solved in 3 hours with
1000 machines

Complexities in Distributed Computing


o How to parallelize the computation?
o Coordinate with other nodes
o Handling failures
o Preserve bandwidth
o Load balancing

A platform to hide the messy details of distributed


computing

Which are,
Parallelization
Fault-tolerance
Data distribution
Load Balancing

A programming model

An implementation

Example: Word count


the quick
brown fox

the fox ate

the mouse

Document

the
quick
brown
fox
the
fox
ate
the
mouse

1
1
1
1
1
1
1
1
1

Mapped

the
quick
brown
fox
ate
mouse

3
1
1
2
1
1

Reduced

Eg: Word count using MapReduce


the
quick
brown
fox

Map

the, 3

the, 1
quick, 1
brown, 1
fox, 1

quick, 1
brown, 1
Reduce

the fox
ate
the
mouse

Input

Map

Map

the, 1
fox, 1
ate,1
the, 1
mouse, 1

fox, 2
ate, 1
mouse, 1

Reduce

Output

Input Text file


Output (fox, 1)
Document Name

Document Contents

map(String key, String value):


for each word w in value:
EmitIntermediate(w, "1");
Intermediate key/value pair Eg: (fox, 1)

Input (fox, {1, 1})


Output (fox, 2)
Word

List of Counts (Output from Map)

reduce(String key, Iterator values):


int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Accumulated Count

Reverse Web-Link Graph


Source
Web
page 1
Source
Web
page 5

Source
Web
page 4

Target
(My web
page)

Source
Web
page 2

Source
Web
page 3

Reverse Web-Link Graph


(My Web, Source 1)
(Not My Web, Source 2)
(My Web, Source 3)

Map

(My Web, Source 4)


(My Web, Source 5)

Source web pages

Target

(My Web, {Source 1, Source 3,.....})

Source pointing
to the target

Reduce

User Program
(1) Fork
(1) Fork

Master

(2) Assign Map

Split 0
Split 1

Split 2

(1) Fork

(2) Assign Reduce

Worker
(3) Read

(4) Local Write

Worker
(6) Write

Worker
(5) Remote Read

Split 3
Split 4

Worker

Input Layer

Map Layer

O/P File 0

Intermediate
Files

Worker

Reduce Layer

O/P File 1

Output Layer

Complexities in Distributed Computing, to be solved

parallelization
using Map & Reduce
o Automatic
How to parallelize
the computation?
o Coordinate with other nodes

o Handling failures
o Preserve bandwidth
o Load balancing

Restricted Programming model

User specified Map & Reduce functions

1000s of workers, different data sets

Data

Worker1

Worker2

Worker3

User-defined
Map/Reduce
Instruction

Complexities in Distributed Computing, solving..


o Automatic parallelization using Map & Reduce
o Coordinate with
nodesother
usingnodes
a master node

o Handling failures
o Preserve bandwidth
o Load balancing

Master data structure


Pushing information (meta-data) between
workers
Master
Information
Map
Worker

Information
Reduce
Worker

Complexities in Distributed Computing , solving..

o Automatic parallelization using Map & Reduce

o Coordinate nodes using a master node


o Fault
Handling
failures
tolerance
(Re-execution) & back up tasks
o Preserve bandwidth
o Load balancing

No response from a worker task?


If an ongoing Map or Reduce task: Re-execute

If a completed Map task: Re-execute


If a completed Reduce task: Remain untouched

Master failure (unlikely)


Restart

Straggler: machine that takes a long time


to complete the last steps in the computation

Solution: Redundant Execution


Near end of phase, spawn backup copies
Task that finishes first "wins"

Complexities in Distributed Computing , solving..

o Automatic parallelization using Map & Reduce

o Coordinate nodes using a master node


o Fault tolerance (Re-execution) & back up tasks
Preserve
bandwidth
o Saves
bandwidth
through locality
o Load balancing

Same data set in different machines

If a task has data locally, no need to access


other nodes

Complexities in Distributed Computing , solving..


solved

o Automatic parallelization using Map & Reduce

o Coordinate nodes using a master node


o Fault tolerance & back up tasks
o Saves bandwidth through locality
o Load balancing through granularity

Fine granularity tasks: map tasks > machines

1 worker several tasks

Idle workers are quickly assigned to work

Partitioning

Combining

Skipping bad records

Debuggers local execution

Counters

891 S

Normal Execution

1283 S

No backup tasks

44% increment in
time

Very long tail

Stragglers take
>300s to finish

891 S

933 S

5% increment in
time
Quick failure
recovery

Normal Execution

200 processes killed

Clustering for Google News and Google Product Search

Google Maps
Locating addresses
Map tiles rendering

Google PageRank

Localized Search

Apache Hadoop MapReduce

Hadoop Distributed File System (HDFS)

Used in,
Yahoo! Search
Facebook

Amazon
Twitter
Google

Higher level languages/systems based on Hadoop

Amazon Elastic MapReduce


Available for general public
Process data in the cloud

Pig and Hive

Large variety of problems can be expressed as Map


& Reduce

Restricted programming model

Easy to hide details of distributed computing

Achieved scalability & programming efficiency

You might also like