0% found this document useful (0 votes)

111 views34 pages

MapReduce Introduction

The document describes MapReduce, a programming model and associated implementation for processing and generating large data sets on a distributed computing environment. It addresses common complexities in distributed computing like parallelization, fault tolerance, load balancing and bandwidth usage through the Map and Reduce functions. MapReduce has been widely adopted including by Google, Yahoo, Facebook and Amazon to solve problems involving large scale clustering, searching and analytics.

Uploaded by

WarunikaRanaweera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

111 views34 pages

MapReduce Introduction

Uploaded by

WarunikaRanaweera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

By: Jeffrey Dean & Sanjay Ghemawat

Presented by: Warunika Ranaweera

Supervised by: Dr. Nalin Ranasinghe

MapReduce: Simplified Data Processing

on Large Clusters
In Proceedings of the 6th Symposium on Operating Systems
Design and Implementation (OSDI' 04)
Also appears in the Communications of the ACM (2008)

Ph.D. in Computer Science University of Washington

Google Fellow in Systems and Infrastructure Group

ACM Fellow

Research Areas: Distributed Systems and Parallel Computing

Ph.D. in Computer Science Massachusetts Institute of

Technology

Google Fellow

Research Areas: Distributed Systems and Parallel Computing

Calculate 30*50
Easy?

3050 + 3151 + 3252 + 3352 + .... + 40*60

Little bit hard?

Simple computation, but huge data set

Real world example for large computations

20+ billion web pages * 20kB webpage
One computer reads 30/35 MB/sec from disc
Nearly four months to read the web

Parallelize tasks in a distributed computing

environment
Web page problem solved in 3 hours with
1000 machines

Complexities in Distributed Computing

o How to parallelize the computation?
o Coordinate with other nodes
o Handling failures
o Preserve bandwidth
o Load balancing

A platform to hide the messy details of distributed

computing

Which are,
Parallelization
Fault-tolerance
Data distribution
Load Balancing

A programming model

An implementation

Example: Word count

the quick
brown fox

the fox ate

the mouse

Document

the
quick
brown
fox
the
fox
ate
the
mouse

1
1
1
1
1
1
1
1
1

Mapped

the
quick
brown
fox
ate
mouse

3
1
1
2
1
1

Reduced

Eg: Word count using MapReduce

the
quick
brown
fox

Map

the, 3

the, 1
quick, 1
brown, 1
fox, 1

quick, 1
brown, 1
Reduce

the fox
ate
the
mouse

Input

Map

the, 1
fox, 1
ate,1
the, 1
mouse, 1

fox, 2
ate, 1
mouse, 1

Reduce

Output

Input Text file

Output (fox, 1)
Document Name

Document Contents

map(String key, String value):

for each word w in value:
EmitIntermediate(w, "1");
Intermediate key/value pair Eg: (fox, 1)

Input (fox, {1, 1})

Output (fox, 2)
Word

List of Counts (Output from Map)

reduce(String key, Iterator values):

int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Accumulated Count

Reverse Web-Link Graph

Source
Web
page 1
Source
Web
page 5

Source
Web
page 4

Target
(My web
page)

Source
Web
page 2

Source
Web
page 3

Reverse Web-Link Graph

(My Web, Source 1)
(Not My Web, Source 2)
(My Web, Source 3)

Map

(My Web, Source 4)

(My Web, Source 5)

Source web pages

Target

(My Web, {Source 1, Source 3,.....})

Source pointing
to the target

Reduce

User Program
(1) Fork
(1) Fork

Master

(2) Assign Map

Split 0
Split 1

Split 2

(1) Fork

(2) Assign Reduce

Worker
(3) Read

(4) Local Write

Worker
(6) Write

Worker
(5) Remote Read

Split 3
Split 4

Worker

Input Layer

Map Layer

O/P File 0

Intermediate
Files

Worker

Reduce Layer

O/P File 1

Output Layer

Complexities in Distributed Computing, to be solved

parallelization
using Map & Reduce
o Automatic
How to parallelize
the computation?
o Coordinate with other nodes

o Handling failures
o Preserve bandwidth
o Load balancing

Restricted Programming model

User specified Map & Reduce functions

1000s of workers, different data sets

Data

Worker1

Worker2

Worker3

User-defined
Map/Reduce
Instruction

Complexities in Distributed Computing, solving..

o Automatic parallelization using Map & Reduce
o Coordinate with
nodesother
usingnodes
a master node

o Handling failures
o Preserve bandwidth
o Load balancing

Master data structure

Pushing information (meta-data) between
workers
Master
Information
Map
Worker

Information
Reduce
Worker

Complexities in Distributed Computing , solving..

o Automatic parallelization using Map & Reduce

o Coordinate nodes using a master node

o Fault
Handling
failures
tolerance
(Re-execution) & back up tasks
o Preserve bandwidth
o Load balancing

No response from a worker task?

If an ongoing Map or Reduce task: Re-execute

If a completed Map task: Re-execute

If a completed Reduce task: Remain untouched

Master failure (unlikely)

Restart

Straggler: machine that takes a long time

to complete the last steps in the computation

Solution: Redundant Execution

Near end of phase, spawn backup copies
Task that finishes first "wins"

Complexities in Distributed Computing , solving..

o Automatic parallelization using Map & Reduce

o Coordinate nodes using a master node

o Fault tolerance (Re-execution) & back up tasks
Preserve
bandwidth
o Saves
bandwidth
through locality
o Load balancing

Same data set in different machines

If a task has data locally, no need to access

other nodes

Complexities in Distributed Computing , solving..

solved

o Automatic parallelization using Map & Reduce

o Coordinate nodes using a master node

o Fault tolerance & back up tasks
o Saves bandwidth through locality
o Load balancing through granularity

Fine granularity tasks: map tasks > machines

1 worker several tasks

Idle workers are quickly assigned to work

Partitioning

Combining

Skipping bad records

Debuggers local execution

Counters

891 S

Normal Execution

1283 S

No backup tasks

44% increment in
time

Very long tail

Stragglers take
>300s to finish

891 S

933 S

5% increment in
time
Quick failure
recovery

Normal Execution

200 processes killed

Clustering for Google News and Google Product Search

Google Maps
Locating addresses
Map tiles rendering

Google PageRank

Localized Search

Apache Hadoop MapReduce

Hadoop Distributed File System (HDFS)

Used in,
Yahoo! Search
Facebook

Amazon
Twitter
Google

Higher level languages/systems based on Hadoop

Amazon Elastic MapReduce

Available for general public
Process data in the cloud

Pig and Hive

Large variety of problems can be expressed as Map

& Reduce

Restricted programming model

Easy to hide details of distributed computing

Achieved scalability & programming efficiency

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Panasonic Service Manual g80
No ratings yet
Panasonic Service Manual g80
119 pages
How To Set Up A Hadoop Cluster in Docker
No ratings yet
How To Set Up A Hadoop Cluster in Docker
13 pages
MongoDB Manual
No ratings yet
MongoDB Manual
908 pages
CSI 2110 Summary PDF
No ratings yet
CSI 2110 Summary PDF
17 pages
NoSQL Intro
No ratings yet
NoSQL Intro
26 pages
Tutorial Hbase
No ratings yet
Tutorial Hbase
100 pages
Coordination and Agreement Distributed Systems Designs and Concept
No ratings yet
Coordination and Agreement Distributed Systems Designs and Concept
63 pages
DDBMS Exam Questions
No ratings yet
DDBMS Exam Questions
3 pages
Full Stack UNIT 3
No ratings yet
Full Stack UNIT 3
36 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
Blockchain Databases: Practice Exercises
0% (1)
Blockchain Databases: Practice Exercises
4 pages
Nosql PDF
No ratings yet
Nosql PDF
21 pages
Perry Wolf
No ratings yet
Perry Wolf
16 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
Subject Name Parallel and Distributed Computing
100% (1)
Subject Name Parallel and Distributed Computing
3 pages
Nosql Databases: by Amy Alexander and Tanya Christina
No ratings yet
Nosql Databases: by Amy Alexander and Tanya Christina
14 pages
Cassandra Quick Guide
No ratings yet
Cassandra Quick Guide
60 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Hive Join
No ratings yet
Hive Join
6 pages
Introduction To Nosql: - Key Value Databases
No ratings yet
Introduction To Nosql: - Key Value Databases
14 pages
Distributed Systems REPORT
No ratings yet
Distributed Systems REPORT
39 pages
No SQL
No ratings yet
No SQL
32 pages
Foundations of Data Science PPT TEXT BOOK
No ratings yet
Foundations of Data Science PPT TEXT BOOK
132 pages
Consensus
No ratings yet
Consensus
77 pages
BDC Previous Papers 2 Marks
100% (1)
BDC Previous Papers 2 Marks
7 pages
Mapreduce Lab
No ratings yet
Mapreduce Lab
36 pages
SPARQL
No ratings yet
SPARQL
39 pages
WT Unit 3
No ratings yet
WT Unit 3
57 pages
Hive Using Hiveql
No ratings yet
Hive Using Hiveql
38 pages
Soa
100% (1)
Soa
129 pages
Aos Unit-1
No ratings yet
Aos Unit-1
39 pages
R Language
No ratings yet
R Language
59 pages
Hortonworks Cluster Config Guide.1.0
No ratings yet
Hortonworks Cluster Config Guide.1.0
15 pages
Python Pt1 0702
No ratings yet
Python Pt1 0702
121 pages
BDT Quiz
No ratings yet
BDT Quiz
4 pages
Machine Learning
No ratings yet
Machine Learning
2 pages
Distributed Shared Memory
No ratings yet
Distributed Shared Memory
30 pages
ADT
No ratings yet
ADT
34 pages
Big Data Hadoop Insight
No ratings yet
Big Data Hadoop Insight
46 pages
Cs9152 DBT Unit I Notes
100% (1)
Cs9152 DBT Unit I Notes
53 pages
BDA_UNIT_1
No ratings yet
BDA_UNIT_1
32 pages
On Distributed Os
100% (1)
On Distributed Os
131 pages
System Models For Distributed and Cloud Computing
No ratings yet
System Models For Distributed and Cloud Computing
15 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
Java RMI
No ratings yet
Java RMI
10 pages
Grid Architecture
No ratings yet
Grid Architecture
19 pages
Unit-2 Dbms
No ratings yet
Unit-2 Dbms
74 pages
MongoDB Aggregation PDF
No ratings yet
MongoDB Aggregation PDF
4 pages
Distributed Databases: Solutions To Practice Exercises
No ratings yet
Distributed Databases: Solutions To Practice Exercises
4 pages
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
No ratings yet
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
35 pages
Cloud Computing Unit - 3 Final
No ratings yet
Cloud Computing Unit - 3 Final
43 pages
Chapter 2 Requirements and Technology
100% (1)
Chapter 2 Requirements and Technology
28 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Kubernetes and Cloud Native Associate (KCNA) Exam Preparation
From Everand
Kubernetes and Cloud Native Associate (KCNA) Exam Preparation
Georgio Daccache
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
PostgreSQL 9 Administration Cookbook: LITE Edition
From Everand
PostgreSQL 9 Administration Cookbook: LITE Edition
Simon Riggs
3/5 (1)
Unix / Linux FAQ: with Tips to Face Interviews
From Everand
Unix / Linux FAQ: with Tips to Face Interviews
Prof. N.B. Venkateswarlu
No ratings yet
Motor Vehicle - Wikipedia
No ratings yet
Motor Vehicle - Wikipedia
9 pages
CCMTA Canadá PDF
No ratings yet
CCMTA Canadá PDF
18 pages
L08-Branch and Loop
No ratings yet
L08-Branch and Loop
29 pages
Krishna
No ratings yet
Krishna
2 pages
Circuito Regulador de Voltaje Ob2216
No ratings yet
Circuito Regulador de Voltaje Ob2216
9 pages
Invoice WE30683
No ratings yet
Invoice WE30683
2 pages
AI Impact on Accounting April 2025
No ratings yet
AI Impact on Accounting April 2025
3 pages
Dom Eng 003 M 107
No ratings yet
Dom Eng 003 M 107
8 pages
Ombw003a Five Inch Bws
No ratings yet
Ombw003a Five Inch Bws
39 pages
vmp3 Based Installation and Configuration Manual
No ratings yet
vmp3 Based Installation and Configuration Manual
125 pages
Mis Assignment.
No ratings yet
Mis Assignment.
13 pages
Basic Troubleshooting Steps
No ratings yet
Basic Troubleshooting Steps
7 pages
DevOps Leader JD
No ratings yet
DevOps Leader JD
2 pages
Cloud Computing Internship Tasks
No ratings yet
Cloud Computing Internship Tasks
13 pages
US HPV CI Product Sheet
No ratings yet
US HPV CI Product Sheet
2 pages
GARMIN Instinct 810G User Manual
No ratings yet
GARMIN Instinct 810G User Manual
20 pages
ARRIS Zhao Yun Pro User Manual PDF
No ratings yet
ARRIS Zhao Yun Pro User Manual PDF
16 pages
Ethics Syllabus
No ratings yet
Ethics Syllabus
5 pages
Training Course - SRAN16.0 ANR
No ratings yet
Training Course - SRAN16.0 ANR
28 pages
PfSense LB Setup
No ratings yet
PfSense LB Setup
25 pages
TICO TICO Trumpet Sheet Music For Trumpet in B-Flat (Solo)
No ratings yet
TICO TICO Trumpet Sheet Music For Trumpet in B-Flat (Solo)
1 page
Blog - Teknisi: Fault Codes On Atlas Copco XATS 1050 CD7 With Engine Cat C9
No ratings yet
Blog - Teknisi: Fault Codes On Atlas Copco XATS 1050 CD7 With Engine Cat C9
2 pages
Service: LCD-TV
No ratings yet
Service: LCD-TV
107 pages
Module 411 Social Media Location Analytics
No ratings yet
Module 411 Social Media Location Analytics
15 pages
T Mom-19 170117
No ratings yet
T Mom-19 170117
7 pages
Postdoctoral Positon On Construction Materials From Mine Tailings
No ratings yet
Postdoctoral Positon On Construction Materials From Mine Tailings
1 page
Australian Standard: Structural and Pressure Vessel Steel-Quenched and Tempered Plate
0% (1)
Australian Standard: Structural and Pressure Vessel Steel-Quenched and Tempered Plate
7 pages
Design For Deconstruction PDF
No ratings yet
Design For Deconstruction PDF
4 pages
Engine Bearing Kit - 2 Wheelers - English
No ratings yet
Engine Bearing Kit - 2 Wheelers - English
2 pages

Uploaded by

Uploaded by

By: Jeffrey Dean & Sanjay Ghemawat

Presented by: Warunika Ranaweera

MapReduce: Simplified Data Processing

Ph.D. in Computer Science University of Washington

Google Fellow in Systems and Infrastructure Group

Research Areas: Distributed Systems and Parallel Computing

Ph.D. in Computer Science Massachusetts Institute of

Research Areas: Distributed Systems and Parallel Computing

30*50 + 31*51 + 32*52 + 33*52 + .... + 40*60

Simple computation, but huge data set

Real world example for large computations

Parallelize tasks in a distributed computing

Complexities in Distributed Computing

A platform to hide the messy details of distributed

Example: Word count

the fox ate

Eg: Word count using MapReduce

Input Text file

map(String key, String value):

Input (fox, {1, 1})

List of Counts (Output from Map)

reduce(String key, Iterator values):

Reverse Web-Link Graph

Reverse Web-Link Graph

(My Web, Source 4)

Source web pages

(My Web, {Source 1, Source 3,.....})

(2) Assign Map

(2) Assign Reduce

(4) Local Write

Complexities in Distributed Computing, to be solved

Restricted Programming model

User specified Map & Reduce functions

1000s of workers, different data sets

Complexities in Distributed Computing, solving..

Master data structure

Complexities in Distributed Computing , solving..

o Automatic parallelization using Map & Reduce

o Coordinate nodes using a master node

No response from a worker task?

If a completed Map task: Re-execute

Master failure (unlikely)

Straggler: machine that takes a long time

Solution: Redundant Execution

Complexities in Distributed Computing , solving..

o Automatic parallelization using Map & Reduce

o Coordinate nodes using a master node

Same data set in different machines

If a task has data locally, no need to access

Complexities in Distributed Computing , solving..

o Automatic parallelization using Map & Reduce

o Coordinate nodes using a master node

Fine granularity tasks: map tasks > machines

1 worker several tasks

Idle workers are quickly assigned to work

Skipping bad records

Debuggers local execution

Very long tail

200 processes killed

Clustering for Google News and Google Product Search

Apache Hadoop MapReduce

Hadoop Distributed File System (HDFS)

Higher level languages/systems based on Hadoop

Amazon Elastic MapReduce

Pig and Hive

Large variety of problems can be expressed as Map

Restricted programming model

Easy to hide details of distributed computing

Achieved scalability & programming efficiency

You might also like

3050 + 3151 + 3252 + 3352 + .... + 40*60