0% found this document useful (0 votes)
68 views

Apache Pig - A Data Flow Framework Based On Hadoop Map Reduce

This document discusses Apache Pig, a data flow framework for Hadoop. Pig runs on Hadoop and uses HDFS and MapReduce. It has two main components: Pig Latin, a parallel data flow language similar to SQL; and a runtime environment that executes Pig Latin programs. Pig Latin scripts define data reading, processing, and storage operations as directed acyclic graphs of operators and data flows. This allows for parallel execution on large datasets.

Uploaded by

SYED IBRAHIM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

Apache Pig - A Data Flow Framework Based On Hadoop Map Reduce

This document discusses Apache Pig, a data flow framework for Hadoop. Pig runs on Hadoop and uses HDFS and MapReduce. It has two main components: Pig Latin, a parallel data flow language similar to SQL; and a runtime environment that executes Pig Latin programs. Pig Latin scripts define data reading, processing, and storage operations as directed acyclic graphs of operators and data flows. This allows for parallel execution on large datasets.

Uploaded by

SYED IBRAHIM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/321537152

Apache Pig - A Data Flow Framework Based on Hadoop Map Reduce

Article · August 2017


DOI: 10.14445/22315381/IJETT-V50P244

CITATIONS READS

0 1,939

2 authors:

Cdsa Swa Zahid Ansari


Jefferson College of Health Science P.A. College of Engineering
1 PUBLICATION   0 CITATIONS    31 PUBLICATIONS   214 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

HPC and soft computing View project

All content following this page was uploaded by Zahid Ansari on 17 May 2018.

The user has requested enhancement of the downloaded file.


International Journal of Engineering Trends and Technology (IJETT) – Volume 50 Number 5 August 2017

Apache Pig - A Data Flow Framework Based


on Hadoop Map Reduce
Swarna C#1, Zahid Ansari*2
#
Department of Computer Science and Engineering, P.A. College of Engineering, Mangaluru, India
*
Department of Computer Science and Engineering, P.A. College of Engineering, Mangaluru, India
Abstract — Big Data is a technology phenomenon files without any schema information. Pig Latin also
happened due to the increased rate of data growth, comes with a novel debugging environment that is
complex new data types and parallel advancements in particularly useful when dealing with massive data
technology stake. Big data can be structured, unstructured sets.
or semi-structured, resulting in ineffectiveness of
conventional data management methods. Hadoop is a II. PIG COMPONENTS AND ARCHITECTURE
framework for the analysis and transformation of very
large data sets using the Map Reduce paradigm. An Pig is an apache open source project .It is an
important characteristic of Hadoop is the splitting of data engine for executing parallel data flows on Hadoop.
and computation across thousands of hosts and running It runs on Hadoop by making use of both HDFS and
applications in parallel close to their data. Hadoop Map Reduce which are the two components of
accomplish this by HDFS and Map Reduce. Pig is an Hadoop [3]. Pig was initially developed at Yahoo.
apache open source project. It runs on Hadoop by making The Pig programming language is designed to
use of both HDFS and Map Reduce. There are two main handle any type of data that is reasonable. Pig is
components for Pig. First component Pig Latin is the
made up of two components: the first component is
parallel dataflow language which is designed in such a
way to fit between the SQL and the Map Reduce. Pig Latin
the Pig Latin-which is the language, and the second
enables the use to define the reading, processing, storing is the runtime environment where Pig Latin
the data in parallel. Pig Latin script explicates a directed programs are executed [4]. Fig. 1 shows the
acyclic graph, where data flows are represented as edges components of Pig.
and operators are represented as nodes. The second
component is the run time environment in which Pig Latin
programs are executed.

Keywords — Big Data, Hadoop, Map Reduce, Pig, Pig


Latin.

I. INTRODUCTION
The term ‘Big Data’ describes inventive
techniques and technologies to capture, store,
distribute, manage and analyse petabyte or larger-
sized datasets with high-velocity and different Fig 1: Components of Pig
structures [1]. Hadoop is open-source software that Figure 1 also describes the various steps during
enables reliable, scalable, distributed computing on the execution. The data is loaded from HDFS and it
clusters of less expensive servers [2]. In 2004 is then converted to many map and reduce tasks.
Google has invented a frame work called Map Lastly the output is either stored in to a file or
Reduce which is mainly used for parallel data dumped to screen.
processing in a distributed computing environment.
But the Map Reduce is too low level and rigid. it has
many drawbacks like writing low level Map Reduce
code is slow, need a lot of expertise to optimize
Map Reduce code, prototyping is slow, a lot of
custom code required even for simple tasks and it is
hard to manage more complex map reduce job
chains. So a new language called Pig Latin was
developed which is a high level declarative query
language like SQL and a low level procedural
programming like Map Reduce.
Pig Latin is implemented on Pig which is open
source software which run on Hadoop. Pig Latin’s
main features include support for an adaptable
nested data model, extensive support for user
defined functions, and the ability to operate on input Fig 2: Pig Architecture

ISSN: 2231-5381 http://www.ijettjournal.org Page 271


International Journal of Engineering Trends and Technology (IJETT) – Volume 50 Number 5 August 2017

Fig. 2 describes Pig Architecture. Grunt is the TABLE I


interactive shell for the users to enter Pig Latin. SCALAR DATA TYPES
Parser converts Pig Latin in to Logical Plan, which Scalar Description Example
is further optimized by the optimizer. Compiler Data type
converts it in to a series of map reduce jobs. These
jobs are executed by the execution engine. Pig Four-byte signed
int
integer. 12
allows three modes of user interaction [7]:
 Interactive mode: Here, the user is entering
long Eight-byte signed
Pig Commands with an interactive shell integer.
80000L
which is known as Grunt. When the user
asks for output through the STORE float Four byte floating-point
6.2f or 6.2e2f
command plan compilation and execution number.
is triggered.
 Batch mode: In this mode, a user submits a double Eight byte floating 2.718 or
point. 6.626e-34.
prewritten script containing a group of Pig
commands, typically finishing with STORE. A string or character
The semantics are identical to interactive chararray array.
Hello
mode.
 Embedded mode: Pig Latin Commands can bytearray A blob or array of bytes.
be submitted through method invocation .
from a java program. For this a Java library
is provided by Pig. Through this option
dynamic construction of Pig Latin Pig’s three complex data types are: maps, tuples,
programs and dynamic control flow can be and bags. All of these types can contain data of any
achieved. e.g. looping for a non- type, including other complex types. So it is possible
predetermined number of iterations, which to have a map where the value field is a bag, which
is not currently supported in Pig Latin contains a tuple where one of the fields is a map.
Table II describes complex data types.
directly.
TABLE II
III. PIG LATIN COMPLEX DATA TYPES
Through this section the details of Pig Latin Complex Description Example
language is described. We describe Pig data model Data type
in Section A, and the Pig Latin statements in the tuple An ordered set of fields (1,`alice‘ )
subsequent subsections. Pig Latin has the
following key properties [15]:
bag A collection of tuples { (1,`alice’),(2)}

 Ease of programming: Complex tasks


A map is a collection of
comprised of multiple interrelated data map data items, where each [‘a’#’pomegranate’]
transformations are explicitly encoded as item has an associated
data flow sequences, making them easy to key.
write, understand, and maintain.
 Optimization opportunities. The tasks are TABLE III
encoded to permit the system to DIAGNOSTIC OPERATORS
automatically optimize their execution. It Operator Description
allows the user to focus on semantics rather
than efficiency.
 Extensibility: Users can create their own Describe Returns the schema of the relation
functions to do special-purpose processing.
Dump Dumps the results to the screen
A. Data Model
Data in Pig Latin is categorized into two types [16]. Explain Displays execution plans.
Scalar and complex data type. Pig’s scalar types are
similar to the data types that appear in most Displays a step-by-step execution of a
programming languages. With the exception of Illustrate
sequence of statements
bytearray, they are all represented in Pig interfaces
by java.lang classes, making them easy to work with
in UDFs: Table I describes the scalar data types .

ISSN: 2231-5381 http://www.ijettjournal.org Page 272


International Journal of Engineering Trends and Technology (IJETT) – Volume 50 Number 5 August 2017

B. Pig diagnostic operators Our current implementation uses Hadoop, an open-


Pig Latin provides four different types of source, scalable implementation of map-reduce [2],
diagnostic operators: Describe Dump, Explain and as the execution platform. Pig Latin programs are
Illustrate. Describe, Explain and Illustrate are compiled into map-reduce jobs, and executed using
provided to allow the operator to work together with Hadoop. Pig, together with its Hadoop compiler, is
the logical plan, for debugging purposes. The Dump an open-source project implemented by Apache and
is a sort of diagnostic operator too because it is used it is available for general use [11].
only to permit interactive debugging of small result
sets or in combination with Limit. Table III will give A. Building a Logical Plan
a brief description about these operators. As clients issue Pig Latin commands, the Pig
interpreter first parses it, and verifies that the input
TABLE IV files and bags referenced by the command are valid.
PIG COMMANDS For example, if the user enters p = COGROUP q
BY:: :, r BY :: :, Pig verifies whether the bags q and
r have already been defined. Pig builds a logical plan
Command Description for the bags q and r user defines. When a new bag
Load Read data from the file system is defined by a command, the logical plan for the
Store Write data to the file system new bag is constructed by combining the logical
plans for the input bags, and the current command.
Dump Write output to stdout Thus, in the above example, the logical plan for p
Foreach Apply expression to each record and consists of a cogroup command having the logical
Generate generate one or more records plans for q and r as inputs. During the construction
of logical plan processing is not carried out.
Apply predicate to each record and remove
Filter
records where false Processing is activated when the user invokes a
STORE command on a bag. While processing is
Group / Collect records with the same key from one activated, the logical plan for that bag is compiled
Cogroup or more inputs
into a physical plan, and is executed. This is
Join Join two or more inputs based on a key illustrated in figure 3.This lazy style of execution is
Order Sort records based on a Key beneficial because it permits in-memory pipelining,
and other optimizations such as filter reordering
Distinct Remove duplicate records across multiple Pig Latin commands.
Union Merge two datasets
Pig is designed in such a way that the parsing of
Pig Latin and the logical plan construction is
Limit Limit the number of records independent of the execution platform. The
Split data into 2 or more sets, based on filter
compilation of the logical plan into a physical plan
Split depends on which specific execution platform is
conditions
chosen. Next, we describe the compilation into
Creates the cross product of two or more
Cross
relations
Hadoop map-reduce, the execution platform
currently used by Pig.
C. Pig Commands
Apache pig present different built in data processing
operators. For input/output processing Load or Store Transform Transform
commands are used. For filtering data, For each, Pig Latin Logical Physical
Generate and Stream commands are used. There are Program Plan Plan
also commands for grouping and joining data.
Important data processing commands are described
in Table IV. Execute
Output Map
IV. IMPLEMENTATION Reduce
Pig Latin is fully implemented by the system, Pig. Plan
Pig’s architecture allows different systems to be
plugged in as the execution platform for Pig Latin.
Figure 3: Pig Latin Workflow

ISSN: 2231-5381 http://www.ijettjournal.org Page 273


International Journal of Engineering Trends and Technology (IJETT) – Volume 50 Number 5 August 2017

map1 mapi reducei map i+1 reduce i+1

load filter group cogroup cogroup

C1 Ci C1+1

load

Fig 4: Pig Latin Map Reduce Compilation

samples the input to find out quantiles of the sort


B. Map-Reduce Plan Compilation key. The second job range partitions the input
Compilation of a Pig Latin logical plan into map- according to the quantiles, which follows a local sort
reduce jobs is straight forward. The map-reduce task in the reduce phase, finally resulting in a globally
fundamentally provides the capacity to do a large- sorted file.
scale group by, where the map tasks assign keys for The inflexibility of the map-reduce primitive
grouping, and the reduce tasks process a group at a causes some overheads while compiling Pig Latin
time. Pig compiler begins by converting each into map-reduce jobs. For example, data must be
(CO)GROUP command in the logical plan into a materialized and replicated on the distributed file
distinct map-reduce job with its own map and reduce system between successive map-reduce jobs. While
functions. The map function for (CO)GROUP dealing with multiple data sets, an additional field
command p first assigns keys to tuples based on the must be added in every tuple to indicate the origin of
BY clause(s) of p. The reduce function has no data set. Since the Hadoop map-reduce
operation initially. The map-reduce boundary is the implementation does provide many desired
cogroup command. The sequence of FILTER, and properties such as parallelism, load balancing, and
FOREACH commands from the LOAD to the first fault-tolerance, the associated overhead is often
COGROUP operation C1, are pushed into the map acceptable.
function corresponding to C1 (see Figure 4). The
commands that lies between subsequent COGROUP V. APPLICATIONS
commands Ci and Ci+1 can be pushed into either Some of the important uses of Pig are described
 The reduce function corresponding to Ci, or below:
 The map function corresponding to Ci+1.  Pig is a powerful tool for querying data in a
Pig currently always follows option (a). Since Hadoop cluster. It's so powerful that Yahoo
grouping is often followed by aggregation, this estimates that between 40% and 60% of its
approach reduces the amount of data that has to be Hadoop workloads are generated from Pig Latin
materialized between map reduce jobs. scripts.[23]
If the COGROUP command consists of more than  Pig is also used at Twitter (processing logs,
one input data set, the map function appends an extra mining tweet data); at AOL and MapQuest (for
field to each tuple that indicates the data set from analytics and batch data processing); and at
which the tuple originated. The corresponding LinkedIn, where Pig is used to discover people
reduce function decodes this information. The you might know.[23]
decoded information is used to insert the tuple into
 With continually increasing population, crimes
the appropriate nested bag when cogrouped tuples and crime rate analyzing related data is a huge
are generated. issue for governments to make strategic
Parallelism for LOAD is obtained as Pig operates decisions so as to maintain law and order. The
over files which reside in the Hadoop distributed file benefit of using Pig for analysis is that fewer
system. Parallelism for FILTER and FOREACH lines of code have to be written which reduces
operations are also achieved since for a given map- overall development and testing time [20].
reduce job, several map and reduce instances are
 Using pig script a large scale data processing
running in parallel. We also get Parallelism for
system for analyzing web log data through Map
(CO)GROUP since the output from the multiple map
Reduce programming in Hadoop framework is
instances is repartitioned in parallel to the multiple
efficient[21].
reduce instances.
While implementing the ORDER command is
two map-reduce jobs are compiled. The first job

ISSN: 2231-5381 http://www.ijettjournal.org Page 274


International Journal of Engineering Trends and Technology (IJETT) – Volume 50 Number 5 August 2017

 Pig is used to evaluate the performance of a [15] Agarwal, Shafali, and Zeba Khanam. "Map Reduce: A
Survey Paper on Recent Expansion." International Journal
commercial RDBMS and Hadoop in astronomy of Advanced Computer Science and Applications 6.8 (2015):
simulation analysis tasks [22]. 209-215.
[16] Olshannikova, Ekaterina, et al. "Conceptualizing Big Social
Data." Journal of Big Data 4.1 (2017): 3.
[17] Tom White foreword by Doug Cutting; ―Hadoop: The
V. CONCLUSIONS Definitive Guide‖; ISBN: 978-1-449-38973-4 [SB]
This paper introduced the concept of Pig and its 1285179414.
associated language Pig Latin which is a new data [18] Bhardwaj, Vibha, Rahul Johari, and Priti Bhardwaj. "Query
execution evaluation in wireless network using MyHadoop."
processing environment deployed at Yahoo. We Reliability, Infocom Technologies and Optimization
have entered an era of Big Data and Hadoop is a (ICRITO)(Trends and Future Directions), 2015 4th
framework for the analysis and transformation of International Conference on. IEEE, 2015.
this Big data using the Map Reduce paradigm. The [19] Tanimura, Yusuke, et al. "Extensions to the Pig data
processing platform for scalable RDF data processing using
Pig system compiles Pig Latin expressions into a Hadoop." Data Engineering Workshops (ICDEW), 2010
sequence of map-reduce jobs, and orchestrates the IEEE 26th International Conference on. IEEE, 2010.
execution of these jobs on Hadoop. Pig structure is [20] Arushi Jaina, Vishal Bhatnagara Ambedkar” Crime Data
susceptible to substantial parallelization. Analysis Using Pig with Hadoop”, International
Conference on Information Security &Privacy (ICISP2015),
11-12 December 2015
REFERENCES [21] Prasad, PS Durga, T. Vivekanandan, and A. Srinivasan. "A
Methodology for WebLog Data analysis using
HadoopMapReduce and PIG." i-manager's Journal on
[1] Bhosale, Harshawardhan S., and Devendra P. Gadekar. "A Cloud Computing 3.1 (2015): 13.
Review Paper on Big Data and Hadoop." International
[22] Loebman, Sarah, et al. "Analyzing massive astrophysical
Journal of Scientific and Research Publications 4.10 (2014):
datasets: Can Pig/Hadoop or a relational DBMS help?."
[2] Chavan, Ms Vibhavari, and Rajesh N. Phursule. "Survey Cluster Computing and Workshops, 2009. CLUSTER'09.
paper on big data." Int. J. Comput. Sci. Inf. Technol 5.6
IEEE International Conference on. IEEE, 2009.
(2014): 7932-7939.
[23] www.wikepedia.org 12/04/2017 at 8:30 pm
[3] Samak, Taghrid, Daniel Gunter, and Valerie Hendrix.
"Scalable analysis of network measurements with Hadoop
and Pig." Network Operations and Management Symposium
(NOMS), 2012 IEEE. IEEE, 2012.
[4] Goyal, Vikas, and Deepak Soni. "SURVEY PAPER ON
BIG DATA ANALYTICS USING HADOOP
TECHNOLOGIES."
[5] Wang, MingXue, Sidath B. Handurukande, and Mohamed
Nassar. "RPig: A scalable framework for machine learning
and advanced statistical functionalities." Cloud Computing
Technology and Science (CloudCom), 2012 IEEE 4th
International Conference on. IEEE, 2012.
[6] Ouaknine, Keren, Michael Carey, and Scott Kirkpatrick.
"The PigMix Benchmark on Pig, MapReduce, and HPCC
Systems." Big Data (BigData congress), 2015 IEEE
International Congress on. IEEE, 2015.
[7] Samak, Taghrid, Daniel Gunter, and Valerie Hendrix.
"Scalable analysis of network measurements with Hadoop
and Pig." Network Operations and Management Symposium
(NOMS), 2012 IEEE. IEEE, 2012.
[8] Gates, Alan F., et al. "Building a high-level dataflow system
on top of Map-Reduce: the Pig experience." Proceedings of
the VLDB Endowment 2.2 (2009): 1414-1425.
[9] Adnan, Muhammad, et al. "Minimizing big data problems
using cloud computing based on Hadoop architecture."
High-capacity Optical Networks and Emerging/Enabling
Technologies (HONET), 2014 11th Annual. IEEE, 2014.
[10] Shang, Weiyi, Bram Adams, and Ahmed E. Hassan. "Using
Pig as a data preparation language for large-scale mining
software repositories studies: An experience report."
Journal of Systems and Software 85.10 (2012): 2195-2204.
[11] Shvachko, Konstantin, et al. "The hadoop distributed file
system." Mass storage systems and technologies (MSST),
2010 IEEE 26th symposium on. IEEE, 2010.
[12] Olston, Christopher, et al. "Pig latin: a not-so-foreign
language for data processing." Proceedings of the 2008
ACM SIGMOD international conference on Management of
data. ACM, 2008.
[13] Shvachko, Konstantin, et al. "The hadoop distributed file
system." Mass storage systems and technologies (MSST),
2010 IEEE 26th symposium on. IEEE, 2010.
[14] Wang, Yaoguang, et al. "Improving MapReduce
performance with partial speculative execution." Journal of
Grid Computing 13.4 (2015): 587-604.

ISSN: 2231-5381 http://www.ijettjournal.org Page 275

View publication stats

You might also like