0% found this document useful (0 votes)
46 views82 pages

Big Data Analytics

Uploaded by

Prachi Behl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views82 pages

Big Data Analytics

Uploaded by

Prachi Behl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 82

Big Data

Big Data is a collection of data that is huge in volume, yet growing


exponentially with time. It is a data with so large size and
complexity that none of traditional data management tools can
store it or process it efficiently. Big data is also a data but with
huge size.

Big Data Features/ Characteristics

3 V’s are Volume, Velocity and Variety

1. Volume
Volume refers to the huge amounts of data that is collected and
generated every second in large organizations. This data is
generated from different sources such as IoT devices, social
media, videos, financial transactions, and customer logs.

Storing and processing this huge amount of data was a problem


earlier. But now distributed systems such as Hadoop are used for
organizing data collected from all these sources. The size of the
data is crucial for understanding its value. Also, the volume is
useful in determining whether a collection of data is Big Data or
not.
2. Variety
Another one of the most important Big Data characteristics is its
variety. It refers to the different sources of data and their
nature. The sources of data have changed over the years. Earlier,
it was only available in spreadsheets and databases. Nowadays,
data is present in photos, audio files, videos, text files, and PDFs.

The variety of data is crucial for its storage and analysis.

3. Velocity
The term ‘velocity’ refers to the speed of generation of data. How
fast the data is generated and processed to meet the demands,
determines real potential in the data.

4. Value
Among the characteristics of Big Data, value is perhaps the most
important. No matter how fast the data is produced or its amount,
it has to be reliable and useful. Otherwise, the data is not good
enough for processing or analysis. Research says that poor quality
data can lead to almost a 20% loss in a company’s revenue.
Data scientists first convert raw data into information. Then this
data set is cleaned to retrieve the most useful data. Analysis and
pattern identification is done on this data set. If the process is a
success, the data can be considered to be valuable.

5. Veracity
This feature of Big Data is connected to the previous one. It
defines the degree of trustworthiness of the data. As most of
the data you encounter is unstructured, it is important to filter
out the unnecessary information and use the rest for processing.
inconsistency can be shown by the data at times, thus hampering
the process of being able to handle and manage the data
effectively.

Big Data Challenges

1. Managing Voluminous Data


The speed at which big data is being created is quickly surpassing
the rate at which computing and storage systems are being
developed and it’s becoming quite problematic to handle
unstructured data. Unstructured data is basically data that cannot
be easily stored in the traditional column-row database or
spreadsheets such as Microsoft Excel table. For these reasons, it
becomes extremely difficult to analyze, besides being difficult to
search.

2. Lack of data professionals


To run these modern technologies and Big Data tools, companies
need skilled data professionals. These professionals will include
data scientists, data analysts and data engineers who are
experienced in working with the tools and making sense out of
huge data sets.
Companies face a problem of lack of Big Data professionals. This is
because data handling tools have evolved rapidly, but in most
cases, the professionals have not.

3. Integrating data from a variety of sources


Big data entails handling data from many sources. The majority of
these sources use unique data collection methods and distinct
formats.
As such, it is not unusual to experience inconsistencies even in
data with similar value variables, and making adjustments is quite
challenging. Data may not only contain wrong information, but
duplication and contradictions are also possible and data of
inferior quality can hardly offer any useful insights or help to
identify precise opportunities for handling your business tasks.

4. Data Security
Securing these huge sets of data is one of the daunting challenges
of Big Data. Often companies are so busy in understanding, storing
and analyzing their data sets that they push data security for
later stages. But, this is not a smart move as unprotected data
repositories can become breeding grounds for malicious hackers.

5. Too Many Options!


Just like any type of growing technology, the market for big data
analytics has so many options that users find it confusing to settle
for one or a few that they need.
With too many choices, it can be challenging to identify the exact
big data technologies that are in line with your goals. This
confusion happens even to individuals and businesses that are
relatively well-versed on matters data.
The options overload goes beyond the right programs to deploy.
Data scientists use many strategies and methods to collect,
safeguard, and analyze data. Hence, there isn’t a one-size-fits-all
approach.

6. Organizational Resistance
Organizational resistance – even in other areas of business – has
been around since forever. Nothing new here! It is a problem that
companies can anticipate, and as such, decide the best way to deal
with the problem.
If it’s already happening in your organization, you should know that
it is not something out of the ordinary. Of the utmost importance
is to determine the best way to handle the situation to ensure big
data success.

7. Big Data Handling Costs


The management of big data, right from the adoption stage,
demands a lot of expenses. For instance, if your company chooses
to use an on-premises solution must be ready to spend money on
new hardware, electricity, new recruitments such as developers
and administrators and so on.
Additionally, you will be required to meet the costs of developing,
setting up, configuring and maintaining new software even though
the frameworks needed are open source.
On the other hand, organizations that settle for the cloud-based
solution will spend on areas such as hiring new staff (developers
and administrators), cloud services, development, and also meet
costs associated with the development, setup as well as
maintenance of the frameworks needed.

8. Upscaling Problems
One of the things that big data is known for is its dramatic
growth. Unfortunately, this feature is known as one of the most
serious big data challenges.
While the design of your solution is well thought out and
therefore flexible when it comes to upscaling, but the real
problem doesn’t lie in the introduction of new processing and
storage capacities.
The real problem is in the complexity of upscaling in a way that
ensures that the performance of the system does not decline and
that you don’t go overboard with the budget.

Problems with Traditional Large-Scale System


1. Big Data Is Too Big for Traditional Storage
Perhaps the most obvious of the big data challenges is its
enormous scale.
Traditional data storage systems can, in theory, handle large
amounts of data. But when tasked to deliver the efficiency and
insights we need many simply can’t keep up with the demands of
modern data.

2.The Relational Database Conundrum


Relational SQL databases are trusty, timeworn methods to house,
read, and write data. But these databases can struggle to operate
efficiently, even before they’ve met maximum capacity. A
relational database containing large quantities of data can become
slow for many reasons. Inserting, updating, deleting, and
performing other operations can take longer depending on the
number of relationships they have to other tables.
Simply put: The more data that is in a relational database, the
longer each operation takes.

3. Big Data Is Too Complex for Traditional Storage


Another major challenge for traditional storage when it comes to
big data? The complexity of data styles. Traditional data is
“structured.” You can organize it in tables with rows and columns
that bear a straightforward relation to one another.
Big data doesn’t always fit neatly into the relational rows and
columns of a traditional data storage system. It’s largely
unstructured, consisting of myriad file types and often including
images, videos, audio, and social media content. That’s why
traditional storage solutions are unsuitable for working with big
data: They can’t properly categorize it.

4. Big Data Is Too Fast for Traditional Storage


Traditional data storage systems are for steady data retention.
You can add more data regularly and then perform analysis on the
new data set. But big data grows almost instantaneously, and
analysis often needs to occur in real time. An RDBMS isn’t
designed for rapid fluctuations.
The high-velocity nature of big data is not kind to traditional
storage systems, which can be a root cause of project failure or
unrealized ROI.

Sources of Big Data


Technology today allows us to collect data at an astounding rate--
both in terms of volume and variety. There are various sources
that generate data, but in the context of big data, the primary
sources are as follows:

1. Social networks: Arguably, the primary source of all big data


that we know of today is the social networks that have
proliferated over the past 5-10 years. This is by and large
unstructured data that is represented by millions of social
media postings and other data that is generated on a second-
by-second basis through user interactions on the web across
the world. Increase in access to the internet across the
world has been a self-fulfilling act for the growth of data in
social networks.

2. Media: Largely a result of the growth of social networks,


media represents the millions, if not billions, of audio and
visual uploads that take place on a daily basis. Videos
uploaded on YouTube, music recordings on SoundCloud, and
pictures posted on Instagram are prime examples of media,
whose volume continues to grow in an unrestrained manner.

3. Data warehouses: Companies have long invested in


specialized data storage facilities commonly known as data
warehouses. A DW is essentially collections of historical data
that companies wish to maintain and catalog for easy
retrieval, whether for internal use or regulatory purposes. As
industries gradually shift toward the practice of storing data
in platforms such as Hadoop and NoSQL, more and more
companies are moving data from their pre-existing data
warehouses to some of the newer technologies. Company
emails, accounting records, databases, and internal
documents are some examples of DW data that is now being
offloaded onto Hadoop or Hadoop-like platforms that
leverage multiple nodes to provide a highly-available and
fault-tolerant platform.
4. Sensors: A more recent phenomenon in the space of big data
has been the collection of data from sensor devices. While
sensors have always existed and industries such as oil and gas
have been using drilling sensors for measurements at oil rigs
for many decades, the advent of wearable devices, also
known as the Internet Of Things such as Fitbit and Apple
Watch, meant that now each individual could stream data at
the same rate at which a few oil rigs used to do just 10 years
back.

Types of Big Data


Big Data is present in three basic forms. They are –

1. Structured data
As the name suggests, this kind of data is structured and is well-
defined. It has a consistent order that can be easily understood
by a computer or a human. This data can be stored, analyzed, and
processed using a fixed format. Usually, this kind of data has its
own data model.
You will find this kind of data in databases, where it is neatly
stored in columns and rows.
For example, a database consisting of all the details of employees
of a company is a type of structured data set.
2. Unstructured data
Any set of data that is not structured or well-defined is called
unstructured data. This kind of data is unorganized and difficult
to handle, understand and analyze. It does not follow a consistent
format and may vary at different points of time. Most of the data
you encounter comes under this category.
For example, unstructured data are your comments, tweets,
shares, posts, and likes on social media. The videos you watch on
YouTube and text messages you send via WhatsApp all pile up as a
huge heap of unstructured data.

3. Semi-structured data
This kind of data is somewhat structured but not completely. This
may seem to be unstructured at first and does not obey any
formal structures of data models such as RDBMS. For example,
CSV files are also considered semi-structured data.

Google File System

Google File System (GFS or GoogleFS) is a proprietary distributed


file system developed by Google to provide efficient, reliable
access to data using large clusters of commodity hardware.
GFS is enhanced for Google's core data storage and usage needs
(primarily the search engine), which can generate enormous
amounts of data that must be retained.
A GFS cluster consists of multiple nodes. These nodes are divided
into two types: one Master node and multiple Chunkservers. Chunk
servers are off-the-shelf cheap commodity hardware and they
could go down for n number of reasons so it's important for chunk
servers to have heartbeat messages passed along to the master so
that the master knows that the chunk server is still alive let's say
after a while chunk server which was holding those particular
chunks died so now the replicas count which was initially three has
reduced to two now that particular chunk server will not be able to
send the heartbeat message to the master so initially there were
three replicas of the chunk one server died there are now only two
replicas using those two replicas the gfs master will ensure the
there is one more copy created in any of the chunk servers which
are relatively free so that now the replicas count and we again
increased to three and this is
how even if there is a commodity hardware which fails all the time
the gfs master ensures there are at least n number of copies of
each file
Each file is divided into fixed-size chunks. Chunkservers store
these chunks. Each chunk is of 64 MB and is assigned a globally
unique 64-bit label by the master node at the time of creation
which helps in the identification of chunks.
Each chunk is replicated several times throughout the network. At
default, it is replicated three times, but this is configurable.
The Master server does not usually store the actual chunks, but
rather all the metadata associated with the chunks, such as the
tables mapping the 64-bit labels to chunk locations and the files
they make up (mapping from files to chunks), the locations of the
copies of the chunks, what processes are reading or writing to a
particular chunk, etc.
it is important to note that the gfs client does not read the data
from GFS master it only reads a metadata about the file and the
actual reading writing of the file is directly done between the
client and the GFS chunk server which further reduce the load
from the GFS master

The Master server grants permission for modification to a


process for a finite period of time during which no other process
will be granted permission by the Master server to modify the
chunk. The modifying chunkserver, which is always the primary
chunk holder, then propagates the changes to the chunkservers
with the backup copies. The changes are not saved until all
chunkservers acknowledge, thus guaranteeing the completion and
atomicity of the operation.
Programs access the chunks by first querying the Master server
for the locations of the desired chunks; if the chunks are not
being operated on (i.e. no outstanding leases exist), the Master
replies with the locations, and the program then contacts and
receives the data from the chunkserver directly.
The Hadoop Distributed File System (HDFS) is based on the
Google File System (GFS) and provides a distributed file system
that is designed to run on commodity hardware.

GFS features include:

Fault tolerance
Critical data replication
Automatic and efficient data recovery
High aggregate throughput
Reduced client and master interaction because of large chunk
server size
High availability

DFS or Distributed File System:


Distributed File System talks about managing data, i.e. files or
folders across multiple computers or servers. In other words, DFS
is a file system that allows us to store data over multiple nodes or
machines in a cluster and allows multiple users to access data.
Even though the files are stored across the network, DFS
organizes, and displays data in such a manner that a user sitting on
a machine will feel like all the data is stored in that very machine.

What is HDFS?
Hadoop Distributed file system or HDFS is a Java based
distributed file system that allows you to store large data across
multiple nodes in a Hadoop cluster. So, if you install Hadoop, you
get HDFS as an underlying storage system for storing the data in
the distributed environment.
Let’s take an example to understand it. Imagine that you have ten
machines or ten computers with a hard drive of 1 TB on each
machine. Now, HDFS says that if you install Hadoop as a platform
on top of these ten machines, you will get HDFS as a storage
service. Hadoop Distributed File System is distributed in such a
way that every machine contributes their individual storage for
storing any kind of data.

Advantages Of HDFS

1. Distributed Storage:

When you access Hadoop Distributed file system from any of the
ten machines in the Hadoop cluster, you will feel as if you have
logged into a single large machine which has a storage capacity of
10 TB (total storage over ten machines). What does it mean? It
means that you can store a single large file of 10 TB which will be
distributed over the ten machines (1 TB each). So, it is not limited
to the physical boundaries of each individual machine.

2. Distributed & Parallel Computation:

Because the data is divided across the machines, it allows us to


take advantage of Distributed and Parallel Computation. Let’s
understand this concept by the above example. Suppose, it takes
43 minutes to process 1 TB file on a single machine. So, now tell
me, how much time will it take to process the same 1 TB file when
you have 10 machines in a Hadoop cluster with similar
configuration – 43 minutes or 4.3 minutes? 4.3 minutes, Right!
What happened here? Each of the nodes is working with a part of
the 1 TB file in parallel. Therefore, the work which was taking 43
minutes before, gets finished in just 4.3 minutes now as the work
got divided over ten machines.
3. Horizontal Scalability:

Last but not the least, let us talk about the horizontal scaling or
scaling out in Hadoop. There are two types of scaling: vertical and
horizontal. In vertical scaling (scale up), you increase the
hardware capacity of your system. In other words, you procure
more RAM or CPU and add it to your existing system to make it
more robust and powerful. But there are challenges associated
with vertical scaling or scaling up:

● There is always a limit to which you can increase your


hardware capacity. So, you can’t keep on increasing the RAM
or CPU of the machine.
● In vertical scaling, you stop your machine first. Then you
increase the RAM or CPU to make it a more robust hardware
stack. After you have increased your hardware capacity, you
restart the machine. This down time when you are stopping
your system becomes a challenge.
In case of horizontal scaling (scale out), you add more nodes to
existing cluster instead of increasing the hardware capacity of
individual machines. And most importantly, you can add more
machines on the go i.e. Without stopping the system. Therefore,
while scaling out we don’t have any down time or green zone,
nothing of such sort. At the end of the day, you will have more
machines working in parallel to meet your requirements.

Features of HDFS

overview on the features of HDFS:

● Cost: The HDFS, in general, is deployed on a commodity


hardware like your desktop/laptop which you use every day.
So, it is very economical in terms of the cost of ownership of
the project. Since, we are using low cost commodity
hardware, you don’t need to spend huge amount of money for
scaling out your Hadoop cluster. In other words, adding more
nodes to your HDFS is cost effective.
● Variety and Volume of Data: When we talk about HDFS
then we talk about storing huge data i.e. Terabytes &
petabytes of data and different kinds of data. So, you can
store any type of data into HDFS, be it structured,
unstructured or semi structured.
● Reliability and Fault Tolerance: When you store data on
HDFS, it internally divides the given data into data blocks
and stores it in a distributed fashion across your Hadoop
cluster. The information regarding which data block is
located on which of the data nodes is recorded in the
metadata. NameNode manages the meta data and the
DataNodes are responsible for storing the data.
Name node also replicates the data i.e. maintains multiple
copies of the data. This replication of the data makes HDFS
very reliable and fault tolerant. So, even if any of the nodes
fails, we can retrieve the data from the replicas residing on
other data nodes. By default, the replication factor is 3.
Therefore, if you store 1 GB of file in HDFS, it will finally
occupy 3 GB of space. The name node periodically updates
the metadata and maintains the replication factor consistent.
● Data Integrity: Data Integrity talks about whether the
data stored in my HDFS are correct or not. HDFS constantly
checks the integrity of data stored against its checksum. If
it finds any fault, it reports to the name node about it. Then,
the name node creates additional new replicas and therefore
deletes the corrupted copies.
● High Throughput: Throughput is the amount of work done in
a unit time. It talks about how fast you can access the data
from the file system. Basically, it gives you an insight about
the system performance. As you have seen in the above
example where we used ten machines collectively to enhance
computation. There we were able to reduce the processing
time from 43 minutes to a mere 4.3 minutes as all the
machines were working in parallel. Therefore, by processing
data in parallel, we decreased the processing time
tremendously and thus, achieved high throughput.
● Data Locality: Data locality talks about moving processing
unit to data rather than the data to processing unit. In our
traditional system, we used to bring the data to the
application layer and then process it. But now, because of the
architecture and huge volume of the data, bringing the data
to the application layer will reduce the network performance
to a noticeable extent. So, in HDFS, we bring the
computation part to the data nodes where the data is
residing. Hence, you are not moving the data, you are bringing
the program or processing part to the data.

HDFS Architecture:

Apache HDFS or Hadoop Distributed File System is a block-


structured file system where each file is divided into blocks of a
pre-determined size. These blocks are stored across a cluster of
one or several machines.

Apache Hadoop HDFS Architecture follows a Master/Slave


Architecture, where a cluster comprises of a single NameNode
(Master node) and all the other nodes are DataNodes (Slave
nodes). HDFS can be deployed on a broad spectrum of machines
that support Java. Though one can run several DataNodes on a
single machine, but in the practical world, these DataNodes are
spread across various machines.

There are 5 building blocks / daemons of hadoop:


1. Namenode
2. Datanode
3. Secondary node
4. Job Tracker
5. Task Tracker

NameNode:

NameNode is the master node in the Apache Hadoop HDFS


Architecture that maintains and manages the blocks present on
the DataNodes (slave nodes). NameNode is a very highly available
server that controls access to files by clients.

Functions of NameNode:
● It is the master daemon that maintains and manages the
DataNodes (slave nodes)
● It records the metadata of all the files stored in the cluster,
e.g. The location of blocks stored, the size of the files,
permissions, hierarchy, etc. There are two files associated
with the metadata:
○ FsImage: It contains the complete state of the file
system namespace since the start of the NameNode.
○ EditLogs: It contains all the recent modifications made
to the file system with respect to the most recent
FsImage.
● It records each change that takes place to the file system
metadata. For example, if a file is deleted in HDFS, the
NameNode will immediately record this in the EditLog.
● It regularly receives a Heartbeat and a block report from all
the DataNodes in the cluster to ensure that the DataNodes
are live.
● It keeps a record of all the blocks in HDFS and in which
nodes these blocks are located.
● The NameNode is also responsible to take care of the
replication factor of all the blocks which we will discuss in
detail later in this HDFS tutorial blog.
● In case of the DataNode failure, the NameNode chooses new
DataNodes for new replicas, balance disk usage and manages
the communication traffic to the DataNodes.

DataNode:
DataNodes are the slave nodes in HDFS. Unlike NameNode,
DataNode is a commodity hardware, that is, a non-expensive
system which is not of high quality or high-availability.

Functions of DataNode:
● These are slave daemons or process which runs on each slave
machine.
● The actual data is stored on DataNodes.
● The DataNodes perform the low-level read and write
requests from the file system’s clients.
● They send heartbeats to the NameNode periodically to
report the overall health of HDFS, by default, this frequency
is set to 3 seconds.
Secondary NameNode:
Apart from these two daemons, there is a third daemon or a
process called Secondary NameNode. The Secondary NameNode
works concurrently with the primary NameNode as a helper
daemon. And don’t be confused about the Secondary NameNode
being a backup NameNode because it is not.

Functions of Secondary NameNode:


● The Secondary NameNode is one which constantly reads all
the file systems and metadata from the RAM of the
NameNode and writes it into the hard disk or the file
system.
● It is responsible for combining the EditLogs with FsImage
from the NameNode.
● It downloads the EditLogs from the NameNode at regular
intervals and applies to FsImage. The new FsImage is copied
back to the NameNode, which is used whenever the
NameNode is started the next time.
Hence, Secondary NameNode performs regular checkpoints in
HDFS. Therefore, it is also called CheckpointNode.

JobTracker and TaskTracker are 2 essential process involved in


MapReduce execution in MRv1 (or Hadoop version 1). Both
processes are now deprecated in MRv2 (or Hadoop version 2) and
replaced by Resource Manager, Application Master and Node
Manager Daemons.
Job Tracker –
● JobTracker process runs on a separate node and not usually
on a DataNode.
● JobTracker is an essential Daemon for MapReduce execution
in MRv1. It is replaced by
ResourceManager/ApplicationMaster in MRv2.
● JobTracker receives the requests for MapReduce execution
from the client.
● JobTracker talks to the NameNode to determine the
location of the data.
● JobTracker finds the best TaskTracker nodes to execute
tasks based on the data locality (proximity of the data) and
the available slots to execute a task on a given node.
● JobTracker monitors the individual TaskTrackers and the
submits back the overall status of the job back to the client.
● JobTracker process is critical to the Hadoop cluster in
terms of MapReduce execution.
● When the JobTracker is down, HDFS will still be functional
but the MapReduce execution can not be started and the
existing MapReduce jobs will be halted.

TaskTracker –
● TaskTracker runs on DataNode. Mostly on all DataNodes.
● TaskTracker is replaced by Node Manager in MRv2.
● Mapper and Reducer tasks are executed on DataNodes
administered by TaskTrackers.
● TaskTrackers will be assigned Mapper and Reducer tasks to
execute by JobTracker.
● TaskTracker will be in constant communication with the
JobTracker signalling the progress of the task in execution.
● TaskTracker failure is not considered fatal. When a
TaskTracker becomes unresponsive, JobTracker will assign
the task executed by the TaskTracker to another node.

Blocks:
Now, as we know that the data in HDFS is scattered across the
DataNodes as blocks. Let’s have a look at what is a block and how
is it formed?

Blocks are the nothing but the smallest continuous location on your
hard drive where data is stored. In general, in any of the File
System, you store the data as a collection of blocks. Similarly,
HDFS stores each file as blocks which are scattered throughout
the Apache Hadoop cluster. The default size of each block is 128
MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) which you
can configure as per your requirement.
It is not necessary that in HDFS, each file is stored in exact
multiple of the configured block size (128 MB, 256 MB etc.). Let’s
take an example where I have a file “example.txt” of size 514 MB
as shown in above figure. Suppose that we are using the default
configuration of block size, which is 128 MB. Then, how many
blocks will be created? 5, Right. The first four blocks will be of
128 MB. But, the last block will be of 2 MB size only.

Now, you must be thinking why we need to have such a huge blocks
size i.e. 128 MB?

Well, whenever we talk about HDFS, we talk about huge data sets,
i.e. Terabytes and Petabytes of data. So, if we had a block size of
let’s say of 4 KB, as in Linux file system, we would be having too
many blocks and therefore too much of the metadata. So,
managing these no. of blocks and metadata will create huge
overhead, which is something, we don’t want.

Replication Management:
HDFS provides a reliable way to store huge data in a distributed
environment as data blocks. The blocks are also replicated to
provide fault tolerance. The default replication factor is 3 which
is again configurable. So, as you can see in the figure below where
each block is replicated three times and stored on different
DataNodes (considering the default replication factor):

Therefore, if you are storing a file of 128 MB in HDFS using the


default configuration, you will end up occupying a space of 384 MB
(3*128 MB) as the blocks will be replicated three times and each
replica will be residing on a different DataNode.

Note: The NameNode collects block report from DataNode


periodically to maintain the replication factor. Therefore,
whenever a block is over-replicated or under-replicated the
NameNode deletes or add replicas as needed.

Rack Awareness:
A rack is nothing but a collection of 30-40 DataNodes or
machines in a Hadoop cluster located in a single data center or
location. These DataNodes in a rack are connected to the
NameNode through traditional network design via a network
switch. A large Hadoop cluster will have multiple racks.

The process of making Hadoop aware of what machine is part of


which rack and how these racks are connected to each other
within the Hadoop cluster is what defines rack awareness. In a
Hadoop cluster, NameNode keeps the rack ids of all the
DataNodes. Namenode chooses the closest DataNode while
storing the data blocks using the rack information. In simple
terms, having the knowledge of how different data nodes are
distributed across the racks or knowing the cluster topology in
the Hadoop cluster is called rack awareness in Hadoop. Rack
awareness is important as it ensures data reliability and helps to
recover data in case of a rack failure.

Hadoop keeps multiple copies for all data that is present in HDFS.
If Hadoop is aware of the rack topology, each copy of data can be
kept in a different rack. By doing this, in case an entire rack
suffers a failure for some reason, the data can be retrieved from
a different rack.

Hadoop Ecosystem
Hadoop Ecosystem is a platform or a suite which provides various
services to solve the big data problems. It includes Apache
projects and various commercial tools and solutions. There are
four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and
Hadoop Common.

Following are the components that collectively form a Hadoop


ecosystem:

● HDFS: Hadoop Distributed File System


● YARN: Yet Another Resource Negotiator
● MapReduce: Programming based Data Processing
● Spark: In-Memory data processing
● PIG, HIVE: Query based processing of data services
● HBase: NoSQL Database
● Mahout, Spark MLLib: Machine Learning algorithm libraries
● Solar, Lucene: Searching and Indexing
● Zookeeper: Managing cluster
● Oozie: Job Scheduling

MapReduce
MapReduce is one of the core building blocks of processing in
Hadoop framework. MapReduce is a programming framework that
allows us to perform distributed and parallel processing on large
data sets in a distributed environment.

● MapReduce consists of two distinct tasks – Map and Reduce.


● As the name MapReduce suggests, the reducer phase takes
place after the mapper phase has been completed.
● So, the first is the map job, where a block of data is read
and processed to produce key-value pairs as intermediate
outputs.
● The output of a Mapper or map job (key-value pairs) is input
to the Reducer.
● The reducer receives the key-value pair from multiple map
jobs.
● Then, the reducer aggregates those intermediate data tuples
(intermediate key-value pair) into a smaller set of tuples or
key-value pairs which is the final output.
Let us understand more about MapReduce and its components.
MapReduce majorly has the following three Classes. They are,

Mapper Class

The first stage in Data Processing using MapReduce is the Mapper


Class. Here, RecordReader processes each Input record and
generates the respective key-value pair. Hadoop’s Mapper store
saves this intermediate data into the local disk.

● Input Split
It is the logical representation of data. It represents a block
of work that contains a single map task in the MapReduce
Program.

● RecordReader
It interacts with the Input split and converts the obtained
data in the form of Key-Value Pairs.

Reducer Class
The Intermediate output generated from the mapper is fed to the
reducer which processes it and generates the final output which is
then saved in the HDFS.

Driver Class
The major component in a MapReduce job is a Driver Class. It is
responsible for setting up a MapReduce Job to run-in Hadoop. We
specify the names of Mapper and Reducer Classes long with data
types and their respective job names.
MapReduce Tutorial: A Word Count Example of MapReduce
Let us understand, how a MapReduce works by taking an example
where I have a text file called example.txt whose contents are as
follows:

Dear, Bear, River, Car, Car, River, Deer, Car and Bear

Now, suppose, we have to perform a word count on the sample.txt


using MapReduce. So, we will be finding the unique words and the
number of occurrences of those unique words.

● First, we divide the input into three splits as shown in the


figure. This will distribute the work among all the map nodes.
● Then, we tokenize the words in each of the mappers and give
a hardcoded value (1) to each of the tokens or words. The
rationale behind giving a hardcoded value equal to 1 is that
every word, in itself, will occur once.
● Now, a list of key-value pair will be created where the key is
nothing but the individual words and value is one. So, for the
first line (Dear Bear River) we have 3 key-value pairs – Dear,
1; Bear, 1; River, 1. The mapping process remains the same on
all the nodes.
● After the mapper phase, a partition process takes place
where sorting and shuffling happen so that all the tuples with
the same key are sent to the corresponding reducer.
● So, after the sorting and shuffling phase, each reducer will
have a unique key and a list of values corresponding to that
very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
● Now, each Reducer counts the values which are present in
that list of values. As shown in the figure, reducer gets a list
of values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as –
Bear, 2.
● Finally, all the output key/value pairs are then collected and
written in the output file.

Advantages of MapReduce
The two biggest advantages of MapReduce are:

1. Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes
and each node works with a part of the job simultaneously.
So, MapReduce is based on Divide and Conquer paradigm
which helps us to process the data using different machines.
As the data is processed by multiple machines instead of a
single machine in parallel, the time taken to process the data
gets reduced by a tremendous amount as shown in the figure
below (2).

2. Data Locality:
Instead of moving data to the processing unit, we are moving
the processing unit to the data in the MapReduce Framework.
In the traditional system, we used to bring data to the
processing unit and process it. But, as the data grew and
became very huge, bringing this huge amount of data to the
processing unit posed the following issues:

● Moving huge data to processing is costly and


deteriorates the network performance.
● Processing takes time as the data is processed by a
single unit which becomes the bottleneck.
● The master node can get over-burdened and may fail.

Now, MapReduce allows us to overcome the above issues by


bringing the processing unit to the data. So, as you can see in
the above image that the data is distributed among multiple
nodes where each node processes the part of the data
residing on it. This allows us to have the following
advantages:

● It is very cost-effective to move processing unit to the


data.
● The processing time is reduced as all the nodes are
working with their part of the data in parallel.
● Every node gets a part of the data to process and
therefore, there is no chance of a node getting
overburdened.

Record Reader
● In hadoop, jobs run in the form of map and reduce
● All the mapper and reducer can do their job only with key-
value pairs
● Record reader is an interface between input split and mapper
which reads one line at a time (by default) from input split
and convert those lines into key-value pairs and passes this
key-value pairs to mapper as input.

Combiner
● It works between mapper and reducer
● The output produced by the Mapper is the intermediate
output in terms of key-value pairs which is massive in size. If
we directly feed this huge output to the Reducer, then that
will result in increasing the Network Congestion.
● So to minimize this Network congestion we have to put
combiner in between Mapper and Reducer. These combiners
are also known as semi-reducer.
● It is not necessary to add a combiner to your Map-Reduce
program, it is optional.
● Combiner is also a class in our java program like Map and
Reduce class that is used in between this Map and Reduce
classes.
● Combiner helps us to produce abstract details or a summary
of very large datasets.
● When we process or deal with very large datasets using
Hadoop Combiner is very much necessary, resulting in the
enhancement of overall performance.

How does combiner work?


In the above example, we can see that two Mappers are containing
different data. The main text file is divided into two different
Mappers. Each mapper is assigned to process a different line of
our data.
In our above example, we have two lines of data so we have two
Mappers to handle each line. Mappers are producing the
intermediate key-value pairs, where the name of the particular
word is key and its count is its value. For example for the data
Geeks For Geeks For the key-value pairs are shown below.

// Key Value pairs generated for data Geeks For Geeks For
(Geeks,1)
(For,1)
(Geeks,1)
(For,1)

The key-value pairs generated by the Mapper are known as the


intermediate key-value pairs or intermediate output of the
Mapper. Now we can minimize the number of these key-value pairs
by introducing a combiner for each Mapper in our program. In our
case, we have 4 key-value pairs generated by each of the Mapper.
since these intermediate key-value pairs are not ready to directly
feed to Reducer because that can increase Network congestion so
Combiner will combine these intermediate key-value pairs before
sending them to Reducer. The combiner combines these
intermediate key-value pairs as per their key. For the above
example for data Geeks For Geeks For the combiner will partially
reduce them by merging the same pairs according to their key
value and generate new key-value pairs as shown below.

// Partially reduced key-value pairs with combiner


(Geeks,2)
(For,2)

With the help of Combiner, the Mapper output got partially


reduced in terms of size(key-value pairs) which now can be made
available to the Reducer for better performance. Now the
Reducer will again Reduce the output obtained from combiners and
produces the final output that is stored on HDFS(Hadoop
Distributed File System).

Advantage of combiners

● Reduces the time taken for transferring the data from


Mapper to Reducer.
● Reduces the size of the intermediate output generated by
the Mapper.
● Improves performance by minimizing Network congestion.

Disadvantage of combiners

● The intermediate key-value pairs generated by Mappers are


stored on Local Disk and combiners will run later on to
partially reduce the output which results in expensive Disk
Input-Output.
● The map-Reduce job can not depend on the function of the
combiner because there is no such guarantee in its execution.

Partitioner

Hash Partitioner is the default Partitioner. It computes a hash


value for the key. It also assigns the partition based on this
result.
Need of Partitioner
● Hadoop Partitioning specifies that all the values for each key
are grouped together. It also makes sure that all the values
of a single key go to the same reducer. This allows even
distribution of the map output over the reducer.
● Partitioner in a MapReduce job redirects the mapper output
to the reducer by determining which reducer handles the
particular key.
Understanding Hadoop API for MapReduce
framework (Old and New)
Pig

I would like you to ask yourself a question – “while MapReduce was


there for Big Data Analytics why Apache Pig came into picture?“

The sweet and simple answer to this is:

approximately 10 lines of Pig code is equal to 200 lines of


MapReduce code.

Writing MapReduce jobs in Java is not an easy task for everyone.


Thus, Apache Pig emerged as a boon for programmers who were
not good with Java or Python. Even if someone who knows Java and
is good with MapReduce, they will also prefer Apache Pig due to
the ease working with Pig. Let us take a look now.

Programmers face difficulty writing MapReduce tasks as it


requires Java or Python programming knowledge. For them,
Apache Pig is a savior.

Apache Pig is an apen flow high-level dataflow system developed


by Yahoo which is used to analyze large data sets. Pig gives you a
way using which you can write simple queries and these queries are
converted into MapReduce program by apache pig tool. This
mapreduce is executed over the hadoop cluster and the result is
then sent back to the client. It is designed to provide an
abstraction over MapReduce, reducing the complexities of writing
a MapReduce program. We can perform data manipulation
operations very easily in Hadoop using Apache Pig.

The features of Apache pig are:

● It is a dataflow language
● Pig enables programmers to write complex data
transformations without knowing Java.
● Apache Pig has two main components – the Pig Latin language
and the Pig Run-time Environment, in which Pig Latin
programs are executed.
● For Big Data Analytics, Pig gives a simple data flow language
known as Pig Latin which has functionalities similar to SQL
like join, filter, limit etc.
● Developers who are working with scripting languages and
SQL, leverages Pig Latin. This gives developers ease of
programming with Apache Pig. Pig Latin provides various built-
in operators like join, sort, filter, etc to read, write, and
process large data sets. Thus it is evident, Pig has a rich set
of operators.
● Programmers write scripts using Pig Latin to analyze data and
these scripts are internally converted to Map and Reduce
tasks by Pig MapReduce Engine. Before Pig, writing
MapReduce tasks was the only way to process the data
stored in HDFS.
● If a programmer wants to write custom functions which is
unavailable in Pig, Pig allows them to write User Defined
Functions (UDF) in any language of their choice like Java,
Python, Ruby, Jython, JRuby etc. and embed them in Pig
script. This provides extensibility to Apache Pig.
● Pig can process any kind of data, i.e. structured, semi-
structured or unstructured data, coming from various
sources. Apache Pig handles all kinds of data.
● Approximately, 10 lines of pig code is equal to 200 lines of
MapReduce code.
● It can handle inconsistent schema (in case of unstructured
data).
● Apache Pig extracts the data, performs operations on that
data and dumps the data in the required format in HDFS i.e.
ETL (Extract Transform Load).
● Apache Pig automatically optimizes the tasks before
execution, i.e. automatic optimization.
● It allows programmers and developers to concentrate upon
the whole operation irrespective of creating mapper and
reducer functions separately.
Pig Components
1. Pig latin: it is made up of series of operations or
transformations that are applied to input data to produce
output
2. Pig execution: it is the different ways using which you can
submit the query to apache pig
a. Script: contains pig commands in a file (.pig)
b. Grunt: interactive shell for running pig commands
c. Embedded: provisioning pig script in java

Pig Architecture
Pig Latin Scripts

● Pig Latin is a language used to process large amounts of data.


● Pig Latin Scripts are written by the programmer using Pig
Latin language and built-in operators
● We submit Pig scripts to the Apache Pig execution
environment

Grunt Shell
● It is an interactive shell environment used by the
programmers to write and execute Pig Latin scripts

Parser
● After writing Pig Latin scripts, they are given to the parser
which is the next component
● The parser will then do the syntax checking, type checking
and it constructs a logical plan in the form of Directed
Acyclic Graph (DAG)
● The nodes of the DAG represent logical/relational operator
and the edges represent the data flow (i.e. how the data is
flowing in between the jobs)

Optimizer
● The DAG (or logical plan) that is output from the parser will
be provided as input to the optimizer
● The optimizer will optimize the given logical plan by applying
some projections and some other operations and it then
generates the optimized logical plan

Compiler
● The optimized logical plan is given as input to the next
component, which is the compiler
● The compiler compiles the given logical plan and generates a
series of map-reduce jobs

Execution Engine
● Finally, as shown in the figure, these MapReduce jobs are
submitted for execution to the execution engine.
● The execution engine executes the MapReduce jobs by taking
help from the hdfs where the data is residing
● The MapReduce jobs are executed and gives the required
result. The result can be displayed on the screen using
“DUMP” statement and can be stored in the HDFS using
“STORE” statement.
To sum it up, initially the Pig latin scripts are written by the user,
and he or she will be writing the scripts in the grunt shell.

Now in the grunt shell we are having the scripts that scripts is
given for Execution. The script is given to the parser and the
parser does three things, the first one is syntax check, the
second one is type check, and the third one is construction of the
DAG. Once the DAG has been constructed that will be the output
from the parser and that will be input to the optimizer now the
optimizer will take this DAG which may be called as a logical plan
as well. This logical plan will be taken by the optimizer, it will still
optimize the logical plan and produces the optimized logical plan.
This optimized logical plan will be given as input to the compiler
and the compiler will compile the given logical plan into series of
mapreduce jobs. These mapreduce jobs are given to the execution
engine and to execute that particular jobs this execution engine
will execute this map reduce jobs by taking the data from hadoop
distributed file system. So in this way after executing we'll be
getting the result in this way. This entire thing is a pig execution
model and at high level we'll call it as a pig architecture.

NOTE: whatever the scripts you write that ultimately changes


into mapreduce jobs in hadoop underlying abstraction is
mapreduce jobs and these mapreduce jobs will be executed by the
execution engine.

Pig Latin Data Model


The data model of Pig Latin enables Pig to handle all types of data.
Pig Latin can handle both atomic data types like int, float, long,
double etc. and complex data types like tuple, bag and map.
Atomic /Scalar Data type
● Atomic or scalar data types are the basic data types which
are used in all the languages like string, int, float, long,
double, char[], byte[].
● These are also called the primitive data types.
● The value of each cell in a field (column) is an atomic data
type.
● Example: Scalar data types are − ‘1’, ‘Linkin Park’,
‘7’, ‘California’ etc.

Now we will talk about complex data types in Pig Latin i.e. Tuple,
Bag and Map.

Tuple
● Tuple is an ordered set of fields which may contain different
data types for each field.
● You can understand it as the records stored in a row in a
relational database.
● A Tuple is a set of cells from a single row as shown in the
above image.
● A tuple is represented by ‘()’ symbol.
● Since tuples are ordered, we can access fields in each tuple
using indexes of the fields
● Example of tuple − (1, Linkin Park, 7, California)

Bag
● A bag is a collection of a set of tuples and these tuples are
subset of rows or entire rows of a table.
● A bag can contain duplicate tuples, and it is not mandatory
that they need to be unique.
● The bag has a flexible schema i.e. tuples within the bag can
have different number of fields.
● A bag is represented by ‘{}’ symbol.
● Example of a bag − {(Linkin Park, 7, California),
(Metallica, 8), (Mega Death, Los Angeles)}

Map
● A map is key-value pairs used to represent data elements.
● The key must be a chararray [] and should be unique like
column name, so it can be indexed and value associated with
it can be accessed on basis of the keys.
● The value can be of any data type.
● Maps are represented by ‘[]’ symbol and key-value are
separated by ‘#’ symbol, as you can see in the above image.
● Example of maps− [band#Linkin Park, members#7 ],
[band#Metallica, members#8 ]

Going with the Pig Latin Application Flow

At its core, Pig Latin is a dataflow language, where you define a


data stream and a series of transformations that are applied to
the data as it flows through your application. This is in contrast to
a control flow language (like C or Java), where you write a series
of instructions. In control flow languages, we use constructs like
loops and conditional logic (like an if statement). You won‘t find
loops and if statements in Pig Latin.
The following Listing specifies Sample Pig Code to illustrate the
data processing dataflow.
A = LOAD 'data_file.txt';
...
B = GROUP ... ;
...
C= FILTER ...;
...
DUMP B;
..
STORE C INTO 'Results';

Some of the text in this example actually looks like English.


Looking at each line in turn, you can see the basic flow of a Pig
program. This code can either be part of a script or issued on the
interactive shell called Grunt.

Load:
● You first load (LOAD) the data you want to manipulate. As in
a typical MapReduce job, that data is stored in HDFS.
● For a Pig program to access the data, you first tell Pig what
file or files to use. For that task, you use the LOAD
'data_file' command.
● Here, 'data_file' can specify either an HDFS file or a
directory. If a directory is specified, all files in that
directory are loaded into the program.
● If the data is stored in a file format that isn‘t natively
accessible to Pig, you can optionally add the USING function
to the LOAD statement to specify a user-defined function
that can read in (and interpret) the data.

Transform:
● You run the data through a set of transformations which are
translated into a set of Map and Reduce tasks.
● The transformation logic is where all the data manipulation
happens.
● You can FILTER out rows that aren‘t of interest, JOIN two
sets of data files, GROUP data to build aggregations, ORDER
results, and do much, much more.

Dump:
● Finally, you dump (DUMP) the results to the screen
● or
● Store (STORE) the results in a file somewhere.

Working through the ABCs of Pig Latin

Pig Latin is the language for Pig programs. Pig translates the Pig
Latin script into MapReduce jobs that can be executed within
Hadoop cluster. When coming up with Pig Latin, the development
team followed three key design principles:

● Keep it simple.
○ Pig Latin provides a streamlined method for interacting
with Java MapReduce.
○ It‘s an abstraction, in other words, that simplifies the
creation of parallel programs on the Hadoop cluster for
data flows and analysis.
○ Complex tasks may require a series of interrelated data
transformations — such series are encoded as data flow
sequences.
○ Writing data transformation and flows as Pig Latin
scripts instead of Java MapReduce programs makes
these programs easier to write, understand, and
maintain because
a) you don‘t have to write the job in Java,
b) you don‘t have to think in terms of MapReduce,
and
c) you don‘t need to come up with custom code to
support rich data types.

Pig Latin provides a simpler language to exploit your


Hadoop cluster, thus making it easier for more people to
leverage the power of Hadoop and become productive
sooner.
● Make it smart.
○ You may recall that the Pig Latin Compiler does the work
of transforming a Pig Latin program into a series of
Java MapReduce jobs.
○ The trick is to make sure that the compiler can optimize
the execution of these Java MapReduce jobs
automatically, allowing the user to focus on semantics
rather than on how to optimize and access the data.
○ The cost-based optimizer chooses the most efficient
approach for execution.

● Don’t limit development.


○ Make Pig extensible so that developers can add
functions to address their particular business problems.

Pig Operators
Evaluating Local and Distributed Mode of running Pig
Scripts

Apache Pig Execution Modes


You can run Apache Pig in two modes, namely, Local Mode and
HDFS mode.
Local Mode
● In this mode, all the files are installed and run from your
local host and local file system.
● There is no need of Hadoop or HDFS.
● This mode is generally used for testing purpose.

MapReduce Mode (also k/a hadoop mode)


● MapReduce mode is where we load or process the data that
exists in the Hadoop File System (HDFS) using Apache Pig.
● In this mode, whenever we execute the Pig Latin statements
to process the data, a MapReduce job is invoked in the back-
end to perform a particular operation on the data that exists
in the HDFS.

If you have a terabyte of data that you want to perform


operations on and you want to interactively develop a program, you
may soon find things slowing down considerably, and you may start
growing your storage.
Local mode allows you to work with a subset of your data in a more
interactive manner so that you can figure out the logic (and work
out the bugs) of your Pig program. After you have things set up as
you want them and your operations are running smoothly, you can
then run the script against the full data set using
MapReduce mode.

Checking Out the Pig Script Interfaces


Pig programs can be packaged in three different ways:
● Script: This method is nothing more than a file containing Pig
Latin commands, identified by the .pig suffix (FlightData.pig,
for example). Ending your Pig program with the .pig extension
is a convention but not required. The commands are
interpreted by the Pig Latin compiler and executed in the
order determined by the Pig optimizer.
● Grunt: Grunt acts as a command interpreter where you can
interactively enter Pig Latin at the Grunt command line and
immediately see the response. This method is helpful for
prototyping during initial development and with what-if
scenarios.
● Embedded: Pig Latin statements can be executed within
Java, Python, or JavaScript programs.

Pig scripts, Grunt shell Pig commands, and embedded Pig programs
can run in either Local mode or MapReduce mode.
To specify whether a script or Grunt shell is executed locally or in
Hadoop mode just specify it in the –x flag to the pig command.

The following is an example of how you‘d specify running your Pig


script in local mode:
pig -x local milesPerCarrier.pig

Here‘s how you‘d run the Pig script in Hadoop mode, which is the
default if you don‘t specify the flag:
pig -x mapreduce milesPerCarrier.pig

By default, when you specify the pig command without any


parameters, it starts the Grunt shell in Hadoop mode. If you want
to start the Grunt shell in local mode just add the –x local flag to
the command. Here is an example: pig -x local

Scripting with Pig Latin

Hadoop is a rich and quickly evolving ecosystem with a growing set


of new applications. Rather than try to keep up with all the
requirements for new capabilities, Pig is designed to be extensible
via user-defined functions, also known as UDFs. UDFs can be
written in a number of programming languages, including Java,
Python, and JavaScript. Developers are also posting and sharing a
growing collection of UDFs online. (Look for Piggy Bank and
DataFu, to name just two examples of such online collections.)
Some of the Pig UDFs that are part of these repositories are
LOAD/STORE functions (XML, for example), date time
functions, text, math, and stats functions.

Pig can also be embedded in host languages such as Java, Python,


and JavaScript, which allows you to integrate Pig with your
existing applications. It also helps overcome limitations in the Pig
language. One of the most commonly referenced limitations is that
Pig doesn‘t support control flow statements: if/else, while loop,
for loop, and condition statements. Pig natively supports data flow,
but needs to be embedded within another language to provide
control flow. There are tradeoffs, however of embedding Pig in a
control-flow language. For example if a Pig statement is embedded
in a loop, every time the loop iterates and runs the Pig
statement, this causes a separate MapReduce job to run.

Hive

Apache Hive is a data warehousing tool (query engine) in the


Hadoop Ecosystem, which provides SQL like language for querying
and analyzing Big Data. The motivation behind the development of
Hive is the friction-less learning path for SQL developers &
analyst. Hive is not only a saviour for people from the non-
programming backgrounds, but it also reduces the work of
programmers who spend long hours writing MapReduce programs.

Apache Hive is a data warehouse system built on top of Hadoop


and is used for analyzing structured and semi-structured data.
Hive abstracts the complexity of Hadoop MapReduce. Basically, it
provides a mechanism to project structure onto the data and
perform queries written in HQL (Hive Query Language) that are
similar to SQL statements. Internally, these queries or HQL gets
converted to map reduce jobs by the Hive compiler. Therefore,
you don’t need to worry about writing complex MapReduce
programs to process your data using Hadoop. It is targeted
towards users who are comfortable with SQL. Apache Hive
supports Data Definition Language (DDL), Data Manipulation
Language (DML) and User Defined Functions (UDF).

Advantages of Hive
● Useful for people who aren’t from a programming background
as it eliminates the need to write complex MapReduce
program.
● Extensible and scalable to cope up with the growing volume
and variety of data, without affecting performance of the
system.
● It is as an efficient ETL (Extract, Transform, Load) tool.
● Hive supports any client application written in Java, PHP,
Python, C++ or Ruby by exposing its Thrift server. (You can
use these client – side languages embedded with SQL for
accessing a database such as DB2, etc.).
● As the metadata information of Hive is stored in an RDBMS,
it significantly reduces the time to perform semantic checks
during query execution.

Apache Hive: Architecture and its components


The following architecture explains the flow of submission of
query into Hive.

The Hive Architecture can be categorized into the following


components:

● Hive Clients: Hive supports application written in many


languages like Java, C++, Python etc. using JDBC, Thrift and
ODBC drivers. Hence one can always write hive client
application written in a language of their choice.
● Hive Services: Apache Hive provides various services like
CLI, Web Interface etc. to perform queries.
● Processing framework and Resource Management:
Internally, Hive uses Hadoop MapReduce framework as de
facto engine to execute the queries.
● Distributed Storage: As Hive is installed on top of Hadoop,
it uses the underlying HDFS for the distributed storage.

Now, let us explore the first two major components in the Hive
Architecture:

1. Hive Clients:
Hive allows writing applications in various languages, including
Java, Python, and C++. It supports different types of clients such
as:
● Thrift Server: It is a cross-language service provider
platform that serves the request from all those programming
languages that supports Thrift.
● JDBC Driver: It is used to establish a connection between
hive and Java applications.
● ODBC Driver: It allows the applications that support the
ODBC protocol to connect to Hive.

2. Hive Services:
Hive provides many services as shown in the image above. Let us
have a look at each of them:
● Hive CLI (Command Line Interface): The Hive CLI
(Command Line Interface) is a shell where we can execute
Hive queries and commands.
● Apache Hive Web Interfaces:The Hive Web UI is just an
alternative of Hive CLI. It provides a web-based GUI for
executing Hive queries and commands.
● Hive Server: Hive server is built on Apache Thrift and
therefore, is also referred as Thrift Server.It accepts the
request from different clients and provides it to Hive Driver.
● Apache Hive Driver: Compilation, optimization and execution
happens here. It receives queries from different sources like
web UI, CLI, Thrift, and JDBC/ODBC driver. Then, the
driver passes the query to the compiler where parsing, type
checking and semantic analysis takes place with the help of
schema present in the metastore. In the next step, an
optimized logical plan is generated in the form of a DAG
(Directed Acyclic Graph) of map-reduce tasks and HDFS
tasks. Finally, the execution engine executes these tasks in
the order of their dependencies, using Hadoop.
● Metastore: You can think metastore as a central repository
for storing all the Hive metadata information. Hive metadata
includes various types of information like structure of tables
and the partitions along with the column, column type,
serializer and deserializer, etc.

Different components present in hive


1. Shell: it is the place where we write our queries and execute
them
2. Metastore: it is a place where all the details related to our
tables (like schema) is stored
3. Execution engine: it is the component which converts the
query that we have written into the language which hive can
understand
4. Driver: it is the component which executes the code/query in
the form of cyclic graphs
5. Compiler: it compiles the code that we write and executes it
to provide the output

Data types in Hive

Data types are very important elements in Hive query language and
data modeling. For defining the table column types, we must have
to know about the data types and its usage.

The following gives brief overview of some data types present in


Hive:
● Numeric Types
● String Types
● Date/Time Types
● Complex Types
Creating and Managing Databases and Tables (hive)

In Hive, the database is considered as a catalog or namespace of


tables. So, we can maintain multiple tables within a database
where a unique name is assigned to each table. Hive also provides a
default database with a name default.

❖ Create database

➢ To check the list of existing databases, follow the


below command:
hive> show databases;

➢ Create a new database by using the following


command:
hive> create database demo;

➢ To retrieve the information associated with the


database.
hive> describe database extended demo;

❖ Drop database

➢ drop the database by using the following command.


hive> drop database demo;

In Hive, it is not allowed to drop the database that contains


the tables directly. In such a case, we can drop the database
either by dropping tables first or use Cascade keyword with
the command.
Let's see the cascade command used to drop the database:-
hive> drop database if exists demo cascade;
This command automatically drops the tables present in the
database first.

❖ Create Table
In Hive, we can create a table by using the conventions
similar to the SQL. It supports a wide range of flexibility
where the data files for tables are stored. It provides two
types of table: -
1. Internal table
2. External table

Internal table:
The internal tables are also called managed tables as the
lifecycle of their data is controlled by the Hive.They are the
default tables.The internal tables are not flexible enough to
share with other tools like Pig. If we try to drop the internal
table, Hive deletes both table schema and data.

● Create an internal table by using the following


command:-
hive> create table demo.employee (Id int, Name string ,
Salary float)
row format delimited
fields terminated by ',' ;
● To see the metadata of the created table by using
the following command:-
hive> describe demo.employee

External Table
The external table allows us to create and access a table and
a data externally. The external keyword is used to specify
the external table, whereas the location keyword is used to
determine the location of loaded data.

As the table is external, the data is not present in the Hive


directory. Therefore, if we try to drop the table, the
metadata of the table will be deleted, but the data still
exists.

● create an external table using the following


command:
hive> create external table emplist (Id int, Name
string , Salary float)
row format delimited
fields terminated by ','
location '/HiveDirectory';

● use the following command to retrieve the data:


select * from emplist;
Internal tables are not secure. Imagine if you are working
with a team and all your team and members have access to
hive/hue and the table has been existing in your hive. Then if
it is an internal table and someone tries to delete it then the
data would be erased, but if it is an external table then even
if someone tries to delete it (the table or data) then that
would be deleted from their own local system but not from
hive

❖ Load Data
in Hive, we can easily load data from any file to the database.

➢ Let's load the data of the file into the database by


using the following command: -
load data local inpath
'/home/codegyani/hive/emp_details' into table
demo.employee;

❖ Drop Table
Hive facilitates us to drop a table by using the SQL drop
table command. Let's follow the below steps to drop the
table from the database.

➢ Let's check the list of existing databases by using


the following command:
hive> show databases;
➢ Now select the database from which we want to
delete the table by using the following command: -
hive> use demo;

➢ Let's check the list of existing tables in the


corresponding database.
hive> show tables;

➢ Now, drop the table by using the following command:


hive> drop table new_employee;

❖ Alter Table
In Hive, we can perform modifications in the existing table
like changing the table name, column name, comments, and
table properties. It provides SQL like commands to alter the
table.

➢ Rename a table
If we want to change the name of an existing table, we
can rename that table by using the following signature: -
Alter table old_table_name rename to new_table_name;

➢ Adding column
In Hive, we can add one or more columns in an existing
table by using the following signature: -
Alter table table_name add columns(column_name
datatype);

➢ Change Column
In Hive, we can rename a column, change its type and
position. Here, we are changing the name of the column
by using the following signature: -
Alter table table_name change old_column_name
new_column_name datatype;

➢ Delete or Replace Column


Hive allows us to delete one or more columns by
replacing them with the new columns. Thus, we cannot
drop the column directly.
alter table employee_data replace columns( id string,
first_name string, age int);

Hive DML Commands

To understand the Hive DML commands, let's see the employee


and employee_department table first.
● LOAD DATA

hive> LOAD DATA LOCAL INPATH './usr/Desktop/kv1.txt'


OVERWRITE INTO TABLE Employee;

● SELECTS and FILTERS

hive> SELECT E.EMP_ID FROM Employee E WHERE


E.Address='US';

● GROUP BY

hive> hive> SELECT E.EMP_ID FROM Employee E GROUP BY


E.Addresss;

Hive Sort By vs Order By


Hive sort by and order by commands are used to fetch data
in sorted order. The main differences between sort by and
order by commands are given below.
● Sort by

hive> SELECT E.EMP_ID FROM Employee E SORT BY


E.empid;
○ May use multiple reducers for final output.
○ Only guarantees ordering of rows within a reducer.
○ May give partially ordered result.

● Order by

hive> SELECT E.EMP_ID FROM Employee E order BY


E.empid;
○ Uses single reducer to guarantee total order in output.
○ LIMIT can be used to minimize sort time.

Joins

● Inner joins

Select * from employee join employeedepartment ON


(employee.empid=employeedepartment.empId)

● Left outer joins


Select e.empId, empName, department from employee e Left
outer join employeedepartment ed on(e.empId=ed.empId);

● Right outer joins

Select e.empId, empName, department from employee e


Right outer join employeedepartment ed
on(e.empId=ed.empId);

● Full outer joins

Select e.empId, empName, department from employee e


FULL outer join employeedepartment ed
on(e.empId=ed.empId);

Partitioning in Hive
The partitioning in Hive means dividing the table into some parts
based on the values of a particular column like date, course, city
or country. The advantage of partitioning is that since the data is
stored in slices, the query response time becomes faster.

As we know that Hadoop is used to handle the huge amount of


data, it is always required to use the best approach to deal with it.
The partitioning in Hive is the best example of it.
Let's assume we have a data of 10 million students studying in an
institute. Now, we have to fetch the students of a particular
course. If we use a traditional approach, we have to go through
the entire data. This leads to performance degradation. In such a
case, we can adopt the better approach i.e., partitioning in Hive
and divide the data among the different datasets based on
particular columns.
The partitioning in Hive can be executed in two ways -
1. Static partitioning
2. Dynamic partitioning

By default, the static partitioning is what happens in hive and the


partition mode is strict

In static partitioning (manual partitioning) it is required to


required to pass the values of partition columns manually by
loading the data into table and hence the data files thus created
do not contain partition columns.
(here we sent the partition columns ‘java’, ‘hadoop’ and ‘python’
manually for course)

In dynamic partitioning, we just need to do it once and all the


three files are automatically configured and the value of
partitioned column does exist within the table
(we had 3 map reduce jobs running, one each for python, java and
hadoop)
Bucketing in Hive
The bucketing in Hive is a data organizing technique. It is similar
to partitioning in Hive with an added functionality that it divides
large datasets into more manageable parts known as buckets. So,
we can use bucketing in Hive when the implementation of
partitioning becomes difficult. However, we can also divide
partitions further in buckets.

You might also like