Big Data Analytics
Big Data Analytics
1. Volume
Volume refers to the huge amounts of data that is collected and
generated every second in large organizations. This data is
generated from different sources such as IoT devices, social
media, videos, financial transactions, and customer logs.
3. Velocity
The term ‘velocity’ refers to the speed of generation of data. How
fast the data is generated and processed to meet the demands,
determines real potential in the data.
4. Value
Among the characteristics of Big Data, value is perhaps the most
important. No matter how fast the data is produced or its amount,
it has to be reliable and useful. Otherwise, the data is not good
enough for processing or analysis. Research says that poor quality
data can lead to almost a 20% loss in a company’s revenue.
Data scientists first convert raw data into information. Then this
data set is cleaned to retrieve the most useful data. Analysis and
pattern identification is done on this data set. If the process is a
success, the data can be considered to be valuable.
5. Veracity
This feature of Big Data is connected to the previous one. It
defines the degree of trustworthiness of the data. As most of
the data you encounter is unstructured, it is important to filter
out the unnecessary information and use the rest for processing.
inconsistency can be shown by the data at times, thus hampering
the process of being able to handle and manage the data
effectively.
4. Data Security
Securing these huge sets of data is one of the daunting challenges
of Big Data. Often companies are so busy in understanding, storing
and analyzing their data sets that they push data security for
later stages. But, this is not a smart move as unprotected data
repositories can become breeding grounds for malicious hackers.
6. Organizational Resistance
Organizational resistance – even in other areas of business – has
been around since forever. Nothing new here! It is a problem that
companies can anticipate, and as such, decide the best way to deal
with the problem.
If it’s already happening in your organization, you should know that
it is not something out of the ordinary. Of the utmost importance
is to determine the best way to handle the situation to ensure big
data success.
8. Upscaling Problems
One of the things that big data is known for is its dramatic
growth. Unfortunately, this feature is known as one of the most
serious big data challenges.
While the design of your solution is well thought out and
therefore flexible when it comes to upscaling, but the real
problem doesn’t lie in the introduction of new processing and
storage capacities.
The real problem is in the complexity of upscaling in a way that
ensures that the performance of the system does not decline and
that you don’t go overboard with the budget.
1. Structured data
As the name suggests, this kind of data is structured and is well-
defined. It has a consistent order that can be easily understood
by a computer or a human. This data can be stored, analyzed, and
processed using a fixed format. Usually, this kind of data has its
own data model.
You will find this kind of data in databases, where it is neatly
stored in columns and rows.
For example, a database consisting of all the details of employees
of a company is a type of structured data set.
2. Unstructured data
Any set of data that is not structured or well-defined is called
unstructured data. This kind of data is unorganized and difficult
to handle, understand and analyze. It does not follow a consistent
format and may vary at different points of time. Most of the data
you encounter comes under this category.
For example, unstructured data are your comments, tweets,
shares, posts, and likes on social media. The videos you watch on
YouTube and text messages you send via WhatsApp all pile up as a
huge heap of unstructured data.
3. Semi-structured data
This kind of data is somewhat structured but not completely. This
may seem to be unstructured at first and does not obey any
formal structures of data models such as RDBMS. For example,
CSV files are also considered semi-structured data.
Fault tolerance
Critical data replication
Automatic and efficient data recovery
High aggregate throughput
Reduced client and master interaction because of large chunk
server size
High availability
What is HDFS?
Hadoop Distributed file system or HDFS is a Java based
distributed file system that allows you to store large data across
multiple nodes in a Hadoop cluster. So, if you install Hadoop, you
get HDFS as an underlying storage system for storing the data in
the distributed environment.
Let’s take an example to understand it. Imagine that you have ten
machines or ten computers with a hard drive of 1 TB on each
machine. Now, HDFS says that if you install Hadoop as a platform
on top of these ten machines, you will get HDFS as a storage
service. Hadoop Distributed File System is distributed in such a
way that every machine contributes their individual storage for
storing any kind of data.
Advantages Of HDFS
1. Distributed Storage:
When you access Hadoop Distributed file system from any of the
ten machines in the Hadoop cluster, you will feel as if you have
logged into a single large machine which has a storage capacity of
10 TB (total storage over ten machines). What does it mean? It
means that you can store a single large file of 10 TB which will be
distributed over the ten machines (1 TB each). So, it is not limited
to the physical boundaries of each individual machine.
Last but not the least, let us talk about the horizontal scaling or
scaling out in Hadoop. There are two types of scaling: vertical and
horizontal. In vertical scaling (scale up), you increase the
hardware capacity of your system. In other words, you procure
more RAM or CPU and add it to your existing system to make it
more robust and powerful. But there are challenges associated
with vertical scaling or scaling up:
Features of HDFS
HDFS Architecture:
NameNode:
Functions of NameNode:
● It is the master daemon that maintains and manages the
DataNodes (slave nodes)
● It records the metadata of all the files stored in the cluster,
e.g. The location of blocks stored, the size of the files,
permissions, hierarchy, etc. There are two files associated
with the metadata:
○ FsImage: It contains the complete state of the file
system namespace since the start of the NameNode.
○ EditLogs: It contains all the recent modifications made
to the file system with respect to the most recent
FsImage.
● It records each change that takes place to the file system
metadata. For example, if a file is deleted in HDFS, the
NameNode will immediately record this in the EditLog.
● It regularly receives a Heartbeat and a block report from all
the DataNodes in the cluster to ensure that the DataNodes
are live.
● It keeps a record of all the blocks in HDFS and in which
nodes these blocks are located.
● The NameNode is also responsible to take care of the
replication factor of all the blocks which we will discuss in
detail later in this HDFS tutorial blog.
● In case of the DataNode failure, the NameNode chooses new
DataNodes for new replicas, balance disk usage and manages
the communication traffic to the DataNodes.
DataNode:
DataNodes are the slave nodes in HDFS. Unlike NameNode,
DataNode is a commodity hardware, that is, a non-expensive
system which is not of high quality or high-availability.
Functions of DataNode:
● These are slave daemons or process which runs on each slave
machine.
● The actual data is stored on DataNodes.
● The DataNodes perform the low-level read and write
requests from the file system’s clients.
● They send heartbeats to the NameNode periodically to
report the overall health of HDFS, by default, this frequency
is set to 3 seconds.
Secondary NameNode:
Apart from these two daemons, there is a third daemon or a
process called Secondary NameNode. The Secondary NameNode
works concurrently with the primary NameNode as a helper
daemon. And don’t be confused about the Secondary NameNode
being a backup NameNode because it is not.
TaskTracker –
● TaskTracker runs on DataNode. Mostly on all DataNodes.
● TaskTracker is replaced by Node Manager in MRv2.
● Mapper and Reducer tasks are executed on DataNodes
administered by TaskTrackers.
● TaskTrackers will be assigned Mapper and Reducer tasks to
execute by JobTracker.
● TaskTracker will be in constant communication with the
JobTracker signalling the progress of the task in execution.
● TaskTracker failure is not considered fatal. When a
TaskTracker becomes unresponsive, JobTracker will assign
the task executed by the TaskTracker to another node.
Blocks:
Now, as we know that the data in HDFS is scattered across the
DataNodes as blocks. Let’s have a look at what is a block and how
is it formed?
Blocks are the nothing but the smallest continuous location on your
hard drive where data is stored. In general, in any of the File
System, you store the data as a collection of blocks. Similarly,
HDFS stores each file as blocks which are scattered throughout
the Apache Hadoop cluster. The default size of each block is 128
MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) which you
can configure as per your requirement.
It is not necessary that in HDFS, each file is stored in exact
multiple of the configured block size (128 MB, 256 MB etc.). Let’s
take an example where I have a file “example.txt” of size 514 MB
as shown in above figure. Suppose that we are using the default
configuration of block size, which is 128 MB. Then, how many
blocks will be created? 5, Right. The first four blocks will be of
128 MB. But, the last block will be of 2 MB size only.
Now, you must be thinking why we need to have such a huge blocks
size i.e. 128 MB?
Well, whenever we talk about HDFS, we talk about huge data sets,
i.e. Terabytes and Petabytes of data. So, if we had a block size of
let’s say of 4 KB, as in Linux file system, we would be having too
many blocks and therefore too much of the metadata. So,
managing these no. of blocks and metadata will create huge
overhead, which is something, we don’t want.
Replication Management:
HDFS provides a reliable way to store huge data in a distributed
environment as data blocks. The blocks are also replicated to
provide fault tolerance. The default replication factor is 3 which
is again configurable. So, as you can see in the figure below where
each block is replicated three times and stored on different
DataNodes (considering the default replication factor):
Rack Awareness:
A rack is nothing but a collection of 30-40 DataNodes or
machines in a Hadoop cluster located in a single data center or
location. These DataNodes in a rack are connected to the
NameNode through traditional network design via a network
switch. A large Hadoop cluster will have multiple racks.
Hadoop keeps multiple copies for all data that is present in HDFS.
If Hadoop is aware of the rack topology, each copy of data can be
kept in a different rack. By doing this, in case an entire rack
suffers a failure for some reason, the data can be retrieved from
a different rack.
Hadoop Ecosystem
Hadoop Ecosystem is a platform or a suite which provides various
services to solve the big data problems. It includes Apache
projects and various commercial tools and solutions. There are
four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and
Hadoop Common.
MapReduce
MapReduce is one of the core building blocks of processing in
Hadoop framework. MapReduce is a programming framework that
allows us to perform distributed and parallel processing on large
data sets in a distributed environment.
Mapper Class
● Input Split
It is the logical representation of data. It represents a block
of work that contains a single map task in the MapReduce
Program.
● RecordReader
It interacts with the Input split and converts the obtained
data in the form of Key-Value Pairs.
Reducer Class
The Intermediate output generated from the mapper is fed to the
reducer which processes it and generates the final output which is
then saved in the HDFS.
Driver Class
The major component in a MapReduce job is a Driver Class. It is
responsible for setting up a MapReduce Job to run-in Hadoop. We
specify the names of Mapper and Reducer Classes long with data
types and their respective job names.
MapReduce Tutorial: A Word Count Example of MapReduce
Let us understand, how a MapReduce works by taking an example
where I have a text file called example.txt whose contents are as
follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
Advantages of MapReduce
The two biggest advantages of MapReduce are:
1. Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes
and each node works with a part of the job simultaneously.
So, MapReduce is based on Divide and Conquer paradigm
which helps us to process the data using different machines.
As the data is processed by multiple machines instead of a
single machine in parallel, the time taken to process the data
gets reduced by a tremendous amount as shown in the figure
below (2).
2. Data Locality:
Instead of moving data to the processing unit, we are moving
the processing unit to the data in the MapReduce Framework.
In the traditional system, we used to bring data to the
processing unit and process it. But, as the data grew and
became very huge, bringing this huge amount of data to the
processing unit posed the following issues:
Record Reader
● In hadoop, jobs run in the form of map and reduce
● All the mapper and reducer can do their job only with key-
value pairs
● Record reader is an interface between input split and mapper
which reads one line at a time (by default) from input split
and convert those lines into key-value pairs and passes this
key-value pairs to mapper as input.
Combiner
● It works between mapper and reducer
● The output produced by the Mapper is the intermediate
output in terms of key-value pairs which is massive in size. If
we directly feed this huge output to the Reducer, then that
will result in increasing the Network Congestion.
● So to minimize this Network congestion we have to put
combiner in between Mapper and Reducer. These combiners
are also known as semi-reducer.
● It is not necessary to add a combiner to your Map-Reduce
program, it is optional.
● Combiner is also a class in our java program like Map and
Reduce class that is used in between this Map and Reduce
classes.
● Combiner helps us to produce abstract details or a summary
of very large datasets.
● When we process or deal with very large datasets using
Hadoop Combiner is very much necessary, resulting in the
enhancement of overall performance.
// Key Value pairs generated for data Geeks For Geeks For
(Geeks,1)
(For,1)
(Geeks,1)
(For,1)
Advantage of combiners
Disadvantage of combiners
Partitioner
● It is a dataflow language
● Pig enables programmers to write complex data
transformations without knowing Java.
● Apache Pig has two main components – the Pig Latin language
and the Pig Run-time Environment, in which Pig Latin
programs are executed.
● For Big Data Analytics, Pig gives a simple data flow language
known as Pig Latin which has functionalities similar to SQL
like join, filter, limit etc.
● Developers who are working with scripting languages and
SQL, leverages Pig Latin. This gives developers ease of
programming with Apache Pig. Pig Latin provides various built-
in operators like join, sort, filter, etc to read, write, and
process large data sets. Thus it is evident, Pig has a rich set
of operators.
● Programmers write scripts using Pig Latin to analyze data and
these scripts are internally converted to Map and Reduce
tasks by Pig MapReduce Engine. Before Pig, writing
MapReduce tasks was the only way to process the data
stored in HDFS.
● If a programmer wants to write custom functions which is
unavailable in Pig, Pig allows them to write User Defined
Functions (UDF) in any language of their choice like Java,
Python, Ruby, Jython, JRuby etc. and embed them in Pig
script. This provides extensibility to Apache Pig.
● Pig can process any kind of data, i.e. structured, semi-
structured or unstructured data, coming from various
sources. Apache Pig handles all kinds of data.
● Approximately, 10 lines of pig code is equal to 200 lines of
MapReduce code.
● It can handle inconsistent schema (in case of unstructured
data).
● Apache Pig extracts the data, performs operations on that
data and dumps the data in the required format in HDFS i.e.
ETL (Extract Transform Load).
● Apache Pig automatically optimizes the tasks before
execution, i.e. automatic optimization.
● It allows programmers and developers to concentrate upon
the whole operation irrespective of creating mapper and
reducer functions separately.
Pig Components
1. Pig latin: it is made up of series of operations or
transformations that are applied to input data to produce
output
2. Pig execution: it is the different ways using which you can
submit the query to apache pig
a. Script: contains pig commands in a file (.pig)
b. Grunt: interactive shell for running pig commands
c. Embedded: provisioning pig script in java
Pig Architecture
Pig Latin Scripts
Grunt Shell
● It is an interactive shell environment used by the
programmers to write and execute Pig Latin scripts
Parser
● After writing Pig Latin scripts, they are given to the parser
which is the next component
● The parser will then do the syntax checking, type checking
and it constructs a logical plan in the form of Directed
Acyclic Graph (DAG)
● The nodes of the DAG represent logical/relational operator
and the edges represent the data flow (i.e. how the data is
flowing in between the jobs)
Optimizer
● The DAG (or logical plan) that is output from the parser will
be provided as input to the optimizer
● The optimizer will optimize the given logical plan by applying
some projections and some other operations and it then
generates the optimized logical plan
Compiler
● The optimized logical plan is given as input to the next
component, which is the compiler
● The compiler compiles the given logical plan and generates a
series of map-reduce jobs
Execution Engine
● Finally, as shown in the figure, these MapReduce jobs are
submitted for execution to the execution engine.
● The execution engine executes the MapReduce jobs by taking
help from the hdfs where the data is residing
● The MapReduce jobs are executed and gives the required
result. The result can be displayed on the screen using
“DUMP” statement and can be stored in the HDFS using
“STORE” statement.
To sum it up, initially the Pig latin scripts are written by the user,
and he or she will be writing the scripts in the grunt shell.
Now in the grunt shell we are having the scripts that scripts is
given for Execution. The script is given to the parser and the
parser does three things, the first one is syntax check, the
second one is type check, and the third one is construction of the
DAG. Once the DAG has been constructed that will be the output
from the parser and that will be input to the optimizer now the
optimizer will take this DAG which may be called as a logical plan
as well. This logical plan will be taken by the optimizer, it will still
optimize the logical plan and produces the optimized logical plan.
This optimized logical plan will be given as input to the compiler
and the compiler will compile the given logical plan into series of
mapreduce jobs. These mapreduce jobs are given to the execution
engine and to execute that particular jobs this execution engine
will execute this map reduce jobs by taking the data from hadoop
distributed file system. So in this way after executing we'll be
getting the result in this way. This entire thing is a pig execution
model and at high level we'll call it as a pig architecture.
Now we will talk about complex data types in Pig Latin i.e. Tuple,
Bag and Map.
Tuple
● Tuple is an ordered set of fields which may contain different
data types for each field.
● You can understand it as the records stored in a row in a
relational database.
● A Tuple is a set of cells from a single row as shown in the
above image.
● A tuple is represented by ‘()’ symbol.
● Since tuples are ordered, we can access fields in each tuple
using indexes of the fields
● Example of tuple − (1, Linkin Park, 7, California)
Bag
● A bag is a collection of a set of tuples and these tuples are
subset of rows or entire rows of a table.
● A bag can contain duplicate tuples, and it is not mandatory
that they need to be unique.
● The bag has a flexible schema i.e. tuples within the bag can
have different number of fields.
● A bag is represented by ‘{}’ symbol.
● Example of a bag − {(Linkin Park, 7, California),
(Metallica, 8), (Mega Death, Los Angeles)}
Map
● A map is key-value pairs used to represent data elements.
● The key must be a chararray [] and should be unique like
column name, so it can be indexed and value associated with
it can be accessed on basis of the keys.
● The value can be of any data type.
● Maps are represented by ‘[]’ symbol and key-value are
separated by ‘#’ symbol, as you can see in the above image.
● Example of maps− [band#Linkin Park, members#7 ],
[band#Metallica, members#8 ]
Load:
● You first load (LOAD) the data you want to manipulate. As in
a typical MapReduce job, that data is stored in HDFS.
● For a Pig program to access the data, you first tell Pig what
file or files to use. For that task, you use the LOAD
'data_file' command.
● Here, 'data_file' can specify either an HDFS file or a
directory. If a directory is specified, all files in that
directory are loaded into the program.
● If the data is stored in a file format that isn‘t natively
accessible to Pig, you can optionally add the USING function
to the LOAD statement to specify a user-defined function
that can read in (and interpret) the data.
Transform:
● You run the data through a set of transformations which are
translated into a set of Map and Reduce tasks.
● The transformation logic is where all the data manipulation
happens.
● You can FILTER out rows that aren‘t of interest, JOIN two
sets of data files, GROUP data to build aggregations, ORDER
results, and do much, much more.
Dump:
● Finally, you dump (DUMP) the results to the screen
● or
● Store (STORE) the results in a file somewhere.
Pig Latin is the language for Pig programs. Pig translates the Pig
Latin script into MapReduce jobs that can be executed within
Hadoop cluster. When coming up with Pig Latin, the development
team followed three key design principles:
● Keep it simple.
○ Pig Latin provides a streamlined method for interacting
with Java MapReduce.
○ It‘s an abstraction, in other words, that simplifies the
creation of parallel programs on the Hadoop cluster for
data flows and analysis.
○ Complex tasks may require a series of interrelated data
transformations — such series are encoded as data flow
sequences.
○ Writing data transformation and flows as Pig Latin
scripts instead of Java MapReduce programs makes
these programs easier to write, understand, and
maintain because
a) you don‘t have to write the job in Java,
b) you don‘t have to think in terms of MapReduce,
and
c) you don‘t need to come up with custom code to
support rich data types.
Pig Operators
Evaluating Local and Distributed Mode of running Pig
Scripts
Pig scripts, Grunt shell Pig commands, and embedded Pig programs
can run in either Local mode or MapReduce mode.
To specify whether a script or Grunt shell is executed locally or in
Hadoop mode just specify it in the –x flag to the pig command.
Here‘s how you‘d run the Pig script in Hadoop mode, which is the
default if you don‘t specify the flag:
pig -x mapreduce milesPerCarrier.pig
Hive
Advantages of Hive
● Useful for people who aren’t from a programming background
as it eliminates the need to write complex MapReduce
program.
● Extensible and scalable to cope up with the growing volume
and variety of data, without affecting performance of the
system.
● It is as an efficient ETL (Extract, Transform, Load) tool.
● Hive supports any client application written in Java, PHP,
Python, C++ or Ruby by exposing its Thrift server. (You can
use these client – side languages embedded with SQL for
accessing a database such as DB2, etc.).
● As the metadata information of Hive is stored in an RDBMS,
it significantly reduces the time to perform semantic checks
during query execution.
Now, let us explore the first two major components in the Hive
Architecture:
1. Hive Clients:
Hive allows writing applications in various languages, including
Java, Python, and C++. It supports different types of clients such
as:
● Thrift Server: It is a cross-language service provider
platform that serves the request from all those programming
languages that supports Thrift.
● JDBC Driver: It is used to establish a connection between
hive and Java applications.
● ODBC Driver: It allows the applications that support the
ODBC protocol to connect to Hive.
2. Hive Services:
Hive provides many services as shown in the image above. Let us
have a look at each of them:
● Hive CLI (Command Line Interface): The Hive CLI
(Command Line Interface) is a shell where we can execute
Hive queries and commands.
● Apache Hive Web Interfaces:The Hive Web UI is just an
alternative of Hive CLI. It provides a web-based GUI for
executing Hive queries and commands.
● Hive Server: Hive server is built on Apache Thrift and
therefore, is also referred as Thrift Server.It accepts the
request from different clients and provides it to Hive Driver.
● Apache Hive Driver: Compilation, optimization and execution
happens here. It receives queries from different sources like
web UI, CLI, Thrift, and JDBC/ODBC driver. Then, the
driver passes the query to the compiler where parsing, type
checking and semantic analysis takes place with the help of
schema present in the metastore. In the next step, an
optimized logical plan is generated in the form of a DAG
(Directed Acyclic Graph) of map-reduce tasks and HDFS
tasks. Finally, the execution engine executes these tasks in
the order of their dependencies, using Hadoop.
● Metastore: You can think metastore as a central repository
for storing all the Hive metadata information. Hive metadata
includes various types of information like structure of tables
and the partitions along with the column, column type,
serializer and deserializer, etc.
Data types are very important elements in Hive query language and
data modeling. For defining the table column types, we must have
to know about the data types and its usage.
❖ Create database
❖ Drop database
❖ Create Table
In Hive, we can create a table by using the conventions
similar to the SQL. It supports a wide range of flexibility
where the data files for tables are stored. It provides two
types of table: -
1. Internal table
2. External table
Internal table:
The internal tables are also called managed tables as the
lifecycle of their data is controlled by the Hive.They are the
default tables.The internal tables are not flexible enough to
share with other tools like Pig. If we try to drop the internal
table, Hive deletes both table schema and data.
External Table
The external table allows us to create and access a table and
a data externally. The external keyword is used to specify
the external table, whereas the location keyword is used to
determine the location of loaded data.
❖ Load Data
in Hive, we can easily load data from any file to the database.
❖ Drop Table
Hive facilitates us to drop a table by using the SQL drop
table command. Let's follow the below steps to drop the
table from the database.
❖ Alter Table
In Hive, we can perform modifications in the existing table
like changing the table name, column name, comments, and
table properties. It provides SQL like commands to alter the
table.
➢ Rename a table
If we want to change the name of an existing table, we
can rename that table by using the following signature: -
Alter table old_table_name rename to new_table_name;
➢ Adding column
In Hive, we can add one or more columns in an existing
table by using the following signature: -
Alter table table_name add columns(column_name
datatype);
➢ Change Column
In Hive, we can rename a column, change its type and
position. Here, we are changing the name of the column
by using the following signature: -
Alter table table_name change old_column_name
new_column_name datatype;
● GROUP BY
● Order by
Joins
● Inner joins
Partitioning in Hive
The partitioning in Hive means dividing the table into some parts
based on the values of a particular column like date, course, city
or country. The advantage of partitioning is that since the data is
stored in slices, the query response time becomes faster.