Unit-2 Introduction To Hadoop
Unit-2 Introduction To Hadoop
Introducing Hadoop, need of Hadoop, limitations of RDBMS, RDBMS versus Hadoop, Distributed Computing
Challenges, History of Hadoop , Hadoop Overview, Use Case of Hadoop, Hadoop Distributors, HDFS, Processing Data
with Hadoop, Managing Resources and Applications with Hadoop YARN, Interacting with Hadoop Ecosystem.
Hadoop is an open source framework that is meant for storage and processing of big data in a distributed manner.
Open Source – Hadoop is an open source framework which means it is available free of cost. Also, the users are
allowed to change the source code as per their requirements.
Distributed Processing – Hadoop supports distributed processing of data i.e. faster processing. The data in Hadoop
HDFS is stored in a distributed manner and MapReduce is responsible for the parallel processing of data.
Fault Tolerance – Hadoop is highly fault-tolerant. It creates three replicas for each block (default) at different nodes.
Reliability – Hadoop stores data on the cluster in a reliable manner that is independent of machine. So, the data
stored in Hadoop environment is not affected by the failure of the machine.
Scalability – It is compatible with the other hardware and we can easily ass the new hardware to the nodes.
High Availability – The data stored in Hadoop is available to access even after the hardware failure. In case of
hardware failure, the data can be accessed from another node.
HDFS: (Hadoop Distributed File System) – HDFS is the basic storage system of Hadoop. The large data files running on
a cluster of commodity hardware are stored in HDFS. It can store data in a reliable manner even when hardware fails.
The key aspects of HDFS are:
Storage component
Map Reduce: MapReduce is the Hadoop layer that is responsible for data processing. It writes an application to
process unstructured and structured data stored in HDFS.
It is responsible for the parallel processing of high volume of data by dividing data into independent tasks. The
processing is done in two phases Map and Reduce.
The Map is the first phase of processing that specifies complex logic code and the
Explain features of HDFS.Discuss the design of Hadoop distributed file system and concept in detail.
HDFS: (Hadoop Distributed File System) – HDFS is the basic storage system of Hadoop. The large data files running on
a cluster of commodity hardware are stored in HDFS. It can store data in a reliable manner even when hardware fails.
The key aspects of HDFS are:
Distributes data across several nodes: divides large file into blocks and stores in various data nodes.
High Throughput Access: Provides access to data blocks which are nearer to the client.
HDFS Daemons:
NameNode
The NameNode is the master of HDFS that directs the slave DataNodes to perform I/O tasks.
Blocks: HDFS breaks large file into smaller pieces called blocks.
(rack is a collection of datanodes with in the cluster) NameNode keep track of blocks of a file.
File System Namespace: NameNode is the book keeper of HDFS. It keeps track of how files are broken down into
blocks and which DataNode stores these blocks. It is a collection of files in the cluster.
FsImage: file system namespace includes mapping of blocks of a file, file properties and is stored in a file called
FsImage.
EditLog: namenode uses an EditLog (transaction log) to record every transaction that happens to the file system
metadata.
DataNode
Multiple data nodes per cluster. Each slave machine in the cluster have DataNode daemon for reading and writing
HDFS blocks of actual file on local file system
During pipeline read and write DataNodes communicate with each other.
It also continuously Sends “heartbeat” message to NameNode to ensure the connectivity between the Name node
and the data node.
If no heartbeat is received for a period of time NameNode assumes that the DataNode had failed and it is re-
replicated.
Fig. Interaction between NameNode and DataNode.
Takes snapshot of HDFS meta data at intervals specified in the hadoop configuration.
In case of name node failure secondary name node can be configured manually to bring up the cluster i.e; we make
secondary namenode as name node.
The client opens the file that it wishes to read from by calling open() on the DFS.
The DFS communicates with the NameNode to get the location of data blocks. NameNode returns with the addresses
of the DataNodes that the data blocks are stored on.
Subsequent to this, the DFS returns an FSD to client to read from the file.
Client then calls read() on the stream DFSInputStream, which has addresses of DataNodes for the first few block of the
file.
Client calls read() repeatedly to stream the data from the DataNode.
When the end of the block is reached, DFSInputStream closes the connection with the DataNode. It repeats the steps
to find the best DataNode for the next block and subsequent blocks.
When the client completes the reading of the file, it calls close() on the FSInputStream to the connection.
Fig. File Read Anatomy
An RPC call to the namenode happens through the DFS to create a new file.
As the client writes data, data is split into packets by DFSOutputStream, which is then writes to an internal queue,
called data queue. Datastreamer consumes the data queue.
Data streamer streams the packets to the first DataNode in the pipeline. It stores packet and forwards it to the
second DataNode in the pipeline.
In addition to the internal queue, DFSOutputStream also manages on “Ackqueue” of the packets that are waiting for
acknowledged by DataNodes.
When the client finishes writing the file, it calls close() on the stream.
Fig. File Write Anatomy
Data Replication: There is absolutely no need for a client application to track all blocks. It directs client to the nearest
replica to ensure high performance.
Data Pipeline: A client application writes a block to the first DataNode in the pipeline. Then this DataNode takes over
and forwards the data to the next node in the pipeline. This process continues for all the data blocks, and
subsequently all the replicas are written to the disk.
Fig. File Replacement Strategy
Creating a directory:
o/p: 1 1 60
Input data set splits into independent chunks. Map tasks process these independent chunks completely in a parallel
manner.
Reduce task-provides reduced output by combining the output of various mapers. There are two daemons associated
with MapReduce Programming: JobTracker and TaskTracer.
JobTracker:
Whenever code submitted to a cluster, JobTracker creates the execution plan by deciding which task to assign to
which node.
It also monitors all the running tasks. When task fails it automatically re-schedules the task to a different node after a
predefined number of retires.
There will be one job Tracker process running on a single Hadoop cluster. Job Tracker processes run on their own
Java Virtual machine process.
Fig. Job Tracker and Task Tracker interaction
TaskTracker:
This daemon is responsible for executing individual tasks that is assigned by the Job Tracker.
Task Tracker continuously sends heartbeat message to job tracker. When a job tracker fails to receive a heartbeat
message from a TaskTracker, the JobTracker assumes that the TaskTracker has failed and resubmits the task to
another available node in the cluster.
MapReduce divides a data analysis task into two parts – Map and Reduce. In the example given below: there two
mappers and one reduce.
Each mapper works on the partial data set that is stored on that node and the reducer combines the output from the
mappers to produce the reduced result set.
Steps:
Next, the framework creates a master and several slave processes and executes the worker processes remotely.
Several map tasks work simultaneously and read pieces of data that were assigned to each map task.
Map worker uses partitioner function to divide the data into regions.
When the map slaves complete their work, the master instructs the reduce slaves to begin their work.
When all the reduce slaves complete their work, the master transfers the control to the user program.
MapperClass: this class overrides the MapFunction based on the problem statement.
Reducer Class: This class overrides the Reduce function based on the problem statement.
NOTE: Based on marks given write MapReduce example if necessary with program.
Limitations of Hadoop 1.0: HDFS and MapReduce are core components, while other components are built around the
core.
Not suitable for Machine learning algorithms, graphs, and other memory intensive algorithms
HDFS Limitation: The NameNode can quickly become overwhelmed with load on the system increasing. In Hadoop 2.x
this problem is resolved.
Hadoop 2: Hadoop 2.x is YARN based architecture. It is general processing platform. YARN is not constrained to
MapReduce only. One can run multiple applications in Hadoop 2.x in which all applications share common resource
management.
Hadoop 2.x can be used for various types of processing such as Batch, Interactive, Online, Streaming, Graph and
others.
a) NameSpace: Takes care of file related operations such as creating files, modifying files and directories
b) Block storage service: It handles daa node cluster management and replication.
HDFS 2 Features:
Horizontal scalability: HDFS Federation uses multiple independent NameNodes for horizontal scalability. The
DataNodes are common storage for blocks and shared by all NameNodes. All DataNodes in the cluster registers with
each NameNode in the cluster.
High availability: High availability of NameNode is obtained with the help of Passive Standby NameNode.
Active-Passive NameNode handles failover automatically. All namespace edits are recorded to a shared NFS(Network
File Storage) Storage and there is a single writer at an point of time.
Passive NameNode reads edits from shared storage and keeps updated metadata information.
Incase of Active NameNode failure, Passive NameNode becomes an Active NameNode automatically. Then it starts
writing to the shared storage.
Active Passive
NameNode NameNode
Shared Edit
Write Logs Read
Fig. Active and Passive NameNode Interaction
Hadoop1X Hadoop2X
1 Supports MapReduce (MR) processing model Allows to work in MR as well as other distributed
only. Does not support non-MR tools computing models like Spark, & HBase coprocessors.
2 MR does both processing and cluster-resource YARN does cluster resource management
management. and processing is done using different processing
models.
3 Has limited scaling of nodes. Limited to 4000 Has better scalability. Scalable up to 10000 nodes per
nodes per cluster cluster
4 Works on concepts of slots – slots can run Works on concepts of containers. Using containers can
either a Map task or a Reduce task only. run generic tasks.
5 A single Namenode to manage the entire Multiple Namenode servers manage multiple
namespace. namespaces.
6 Has Single-Point-of-Failure (SPOF) – because of Has to feature to overcome SPOF with a standby
single Namenode. Namenode and in the case of Namenode failure, it is
configured for automatic recovery.
7 MR API is compatible with Hadoop1x. A MR API requires additional files for a program written
program written in Hadoop1 executes in Hadoop1x to execute in Hadoop2x.
in Hadoop1x without any additional files.
8 Has a limitation to serve as a platform for event Can serve as a platform for a wide variety of data
processing, streaming and real-time operations. analytics-possible to run event processing, streaming
and real-time operations.
9 Does not support Microsoft Windows Added support for Microsoft windows
Explain in detail about YARN?
The fundamental idea behind the YARN(Yet Another Resource Negotiator) architecture is to splitting the JobTracker
reponsibility of resource management and job scheduling/monitoring into separate daemons.
1. Global Resource Manager: The main responsibility of Global Resource Manager is to distribute resources among
various applications.
NodeManager:
This is a per-machine slave daemon. NodeManager responsibility is launching the application containers for
application execution.
NodeManager monitors the resource usage such as memory, CPU, disk, network, etc.
Per-Application Application Master: Per-application Application master is an application specific entity. It’s
responsibility is to negotiate required resources for execution from the ResourceManager.
It works along with the NodeManager for executing and monitoring component tasks.
Container: Basic unit of allocation. Replaces fixed map/reduce slots. Fine-grained resource allocation across multiple
resource type
Container_1: 1GB,6CPU
Fig. YARN Architecture
The Resource Manager launches the Application Master by assigning some container.
On successful container allocations, the application master launches the container by providing the container launch
specification to the NodeManager.
During the application execution, the client that submitted the job directly communicates with the Application Master
to get status, progress updates.
Once the application has been processed completely, the application master deregisters with the ResourceManager
and shutsdown allowing its own container to be repurposed.
HDFS: Hadoop Distributed File System. It simply stores data files as close to the original form as possible.
HBase: It is Hadoop’s distributed column based database. It supports structured data storage for large tables.
Hive: It is a Hadoop’s data warehouse, enables analysis of large data sets using a language very similar to SQL. So, one
can access data stored in hadoop cluster by using Hive.
Pig: Pig is an easy to understand data flow language. It helps with the analysis of large data sets which is quite the
order with Hadoop without writing codes in MapReduce paradigm.
ZooKeeper: It is an open source application that configures synchronizes the distribured systems.
Sqoop: it is used to transfer bulk data between Hadoop and structured data stores such as relational databases.
Ambari: it is a web based tool for provisioning, Managing and Monitoring Apache Hadoop clusters.
Hadoop Common: It is a set of common utilities and libraries which handle other Hadoop modules. It makes sure that
the hardware failures are managed by Hadoop cluster automatically.
Hadoop YARN: It allocates resources which in turn allow different users to execute various applications without
worrying about the increased workloads.
HDFS: It is a Hadoop Distributed File System that stores data in the form of small memory blocks and distributes them
across the cluster. Each data is replicated multiple times to ensure data availability.
Hadoop MapReduce: It executes tasks in a parallel fashion by distributing the data as small blocks.
Standalone, or local mode: which is one of the least commonly used environments, which only for running and
debugging of MapReduce programs. This mode does not use HDFS nor it launches any of the hadoop daemon.
Pseudo-distributed mode(Cluster of One), which runs all daemons on single machine. It is most commonly used in
development environments.
Fully distributed mode, which is most commonly used in production environments. This mode runs all daemons on a
cluster of machines rather than single one.
core-site.xml – This configuration file contains Hadoop core configuration settings, for example, I/O settings, very
common for MapReduce and HDFS. mapred-site.xml – This configuration file specifies a framework name for
MapReduce by setting mapreduce.framework.name
hdfs-site.xml – This configuration file contains HDFS daemons configuration settings. It also specifies default block
permission and replication checking on HDFS.
yarn-site.xml – This configuration file specifies configuration settings for ResourceManager and NodeManager.
Master HDFS: Its main responsibility is partitioning the data storage across the slave nodes. It also keep track of
locations of data on Datanodes.
Master Map Reduce: It decides and schedules computation task on slave nodes.
NOTE: Based on marks for the question explain hdfs daemons and mapreduce daemons.