Unit 2
Unit 2
3
History of Hadoop
“It is an important technique!”
“Map-reduce”
2004
Doug Cutting
Extended
Apache Nutch
5
HADOOP
• Fault-tolerance
7
Why Distributed processing?
1 Machine 10 Machines
4 I/O channels 4 I/O channels
Each Channel - 100 mbps Each Channel - 100 mbps
8
Hadoop Key Characteristics
Robust
and
Reliable
Economical
Accessible
Hadoop
Attributes
Flexible Scalable
9
Apache Hadoop Ecosystem
Core components of Hadoop
• Mapreduce and HDFS are core components
• NameNode receives heartbeat and block reports from all DataNodes that
ensure DataNode is alive.
• If the DataNode fails, the NameNode chooses new DataNodes for new
replicas.
What is HDFS DataNode?
23
Advantages of HDFS Block
• The benefits with HDFS block are:
• The blocks are of fixed size, so it is very easy to calculate the number of
blocks that can be stored on a disk.
• HDFS block concept simplifies the storage of the datanodes.
• The datanodes doesn’t need to concern about the blocks metadata data
like file permissions etc.
• The namenode maintains the metadata of all the blocks.
• If the size of the file is less than the HDFS block size, then the file does not
occupy the complete block storage.
• As the file is chunked into blocks, it is easy to store a file that is larger than
the disk size as the data blocks are distributed and stored on multiple
nodes in a hadoop cluster.
• Blocks are easy to replicate between the datanodes and thus provide fault
tolerance and high availability. Hadoop framework replicates each block
across multiple nodes (default replication factor is 3). I
• n case of any node failure or block corruption, the same block can be read
Rack
• Rack is the collection of around 40-50 machines (DataNodes) connected
using the same network switch.
• If the network goes down, the whole rack will be unavailable.
• Rack Awareness is the concept of choosing the closest node based on the
rack information.
• To ensure that all the replicas of a block are not stored on the same rack or
a single rack, NameNode follows a rack awareness algorithm to store
replicas and provide latency and fault tolerance.
• Suppose if the replication factor is 3, then according to the rack awareness
algorithm:
• The first replica will get stored on the local rack.
• The second replica will get stored on the other DataNode in the same rack.
• The third replica will get stored on a different rack.
• In the above image, we have 3 different Racks in our Hadoop cluster each
Rack contains 4 Data node.
• Now suppose you have 3 file blocks(Block 1, Block 2, Block 3) that you
want to put in this data node.
• As we all know Hadoop has a Feature to make Replica’s of the file blocks
to provide the high availability and fault tolerance. By default, the
Replication Factor is 3 so Hadoop is so smart that it will place the replica’s
of Blocks in Racks in such a way that we can achieve a good network
bandwidth.
• For that Hadoop has some Rack awareness policies.
• There should not be more than 1 replica on the same Data node.
• More then 2 replica’s of a single block is not allowed on the same Rack.
• The number of racks used inside a Hadoop cluster must be smaller than
the number of replicas.
• Now let’s continue with our above example.
• In the diagram, we can easily found that we have block 1 in the first
Datanode of Rack 1 and 2 replica’s of Block 1 in 5 and 6 number Data node
of Rack which sum up to 3.
• Similarly, we also have a Replica distribution of 2 other blocks in different
Racks which are following the above policies.
• Benefits of Implementing Rack Awareness in our Hadoop Cluster:
• With the rack awareness policy’s we store the data in different Racks so no
way to lose our data.
• Rack awareness helps to maximize the network bandwidth because the
data blocks transfer within the Racks.
• It also improves the cluster performance and provides high data
availability.
Write Operation
• When a client wants to write a file to HDFS, it communicates to the
NameNode for metadata.
• The Namenode responds with a number of blocks, their location, replicas,
and other details.
• Based on information from NameNode, the client directly interacts with
the DataNode.
• The client first sends block A to DataNode 1 along with the IP of the other
two DataNodes where replicas will be stored.
• When Datanode 1 receives block A from the client, DataNode 1 copies the
same block to DataNode 2 of the same rack.
• As both the DataNodes are in the same rack, so block transfer via rack
switch.
• Now DataNode 2 copies the same block to DataNode 4 on a different rack.
As both the DataNoNes are in different racks, so block transfer via an out-
of-rack switch.
• When DataNode receives the blocks from the client, it sends write
confirmation to Namenode.
• The same process is repeated for each block of the file.
Read Operation
• To read from HDFS, the client first communicates with the NameNode for
metadata.
• The Namenode responds with the locations of DataNodes containing
blocks.
• After receiving the DataNodes locations, the client then directly interacts
with the DataNodes.
• The client starts reading data parallelly from the DataNodes based on the
information received from the NameNode.
• The data will flow directly from the DataNode to the client.
• When a client or application receives all the blocks of the file, it combines
these blocks into the form of an original file.
Concept of Blocks in HDFS
Architecture
• When a heartbeat message reappears or a new heartbeat
message is received, the respective data node sending the
message is added to the cluster
• Monitoring
• Rebalancing
• Metadata replication
Command line interface
• Hdfs can be viewed by interfacing with it from the
command line.
• There are various numerous interfaces command line
is one of the easiest
• There are two properties that are set in the
distributed mode
• The principal is fs.default.name set to
hdfs://localhost/ which is used to set a default
Hadoop file system
Using HDFS Files
• import java.io.File;
• import java.io.IOException;
• import org.apache.hadoop.conf.Configuration;
• import org.apache.hadoop.fs.FileSystem;
• import org.apache.hadoop.fs.FSDataInputStream;
• import org.apache.hadoop.fs.FSDataOutputStream;
• import org.apache.hadoop.fs.Path;
• public class HDFSExample {
• public static void main (String [] args) throws IOException {
• String exampleF = "example.txt";
• //Creating a Filesystem Object
• Configuration config = new Configuration();//creates a configuration object config
• FileSystem fsys = FileSystem.get(config);//Creates FileSystem object fs
Path fp = new Path(exampleF);//creating an object for the path
class//
If(fsys.exists(fp)//checking file path
If (fsys.isFile(fp))
Boolean result=fsys.CreateNewFile(fp);
Boolean result=fsys.delete(fp);
FSDataInputStream fin=fsys.open(fp);//reading from the file
FSDataOutputStream fout=fsys.create(fp);//writing to the file
HDFS Commands
• Hadoop FS Command Line
• The Hadoop FS command line is a simple way to
access and interface with HDFS.
• Below are some basic HDFS commands in Linux,
including operations like creating directories, moving
files, deleting files, reading files, and
listing directories.
Fsck command
HDFS Command to check the health of the Hadoop file system.
Command: hdfs fsck /
ls
• HDFS Command to count the number of directories, files, and bytes under
the paths that match the specified file pattern.
• Usage: hdfs dfs -count <path>
• Command: hdfs dfs –count /user
•
cp
• HDFS Command to copy files from source to destination. This command
allows multiple sources as well, in which case the destination must be a
directory.
• Usage: hdfs dfs -cp <src> <dest>
• Command: hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
• Command: hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
/user/hadoop/dir
get
• HDFS Command to copy files from hdfs to the local file system.
• Usage: hdfs dfs -get <src> <localdst>
• Command: hdfs dfs –get /user/test /home/edureka
•
rm
mkdir:
To create a directory, similar to Unix ls command.
Options:
-p : Do not fail if the directory already exists
$ hadoop fs -mkdir [-p]
ls:
List directories present under a specific directory in HDFS, similar to
Unix ls command. The -lsr command can be used for recursive listing
of directories and files.
Options:
-d : List the directories as plain files
-h : Format the sizes of files to a human-readable manner instead of
number of bytes
-R : Recursively list the contents of directories
$ hadoop fs -ls [-d] [-h] [-R]
cat:
Display contents of a file, similar to Unix cat command.
$ hadoop fs -cat /user/data/sampletext.txt
chmod:
Change the permission of a file, similar to Linux shell’s command but
with a few exceptions.
<MODE> Same as mode used for the shell’s command with the only
letters recognized are ‘rwxXt’
<OCTALMODE> Mode specified in 3 or 4 digits. It is not possible to
specify only part of the mode, unlike the shell command.
Options:
-R : Modify the files recursively
$ hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH
chown:
Change owner and group of a file, similar to Linux shell’s command but
with a few exceptions.
Options:
-R : Modify the files recursively
$ hadoop fs -chown [-R] [OWNER][:[GROUP]] PATH
mv:
Move files from one directory to another within HDFS, similar to Unix
mv command.
$ hadoop fs -mv /user/hadoop/sample1.txt /user/text/
rm:
Remove a file from HDFS, similar to Unix rm command.
This command does not delete directories.
For recursive delete, use command -rm -r.
Options:
-r : Recursively remove directories and files
-skipTrash : To bypass trash and immediately delete the source
-f : Mention if there is no file existing
-rR : Recursively delete directories
$ hadoop fs -mv /user/hadoop/sample1.txt /user/text/
Hdfs package
org.apache.Hadoop.io
Generic i/o code for use when reading and writing data to the network, to
databases, and to files.
RawComparator<T> A Comparator that operates directly on byte representations of objects.
Interface Summary
Stringifier<T> Stringifier interface offers two methods to convert an object to a string representation and restore the
object given its string representation.
Writable A serializable object which implements a simple, efficient, serialization protocol, based
on DataInput and DataOutput.
BloomMapFile This class extends MapFile and provides very much the same functionality.
BooleanWritable A WritableComparable for booleans.
BytesWritable A byte sequence that is usable as a key or value.
ByteWritable A WritableComparable for a single byte.
CompressedWritable A base-class for Writables which store themselves compressed and lazily inflate on field access.
DataOutputOutputStream OutputStream implementation that wraps a DataOutput.
DefaultStringifier<T> DefaultStringifier is the default implementation of the Stringifier interface which stringifies the objects using base64 encoding of the serialized version of the objects.
VersionMismatchException Thrown by VersionedWritable.readFields(DataInput) when the version of an object being read does not
match the current implementation version as returned by VersionedWritable.getVersion().
HDFS High Availability
• High availability refers to the availability of system or data in the wake of
component failure in the system .
• High Availability was a new feature added to Hadoop 2.x to solve the
Single point of failure problem in the older versions of Hadoop.
• As the Hadoop HDFS follows the master-slave architecture where the
NameNode is the master node and maintains the filesystem tree.
• So HDFS cannot be used without NameNode.
• This NameNode becomes a bottleneck. HDFS high availability feature
addresses this issue
The high availability feature in Hadoop ensures the availability of the
Hadoop cluster without any downtime, even in unfavourable conditions
like NameNode failure, DataNode failure, machine crash, etc.
Features of HDFS
• Data replication
• Data integrity
• Data resilience
are the three key features of hdfs
Hdfs ensures data integrity throughout the
cluster with help of following features
• Maintains transaction logs
• Validating checksum
• Creating data blocks