0% found this document useful (0 votes)
7 views47 pages

Unit 1 - BDS - DS307

The document provides a comprehensive overview of Big Data, including its definition, characteristics (5Vs: Volume, Velocity, Variety, Value, and Veracity), types (structured, semi-structured, and unstructured), and applications across various sectors such as healthcare, finance, and telecommunications. It also discusses the limitations and advantages of Big Data technologies, particularly focusing on Hadoop as a key framework for processing large datasets. Additionally, it highlights the importance of Big Data in enhancing decision-making and operational efficiencies in modern businesses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views47 pages

Unit 1 - BDS - DS307

The document provides a comprehensive overview of Big Data, including its definition, characteristics (5Vs: Volume, Velocity, Variety, Value, and Veracity), types (structured, semi-structured, and unstructured), and applications across various sectors such as healthcare, finance, and telecommunications. It also discusses the limitations and advantages of Big Data technologies, particularly focusing on Hadoop as a key framework for processing large datasets. Additionally, it highlights the importance of Big Data in enhancing decision-making and operational efficiencies in modern businesses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 47

Big Data Systems

A. Sriram
Asst Prof , Dept of CSE
Unit I: Introduction to Big Data
 Introduction to Big DATA
 Characteristics of Big Data
 Traits of Big Data
 Challenges of Conventional
Systems
 Sources of Big Data
 Evolution of Analytic
Scalability
 Analytic Processes & Tools
 Analysis Vs Reporting
 Modern data analytic tools
 Statistical Concepts:
 Sampling Distributions-Re-
Sampling
 What is Data ?
-The quantities, characters, or symbols on which
operations are performed by a computer, which may
be stored and transmitted in the form of electrical
signals and recorded on magnetic, optical, or
mechanical recording media.

 What is Big Data ?

― Small Data ―

Data
― Big Data ―
Introduction to “ Big Data “

Definition:
―Big data is high-volume, high-
velocity and high-variety information
assets that demand cost- effective,
innovative forms of information
processing for enhanced insight and
decision making “
 So what is Big Data ?
-Big data refers to the massive datasets that are
collected from a variety of data sources for
business needs to reveal new insights for
optimized decision making.
-If big data is referred by ―volume‖ of
transactions and transaction history , then
hundreds of terabytes (1012 bytes) may be
considered ―big data‖ for a pharmaceutical
company .
-and volume of transactions in petabytes (1015
bytes) may be considered small for a government
agency .
Unit Shortened Capacity

Bit b 1 or 0 (on or off)

Byte B 8 bits

Kilobyte KB 1024 bytes

Megabyte MB 1024 kilobytes

Gigabyte GB 1024 megabytes

Terabyte TB 1024 gigabytes

Petabyte PB 1024 terabytes

Exabyte EB 1024 petabytes

Zettabyte ZB 1024 exabytes

Yottabyte YB 1024 zettabytes


 Size of Big Data &
Hadoop-2017-22



 Big Data Statistics,
 Growth &
 Facts 2020
Why Big Data:
 Wal-Mart handles more than One million
customer transactions every hour
 Facebook handles 40 billion photos from its user
base
 Facebook generates 10 TB daily
 Twitter generates 7 TB daily
 Instagram users like nearly 1.73 million photos
 You Tube users upload 300 hours of new video
content
 Skype users make nearly 110,000 new calls
 Amazon receives 4300 new visitors
 Uber passengers take 694 rides
 Netflix subscribers stream nearly 77,000 hours
of video
 IBM claims 90% of today’s stored data was
generated in just
Characteristics/ Attributes of Big Data:
(3Vs)
 Volume of Data :
-a huge volume of information is generated through
social media

 Velocity :
-the velocity in which the information gets uploaded is
also high.

 Veriety /(format) :
-This information can be in the form of pictures, videos
and unstructured texts via social media.

 Big data has five major components: (5Vs)


Volume,Velocity,Variety,Value and Veracity
1. Volume of Data:

 A typical PC might have had 10 gigabytes of


storage in
2000.
 Today, Facebook ingests 500 terabytes of
new data every day.
 Boeing 737 will generate 240 terabytes of
flight data during a single flight across the US.
2. Velocity :
 Clickstreams and ad impressions capture
user behavior at millions of events per second
 high-frequency stock trading algorithms
reflect market
changes within microseconds
 machine to machine processes exchange
data between
billions of devices
 infrastructure and sensors generate massive
log data in real-time
 on-line gaming systems support millions of
concurrent users, each producing multiple
3. Veriety :

 Big Data isn't just numbers, dates, and strings.


Big Data is also geospatial data, 3D data,
audio and video, and unstructured text,
including log files and social media.

 Traditional database systems were designed to


address smaller volumes of structured data,
fewer updates or a predictable, consistent data
structure
 4. Value:
-Value is the major issue that we need to
concentrate on. It is not just the amount of
data that we store or process. It is actually
the amount of valuable, reliable and
trustworthy data that needs to be stored,
processed, analyzed to find insights.

 5. Veracity :
-data veracity is the degree of accuracy or
truthfulness of a data set. In the context of Big data, it’s
not just the quality of the data that is important, but how
trustworthy the source, the type, and processing of data
are.

Big data has five major components: (5Vs)


 Volume,Velocity,Variety,Veracity and Value
Big data has five major components:
(5Vs)
Types of Big-Data
 Big Data is generally categorized into
three different varieties.They are as
shown below:
 Structured Data
 Semi-Structured Data
 Unstructured Data

i) Structured ii) Semi-Structured iii) Unstructured


Data Data Data
i) Structured Data
 Structured Data owns a dedicated data
model, It also has a well-defined structure, it
follows a consistent order and it is designed
in such a way that it can be easily
accessed and used by a person or a
computer. Structured data is usually stored in
well-defined columns and also Databases.

 Example: Database Management


Systems(DBMS)
ii) Semi-Structured Data

 Semi-Structured Data can be considered


as another form of Structured Data. It
inherits a few properties of Structured Data,
but the major part of this kind of data fails
to have a definite structure and also, it does
not obey the formal structure of data models
such as an RDBMS.
 Comma Separated
Example Values(CSV) File.
:
iii) Unstructured Data

 Unstructured Data is completely a


different type of which neither has a
structure nor obeys to follow the formal
structural rules of data models. It does not
even have a consistent format and it found to
be varying all the time. But, rarely it may
have information related to data and time.

 Example : Audio Files, Images etc


Find the Anomalies in Above Figure?
Applications of Big Data :
 Big Data is considered the most valuable and powerful
fuel that can run the massive IT industries of the 21st
Century. Big Data is being the most wide-spread
technology that is being used in almost every business
sector.
 Travel and Tourism: is one of the biggest users of
Big Data Technology. It has enabled us to predict the
requirements for travel facilities in many places,
improving business through dynamic pricing and many
more
 Financial behavior
customer and Bankingbased : on
Sectors extensively
the thei
uses Biginputs
Data Technology. Big
received from data analytics can
investment r
aid banks
personal
patterns, in understanding
orshopping
financial trends, motivation to an
backgrounds.
invest d
 Healthcare sector : Big Data has already started to
create a huge
difference in the healthcare sector. With the help
of predictive analytics, medical professionals and
Health Care Personnel are now able to provide
personalized healthcare services to
individual patients.

 Telecommunication and Multimedia : sector is


one of the primary users of Big Data. There are zetta
bytes of getting generated every day and to handle
such huge data would need nothing other than Big
Data Technologies.

 Government and Military : also use Big Data


Technology at a higher rate. You can consider
the amount of data Government generates on its
records and in the military, a normal fighter jet plane
requires to process petabytes of data during its flight.
Advantages of Big Data :
 Big Data Technology has given us multiple
advantages, Out of which we will now discuss a
few.
 Big Data has enabled predictive analysis which
can save organizations from operational risks.
 Predictive analysis has helped organizations
grow business by analyzing customer needs.
 Big Data has enabled many multimedia
platforms to share
data Ex:YouTube, Instagram
 Medical and Healthcare sectors can keep
patients under constant observations.
 Big Data changed the face of customer-based
companies
Compare Hadoop vs traditional systems
Limitations and Solutions of existing Data
Analytics Architecture

Advantages
Data analytics helps an organization make better decisions
Increase the efficiency of the work
The analytics keeps you updated of your customer behavioural changes
Personalization of products and services
Improving quality of products and services

Limitations
Lack of alignment within teams
Lack of commitment and patience
Low quality of data
Privacy concerns
Complexity & Bias
Data warehouse v/s Data Lakes
What is Hadoop

•Hadoop is an open source framework from Apache and is used to store


process and analyze data which are very huge in volume.
•Hadoop is written in Java and is not OLAP (online analytical processing).
•It is used for batch/offline processing.
• It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and
many more.
•Moreover it can be scaled up just by adding nodes in the cluster.
Hadoop Architecture
Distributing Processing System
•The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It
contains a master/slave architecture. This architecture consist of a single NameNode
performs the role of master, and multiple DataNodes performs the role of a slave.
•Both NameNode and DataNode are capable enough to run on commodity machines. The
Java language is used to develop HDFS. So any machine that supports Java language can
easily run the NameNode and DataNode software.
NameNode
•It is a single master server exist in the HDFS cluster.
•As it is a single node, it may become the reason of single point failure.
•It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
•It simplifies the architecture of the system.
DataNode
•The HDFS cluster contains multiple DataNodes.
•Each DataNode contains multiple data blocks.
•These data blocks are used to store data.
•It is the responsibility of DataNode to read and write requests from the file system's
clients.
•It performs block creation, deletion, and replication upon instruction from the
NameNode.
History of Hadoop
There are three components of Hadoop.
•Hadoop HDFS - Hadoop Distributed File System (HDFS) is the
storage unit of Hadoop.
•Hadoop MapReduce - Hadoop MapReduce is the processing unit
of Hadoop.
•Hadoop YARN - Hadoop YARN is a resource management unit of
Hadoop.
HDFS Architecture
Hadoop – Daemons and Their Features
Daemons mean Process. Hadoop Daemons are a set of processes that run on Hadoop.
Hadoop is a framework written in Java, so all these processes are Java Processes.
Apache Hadoop 2 consists of the following Daemons:
NameNode
DataNode
Secondary Name Node
Resource Manager
Node Manager
Namenode, Secondary NameNode, and Resource Manager work on a Master System
while the Node Manager and DataNode work on the Slave machine.

1. NameNode
NameNode works on the Master System. The primary purpose of Namenode is to
manage all the MetaData. Metadata is the list of files stored in HDFS(Hadoop
Distributed File System). As we know the data is stored in the form of blocks in a
Hadoop cluster. So the DataNode on which or the location at which that block of the
file is stored is mentioned in MetaData. All information regarding the logs of the
transactions happening in a Hadoop cluster (when or who read/wrote the data) will be
stored in MetaData. MetaData is stored in the memory.
2. DataNode
DataNode works on the Slave system. The NameNode always instructs DataNode for
storing the Data. DataNode is a program that runs on the slave system that serves the
read/write request from the client. As the data is stored in this DataNode, they should
possess high memory to store more Data.

3. Secondary NameNode
Secondary NameNode is used for taking the hourly backup of the data. In case the Hadoop
cluster fails, or crashes, the secondary Namenode will take the hourly backup or
checkpoints of that data and store this data into a file name fsimage. This file then gets
transferred to a new system. A new MetaData is assigned to that new system and a new
Master is created with this MetaData, and the cluster is made to run again correctly.

4. Resource Manager
Resource Manager is also known as the Global Master Daemon that works on the Master
System. The Resource Manager Manages the resources for the applications that are
running in a Hadoop Cluster. The Resource Manager Mainly consists of 2 things.
1.ApplicationsManager
2. Scheduler
An Application Manager is responsible for accepting the request for a client and also
makes a memory resource on the Slaves in a Hadoop cluster to host the Application
Master. The scheduler is utilized for providing resources for applications in a Hadoop
cluster and for monitoring this application.
Data (Vs Big Data
Warehouse )
Tools Used in Big Data Analytics
 Some of the tools used in Big Data analytics:
 Hadoop - helps in storing and analyzing data
 MongoDB - used on datasets that change
frequently
 Talend - used for data integration and
management
 Cassandra - a distributed database used to
handle chunks
of data
 Spark - used for real-time processing and
analyzing large amounts of data
 STORM - an open-source real-time
computational system
Big Data Industry Applications
 some of the sectors where Big Data is
actively used:
 Ecommerce
 Marketing
 Education
 Healthcare
 Media and entertainment
 Banking
 Telecommunications
 Government
Why Big Data

 More data More accurate analysis Greater


confidence in decision making Greater
operational efficiencies, Cost reduction,Time
reduction, New product development, and
Optimized offerings, etc..

You might also like