Unit 1 - BDS - DS307
Unit 1 - BDS - DS307
A. Sriram
Asst Prof , Dept of CSE
Unit I: Introduction to Big Data
Introduction to Big DATA
Characteristics of Big Data
Traits of Big Data
Challenges of Conventional
Systems
Sources of Big Data
Evolution of Analytic
Scalability
Analytic Processes & Tools
Analysis Vs Reporting
Modern data analytic tools
Statistical Concepts:
Sampling Distributions-Re-
Sampling
What is Data ?
-The quantities, characters, or symbols on which
operations are performed by a computer, which may
be stored and transmitted in the form of electrical
signals and recorded on magnetic, optical, or
mechanical recording media.
― Small Data ―
Data
― Big Data ―
Introduction to “ Big Data “
Definition:
―Big data is high-volume, high-
velocity and high-variety information
assets that demand cost- effective,
innovative forms of information
processing for enhanced insight and
decision making “
So what is Big Data ?
-Big data refers to the massive datasets that are
collected from a variety of data sources for
business needs to reveal new insights for
optimized decision making.
-If big data is referred by ―volume‖ of
transactions and transaction history , then
hundreds of terabytes (1012 bytes) may be
considered ―big data‖ for a pharmaceutical
company .
-and volume of transactions in petabytes (1015
bytes) may be considered small for a government
agency .
Unit Shortened Capacity
Byte B 8 bits
Big Data Statistics,
Growth &
Facts 2020
Why Big Data:
Wal-Mart handles more than One million
customer transactions every hour
Facebook handles 40 billion photos from its user
base
Facebook generates 10 TB daily
Twitter generates 7 TB daily
Instagram users like nearly 1.73 million photos
You Tube users upload 300 hours of new video
content
Skype users make nearly 110,000 new calls
Amazon receives 4300 new visitors
Uber passengers take 694 rides
Netflix subscribers stream nearly 77,000 hours
of video
IBM claims 90% of today’s stored data was
generated in just
Characteristics/ Attributes of Big Data:
(3Vs)
Volume of Data :
-a huge volume of information is generated through
social media
Velocity :
-the velocity in which the information gets uploaded is
also high.
Veriety /(format) :
-This information can be in the form of pictures, videos
and unstructured texts via social media.
5. Veracity :
-data veracity is the degree of accuracy or
truthfulness of a data set. In the context of Big data, it’s
not just the quality of the data that is important, but how
trustworthy the source, the type, and processing of data
are.
Advantages
Data analytics helps an organization make better decisions
Increase the efficiency of the work
The analytics keeps you updated of your customer behavioural changes
Personalization of products and services
Improving quality of products and services
Limitations
Lack of alignment within teams
Lack of commitment and patience
Low quality of data
Privacy concerns
Complexity & Bias
Data warehouse v/s Data Lakes
What is Hadoop
1. NameNode
NameNode works on the Master System. The primary purpose of Namenode is to
manage all the MetaData. Metadata is the list of files stored in HDFS(Hadoop
Distributed File System). As we know the data is stored in the form of blocks in a
Hadoop cluster. So the DataNode on which or the location at which that block of the
file is stored is mentioned in MetaData. All information regarding the logs of the
transactions happening in a Hadoop cluster (when or who read/wrote the data) will be
stored in MetaData. MetaData is stored in the memory.
2. DataNode
DataNode works on the Slave system. The NameNode always instructs DataNode for
storing the Data. DataNode is a program that runs on the slave system that serves the
read/write request from the client. As the data is stored in this DataNode, they should
possess high memory to store more Data.
3. Secondary NameNode
Secondary NameNode is used for taking the hourly backup of the data. In case the Hadoop
cluster fails, or crashes, the secondary Namenode will take the hourly backup or
checkpoints of that data and store this data into a file name fsimage. This file then gets
transferred to a new system. A new MetaData is assigned to that new system and a new
Master is created with this MetaData, and the cluster is made to run again correctly.
4. Resource Manager
Resource Manager is also known as the Global Master Daemon that works on the Master
System. The Resource Manager Manages the resources for the applications that are
running in a Hadoop Cluster. The Resource Manager Mainly consists of 2 things.
1.ApplicationsManager
2. Scheduler
An Application Manager is responsible for accepting the request for a client and also
makes a memory resource on the Slaves in a Hadoop cluster to host the Application
Master. The scheduler is utilized for providing resources for applications in a Hadoop
cluster and for monitoring this application.
Data (Vs Big Data
Warehouse )
Tools Used in Big Data Analytics
Some of the tools used in Big Data analytics:
Hadoop - helps in storing and analyzing data
MongoDB - used on datasets that change
frequently
Talend - used for data integration and
management
Cassandra - a distributed database used to
handle chunks
of data
Spark - used for real-time processing and
analyzing large amounts of data
STORM - an open-source real-time
computational system
Big Data Industry Applications
some of the sectors where Big Data is
actively used:
Ecommerce
Marketing
Education
Healthcare
Media and entertainment
Banking
Telecommunications
Government
Why Big Data