0% found this document useful (0 votes)

7 views47 pages

Unit 1 - BDS - DS307

The document provides a comprehensive overview of Big Data, including its definition, characteristics (5Vs: Volume, Velocity, Variety, Value, and Veracity), types (structured, semi-structured, and unstructured), and applications across various sectors such as healthcare, finance, and telecommunications. It also discusses the limitations and advantages of Big Data technologies, particularly focusing on Hadoop as a key framework for processing large datasets. Additionally, it highlights the importance of Big Data in enhancing decision-making and operational efficiencies in modern businesses.

Uploaded by

Chaitanya Krishna Deepak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views47 pages

Unit 1 - BDS - DS307

Uploaded by

Chaitanya Krishna Deepak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 47

Big Data Systems

A. Sriram
Asst Prof , Dept of CSE
Unit I: Introduction to Big Data
 Introduction to Big DATA
 Characteristics of Big Data
 Traits of Big Data
 Challenges of Conventional
Systems
 Sources of Big Data
 Evolution of Analytic
Scalability
 Analytic Processes & Tools
 Analysis Vs Reporting
 Modern data analytic tools
 Statistical Concepts:
 Sampling Distributions-Re-
Sampling
 What is Data ?
-The quantities, characters, or symbols on which
operations are performed by a computer, which may
be stored and transmitted in the form of electrical
signals and recorded on magnetic, optical, or
mechanical recording media.

 What is Big Data ?

― Small Data ―

Data
― Big Data ―
Introduction to “ Big Data “

Definition:
―Big data is high-volume, high-
velocity and high-variety information
assets that demand cost- effective,
innovative forms of information
processing for enhanced insight and
decision making “
 So what is Big Data ?
-Big data refers to the massive datasets that are
collected from a variety of data sources for
business needs to reveal new insights for
optimized decision making.
-If big data is referred by ―volume‖ of
transactions and transaction history , then
hundreds of terabytes (1012 bytes) may be
considered ―big data‖ for a pharmaceutical
company .
-and volume of transactions in petabytes (1015
bytes) may be considered small for a government
agency .
Unit Shortened Capacity

Bit b 1 or 0 (on or off)

Byte B 8 bits

Kilobyte KB 1024 bytes

Megabyte MB 1024 kilobytes

Gigabyte GB 1024 megabytes

Terabyte TB 1024 gigabytes

Petabyte PB 1024 terabytes

Exabyte EB 1024 petabytes

Zettabyte ZB 1024 exabytes

Yottabyte YB 1024 zettabytes

 Size of Big Data &
Hadoop-2017-22



 Big Data Statistics,
 Growth &
 Facts 2020
Why Big Data:
 Wal-Mart handles more than One million
customer transactions every hour
 Facebook handles 40 billion photos from its user
base
 Facebook generates 10 TB daily
 Twitter generates 7 TB daily
 Instagram users like nearly 1.73 million photos
 You Tube users upload 300 hours of new video
content
 Skype users make nearly 110,000 new calls
 Amazon receives 4300 new visitors
 Uber passengers take 694 rides
 Netflix subscribers stream nearly 77,000 hours
of video
 IBM claims 90% of today’s stored data was
generated in just
Characteristics/ Attributes of Big Data:
(3Vs)
 Volume of Data :
-a huge volume of information is generated through
social media

 Velocity :
-the velocity in which the information gets uploaded is
also high.

 Veriety /(format) :
-This information can be in the form of pictures, videos
and unstructured texts via social media.

 Big data has five major components: (5Vs)

Volume,Velocity,Variety,Value and Veracity
1. Volume of Data:

 A typical PC might have had 10 gigabytes of

storage in
2000.
 Today, Facebook ingests 500 terabytes of
new data every day.
 Boeing 737 will generate 240 terabytes of
flight data during a single flight across the US.
2. Velocity :
 Clickstreams and ad impressions capture
user behavior at millions of events per second
 high-frequency stock trading algorithms
reflect market
changes within microseconds
 machine to machine processes exchange
data between
billions of devices
 infrastructure and sensors generate massive
log data in real-time
 on-line gaming systems support millions of
concurrent users, each producing multiple
3. Veriety :

 Big Data isn't just numbers, dates, and strings.

Big Data is also geospatial data, 3D data,
audio and video, and unstructured text,
including log files and social media.

 Traditional database systems were designed to

address smaller volumes of structured data,
fewer updates or a predictable, consistent data
structure
 4. Value:
-Value is the major issue that we need to
concentrate on. It is not just the amount of
data that we store or process. It is actually
the amount of valuable, reliable and
trustworthy data that needs to be stored,
processed, analyzed to find insights.

 5. Veracity :
-data veracity is the degree of accuracy or
truthfulness of a data set. In the context of Big data, it’s
not just the quality of the data that is important, but how
trustworthy the source, the type, and processing of data
are.

Big data has five major components: (5Vs)

 Volume,Velocity,Variety,Veracity and Value
Big data has five major components:
(5Vs)
Types of Big-Data
 Big Data is generally categorized into
three different varieties.They are as
shown below:
 Structured Data
 Semi-Structured Data
 Unstructured Data

i) Structured ii) Semi-Structured iii) Unstructured

Data Data Data
i) Structured Data
 Structured Data owns a dedicated data
model, It also has a well-defined structure, it
follows a consistent order and it is designed
in such a way that it can be easily
accessed and used by a person or a
computer. Structured data is usually stored in
well-defined columns and also Databases.

 Example: Database Management

Systems(DBMS)
ii) Semi-Structured Data

 Semi-Structured Data can be considered

as another form of Structured Data. It
inherits a few properties of Structured Data,
but the major part of this kind of data fails
to have a definite structure and also, it does
not obey the formal structure of data models
such as an RDBMS.
 Comma Separated
Example Values(CSV) File.
:
iii) Unstructured Data

 Unstructured Data is completely a

different type of which neither has a
structure nor obeys to follow the formal
structural rules of data models. It does not
even have a consistent format and it found to
be varying all the time. But, rarely it may
have information related to data and time.

 Example : Audio Files, Images etc

Find the Anomalies in Above Figure?
Applications of Big Data :
 Big Data is considered the most valuable and powerful
fuel that can run the massive IT industries of the 21st
Century. Big Data is being the most wide-spread
technology that is being used in almost every business
sector.
 Travel and Tourism: is one of the biggest users of
Big Data Technology. It has enabled us to predict the
requirements for travel facilities in many places,
improving business through dynamic pricing and many
more
 Financial behavior
customer and Bankingbased : on
Sectors extensively
the thei
uses Biginputs
Data Technology. Big
received from data analytics can
investment r
aid banks
personal
patterns, in understanding
orshopping
financial trends, motivation to an
backgrounds.
invest d
 Healthcare sector : Big Data has already started to
create a huge
difference in the healthcare sector. With the help
of predictive analytics, medical professionals and
Health Care Personnel are now able to provide
personalized healthcare services to
individual patients.

 Telecommunication and Multimedia : sector is

one of the primary users of Big Data. There are zetta
bytes of getting generated every day and to handle
such huge data would need nothing other than Big
Data Technologies.

 Government and Military : also use Big Data

Technology at a higher rate. You can consider
the amount of data Government generates on its
records and in the military, a normal fighter jet plane
requires to process petabytes of data during its flight.
Advantages of Big Data :
 Big Data Technology has given us multiple
advantages, Out of which we will now discuss a
few.
 Big Data has enabled predictive analysis which
can save organizations from operational risks.
 Predictive analysis has helped organizations
grow business by analyzing customer needs.
 Big Data has enabled many multimedia
platforms to share
data Ex:YouTube, Instagram
 Medical and Healthcare sectors can keep
patients under constant observations.
 Big Data changed the face of customer-based
companies
Compare Hadoop vs traditional systems
Limitations and Solutions of existing Data
Analytics Architecture

Advantages
Data analytics helps an organization make better decisions
Increase the efficiency of the work
The analytics keeps you updated of your customer behavioural changes
Personalization of products and services
Improving quality of products and services

Limitations
Lack of alignment within teams
Lack of commitment and patience
Low quality of data
Privacy concerns
Complexity & Bias
Data warehouse v/s Data Lakes
What is Hadoop

•Hadoop is an open source framework from Apache and is used to store

process and analyze data which are very huge in volume.
•Hadoop is written in Java and is not OLAP (online analytical processing).
•It is used for batch/offline processing.
• It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and
many more.
•Moreover it can be scaled up just by adding nodes in the cluster.
Hadoop Architecture
Distributing Processing System
•The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It
contains a master/slave architecture. This architecture consist of a single NameNode
performs the role of master, and multiple DataNodes performs the role of a slave.
•Both NameNode and DataNode are capable enough to run on commodity machines. The
Java language is used to develop HDFS. So any machine that supports Java language can
easily run the NameNode and DataNode software.
NameNode
•It is a single master server exist in the HDFS cluster.
•As it is a single node, it may become the reason of single point failure.
•It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
•It simplifies the architecture of the system.
DataNode
•The HDFS cluster contains multiple DataNodes.
•Each DataNode contains multiple data blocks.
•These data blocks are used to store data.
•It is the responsibility of DataNode to read and write requests from the file system's
clients.
•It performs block creation, deletion, and replication upon instruction from the
NameNode.
History of Hadoop
There are three components of Hadoop.
•Hadoop HDFS - Hadoop Distributed File System (HDFS) is the
storage unit of Hadoop.
•Hadoop MapReduce - Hadoop MapReduce is the processing unit
of Hadoop.
•Hadoop YARN - Hadoop YARN is a resource management unit of
Hadoop.
HDFS Architecture
Hadoop – Daemons and Their Features
Daemons mean Process. Hadoop Daemons are a set of processes that run on Hadoop.
Hadoop is a framework written in Java, so all these processes are Java Processes.
Apache Hadoop 2 consists of the following Daemons:
NameNode
DataNode
Secondary Name Node
Resource Manager
Node Manager
Namenode, Secondary NameNode, and Resource Manager work on a Master System
while the Node Manager and DataNode work on the Slave machine.

1. NameNode
NameNode works on the Master System. The primary purpose of Namenode is to
manage all the MetaData. Metadata is the list of files stored in HDFS(Hadoop
Distributed File System). As we know the data is stored in the form of blocks in a
Hadoop cluster. So the DataNode on which or the location at which that block of the
file is stored is mentioned in MetaData. All information regarding the logs of the
transactions happening in a Hadoop cluster (when or who read/wrote the data) will be
stored in MetaData. MetaData is stored in the memory.
2. DataNode
DataNode works on the Slave system. The NameNode always instructs DataNode for
storing the Data. DataNode is a program that runs on the slave system that serves the
read/write request from the client. As the data is stored in this DataNode, they should
possess high memory to store more Data.

3. Secondary NameNode
Secondary NameNode is used for taking the hourly backup of the data. In case the Hadoop
cluster fails, or crashes, the secondary Namenode will take the hourly backup or
checkpoints of that data and store this data into a file name fsimage. This file then gets
transferred to a new system. A new MetaData is assigned to that new system and a new
Master is created with this MetaData, and the cluster is made to run again correctly.

4. Resource Manager
Resource Manager is also known as the Global Master Daemon that works on the Master
System. The Resource Manager Manages the resources for the applications that are
running in a Hadoop Cluster. The Resource Manager Mainly consists of 2 things.
1.ApplicationsManager
2. Scheduler
An Application Manager is responsible for accepting the request for a client and also
makes a memory resource on the Slaves in a Hadoop cluster to host the Application
Master. The scheduler is utilized for providing resources for applications in a Hadoop
cluster and for monitoring this application.
Data (Vs Big Data
Warehouse )
Tools Used in Big Data Analytics
 Some of the tools used in Big Data analytics:
 Hadoop - helps in storing and analyzing data
 MongoDB - used on datasets that change
frequently
 Talend - used for data integration and
management
 Cassandra - a distributed database used to
handle chunks
of data
 Spark - used for real-time processing and
analyzing large amounts of data
 STORM - an open-source real-time
computational system
Big Data Industry Applications
 some of the sectors where Big Data is
actively used:
 Ecommerce
 Marketing
 Education
 Healthcare
 Media and entertainment
 Banking
 Telecommunications
 Government
Why Big Data

 More data More accurate analysis Greater

confidence in decision making Greater
operational efficiencies, Cost reduction,Time
reduction, New product development, and
Optimized offerings, etc..

Ecology and Environment by PD Sharma PDF Free
15% (20)
Ecology and Environment by PD Sharma PDF Free
5 pages
Big Data PPT 55b0fc01e7543
No ratings yet
Big Data PPT 55b0fc01e7543
31 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Bda U1
No ratings yet
Bda U1
78 pages
Introductions: What Are The 5 Vs of Big Data/ Characteristics of Big Data or Nature of Data
No ratings yet
Introductions: What Are The 5 Vs of Big Data/ Characteristics of Big Data or Nature of Data
75 pages
BD 1
No ratings yet
BD 1
15 pages
Big Data Analytics
No ratings yet
Big Data Analytics
45 pages
05-Big Data
No ratings yet
05-Big Data
29 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Unit I-Ch 01-Big Data Introduction
No ratings yet
Unit I-Ch 01-Big Data Introduction
40 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
Big Data
No ratings yet
Big Data
30 pages
Module-1-Introduction To BigData Platform
No ratings yet
Module-1-Introduction To BigData Platform
21 pages
Big Data Analysis Seminar
100% (1)
Big Data Analysis Seminar
15 pages
Seminar Report Alisha
No ratings yet
Seminar Report Alisha
22 pages
BIG DATA and Its Traits
No ratings yet
BIG DATA and Its Traits
25 pages
Big Data
No ratings yet
Big Data
16 pages
Big Data
No ratings yet
Big Data
24 pages
Big Data
No ratings yet
Big Data
31 pages
Now To Be Data
No ratings yet
Now To Be Data
16 pages
Unit-1 Final Sgs
No ratings yet
Unit-1 Final Sgs
24 pages
Prepared By: Asmita Deshmukh
No ratings yet
Prepared By: Asmita Deshmukh
51 pages
Fundamentals of Big Data Analytics
No ratings yet
Fundamentals of Big Data Analytics
151 pages
Big Data
No ratings yet
Big Data
30 pages
Lecture 3-Introduction To Big Data
No ratings yet
Lecture 3-Introduction To Big Data
25 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
Bigdatappt
No ratings yet
Bigdatappt
31 pages
Big Data
No ratings yet
Big Data
25 pages
Big Data Presentation Slide
100% (1)
Big Data Presentation Slide
30 pages
Big Data (1) (Autosaved)
No ratings yet
Big Data (1) (Autosaved)
13 pages
Bigdata 201126054145 PDF
No ratings yet
Bigdata 201126054145 PDF
23 pages
Introduction To Big Data Platform
No ratings yet
Introduction To Big Data Platform
20 pages
CC Becse Unit 4 PDF
No ratings yet
CC Becse Unit 4 PDF
32 pages
Unit 1.1 - Introduction To Big Data Analytics
No ratings yet
Unit 1.1 - Introduction To Big Data Analytics
19 pages
BDA Notes Part 1
No ratings yet
BDA Notes Part 1
11 pages
Kwasu-Csc204 Big Data Computing and Security-1
No ratings yet
Kwasu-Csc204 Big Data Computing and Security-1
57 pages
Seminar On: Big Data
No ratings yet
Seminar On: Big Data
23 pages
BDT Module 1
No ratings yet
BDT Module 1
107 pages
Big Data
No ratings yet
Big Data
190 pages
Introduction To Big Data Computing
No ratings yet
Introduction To Big Data Computing
25 pages
Big Data Analtics (Unit 1)
No ratings yet
Big Data Analtics (Unit 1)
31 pages
Big Data Analytics
No ratings yet
Big Data Analytics
14 pages
Big Data Analysis
No ratings yet
Big Data Analysis
3 pages
Title - Concept of Big Data: Presented by - Divyanshu Upadhyay Naman Gupta Adarsh Pandey Pankaj Chaudhary Shivbrat Singh
No ratings yet
Title - Concept of Big Data: Presented by - Divyanshu Upadhyay Naman Gupta Adarsh Pandey Pankaj Chaudhary Shivbrat Singh
17 pages
Kwasu-Csc204 Module 1 Big Data Computing and Security 2
No ratings yet
Kwasu-Csc204 Module 1 Big Data Computing and Security 2
22 pages
Big Data Seminar
100% (2)
Big Data Seminar
27 pages
Unit 1
No ratings yet
Unit 1
18 pages
Big Data Analysis by Deshbandhu
No ratings yet
Big Data Analysis by Deshbandhu
368 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
ETB 1 (Big Data)
No ratings yet
ETB 1 (Big Data)
28 pages
Bda PST
No ratings yet
Bda PST
11 pages
DBMS Unit1
No ratings yet
DBMS Unit1
30 pages
BDA Unit 1
No ratings yet
BDA Unit 1
68 pages
Lecture 3-Introduction To Big Data
No ratings yet
Lecture 3-Introduction To Big Data
25 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
42 pages
Unit - 1 Bda
No ratings yet
Unit - 1 Bda
14 pages
UNIT Two Emerging Technology
No ratings yet
UNIT Two Emerging Technology
43 pages
The Data Whisperer - Making Sense of Big Data
From Everand
The Data Whisperer - Making Sense of Big Data
Keaton Rivers
No ratings yet
Big Data: the Revolution That Is Transforming Our Work, Market and World
From Everand
Big Data: the Revolution That Is Transforming Our Work, Market and World
PAT NAKAMOTO
No ratings yet
Hadoop BIG DATA Interview Questions You'll Most Likely Be Asked
From Everand
Hadoop BIG DATA Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Texture Analysis
No ratings yet
Texture Analysis
67 pages
Placement Policy 24-25
No ratings yet
Placement Policy 24-25
6 pages
Unit 2
No ratings yet
Unit 2
18 pages
V.exception Handling
No ratings yet
V.exception Handling
43 pages
Lect 5
No ratings yet
Lect 5
12 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
209 pages
Parametric Testing
No ratings yet
Parametric Testing
127 pages
Midterm Robotics 12
No ratings yet
Midterm Robotics 12
2 pages
Clevo w650sj Service Manual PDF
No ratings yet
Clevo w650sj Service Manual PDF
107 pages
ELS 21 Maret 2024
No ratings yet
ELS 21 Maret 2024
25 pages
CoDeSysVisualization ABB
No ratings yet
CoDeSysVisualization ABB
568 pages
QuickServer PS-QS-2XX0 Start-Up Guide
No ratings yet
QuickServer PS-QS-2XX0 Start-Up Guide
45 pages
HeapLab - GLIBC Heap Exploitation
No ratings yet
HeapLab - GLIBC Heap Exploitation
39 pages
IBM Network Advisor Best Practices and Deployment Guide - v3.10
No ratings yet
IBM Network Advisor Best Practices and Deployment Guide - v3.10
94 pages
Lesson 4 Analog Input
No ratings yet
Lesson 4 Analog Input
21 pages
VideoEdge - NVR - Training - 4.01 (Aligns With D0 Manual)
100% (1)
VideoEdge - NVR - Training - 4.01 (Aligns With D0 Manual)
272 pages
DX Diag
No ratings yet
DX Diag
32 pages
GN-MC9557-2 (ExReader)
No ratings yet
GN-MC9557-2 (ExReader)
7 pages
Vacon OPTC3 C5 Profibus Option Board User Manual DPD00892B UK
No ratings yet
Vacon OPTC3 C5 Profibus Option Board User Manual DPD00892B UK
46 pages
Commands: Area Command Description And/or Example
100% (2)
Commands: Area Command Description And/or Example
4 pages
DL2 Interview Questions
No ratings yet
DL2 Interview Questions
34 pages
eNodeB Commissioning Guide by Using LMT V1 2
No ratings yet
eNodeB Commissioning Guide by Using LMT V1 2
12 pages
Computer Assembly Hanif Sharil Ikhwan SPPJ
No ratings yet
Computer Assembly Hanif Sharil Ikhwan SPPJ
19 pages
Using The LwIP Network Stack
No ratings yet
Using The LwIP Network Stack
34 pages
1.1.5 Modulo HTRI-R (Modulo de Relee)
No ratings yet
1.1.5 Modulo HTRI-R (Modulo de Relee)
2 pages
IEEE 1451 Manual
No ratings yet
IEEE 1451 Manual
40 pages
20410D ENU LabMod04
No ratings yet
20410D ENU LabMod04
4 pages
Matlab Web App Server Quick Start Guide
No ratings yet
Matlab Web App Server Quick Start Guide
3 pages
Manual For Version 1.8.10 Written by Dimitri Van Heesch ©1997-2015
No ratings yet
Manual For Version 1.8.10 Written by Dimitri Van Heesch ©1997-2015
224 pages
Cse1002 Problem Solving and Object Oriented Programming Lo 1.0 37 Cse1002
No ratings yet
Cse1002 Problem Solving and Object Oriented Programming Lo 1.0 37 Cse1002
3 pages
Planning Questions
No ratings yet
Planning Questions
24 pages
Install Log
No ratings yet
Install Log
13 pages
Chieh-En Li's Resume (SW)
No ratings yet
Chieh-En Li's Resume (SW)
1 page
Memory Game
No ratings yet
Memory Game
7 pages
What Is NOT Major Branches of Artificial Intelligence
No ratings yet
What Is NOT Major Branches of Artificial Intelligence
6 pages
Rukkmini Kalyanam
No ratings yet
Rukkmini Kalyanam
11 pages

Uploaded by

Uploaded by

Big Data Systems

 What is Big Data ?

Bit b 1 or 0 (on or off)

Kilobyte KB 1024 bytes

Megabyte MB 1024 kilobytes

Gigabyte GB 1024 megabytes

Terabyte TB 1024 gigabytes

Petabyte PB 1024 terabytes

Exabyte EB 1024 petabytes

Zettabyte ZB 1024 exabytes

Yottabyte YB 1024 zettabytes

 Big data has five major components: (5Vs)

 A typical PC might have had 10 gigabytes of

 Big Data isn't just numbers, dates, and strings.

 Traditional database systems were designed to

Big data has five major components: (5Vs)

i) Structured ii) Semi-Structured iii) Unstructured

 Example: Database Management

 Semi-Structured Data can be considered

 Unstructured Data is completely a

 Example : Audio Files, Images etc

 Telecommunication and Multimedia : sector is

 Government and Military : also use Big Data

•Hadoop is an open source framework from Apache and is used to store

 More data More accurate analysis Greater

You might also like