0% found this document useful (0 votes)

12 views33 pages

u1 a clsrm

The document provides an overview of Big Data, including its types, importance, applications, and architecture. It discusses the evolution of Big Data, its key characteristics (the 5 Vs), and various platforms and technologies used for processing and analyzing large datasets. Additionally, it highlights the challenges of managing unstructured data and the significance of data analytics in modern enterprises.

Uploaded by

lolrofl102938

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views33 pages

u1 a clsrm

Uploaded by

lolrofl102938

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Lecture-1&2

Big Data (KCS-061)

Unit 1: Introduction to Big Data
• Types of digital data
• History of Big Data innovation
• Introduction to Big Data platform, drivers for Big Data
• Big Data architecture and characteristics
• 5 Vs of Big Data
• Big Data technology components
• Big Data importance and applications
• Big Data features – security, compliance, auditing and protection
• Big Data privacy and ethics
• Big Data Analytics
• Challenges of conventional systems
• Intelligent data analysis, nature of data, analytic processes and tools, analysis vs
reporting, modern data analytic tools
Introduction
• Today, data undoubtedly is an invaluable asset of any enterprise (big
or small). Even though professionals work with data all the time, the
understanding, management and analysis of data from
heterogeneous sources remains a serious challenge.
• Data growth has seen exponential acceleration since the advent of
the computer and internet.
What is Big Data?
• The term “big data” refers to data that is so large, fast or complex
that it’s difficult or impossible to process using traditional methods
and it demands cost-effective, innovative forms of information
processing for enhanced insight and decision making.
Why is Big Data important?
• The importance of big data doesn’t only revolve around how much
data you have, but what you do with it.

• You can take data from any source and analyze it to find answers that
enable 1) cost reductions, 2) time reductions, 3) new product
development and optimized offerings, and 4) smart decision making.

• When you combine big data with high-powered analytics, you can
.accomplish business-related tasks
Big Data Applications
• Its application include:
• Healthcare
• Academia
• Banking
• Manufacturing
• IT
• Retail
• Transportation
• Media and Entertainment
• Today numerous companies are using big data anlytics.
Types of digital data
• Structured
• Unstructured
• Semi-structured
• Usually, data is in the unstructured format which makes extracting
information from it difficult.
• According to Merrill Lynch, 80–90% of business data is either
unstructured or semi-structured.
• Gartner also estimates that unstructured data constitutes 80% of the
whole enterprise data.
Formats of Digital Data
Here is a percent distribution of the three forms of data
• Structured
By structured data, we mean data that can be processed, stored, and retrieved in a fixed
format. It refers to highly organized information that can be readily and seamlessly
stored and accessed from a database by simple search engine algorithms. For instance,
the employee table in a company database will be structured as the employee details,
their job positions, their salaries, etc., will be present in an organized manner.

• Unstructured
This data refers to the data that lacks any specific form or structure whatsoever. This
makes it very difficult and time-consuming to process and analyze unstructured data.
Email is an example of unstructured data. Structured and unstructured are two
important types of big data.
• Semi-structured
This data pertains to the data containing both the formats mentioned above, that is,
structured and unstructured data. To be precise, it refers to the data that although has
not been classified under a particular repository (database), yet contains vital
information or tags that segregate individual elements within the data. Thus we come to
the end of types of data. Lets discuss the characteristics of data.
Example: data in an XML file
History of Big Data innovation
• 90% of the available data has been created in the last two years and the term Big Data
has been around 2005, when it was launched by O’Reilly Media in 2005.
• The evolution of Big Data includes a number of preliminary steps for its foundation.
• Such steps to the modern conception of Big Data involve the development of computers,
smart phones, the internet, and sensory (Internet of Things) equipment to provide data.
Credit cards also played a role, by providing increasingly large amounts of data, and
certainly social media changed the nature of data volumes in novel and still developing
ways.
The Foundations of Big Data:
•1937 - The first major data project is created in 1937 and was ordered by the Franklin D.
Roosevelt’s administration in the USA. After the Social Security Act became law in 1937,
the government had to keep track of contribution from 26 million Americans and more
than 3 million employers.
•1943 - The first data-processing machine appeared in 1943 and was developed by the
British to decipher Nazi codes during World War II. This device, named Colossus, searched
for patterns in intercepted messages at a rate of 5.000 characters per second
•1965 - The United Stated Government decided to build the first data center to store over
742 million tax returns and 175 million sets of fingerprints by transferring al those records
onto magnetic computer tape that had to be stored in a single location.
• 1989 - British computer scientist Tim Berners-Lee invented eventually the World
Wide Web. He wanted to facilitate the sharing of information via a ‘hypertext’
system.

• 1995 - The first super computer is built, which was able to do as much work in a
second than a calculator operated by a single person can do in 30.000 years.

• 2005 - The term Big Data was coined by Roger Mougalas back in 2005. However,
the application of big data and the quest to understand the available data is
something that has been in existence for a long time. 2005 is also the year
that Hadoop was created by Yahoo built on top of Google’s MapReduce. It’s goal
was to index the entire World Wide Web and nowadays the open-source Hadoop
is used by a lot organizations to crunch through huge amounts of data.
• 2010 - In 2010 Eric Schmidt speaks at the Techonomy conference in Lake Tahoe
in California and he states that "there were 5 exabytes of information created by
the entire world between the dawn of civilization and 2003. Now that same
amount is created every two days.”

• 2011 - The McKinsey report states that by 2018 the US will face a shortfall of
between 140,000 and 190,000 professional data scientists, and states that issues
including privacy, security and intellectual property will have to be resolved
before the full value of Big Data will be realized.

• 2014 - The rise of the mobile machines – as for the first time, more people are
using mobile devices to access digital data, than office or home computers. 88%
of business executives surveyed by GE working with Accenture report that big
data analytics is a top priority for their business.
Big Data Platforms
• Big data platform is a type of IT solution that combines the features and
capabilities of several big data application and utilities within a single solution.

• Below are some Big Data platforms and tools:

1) Microsoft Azure
Users can analyze data stored on Microsoft’s Cloud platform, Azure, with a
broad spectrum of open-source Apache technologies, including Hadoop and
Spark. Azure also features a native analytics tool, HDInsight, that streamlines
data cluster analysis and integrates seamlessly with Azure's other data tools.
2) Cloudera:
Rooted in Apache’s Hadoop, Cloudera can handle massive amounts of data. Clients
routinely store more than 50 petabytes in Cloudera’s Data Warehouse, which can
manage data including machine logs, text, and more. Meanwhile, Cloudera’s
DataFlow—previously Hortonworks’ DataFlow—analyzes and prioritizes data in
real time.

3) Google Cloud:
Google Cloud offers lots of big data management tools, each with its own
speciality. BigQuery warehouses petabytes of data in an easily queried
format. Cloud Dataflow analyzes ongoing data streams and batches of historical
data side by side. With Google Data Studio, clients can turn varied data into custom
graphics.
4) Talend:
Talend’s trio of big data integration platforms includes a free basic platform and
two paid subscription platforms, all rooted in open-source tools like Apache Spark.
The paid platforms, though—one designed for existing data, the other for real-time
data streams—come with more power and tech support. Both can clean and parse
data, delete duplicate data and detect fraud automatically, among other functions.

5)Tableau:
The Tableau platform—available on-premises or in the Cloud—allows users to find
correlations, trends and unexpected interdependences between data sets.
The Data Management add-on further enhances the platform, allowing for more
granular data cataloging and the tracking of data lineage.
6) MAPR:
MapR’s platform, which they term "dataware," has attracted customers like
American Express and Samsung with its massive capacity (exabytes!) and robust
security measures. It is a dashboard for managing big data spread across various
platforms, clouds, servers and edge-computing devices.

7) Amazon Web Services:

Best known as AWS, Amazon’s cloud-based platform comes with 11 analytics tools
that are designed for everything from data prep and warehousing to SQL queries
and data lake design. All the resources scale with your data as it grows in a secure
cloud-based environment. Features include customizable encryption and the
option of a virtual private cloud.
8) IBM Cloud:
IBM’s full-stack cloud comes with 170 built-in tools, including more than 20 for
customizable big data management. Users can opt for a NoSQL or SQL database, or
store their data as JSON documents, among other database designs. The
platform can also run in-memory analysis and integrate open-source tools like
Apache Spark.

9) Alibaba Cloud:
The leading public cloud provider in China, Alibaba operates in 19 regions
worldwide, including the U.S. Its popular cloud platform offers a variety of
database formats and big data tools, including data warehousing, analytics for
streaming data and speedy Elasticsearch, which can scan petabytes of data
scattered across hundreds of servers in real time.
Drivers for Big data
• Big Data is no longer just a buzzword; it is a proven phenomenon and not likely to
die away soon
• Two factors have combined to make Big Data especially appealing
• One is that so many potentially valuable data resources have come into
existence. These sources include the telemetry generated by today's smart
devices, the digital footprints left by people who are increasingly living their lives
online, and the rich sources of information commercially available from
specialized data vendors.
• The other factor contributing to Big Data's appeal is the emergence of powerful
technologies for effectively exploiting it. IT organizations can now take advantage
of tools such as Hadoop, NoSQL to rationalize, analyze and visualize Big Data in
ways that enable them to quickly separate the actionable insight from the
massive chaff of raw input
Big Data Architecture
A big data architecture is designed to handle the ingestion, processing, and analysis
of data that is too large or complex for traditional database systems.
Most big data architectures include some or all of the following components:
•Data sources: All big data solutions start with one or more data sources. Examples
include:
• Application data stores, such as relational databases.
• Static files produced by applications, such as web server log files.
• Real-time data sources, such as IoT devices.
•Data storage: Data for batch processing operations is typically stored in a distributed
file store that can hold high volumes of large files in various formats. This kind of store
is often called a data lake. Options for implementing this storage include Azure Data
Lake Store or blob containers in Azure Storage.
• Batch processing: Because the data sets are so large, often a big data solution
must process data files using long-running batch jobs to filter, aggregate, and
otherwise prepare the data for analysis. Usually these jobs involve reading source
files, processing them, and writing the output to new files. Options include
running U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom
Map/Reduce jobs in an HDInsight Hadoop cluster, or using Java, Scala, or Python
programs in an HDInsight Spark cluster.

• Real-time message ingestion: If the solution includes real-time sources, the

architecture must include a way to capture and store real-time messages for
stream processing. This might be a simple data store, where incoming messages
are dropped into a folder for processing. However, many solutions need a
message ingestion store to act as a buffer for messages, and to support scale-out
processing, reliable delivery, and other message queuing semantics. Options
include Azure Event Hubs, Azure IoT Hubs, and Kafka.
• Stream processing: After capturing real-time messages, the solution must
process them by filtering, aggregating, and otherwise preparing the data for
analysis. The processed stream data is then written to an output sink. Azure
Stream Analytics provides a managed stream processing service based on
perpetually running SQL queries that operate on unbounded streams.

• Analytical data store: Many big data solutions prepare data for analysis and then
serve the processed data in a structured format that can be queried using
analytical tools. The data could be presented through a low-latency NoSQL
technology such as HBase, or an interactive Hive database that provides a
metadata abstraction over data files in the distributed data store. Azure Synapse
Analytics provides a managed service for large-scale, cloud-based data
warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL, which
can also be used to serve data for analysis.
• Analysis and reporting: To empower users to analyze the data, the architecture
may include a data modeling layer, such as a multidimensional OLAP cube or
tabular data model in Azure Analysis Services. It might also support self-service
BI, using the modeling and visualization technologies in Microsoft Power BI or
Microsoft Excel. Analysis and reporting can also take the form of interactive data
exploration by data scientists or data analysts. For these scenarios, many Azure
services support analytical notebooks, such as Jupyter, enabling these users to
leverage their existing skills with Python or R.

• Orchestration: Most big data solutions consist of repeated data processing

operations, encapsulated in workflows, that transform source data, move data
between multiple sources and sinks, load the processed data into an analytical
data store, or push the results straight to a report or dashboard. To automate
these workflows, you can use an orchestration technology such Azure Data
Factory or Apache Oozie and Sqoop.
When to use this architecture
Consider this architecture style when you need to:
•Store and process data in volumes too large for a traditional database.
•Transform unstructured data for analysis and reporting.
•Capture, process, and analyze unbounded streams of data in real time, or with low
latency.
•Use Azure Machine Learning or Microsoft Cognitive Services.
Big Data Characteristics (the 5 Vs)
• Volume
• The name ‘Big Data’ itself is related to a size which is enormous.
• Volume is a huge amount of data.
• To determine the value of data, size of data plays a very crucial role. If the
volume of data is very large then it is actually considered as a ‘Big Data’. This
means whether a particular data can actually be considered as a Big Data or not,
is dependent upon the volume of data.
• Hence while dealing with Big Data it is necessary to consider a characteristic
‘Volume’.
• Variety
• It refers to nature of data that is structured, semi-structured and unstructured
data.
• It also refers to heterogeneous sources.
• Variety is basically the arrival of data from new sources that are both inside and
outside of an enterprise. It can be structured, semi-structured and unstructured.
• Structured data: This data is basically an organized data. It generally refers to
data that has defined the length and format of data.
• Semi- Structured data: This data is basically a semi-organised data. It is generally
a form of data that do not conform to the formal structure of data. Log files are
the examples of this type of data.
• Unstructured data: This data basically refers to unorganized data. It generally
refers to data that doesn’t fit neatly into the traditional row and column
structure of the relational database. Texts, pictures, videos etc. are the examples
of unstructured data which can’t be stored in the form of rows and columns.
• Velocity
• Velocity refers to the high speed of accumulation of data.
• In Big Data velocity data flows in from sources like machines, networks, social
media, mobile phones etc.
• There is a massive and continuous flow of data. This determines the potential of
data that how fast the data is generated and processed to meet the demands.
• Sampling data can help in dealing with the issue like ‘velocity’.
• Veracity
• It refers to inconsistencies and uncertainty in data, that is data which
is available can sometimes get messy and quality and accuracy are
difficult to control.
• Big Data is also variable because of the multitude of data dimensions
resulting from multiple disparate data types and sources.
• Example: Data in bulk could create confusion whereas less amount of
data could convey half or Incomplete Information.
• Value
• After having the 4 V’s into account there comes one more V which
stands for Value!. The bulk of Data having no Value is of no good to
the company, unless you turn it into something useful.
• Data in itself is of no use or importance but it needs to be converted
into something valuable to extract Information. Hence, you can state
that Value! is the most important V of all the 5V’s.
Thank You

Unit 1-BigDataTools
No ratings yet
Unit 1-BigDataTools
69 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
79 pages
BDA(UNIT-1)
No ratings yet
BDA(UNIT-1)
24 pages
CS8091 - Big Data Analytics - Unit 1
No ratings yet
CS8091 - Big Data Analytics - Unit 1
28 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
BDA NOTES With Questions Included
No ratings yet
BDA NOTES With Questions Included
108 pages
Mongodb-Unit 5
No ratings yet
Mongodb-Unit 5
120 pages
Big Data Analytics
No ratings yet
Big Data Analytics
83 pages
Hamid Seminar Doc
No ratings yet
Hamid Seminar Doc
57 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
Kwasu-csc204 Module 1 Big Data Computing and Security 2
No ratings yet
Kwasu-csc204 Module 1 Big Data Computing and Security 2
22 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Fundamentals of Big Data Analytics
No ratings yet
Fundamentals of Big Data Analytics
151 pages
Bigdata Notes
No ratings yet
Bigdata Notes
136 pages
BigData Unit-1
No ratings yet
BigData Unit-1
72 pages
Unit-1 Module Updated
No ratings yet
Unit-1 Module Updated
48 pages
Unit 1
No ratings yet
Unit 1
63 pages
Introductionto Big Data Analytics
No ratings yet
Introductionto Big Data Analytics
162 pages
Unlocking The Power of Big Data Analytics With Hadoop and NoSQL Databases For Beginners
No ratings yet
Unlocking The Power of Big Data Analytics With Hadoop and NoSQL Databases For Beginners
47 pages
Detailednotes_unit1_Big Data
No ratings yet
Detailednotes_unit1_Big Data
22 pages
Big Data Unit 1 Notes - 240311 - 100703
No ratings yet
Big Data Unit 1 Notes - 240311 - 100703
15 pages
Unit - 1
No ratings yet
Unit - 1
46 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
BDU1
No ratings yet
BDU1
39 pages
Unit 1 Big Data Notes
No ratings yet
Unit 1 Big Data Notes
48 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
29 pages
01a Introduction To EA
No ratings yet
01a Introduction To EA
105 pages
SQL Injection
No ratings yet
SQL Injection
7 pages
BD U1.PDF.crdownload
No ratings yet
BD U1.PDF.crdownload
65 pages
UNIT 1 BD
No ratings yet
UNIT 1 BD
24 pages
Conceptual Modelling
No ratings yet
Conceptual Modelling
30 pages
Kwasu-csc204 Big Data Computing and Security-1
No ratings yet
Kwasu-csc204 Big Data Computing and Security-1
57 pages
REVIT ADDINS FOR BIM Users
No ratings yet
REVIT ADDINS FOR BIM Users
19 pages
Unit 1
No ratings yet
Unit 1
20 pages
BDA Class1
No ratings yet
BDA Class1
26 pages
Open Tech Fund: Penetration Test Report
No ratings yet
Open Tech Fund: Penetration Test Report
66 pages
Big Data
No ratings yet
Big Data
16 pages
Unit 1
No ratings yet
Unit 1
17 pages
# What is Big Data
No ratings yet
# What is Big Data
10 pages
What Is Big Data
No ratings yet
What Is Big Data
18 pages
Database Systems Design Implementation and Management 13th Edition Coronel Solutions Manual
100% (46)
Database Systems Design Implementation and Management 13th Edition Coronel Solutions Manual
20 pages
Unit-1 Introduction to Data Analytics.pptx
No ratings yet
Unit-1 Introduction to Data Analytics.pptx
35 pages
Types of Digital Data: Unit 1 Big Data KCS-061
No ratings yet
Types of Digital Data: Unit 1 Big Data KCS-061
12 pages
BDA-Unit-1 (2)
No ratings yet
BDA-Unit-1 (2)
39 pages
Unit I: Chapter 1: Introduction To Big Data
No ratings yet
Unit I: Chapter 1: Introduction To Big Data
35 pages
UNIT 1 -BDA
No ratings yet
UNIT 1 -BDA
21 pages
BIG DATA (UNIT 1)
No ratings yet
BIG DATA (UNIT 1)
32 pages
Besufekad BIG DATA 1
No ratings yet
Besufekad BIG DATA 1
10 pages
Introduction to Data
No ratings yet
Introduction to Data
34 pages
Unit I-KCS-061
No ratings yet
Unit I-KCS-061
42 pages
Now To Be Data
No ratings yet
Now To Be Data
16 pages
Geie 112 - S19 - LCN - 5
No ratings yet
Geie 112 - S19 - LCN - 5
24 pages
The Institute of Chartered Accountants of India: Information Technology Training Programme
No ratings yet
The Institute of Chartered Accountants of India: Information Technology Training Programme
34 pages
Big Data..Unit-1 Notes
No ratings yet
Big Data..Unit-1 Notes
16 pages
Unit - I - Types of Digital Data
No ratings yet
Unit - I - Types of Digital Data
45 pages
Briskinfosec Pentest Toolkit
No ratings yet
Briskinfosec Pentest Toolkit
17 pages
SAP Workflow Interview Questions With Answers
No ratings yet
SAP Workflow Interview Questions With Answers
5 pages
Big Data Analytics
No ratings yet
Big Data Analytics
73 pages
WHAT IS BIG DATA87
No ratings yet
WHAT IS BIG DATA87
4 pages
BDA-1
No ratings yet
BDA-1
26 pages
Survey of Different Database Ssoftwares
50% (2)
Survey of Different Database Ssoftwares
17 pages
Da Unit - I - Notes
No ratings yet
Da Unit - I - Notes
30 pages
Usar Internet Explorer en Edge
No ratings yet
Usar Internet Explorer en Edge
7 pages
Chapter-5-Working With Typical Operating System Bookbackanswer
No ratings yet
Chapter-5-Working With Typical Operating System Bookbackanswer
5 pages
External Bypass - Verification Procedure - Customer Space - Allot Confluence Server
No ratings yet
External Bypass - Verification Procedure - Customer Space - Allot Confluence Server
4 pages
Oracle 11g Results Cache
No ratings yet
Oracle 11g Results Cache
25 pages
Ocs Author Manual (Eng)
No ratings yet
Ocs Author Manual (Eng)
10 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
BIG Data_Unit_1
No ratings yet
BIG Data_Unit_1
24 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
What Is The Big Data
No ratings yet
What Is The Big Data
4 pages
Java Programming Chapter 3 GUI With Javafx
No ratings yet
Java Programming Chapter 3 GUI With Javafx
15 pages
PosterSLSV Oslo Mai 2023-V01
No ratings yet
PosterSLSV Oslo Mai 2023-V01
1 page
Oracle
No ratings yet
Oracle
5 pages
Eighth S Mester B. - Degree Ex: Softw e
No ratings yet
Eighth S Mester B. - Degree Ex: Softw e
1 page
Brochure NextGen
No ratings yet
Brochure NextGen
2 pages
Doctor Appointment Booking System
No ratings yet
Doctor Appointment Booking System
3 pages
Manually Install or Upgrade VMware Tools in A Linux Virtual Machine PDF
No ratings yet
Manually Install or Upgrade VMware Tools in A Linux Virtual Machine PDF
2 pages
SSH Cadangan 21 Peb (Sfile
No ratings yet
SSH Cadangan 21 Peb (Sfile
3 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
CTS INTERNSHIP REPORT - Mohak
50% (4)
CTS INTERNSHIP REPORT - Mohak
32 pages
Unit 1 - ETI (BDA)
No ratings yet
Unit 1 - ETI (BDA)
20 pages
Moving AUD$ To Another Tablespace and Adding Triggers To AUD$
No ratings yet
Moving AUD$ To Another Tablespace and Adding Triggers To AUD$
3 pages
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
No ratings yet
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
12 pages
Big Data: Made By: Harshita Salian 17038 Syed Khadija Rizvi 17049 Sayyed Alfiya 17041 Rahul Masam 17028 Deepak Pal 17033
No ratings yet
Big Data: Made By: Harshita Salian 17038 Syed Khadija Rizvi 17049 Sayyed Alfiya 17041 Rahul Masam 17028 Deepak Pal 17033
12 pages
Online Ticket Reservation System
80% (5)
Online Ticket Reservation System
20 pages
Secure Development Policy
No ratings yet
Secure Development Policy
7 pages
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
PAWAN
100% (1)
PAWAN
20 pages
2 - Apache Airflow
No ratings yet
2 - Apache Airflow
5 pages

Uploaded by

Uploaded by

Lecture-1&2

Big Data (KCS-061)

• Below are some Big Data platforms and tools:

7) Amazon Web Services:

• Real-time message ingestion: If the solution includes real-time sources, the

• Orchestration: Most big data solutions consist of repeated data processing

You might also like