Big Data Analytics M1
Big Data Analytics M1
Module 1
Introduction to Big Data
Data
• Data is a set of values that represent a concept or concepts. It can be raw
information, such as numbers or text, or it can be more complex, such as images,
graphics, or videos.
Characteristics of Data
Composition: deals with structure of data, that is, the sources of data, the types, and
the nature of the data as to whether it is static or real-time streaming.
Condition: The condition of data deals with the state of the data that is “can one use
this data as is for analysis?” or “Does it require cleansing for further enhancement and
enrichment?”
Context: deals with “Where has this data been generated?”, “Why was this data
generated?” and so on.
In simple terms, characteristics of data includes
• Accuracy
• Completeness
• Consistency
• Timeliness
• Validity
• Uniqueness
Characteristics of Big Data
The characteristics of big data includes,
Evolution of Big Data
• 1970s and before – Mainframe: Basic Data Storage, Data has a structure.
• 1980s and 1990s – Relational Databases: It has a structure and relationship of the
data.
• 2000s and beyond – Structured, Unstructured and Multimedia data in the form of
WWW.
There are a lot of milestones in the evolution of Big Data which are described below:
Data Warehousing:
In the 1990s, data warehousing emerged as a solution to store and analyze large
volumes of structured data.
Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed storage
medium and large data processing are provided by Hadoop, and it is an open-source
framework.
Evolution of Big Data
NoSQL Databases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and
retrieve unstructured data.
At present, technologies like cloud computing, machine learning are widely used by
companies for reducing the maintenance cost and infrastructure cost and also to get the
proper insights from the big data effectively.
Challenges with Big Data
Soft state refers to a system design principle where the state of a system or its data is
allowed to change over time, even without direct user interaction.
1. Sensing
2. Collection
3. Wrangling
4. Analysis
5. Storage
Role / Elements of Big Data Ecosystem
1. Sensing
Sensing refers to the process of identifying data sources for your project.
This evaluation includes asking such questions as:
• Is the data accurate?
• Is the data recent and up to date?
• Is the data complete? Is the data valid? Can it be trusted?
Key pieces of the data ecosystem leveraged in this stage include:
• Internal data sources: Spreadsheets, and other resources that originate from within
organization.
• External data sources: Databases, spreadsheets, websites that originate from outside
your organization.
• Software: Custom software that exists for the sole purpose of data sensing.
• Algorithms: A set of steps or rules that automates the process of evaluating data for
accuracy and completion before it’s used.
Role / Elements of Big Data Ecosystem
2. Collection
Once a potential data source has been identified, data must be collected. Data collection
can be completed through manual or automated processes.