Unit 1 - ETI (BDA)
Unit 1 - ETI (BDA)
Topics ●
●
What is Big Data
Why Big Data
The history of Big Data analytics can be traced back to the early days of computing, when organizations
first began using computers to store and analyze large amounts of data. However, it was not until the
late 1990s and early 2000s that Big Data analytics really began to take off, as organizations increasingly
turned to computers to help them make sense of the rapidly growing volumes of data being generated
by their businesses.
Today, Big Data analytics has become an essential tool for organizations of all sizes across a wide range
of industries. By harnessing the power of Big Data, organizations are able to gain insights into their
customers, their businesses, and the world around them that were simply not possible before.
As the field of Big Data analytics continues to evolve, we can expect to see even more amazing and
transformative applications of this technology in the years to come.
Data can be categorized into two main types:
➢ Structured Data: This type of data is highly ➢ Unstructured Data: This type of data lacks
organized and formatted in a way that is a predefined data model or structure and
easily searchable and can be processed by is not easily searchable in traditional
machines. Examples include databases, databases. Examples include text
spreadsheets, and tables. documents, images, videos, and social
media posts.
Characteristics of Data
1. Accuracy
2. Relevance
3. Completeness
4. Consistency
5. Timeliness
6. Accessibility
7. Granularity
8. Consolidation
9. Validity
10. Security
11. Scalability
12. Volatility
These qualities define the quality, usefulness, and reliability of data in various contexts.
Big Data
● Big Data is a collection of data that is huge in volume, yet growing
exponentially with time.
● It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently.
● Big data is also a data but with huge size.
Big Data analytics provides various advantages—it can be used for better decision
making, preventing fraudulent activities, among other things.
Classification of analytics
1. Descriptive Analytics: This summarizes past data into a form that people can
easily read. This helps in creating reports, like a company’s revenue, profit, sales,
and so on. Also, it helps in the tabulation of social media metrics.
2. Diagnostic Analytics: This is done to understand what caused a problem in the
first place. Techniques like drill-down, data mining, and data recovery are all
examples. Organizations use diagnostic analytics because they provide an
in-depth insight into a particular problem.
Classification of analytics
3. Predictive Analytics: This type of analytics looks into the historical and present
data to make predictions of the future. Predictive analytics uses data mining, AI,
and machine learning to analyze current data and make predictions about the
future. It works on predicting customer trends, market trends, and so on.
4. Prescriptive Analytics:This type of analytics prescribes the solution to a
particular problem. Perspective analytics works with both descriptive and
predictive analytics. Most of the time, it relies on AI and machine learning.
Why is big data analytics important
● Data is the oil for today’s world. With the right tools, technologies,
algorithms, we can use
● data and convert it into a distinct business advantage
● Data Science can help you to detect fraud using advanced machine
learning algorithms
● It helps you to prevent any significant monetary losses
● Allows to build intelligence ability in machines
● You can perform sentiment analysis to gauge customer brand loyalty
● It enables you to take better and faster decisions
● It helps you to recommend the right product to the right customer to
enhance your
● business
Data Science
•Data Scientist
•Data Engineer
•Data Analyst
•Statistician
•Data Architect
•Data Admin
•Business Analyst
•Data/Analytics Manager
Responsibilities of Data Scientist
In the realm of Big Data, several terminologies and concepts are commonly used
to describe various aspects of data processing, storage, and analytics. Here are
some key terminologies used in Big Data environments:
Big Data:
● Refers to large and complex datasets that are challenging to process
using traditional data management tools.
Volume, Velocity, Variety:
● The three Vs of Big Data describe its main characteristics: Volume (the
sheer size of data), Velocity (the speed at which data is generated and
processed), and Variety (the diversity of data types and sources).
Structured Data:
● Data that is organized in a tabular format with a fixed schema. Examples include relational
databases.
Unstructured Data:
● Data that lacks a predefined data model or structure. Examples include text, images, and
videos.
Semi-structured Data:
● Data that is partially organized and may have some level of structure. Examples include JSON
and XML files.
Hadoop:
● An open-source framework for distributed storage and processing of large datasets. It
includes the Hadoop Distributed File System (HDFS) and MapReduce programming model.
MapReduce:
● A programming model for processing and generating large datasets in parallel across a
distributed cluster.
Apache Spark:
● An open-source, distributed computing system that provides fast and general-purpose data
processing for Big Data.
NoSQL:
● Stands for "Not Only SQL" and refers to a category of databases that do not strictly adhere to
the traditional relational database model. Examples include MongoDB and Cassandra.
Data Lake:
● A centralized repository that allows organizations to store structured and
unstructured data at any scale. It enables data exploration and analysis.
Data Warehouse:
● A centralized repository for storing and analyzing structured data from different
sources. Data warehouses are designed for efficient querying and reporting.
ETL (Extract, Transform, Load):
● The process of extracting data from various sources, transforming it into a
suitable format, and loading it into a target system, such as a data warehouse.
Machine Learning:
● A field of artificial intelligence (AI) that involves the development of algorithms
that enable systems to learn and make predictions or decisions based on data.
Data Mining:
● The process of discovering patterns, trends, and insights from large datasets
using various techniques.
Data Governance:
● The overall management of the availability, usability, integrity, and security of data
within an organization.
Data Security:
● The practice of protecting data from unauthorized access, disclosure, alteration,
and destruction.
Data Privacy:
● Concerned with protecting individuals' personal information and ensuring that data
is handled in compliance with privacy regulations.
Streaming Analytics:
● Analyzing and processing real-time data streams as they are generated, allowing
for immediate insights and actions.
Lambda Architecture:
● A data processing architecture that combines batch processing and stream
processing to handle large-scale data.
Edge Computing:
● Processing and analyzing data closer to the source (at the edge of the network)
rather than in a centralized data center.