0% found this document useful (0 votes)
11 views58 pages

Big Data Analytics

The document provides an overview of Big Data Analytics, detailing the types of digital data: structured, semi-structured, and unstructured, along with their merits and limitations. It discusses the sources of Big Data, the 3 V's (Volume, Variety, Velocity), and the significance of Big Data analytics in various industries such as healthcare, finance, retail, and agriculture. Additionally, it outlines the data analysis process and the challenges faced in managing Big Data.

Uploaded by

Priyanka Naik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views58 pages

Big Data Analytics

The document provides an overview of Big Data Analytics, detailing the types of digital data: structured, semi-structured, and unstructured, along with their merits and limitations. It discusses the sources of Big Data, the 3 V's (Volume, Variety, Velocity), and the significance of Big Data analytics in various industries such as healthcare, finance, retail, and agriculture. Additionally, it outlines the data analysis process and the challenges faced in managing Big Data.

Uploaded by

Priyanka Naik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

Big Data Analytics

Types of Digital Data


 Structured
 Semi-Structured
 Unstructured
Structured Data
 Structured data is one of types of big data, characterized by
its organized and systematic format.
 Structured data is defined as a clear framework, typically
represented in tables, rows, and columns.
 Suitable for traditional database systems and facilitates
efficient storage, retrieval, and analysis.

Examples:
 Tables in relational databases.
 Spreadsheets.
 Formatted Dates or Time and information like account numbers.
Structured Data
Merits:
 The organized format helps to define data fields and establish relationships for efficient retrieval.
 Structured query languages (SQL), enable precise and rapid querying which accelerates data analysis.
 Promotes data consistency and accuracy while minimizing errors and discrepancies that could arise during data entry or processing.
 Seamless data migration between systems and platforms, allowing interoperability and integration across diverse applications.
 Quantitative analysis, statistical calculations, and aggregation are easier with structured data.

Limitations:
 Rigidity: The predefined structure can be limiting when dealing with complex, dynamic, or evolving data types.
 Data Loss: The structured approach might force oversimplification, leading to the omission of potentially valuable information and
overlooking fine grained detail.
 Scalability Challenges: As data volumes grow exponentially, maintaining the structural integrity while scaling of data becomes
increasingly challenging due to performance bottlenecks.
Semi-Structured Data
 Semi-structured data is one of the types of big data that represents a middle ground between the
structured and unstructured data categories.
 It combines elements of organization and flexibility, allowing for data to be partially structured while
accommodating variations in format and content.
 This type of data is often represented with tags, labels, or hierarchies, which provide a level of
organization without strict constraints.

Examples:
 XML Documents
 JSON Data
 NoSQL Databases
Semi-Structured Data
Merits:
 Semi-structured data is flexible and can represent complex relationships and hierarchical structures. It can
accommodates changes to data formats without requiring major alterations to the underlying processing
systems.
 Semi-structured data can be stored in ways that optimize space utilization and retrieval efficiency.

Limitations:
 Data Integrity: The flexible nature of semi-structured data can lead to issues related to data integrity,
consistency, and validation.
 Query Complexity: Analyzing and querying semi-structured data might require more complex and
specialized techniques compared to structured data.
 Migration: Migrating or integrating semi-structured data across different systems can be challenging due
Unstructured Data
 Unstructured data is one of the types of big data that represents a
diverse and often unorganized collection of information.
 It lacks a consistent structure, making it more challenging to
organize and analyze.
 This data type encompasses a wide array of content, including
text, images, audio, video, and more, often originating from
sources like social media, emails, and multimedia platforms.
Example:
 Social media posts data.
 Customer reviews and feedback, found on e-commerce platforms, review sites,
and surveys.
 Medical images, such as X-rays, MRIs, and CT scans, are examples of
unstructured data.
Unstructured Data
Merits:
 Unstructured data can capture more information and qualitative aspects that structured data might
overlook.
 The diverse nature of unstructured data mirrors real-world scenarios more closely, and can be valuable
for decision-making and trend analysis.
 Unstructured data fuels innovation in fields like natural language processing, image recognition, and
machine learning.
Limitations:
 Data Complexity: The lack of a predefined structure complicates data organization, storage, and retrieval.
 Data Noise: Unstructured data can include noise, irrelevant information, and outliers.
 Scalability: As unstructured data volumes grow, managing and processing this data becomes resource-
Introduction to Big Data
 Big Data: This is a term related to extracting meaningful data by analyzing the huge amount of complex,
variously formatted data generated at high speed, that cannot be handled, or processed by the
traditional system.
Sources of Big Data
 Social Media: Today’s world a good percent of the total world population is engaged with social media
like Facebook, WhatsApp, Twitter, YouTube, Instagram, etc. Each activity on such media like uploading a
photo, or video, sending a message, making comment, putting like, etc create data.
 A sensor placed in various places: Sensor placed in various places of the city that gathers data on
temperature, humidity, etc. A camera placed beside the road gather information about traffic condition
and creates data. Security cameras placed in sensitive areas like airports, railway stations, and shopping
malls create a lot of data.
 Customer Satisfaction Feedback: Customer feedback on the product or service of the various company
on their website creates data. For Example, retail commercial sites like Amazon, Walmart, Flipkart, and
Myntra gather customer feedback on the quality of their product and delivery time. Telecom
companies, and other service provider organizations seek customer experience with their service. These
create a lot of data.
Sources of Big Data
 IoT Appliance: Electronic devices that are connected to the internet create data for their smart
functionality, examples are a smart TV, smart washing machine, smart coffee machine, smart AC, etc. It
is machine-generated data that are created by sensors kept in various devices. For Example, a Smart
printing machine – is connected to the internet. A number of such printing machines connected to a
network can transfer data within each other. So, if anyone loads a file copy in one printing machine, the
system stores that file content, and another printing machine kept in another building or another floor
can print out that file hard copy. Such data transfer between various printing machines generates data.
 E-commerce: In e-commerce transactions, business transactions, banking, and the stock market, lots of
records stored are considered one of the sources of big data. Payments through credit cards, debit
cards, or other electronic ways, all are kept recorded as data.
 Global Positioning System (GPS): GPS in the vehicle helps in monitoring the movement of the vehicle to
shorten the path to a destination to cut fuel, and time consumption.
The 3V’s of Big Data

Volume Variety Velocity


• The amount of data generated by • It describes the combination • Describes the frequency at
the organization or any of structured, semi- which data is captured and
individual. structured and unstructured shared.
data.
• Google processes 20 Petabytes a • Text, Sensor data, Audio and • Facebook processes 2.5
day. Video streams and log files. Petabytes and receives
• And receives a 2000000 search 34722 likes per minute.
queries per minute. • eBay has 6.6 Petabytes of
user data.
Why Big Data ?

More Data

More accurate
analysis

More confidence in
decision making
Greater Operational efficiencies, cost
reduction, time reduction, new
product development, optimized
offerings, etc.
Big Data Analytics
 Big data analytics is the often complex process of examining big data to uncover information -- such as
hidden patterns, correlations, market trends and customer preferences -- that can help organizations
make informed business decisions.
 On a broad scale, data analytics technologies and techniques give organizations a way to analyze data
sets and gather new information. Business intelligence (BI) queries answer basic questions about
business operations and performance.
Big Data Analytics
 Big data analytics is a form of advanced analytics, which involve complex applications with elements
such as predictive models, statistical algorithms and what-if analysis powered by analytics systems.
 An example of big data analytics can be found in the healthcare industry, where millions of patient
records, medical claims, clinical results, care management records and other data must be collected,
aggregated, processed and analyzed.
 Big data analytics is used for accounting, decision-making, predictive analytics and many other
purposes. This data varies greatly in type, quality and accessibility, presenting significant challenges but
also offering tremendous benefits.
Big Data Analytics - Examples
Big Data is one of the most powerful innovations in almost
every industry. It plays a key role in planning future products,
services, and whatnot. Approximately, 98% of businesses are
investing in Big Data by 2024. Within just a decade it has grown
to such a level that it has almost entered each aspect of
our lifestyle like shopping, transportation, healthcare, and
routine choices.
In the last 2 years, 90% of the world’s data has been created and businesses are spending more than $180
billion a year on big data
analysis.
Big Data Analytics - Examples
Big Data in Health Care
 Predictive Analytics for Patient Care

The main concern in big data analytics consists in the utilization of predictive modeling in order to foresee
patient health trends. Hence, Mount Sinai Medical Center in New York applies Big Data to predict the
occurrences of patient admissions. Through assessing past patient data records, hospital is able to plan for
appropriate staffing needs and ensure resources are used efficiently, which will in turn result in improved
patient care and efficient operations.
 Genomic Research

Genomics generates immense amounts of data. The 1000 Genomes Project, which aims to establish a
comprehensive resource on human genetic variation, relies on Big Data technologies to process and
analyze genetic information from numerous samples, paving the way for personalized medicine and
advanced genetic research.
Big Data Analytics - Examples
2. Example of Big Data in Finance
 Fraud Detection

Financial houses implement Big Data to deal with fraud. One illustrative example is JPMorgan Chase who
has advanced analytics to track real-time transactions. Algorithms do analysis of patterns and flag
consistently any fraudulent activity, which in turn reduces fraud cases by a large percentage and saves the
institution millions of dollars every year.
 Risk Management

Huge data is the instrument that allows one to obtain detailed information about the market dynamics.
Companies such as Goldman Sachs make use of Big Data analytics, which leads to the evaluation of
market situations, anticipation of stock movements and better risk managing in portfolios, thus justifying
investment decision with more intelligence.
Big Data Analytics - Examples

3. Example of Big Data in Retail


 Personalized Shopping Experience

Big data is used by companies like Amazon and Walmart as a means to provide each shopper with a more
customized shopping experience. Through the examination of the customer choice and earlier purchases,
these firms launch commercial recommendations and bonuses which in turn promote customer
satisfaction and loyalty.
 Inventory Management

Walmart uses Big Data technology for successfully tracking inventory. The info which Walmart collects by
analyzing sales data, weather patterns, and local events makes certain that the right products are in stock
at the right time, and prevents any overstock and stockouts, thereby improving the whole supply chain
operation.
Big Data Analytics - Examples

4. Example of Big Data in Transportation and Logistics


 Route Optimization

A company like UPS, for instance, uses Big Data analytics to map out the most efficient delivery routes.
The ORION system, developed by the company, is responsible for the analysis of various data sources such
as traffic flow and weather pattern to establish the most ideal path for their fleet. It led to a substantial
costs reduction and energy saving.
 Predictive Maintenance

Airlines like Delta, apply Big Data not only predictive means for aircraft maintenance. These companies
can predict component failures by analyzing the data from sensors on planes and this makes it possible to
reduce the time of downtime and enhancing the safety standards.
Big Data Analytics - Examples

5. Example of Big Data in Education


 Student Performance Monitoring

Institutions like Arizona State University as well use Big Data to control the performance of the students
and also to improve the educational outcomes. Data on student engagement, attendance, and academic
results are analyzed by the university to identify such students early. Therefore, the university can offer
them tailored interventions.
 Personalized Learning

Platforms such as Coursera and Khan Academy utilize Big Data to teach personalized learning for students.
The platforms are able to adapt course materials in such a way that they match individual students'
learning speeds and preferences by analyzing students' interaction data; thus, their educational
experience is improved.
Big Data Analytics - Examples
6. Example of Big Data in Agriculture
 Precision Farming

Big Data is main use in the modern agriculture through precision farming. Organizations like John Deere
gather data from several sensors and GPS navigation system to fine-tune planting, fertilization, and
harvesting practices. This system ensures top yields without overusing resources, in turn, contributing to
sustainable agriculture.
 Weather Prediction

Agriculturalists rely on Big Data for precise weather forecasting. Tools such as Climate FieldView use data
resources and weather predictions to offer farmers precision forecasts that help them plan their planting
and harvesting with accuracy.
Big Data Analytics - Examples
7. Example of Big Data in Entertainment
 Content Recommendation

Streaming services like Netflix and Spotify use Big Data to recommend content to their users. By analyzing
viewing and listening habits, these platforms can suggest movies, shows, and music that match user
preferences, enhancing the user experience and engagement.
 Audience Analysis

The film industry uses Big Data for audience analysis. Studios analyze social media, box office results, and
demographic data to predict the success of films and plan marketing strategies accordingly. For example,
Warner Bros. uses predictive analytics to determine the potential success of a movie based on data from
previous releases.
Big Data Analytics - Examples
Summary
 Using analytics to understand customer behavior in order to optimize the customer experience
 Predicting future trends in order to make better business decisions
 Improving marketing campaigns by understanding what works and what doesn't
 Increasing operational efficiency by understanding where bottlenecks are and how to fix them
 Detecting fraud and other forms of misuse sooner
Big Data Challenges
1. Scale
Storage is one major concern that needs to be addressed to handle the need for scaling rapidly and
elastically.
2. Security
Most of the NoSQL big data platforms have poor security mechanisms when it comes to safeguarding the
big data.
3. Schema
Rigid schemas have no place. The technology should be able to fit the big data not other way around.
4. Continuous Availability
5. Consistency
6. Partition tolerant
7. Data Quality
What is Data Analysis?
 The collection, transformation, and organization of data to draw conclusions make predictions for the
future and make informed data-driven decisions is called Data Analysis. The profession that handles
data analysis is called a Data Analyst.
 There is a huge demand for Data Analysts as the data is expanding rapidly nowadays. Data Analysis is
used to find possible solutions for a business problem. The advantage of being a Data Analyst is that
they can work in any field they love healthcare, agriculture, IT, finance, business. Data-driven decision-
making is an important part of Data Analysis. It makes the analysis process much easier. There are six
steps for Data Analysis.
Steps for Data analysis Process
1. Define the Problem or Research Question
2. Collect Data
3. Data Cleaning
4. Analyzing the Data
5. Data Visualization
6. Presenting Data
Steps for Data analysis Process
1. Define the Problem or Research Question
In the first step of process the data analyst is given a problem/business task. The analyst has to
understand the task and the stakeholder’s expectations for the solution. A stakeholder is a person that has
invested their money and resources to a project.
The analyst must be able to ask different questions in order to find the right solution to their problem. The
analyst has to find the root cause of the problem in order to fully understand the problem. The analyst
must make sure that he/she doesn’t have any distractions while analyzing the problem. Communicate
effectively with the stakeholders and other colleagues to completely understand what the underlying
problem is. Questions to ask yourself for the Ask phase are:
What are the problems that are being mentioned by my stakeholders?
What are their expectations for the solutions?
Steps for Data analysis Process
2. Collect Data
The second step is to Prepare or Collect the Data. This step includes collecting data and storing it for
further analysis. The analyst has to collect the data based on the task given from multiple sources. The
data has to be collected from various sources, internal or external sources.
Internal data is the data available in the organization that you work for while external data is the data
available in sources other than your organization. The data that is collected by an individual from their
own resources is called first-party data.
The data that is collected and sold is called second-party data. Data that is collected from outside sources
is called third-party data. The common sources from where the data is collected are Interviews, Surveys,
Feedback, Questionnaires. The collected data can be stored in a spreadsheet or SQL database.
Steps for Data analysis Process
3. Data Cleaning
The third step is Clean and Process Data. After the data is collected from multiple sources, it is time
to clean the data. Clean data means data that is free from misspellings, redundancies, and irrelevance.
Clean data largely depends on data integrity.
There might be duplicate data or the data might not be in a format, therefore the unnecessary data is
removed and cleaned. There are different functions provided by SQL and Excel to clean the data. This is
one of the most important steps in Data Analysis as clean and formatted data helps in finding trends and
solutions.
The most important part of the Process phase is to check whether your data is biased or not. Bias is an act
of favoring a particular group/community while ignoring the rest. Biasing is a big no-no as it might affect
the overall data analysis. The data analyst must make sure to include every group while the data is being
collected.
Steps for Data analysis Process
4. Analyzing the Data
The fourth step is to Analyze. The cleaned data is used for analyzing and identifying trends. It also
performs calculations and combines data for better results.
The tools used for performing calculations are Excel or SQL. These tools provide in-built functions to
perform calculations or sample code is written in SQL to perform calculations.
Using Excel, we can create pivot tables and perform calculations while SQL creates temporary tables to
perform calculations. Programming languages are another way of solving problems. They make it much
easier to solve problems by providing packages. The most widely used programming languages for data
analysis are R and Python.
Steps for Data analysis Process
5. Data Visualization
The fifth step is visualizing the data. Nothing is more compelling than a visualization. The data now
transformed has to be made into a visual (chart, graph). The reason for making data visualizations is that
there might be people, mostly stakeholders that are non-technical.
Visualizations are made for a simple understanding of complex data. Tableau and Looker are the two
popular tools used for compelling data visualizations. Tableau is a simple drag and drop tool that helps in
creating compelling visualizations.
Looker is a data viz tool that directly connects to the database and creates visualizations. Tableau and
Looker are both equally used by data analysts for creating a visualization. R and Python have some
packages that provide beautiful data visualizations.
R has a package named ggplot which has a variety of data visualizations. A presentation is given based on
the data findings. Sharing the insights with the team members and stakeholders will help in making better
Steps for Data analysis Process
6. Presenting the Data
Presenting the data involves transforming raw information into a format that is easily comprehensible and
meaningful for various stakeholders. This process encompasses the creation of visual representations,
such as charts, graphs, and tables, to effectively communicate patterns, trends, and insights gleaned from
the data analysis.
The goal is to facilitate a clear understanding of complex information, making it accessible to both
technical and non-technical audiences. Effective data presentation involves thoughtful selection of
visualization techniques based on the nature of the data and the specific message intended. It goes
beyond mere display to storytelling, where the presenter interprets the findings, emphasizes key points,
and guides the audience through the narrative that the data unfolds. Whether through reports,
presentations, or interactive dashboards, the art of presenting data involves balancing simplicity with
depth, ensuring that the audience can easily grasp the significance of the information presented and use it
for informed decision-making.
Analytical Models
The four main analytical models organizations can deploy are:
 Descriptive
 Diagnostic
 Predictive
 Prescriptive.
Steps for Data analysis Process
Descriptive analytics
Descriptive analytics answer the question: What happened?
This is the most common type of analytics found in business. It generally uses historical data from a single
internal source to pinpoint when an event occurred.
For example:
 How many sales did we make in the last week/day/hour?
 Which customers required the most help from our customer service team?
 How many people viewed our website?
 Which product had the most defects?

Descriptive analytics are often displayed on dashboards and in reports, which are convenient ways to
consume data and inform decisions. Descriptive analytics account for most of the statistics we use,
including basic aggregation (e.g. count or sum of values filtered from a column or data), averages, and
Steps for Data analysis Process
Diagnostic analytics
Diagnostic analytics help us to answer the next question: Why did it happen?
To do this, analysts dive deeper into an organization's historical data, combining multiple sources in search of
patterns, trends, and correlations.
Why would you use diagnostic analytics?
 Identify anomalies: Analysts use the results from descriptive analysis to identify areas that need further
investigation and raise questions that can’t be answered by simply looking at the data. For example: Why
have sales increased in a region that had no change in marketing?
 Drill down into data: To explain anomalies, analysts must find patterns outside existing data sets to identify
correlations. They might need to use techniques such as data mining, and use data from external sources.
 Determine causal relationships: Having identified anomalies and searched for patterns that could be
correlated, analysts use more advanced statistical techniques to determine whether these are related.
Steps for Data analysis Process
Predictive analytics
As an organization increases its analytical maturity and embarks on predictive analytics, it shifts its focus from understanding
historical events to creating insights about a current or future state. Predictive analytics is at the intersection of classical statistical
analysis and modern artificial intelligence (AI) techniques. It tries to answer the question: What will happen next?
It’s impossible to predict exactly what will happen in the future, but by employing predictive analytics, organizations identify the
likelihood of possible outcomes and can increase the chance of taking the best course of action. We see predictive analytics used in
many sectors.
For example:
 Aerospace – Predictive analytics are used to predict the effect of specific maintenance operations on aircraft reliability, fuel use,
availability, and uptime.
 Financial services – Predictive analytics are used to develop credit-risk models and forecast financial market trends.
 Manufacturing – Predictive analytics are used to predict the location and rate of machine failures, and to optimise ordering and
delivery of raw materials based on projected future demands.
 Online retail – Systems monitor customer behavior, and predictive models determine whether providing additional product
Steps for Data analysis Process
Prescriptive analytics
Prescriptive analytics is the most complex type of analytics. It combines internal data, external sources, and machine-
learning techniques to provide the most effective outcomes. In prescriptive analytics, a decision-making process is
applied to descriptive and predictive models to find the combinations of existing conditions and possible decisions that
are likely to have the most effect in the future. This process is both complex and resource intensive but, when done well,
can provide immense value to an organization.
Applications of prescriptive analytics include:
 risk management[2]
 improving healthcare[3]
 guided marketing, selling and pricing[4].

As the most complex form of analytics, prescriptive analytics not only pose technical challenges, but are also influenced
by external factors such as government regulation, market risk, and existing organizational behavior. If you are
considering deploying prescriptive analytics, be sure you have a solid business case that identifies why machine-
The difference between Traditional data and Big data
Exploratory Statistical Analysis
1.Understand Data Distribution:
1. Identify patterns, trends, and relationships within the data.
2. Assess data variability and distribution.

2.Identify Data Anomalies:


1. Spot outliers, missing values, and inconsistent entries.

3.Hypothesize Relationships:
1. Test assumptions and generate hypotheses about variable interactions.

4.Reduce Complexity:
1. Summarize large datasets into manageable insights.
Exploratory Statistical Analysis
 Statistical Methods in ESA
1. Descriptive Statistics
• Measures of Central Tendency:
• Mean: Average value.
• Median: Middle value.
• Mode: Most frequent value.
Consider a scenario where a retail company is analyzing the daily sales (in dollars) of a product across different stores over a
month. The dataset (in dollars) is as follows:
200, 250, 300, 250, 500, 250, 800, 1000, 250, 300
 1. Mean (Average)
 The mean provides the average sales per day.

So, the mean daily sales are $410.


Exploratory Statistical Analysis
2. Median
The median is the middle value when the data is sorted in ascending order. It represents the "center" of
the data distribution.
Step 1: Sort the data:
[200,250,250,250,250,300,300,500,800,1000]

Step 2: Find the middle value(s):


• If the dataset size is odd, the median is the middle value.
• If the dataset size is even, the median is the average of the two middle values.
 Here, the dataset has 10 values (even). The two middle values are the 5th and 6th values:

So, the median daily sales are $275.


Exploratory Statistical Analysis
 3. Mode
 The mode is the value that appears most frequently in the dataset.

In the current dataset:


[200,250,300,250,500,250,800,1000,250,300]
250 occurs 4 times, more than any other value.
 So, the mode is $250.
Exploratory Statistical Analysis
 Statistical Methods in ESA

1. Descriptive Statistics
• Measures of Dispersion:
• Range: Difference between maximum and minimum.
• Variance: Spread of data points from the mean.
• Standard Deviation: Average distance from the mean.

• Shape and Distribution:


• Skewness: Asymmetry of the data distribution.
• Kurtosis: Sharpness of the peak of the distribution.
Exploratory Statistical Analysis
 Statistical Methods in ESA
 2. Inferential Statistics
• Confidence intervals to estimate parameters.
• Hypothesis testing (e.g., t-tests, chi-square tests).
• p-values to determine statistical significance.
 3. Correlation and Covariance
• Measure relationships between variables:
• Pearson correlation: Linear relationships.
• Spearman correlation: Non-linear relationships.
• Covariance: Direction of the relationship between variables.
Exploratory Statistical Analysis
 Statistical Methods in ESA
 4. Outlier Detection
• Statistical techniques like Z-scores, IQR (Interquartile Range), or robust statistical measures (e.g., Tukey's
fences).
 5. Dimensionality Reduction
• Principal Component Analysis (PCA): Reduces dimensionality while retaining significant variance.
• Singular Value Decomposition (SVD): Useful for sparse or large matrices.
Exploratory Statistical Analysis
 Visualization in ESA
 Visualization complements statistical methods by making results interpretable and accessible. Popular techniques include:

• Univariate Analysis:
• Histograms
• Boxplots

• Bivariate and Multivariate Analysis:


• Scatterplots
• Pairplots
• Heatmaps for correlation matrices

• Time Series Analysis:


• Line graphs for trends
• Autocorrelation plots
 Visualization tools like Python’s Matplotlib, Seaborn, or big data-specific tools like Apache Zeppelin can handle large
datasets effectively.
Exploratory Statistical Analysis
 Visualization in ESA
• Bivariate and Multivariate Analysis:
• Scatterplots
• Pairplots
• Heatmaps for correlation matrices

• Time Series Analysis:


• Line graphs for trends
• Autocorrelation plots
 Visualization tools like Python’s Matplotlib, Seaborn,

or big data-specific tools like Apache Zeppelin can handle large datasets effectively.
Exploratory Statistical Analysis
 Visualization in ESA
• Bivariate and Multivariate Analysis:
• Scatterplots
• Pairplots
• Heatmaps for correlation matrices

• Time Series Analysis:


• Line graphs for trends
• Autocorrelation plots
 Visualization tools like Python’s Matplotlib, Seaborn,

or big data-specific tools like Apache Zeppelin can handle large datasets effectively.
Exploratory Statistical Analysis
 Tools for ESA in Big Data

1. Python Libraries:
1. Pandas: Data manipulation and summary statistics.
2. NumPy: Mathematical operations and arrays.
3. SciPy: Statistical methods and hypothesis testing.
4. Statsmodels: Advanced statistical modeling.

2. R Programming:
1. Comprehensive statistical functions and visualization.

3. Big Data Frameworks:


1. Apache Spark MLlib: Scalable statistical analysis.
2. Hadoop with Hive or Pig: Aggregation and summarization.

4. SQL-based Solutions:
Missing Values
Missing values are common in big data analytics due to the diverse sources, formats, and collection methods
of data. Handling missing values is crucial to ensure the accuracy and reliability of analysis and models.
 Types of Missing Data

1. Missing Completely at Random (MCAR):


 Missing values occur entirely by chance and are independent of other variables.
 Example: A sensor malfunction causes random data points to be absent.

2. Missing at Random (MAR):


 Missing values are related to other observed variables but not the missing variable itself.
 Example: Customers with higher incomes are less likely to disclose their income.

3. Missing Not at Random (MNAR):


 Missing values depend on the unobserved data itself.
 Example: Patients with more severe conditions are less likely to report their health status.
Best Practices for Handling Missing Values
 Understand the Context:
Identify why data is missing and classify the type (MCAR, MAR, MNAR).
 Assess the Impact:
Determine how missing values influence the analysis or model.
 Use Appropriate Methods:
Select strategies that align with the dataset size, complexity, and analysis goals.
 Automate the Process:
Use scalable tools and frameworks to handle missing data in real-time or batch processing.
 Document Decisions:
Keep track of methods and assumptions used for handling missing values.
Handling Missing Values - Example
A retail company is analyzing customer data with missing values:
Outliers Detection and Treatment
Outliers are data points that significantly deviate from the majority of the data. They can be caused by
data entry errors, measurement errors, or legitimate but rare phenomena. In big data analytics, outlier
detection and treatment are critical for improving data quality and ensuring the robustness of models.
Types of Outliers
1.Univariate Outliers: Outliers in a single variable.
 Example: An unusually high age in a customer dataset.

2.Multivariate Outliers: Outliers in the relationship between multiple variables.


 Example: A customer with low income but extraordinarily high spending.

3.Contextual Outliers: Outliers specific to the context, such as time or location.


 Example: Sudden temperature spikes during a stable weather period.
Outliers Detection and Treatment
Treatment of Outliers
1. Remove Outliers:
1. Remove data points if they are errors or irrelevant for analysis.
2. Example: Discard sensor readings exceeding physical limits (e.g., negative temperatures).

2. Transform Data:
1. Apply transformations to reduce the impact of outliers:
1. Logarithmic or square root transformations for right-skewed data.
2. Winsorization: Cap extreme values at a specified percentile.

3. Impute Outliers:
1. Replace outliers with statistically estimated values.
2. Methods:
1. Replace with the mean, median, or mode.
2. Interpolation for time-series data.

4. Analyze Separately:
1. Retain and analyze outliers as separate cases if they represent meaningful anomalies (e.g., fraud detection, extreme customer
Outliers Detection and Treatment
Visualization Methods
• Box Plots: Display the spread of data and identify outliers visually.
• Scatter Plots: Useful for bivariate relationships.
• Heatmaps: Identify multivariate anomalies using correlation matrices.

Machine Learning Methods


 Clustering:
• Use clustering algorithms (e.g., K-Means, DBSCAN) to identify points that do not belong to any cluster.
 Isolation Forest:
• A tree-based approach to isolate anomalies by randomly partitioning data.
Standardizing Data Labels and Categorization
Standardizing data labels and categorization is an essential step in big data analytics to ensure consistency,
accuracy, and interoperability across large and diverse datasets. It helps in preparing the data for effective
analysis, visualization, and machine learning applications.
Importance of Standardization
1. Consistency: Ensures that data from multiple sources uses uniform naming and classification conventions.
2. Interoperability: Enables data integration across platforms, teams, or organizations.
3. Accuracy: Reduces ambiguity and errors caused by inconsistent labels or categories.
4. Scalability: Facilitates automated processing in large datasets.
5. Improved Analysis: Allows efficient grouping, aggregation, and summarization of data.
Standardizing Data Labels and Categorization
 Normalize Labels
•Use techniques like:
•Case Normalization: Convert labels to lowercase or uppercase.
•Example: "Category", "CATEGORY", and "category" → "category".
•Whitespace Trimming: Remove leading/trailing spaces.
•Remove Special Characters: Replace or remove non-alphanumeric characters.
 Handle Synonyms and Variants
•Map synonyms or variations to a single, standardized label.
•Example: "Male", "M", "man" → "Male".
 Use Codebooks or Ontologies
•Develop or adopt standard taxonomies, dictionaries, or ontologies to define categories and their relationships.
•Example: Use ICD codes for medical diagnoses or NAICS for industry classification.
 Categorize Data
•Group continuous variables into discrete categories, if applicable.
•Example: Age groups: [0-18]: Child, [19-60]: Adult, >60: Senior.
•Assign multi-level hierarchies for complex datasets.
•Example: Geography: Country → State → City.

You might also like