Big Data Analytics
Big Data Analytics
Examples:
Tables in relational databases.
Spreadsheets.
Formatted Dates or Time and information like account numbers.
Structured Data
Merits:
The organized format helps to define data fields and establish relationships for efficient retrieval.
Structured query languages (SQL), enable precise and rapid querying which accelerates data analysis.
Promotes data consistency and accuracy while minimizing errors and discrepancies that could arise during data entry or processing.
Seamless data migration between systems and platforms, allowing interoperability and integration across diverse applications.
Quantitative analysis, statistical calculations, and aggregation are easier with structured data.
Limitations:
Rigidity: The predefined structure can be limiting when dealing with complex, dynamic, or evolving data types.
Data Loss: The structured approach might force oversimplification, leading to the omission of potentially valuable information and
overlooking fine grained detail.
Scalability Challenges: As data volumes grow exponentially, maintaining the structural integrity while scaling of data becomes
increasingly challenging due to performance bottlenecks.
Semi-Structured Data
Semi-structured data is one of the types of big data that represents a middle ground between the
structured and unstructured data categories.
It combines elements of organization and flexibility, allowing for data to be partially structured while
accommodating variations in format and content.
This type of data is often represented with tags, labels, or hierarchies, which provide a level of
organization without strict constraints.
Examples:
XML Documents
JSON Data
NoSQL Databases
Semi-Structured Data
Merits:
Semi-structured data is flexible and can represent complex relationships and hierarchical structures. It can
accommodates changes to data formats without requiring major alterations to the underlying processing
systems.
Semi-structured data can be stored in ways that optimize space utilization and retrieval efficiency.
Limitations:
Data Integrity: The flexible nature of semi-structured data can lead to issues related to data integrity,
consistency, and validation.
Query Complexity: Analyzing and querying semi-structured data might require more complex and
specialized techniques compared to structured data.
Migration: Migrating or integrating semi-structured data across different systems can be challenging due
Unstructured Data
Unstructured data is one of the types of big data that represents a
diverse and often unorganized collection of information.
It lacks a consistent structure, making it more challenging to
organize and analyze.
This data type encompasses a wide array of content, including
text, images, audio, video, and more, often originating from
sources like social media, emails, and multimedia platforms.
Example:
Social media posts data.
Customer reviews and feedback, found on e-commerce platforms, review sites,
and surveys.
Medical images, such as X-rays, MRIs, and CT scans, are examples of
unstructured data.
Unstructured Data
Merits:
Unstructured data can capture more information and qualitative aspects that structured data might
overlook.
The diverse nature of unstructured data mirrors real-world scenarios more closely, and can be valuable
for decision-making and trend analysis.
Unstructured data fuels innovation in fields like natural language processing, image recognition, and
machine learning.
Limitations:
Data Complexity: The lack of a predefined structure complicates data organization, storage, and retrieval.
Data Noise: Unstructured data can include noise, irrelevant information, and outliers.
Scalability: As unstructured data volumes grow, managing and processing this data becomes resource-
Introduction to Big Data
Big Data: This is a term related to extracting meaningful data by analyzing the huge amount of complex,
variously formatted data generated at high speed, that cannot be handled, or processed by the
traditional system.
Sources of Big Data
Social Media: Today’s world a good percent of the total world population is engaged with social media
like Facebook, WhatsApp, Twitter, YouTube, Instagram, etc. Each activity on such media like uploading a
photo, or video, sending a message, making comment, putting like, etc create data.
A sensor placed in various places: Sensor placed in various places of the city that gathers data on
temperature, humidity, etc. A camera placed beside the road gather information about traffic condition
and creates data. Security cameras placed in sensitive areas like airports, railway stations, and shopping
malls create a lot of data.
Customer Satisfaction Feedback: Customer feedback on the product or service of the various company
on their website creates data. For Example, retail commercial sites like Amazon, Walmart, Flipkart, and
Myntra gather customer feedback on the quality of their product and delivery time. Telecom
companies, and other service provider organizations seek customer experience with their service. These
create a lot of data.
Sources of Big Data
IoT Appliance: Electronic devices that are connected to the internet create data for their smart
functionality, examples are a smart TV, smart washing machine, smart coffee machine, smart AC, etc. It
is machine-generated data that are created by sensors kept in various devices. For Example, a Smart
printing machine – is connected to the internet. A number of such printing machines connected to a
network can transfer data within each other. So, if anyone loads a file copy in one printing machine, the
system stores that file content, and another printing machine kept in another building or another floor
can print out that file hard copy. Such data transfer between various printing machines generates data.
E-commerce: In e-commerce transactions, business transactions, banking, and the stock market, lots of
records stored are considered one of the sources of big data. Payments through credit cards, debit
cards, or other electronic ways, all are kept recorded as data.
Global Positioning System (GPS): GPS in the vehicle helps in monitoring the movement of the vehicle to
shorten the path to a destination to cut fuel, and time consumption.
The 3V’s of Big Data
More Data
More accurate
analysis
More confidence in
decision making
Greater Operational efficiencies, cost
reduction, time reduction, new
product development, optimized
offerings, etc.
Big Data Analytics
Big data analytics is the often complex process of examining big data to uncover information -- such as
hidden patterns, correlations, market trends and customer preferences -- that can help organizations
make informed business decisions.
On a broad scale, data analytics technologies and techniques give organizations a way to analyze data
sets and gather new information. Business intelligence (BI) queries answer basic questions about
business operations and performance.
Big Data Analytics
Big data analytics is a form of advanced analytics, which involve complex applications with elements
such as predictive models, statistical algorithms and what-if analysis powered by analytics systems.
An example of big data analytics can be found in the healthcare industry, where millions of patient
records, medical claims, clinical results, care management records and other data must be collected,
aggregated, processed and analyzed.
Big data analytics is used for accounting, decision-making, predictive analytics and many other
purposes. This data varies greatly in type, quality and accessibility, presenting significant challenges but
also offering tremendous benefits.
Big Data Analytics - Examples
Big Data is one of the most powerful innovations in almost
every industry. It plays a key role in planning future products,
services, and whatnot. Approximately, 98% of businesses are
investing in Big Data by 2024. Within just a decade it has grown
to such a level that it has almost entered each aspect of
our lifestyle like shopping, transportation, healthcare, and
routine choices.
In the last 2 years, 90% of the world’s data has been created and businesses are spending more than $180
billion a year on big data
analysis.
Big Data Analytics - Examples
Big Data in Health Care
Predictive Analytics for Patient Care
The main concern in big data analytics consists in the utilization of predictive modeling in order to foresee
patient health trends. Hence, Mount Sinai Medical Center in New York applies Big Data to predict the
occurrences of patient admissions. Through assessing past patient data records, hospital is able to plan for
appropriate staffing needs and ensure resources are used efficiently, which will in turn result in improved
patient care and efficient operations.
Genomic Research
Genomics generates immense amounts of data. The 1000 Genomes Project, which aims to establish a
comprehensive resource on human genetic variation, relies on Big Data technologies to process and
analyze genetic information from numerous samples, paving the way for personalized medicine and
advanced genetic research.
Big Data Analytics - Examples
2. Example of Big Data in Finance
Fraud Detection
Financial houses implement Big Data to deal with fraud. One illustrative example is JPMorgan Chase who
has advanced analytics to track real-time transactions. Algorithms do analysis of patterns and flag
consistently any fraudulent activity, which in turn reduces fraud cases by a large percentage and saves the
institution millions of dollars every year.
Risk Management
Huge data is the instrument that allows one to obtain detailed information about the market dynamics.
Companies such as Goldman Sachs make use of Big Data analytics, which leads to the evaluation of
market situations, anticipation of stock movements and better risk managing in portfolios, thus justifying
investment decision with more intelligence.
Big Data Analytics - Examples
Big data is used by companies like Amazon and Walmart as a means to provide each shopper with a more
customized shopping experience. Through the examination of the customer choice and earlier purchases,
these firms launch commercial recommendations and bonuses which in turn promote customer
satisfaction and loyalty.
Inventory Management
Walmart uses Big Data technology for successfully tracking inventory. The info which Walmart collects by
analyzing sales data, weather patterns, and local events makes certain that the right products are in stock
at the right time, and prevents any overstock and stockouts, thereby improving the whole supply chain
operation.
Big Data Analytics - Examples
A company like UPS, for instance, uses Big Data analytics to map out the most efficient delivery routes.
The ORION system, developed by the company, is responsible for the analysis of various data sources such
as traffic flow and weather pattern to establish the most ideal path for their fleet. It led to a substantial
costs reduction and energy saving.
Predictive Maintenance
Airlines like Delta, apply Big Data not only predictive means for aircraft maintenance. These companies
can predict component failures by analyzing the data from sensors on planes and this makes it possible to
reduce the time of downtime and enhancing the safety standards.
Big Data Analytics - Examples
Institutions like Arizona State University as well use Big Data to control the performance of the students
and also to improve the educational outcomes. Data on student engagement, attendance, and academic
results are analyzed by the university to identify such students early. Therefore, the university can offer
them tailored interventions.
Personalized Learning
Platforms such as Coursera and Khan Academy utilize Big Data to teach personalized learning for students.
The platforms are able to adapt course materials in such a way that they match individual students'
learning speeds and preferences by analyzing students' interaction data; thus, their educational
experience is improved.
Big Data Analytics - Examples
6. Example of Big Data in Agriculture
Precision Farming
Big Data is main use in the modern agriculture through precision farming. Organizations like John Deere
gather data from several sensors and GPS navigation system to fine-tune planting, fertilization, and
harvesting practices. This system ensures top yields without overusing resources, in turn, contributing to
sustainable agriculture.
Weather Prediction
Agriculturalists rely on Big Data for precise weather forecasting. Tools such as Climate FieldView use data
resources and weather predictions to offer farmers precision forecasts that help them plan their planting
and harvesting with accuracy.
Big Data Analytics - Examples
7. Example of Big Data in Entertainment
Content Recommendation
Streaming services like Netflix and Spotify use Big Data to recommend content to their users. By analyzing
viewing and listening habits, these platforms can suggest movies, shows, and music that match user
preferences, enhancing the user experience and engagement.
Audience Analysis
The film industry uses Big Data for audience analysis. Studios analyze social media, box office results, and
demographic data to predict the success of films and plan marketing strategies accordingly. For example,
Warner Bros. uses predictive analytics to determine the potential success of a movie based on data from
previous releases.
Big Data Analytics - Examples
Summary
Using analytics to understand customer behavior in order to optimize the customer experience
Predicting future trends in order to make better business decisions
Improving marketing campaigns by understanding what works and what doesn't
Increasing operational efficiency by understanding where bottlenecks are and how to fix them
Detecting fraud and other forms of misuse sooner
Big Data Challenges
1. Scale
Storage is one major concern that needs to be addressed to handle the need for scaling rapidly and
elastically.
2. Security
Most of the NoSQL big data platforms have poor security mechanisms when it comes to safeguarding the
big data.
3. Schema
Rigid schemas have no place. The technology should be able to fit the big data not other way around.
4. Continuous Availability
5. Consistency
6. Partition tolerant
7. Data Quality
What is Data Analysis?
The collection, transformation, and organization of data to draw conclusions make predictions for the
future and make informed data-driven decisions is called Data Analysis. The profession that handles
data analysis is called a Data Analyst.
There is a huge demand for Data Analysts as the data is expanding rapidly nowadays. Data Analysis is
used to find possible solutions for a business problem. The advantage of being a Data Analyst is that
they can work in any field they love healthcare, agriculture, IT, finance, business. Data-driven decision-
making is an important part of Data Analysis. It makes the analysis process much easier. There are six
steps for Data Analysis.
Steps for Data analysis Process
1. Define the Problem or Research Question
2. Collect Data
3. Data Cleaning
4. Analyzing the Data
5. Data Visualization
6. Presenting Data
Steps for Data analysis Process
1. Define the Problem or Research Question
In the first step of process the data analyst is given a problem/business task. The analyst has to
understand the task and the stakeholder’s expectations for the solution. A stakeholder is a person that has
invested their money and resources to a project.
The analyst must be able to ask different questions in order to find the right solution to their problem. The
analyst has to find the root cause of the problem in order to fully understand the problem. The analyst
must make sure that he/she doesn’t have any distractions while analyzing the problem. Communicate
effectively with the stakeholders and other colleagues to completely understand what the underlying
problem is. Questions to ask yourself for the Ask phase are:
What are the problems that are being mentioned by my stakeholders?
What are their expectations for the solutions?
Steps for Data analysis Process
2. Collect Data
The second step is to Prepare or Collect the Data. This step includes collecting data and storing it for
further analysis. The analyst has to collect the data based on the task given from multiple sources. The
data has to be collected from various sources, internal or external sources.
Internal data is the data available in the organization that you work for while external data is the data
available in sources other than your organization. The data that is collected by an individual from their
own resources is called first-party data.
The data that is collected and sold is called second-party data. Data that is collected from outside sources
is called third-party data. The common sources from where the data is collected are Interviews, Surveys,
Feedback, Questionnaires. The collected data can be stored in a spreadsheet or SQL database.
Steps for Data analysis Process
3. Data Cleaning
The third step is Clean and Process Data. After the data is collected from multiple sources, it is time
to clean the data. Clean data means data that is free from misspellings, redundancies, and irrelevance.
Clean data largely depends on data integrity.
There might be duplicate data or the data might not be in a format, therefore the unnecessary data is
removed and cleaned. There are different functions provided by SQL and Excel to clean the data. This is
one of the most important steps in Data Analysis as clean and formatted data helps in finding trends and
solutions.
The most important part of the Process phase is to check whether your data is biased or not. Bias is an act
of favoring a particular group/community while ignoring the rest. Biasing is a big no-no as it might affect
the overall data analysis. The data analyst must make sure to include every group while the data is being
collected.
Steps for Data analysis Process
4. Analyzing the Data
The fourth step is to Analyze. The cleaned data is used for analyzing and identifying trends. It also
performs calculations and combines data for better results.
The tools used for performing calculations are Excel or SQL. These tools provide in-built functions to
perform calculations or sample code is written in SQL to perform calculations.
Using Excel, we can create pivot tables and perform calculations while SQL creates temporary tables to
perform calculations. Programming languages are another way of solving problems. They make it much
easier to solve problems by providing packages. The most widely used programming languages for data
analysis are R and Python.
Steps for Data analysis Process
5. Data Visualization
The fifth step is visualizing the data. Nothing is more compelling than a visualization. The data now
transformed has to be made into a visual (chart, graph). The reason for making data visualizations is that
there might be people, mostly stakeholders that are non-technical.
Visualizations are made for a simple understanding of complex data. Tableau and Looker are the two
popular tools used for compelling data visualizations. Tableau is a simple drag and drop tool that helps in
creating compelling visualizations.
Looker is a data viz tool that directly connects to the database and creates visualizations. Tableau and
Looker are both equally used by data analysts for creating a visualization. R and Python have some
packages that provide beautiful data visualizations.
R has a package named ggplot which has a variety of data visualizations. A presentation is given based on
the data findings. Sharing the insights with the team members and stakeholders will help in making better
Steps for Data analysis Process
6. Presenting the Data
Presenting the data involves transforming raw information into a format that is easily comprehensible and
meaningful for various stakeholders. This process encompasses the creation of visual representations,
such as charts, graphs, and tables, to effectively communicate patterns, trends, and insights gleaned from
the data analysis.
The goal is to facilitate a clear understanding of complex information, making it accessible to both
technical and non-technical audiences. Effective data presentation involves thoughtful selection of
visualization techniques based on the nature of the data and the specific message intended. It goes
beyond mere display to storytelling, where the presenter interprets the findings, emphasizes key points,
and guides the audience through the narrative that the data unfolds. Whether through reports,
presentations, or interactive dashboards, the art of presenting data involves balancing simplicity with
depth, ensuring that the audience can easily grasp the significance of the information presented and use it
for informed decision-making.
Analytical Models
The four main analytical models organizations can deploy are:
Descriptive
Diagnostic
Predictive
Prescriptive.
Steps for Data analysis Process
Descriptive analytics
Descriptive analytics answer the question: What happened?
This is the most common type of analytics found in business. It generally uses historical data from a single
internal source to pinpoint when an event occurred.
For example:
How many sales did we make in the last week/day/hour?
Which customers required the most help from our customer service team?
How many people viewed our website?
Which product had the most defects?
Descriptive analytics are often displayed on dashboards and in reports, which are convenient ways to
consume data and inform decisions. Descriptive analytics account for most of the statistics we use,
including basic aggregation (e.g. count or sum of values filtered from a column or data), averages, and
Steps for Data analysis Process
Diagnostic analytics
Diagnostic analytics help us to answer the next question: Why did it happen?
To do this, analysts dive deeper into an organization's historical data, combining multiple sources in search of
patterns, trends, and correlations.
Why would you use diagnostic analytics?
Identify anomalies: Analysts use the results from descriptive analysis to identify areas that need further
investigation and raise questions that can’t be answered by simply looking at the data. For example: Why
have sales increased in a region that had no change in marketing?
Drill down into data: To explain anomalies, analysts must find patterns outside existing data sets to identify
correlations. They might need to use techniques such as data mining, and use data from external sources.
Determine causal relationships: Having identified anomalies and searched for patterns that could be
correlated, analysts use more advanced statistical techniques to determine whether these are related.
Steps for Data analysis Process
Predictive analytics
As an organization increases its analytical maturity and embarks on predictive analytics, it shifts its focus from understanding
historical events to creating insights about a current or future state. Predictive analytics is at the intersection of classical statistical
analysis and modern artificial intelligence (AI) techniques. It tries to answer the question: What will happen next?
It’s impossible to predict exactly what will happen in the future, but by employing predictive analytics, organizations identify the
likelihood of possible outcomes and can increase the chance of taking the best course of action. We see predictive analytics used in
many sectors.
For example:
Aerospace – Predictive analytics are used to predict the effect of specific maintenance operations on aircraft reliability, fuel use,
availability, and uptime.
Financial services – Predictive analytics are used to develop credit-risk models and forecast financial market trends.
Manufacturing – Predictive analytics are used to predict the location and rate of machine failures, and to optimise ordering and
delivery of raw materials based on projected future demands.
Online retail – Systems monitor customer behavior, and predictive models determine whether providing additional product
Steps for Data analysis Process
Prescriptive analytics
Prescriptive analytics is the most complex type of analytics. It combines internal data, external sources, and machine-
learning techniques to provide the most effective outcomes. In prescriptive analytics, a decision-making process is
applied to descriptive and predictive models to find the combinations of existing conditions and possible decisions that
are likely to have the most effect in the future. This process is both complex and resource intensive but, when done well,
can provide immense value to an organization.
Applications of prescriptive analytics include:
risk management[2]
improving healthcare[3]
guided marketing, selling and pricing[4].
As the most complex form of analytics, prescriptive analytics not only pose technical challenges, but are also influenced
by external factors such as government regulation, market risk, and existing organizational behavior. If you are
considering deploying prescriptive analytics, be sure you have a solid business case that identifies why machine-
The difference between Traditional data and Big data
Exploratory Statistical Analysis
1.Understand Data Distribution:
1. Identify patterns, trends, and relationships within the data.
2. Assess data variability and distribution.
3.Hypothesize Relationships:
1. Test assumptions and generate hypotheses about variable interactions.
4.Reduce Complexity:
1. Summarize large datasets into manageable insights.
Exploratory Statistical Analysis
Statistical Methods in ESA
1. Descriptive Statistics
• Measures of Central Tendency:
• Mean: Average value.
• Median: Middle value.
• Mode: Most frequent value.
Consider a scenario where a retail company is analyzing the daily sales (in dollars) of a product across different stores over a
month. The dataset (in dollars) is as follows:
200, 250, 300, 250, 500, 250, 800, 1000, 250, 300
1. Mean (Average)
The mean provides the average sales per day.
1. Descriptive Statistics
• Measures of Dispersion:
• Range: Difference between maximum and minimum.
• Variance: Spread of data points from the mean.
• Standard Deviation: Average distance from the mean.
• Univariate Analysis:
• Histograms
• Boxplots
or big data-specific tools like Apache Zeppelin can handle large datasets effectively.
Exploratory Statistical Analysis
Visualization in ESA
• Bivariate and Multivariate Analysis:
• Scatterplots
• Pairplots
• Heatmaps for correlation matrices
or big data-specific tools like Apache Zeppelin can handle large datasets effectively.
Exploratory Statistical Analysis
Tools for ESA in Big Data
1. Python Libraries:
1. Pandas: Data manipulation and summary statistics.
2. NumPy: Mathematical operations and arrays.
3. SciPy: Statistical methods and hypothesis testing.
4. Statsmodels: Advanced statistical modeling.
2. R Programming:
1. Comprehensive statistical functions and visualization.
4. SQL-based Solutions:
Missing Values
Missing values are common in big data analytics due to the diverse sources, formats, and collection methods
of data. Handling missing values is crucial to ensure the accuracy and reliability of analysis and models.
Types of Missing Data
2. Transform Data:
1. Apply transformations to reduce the impact of outliers:
1. Logarithmic or square root transformations for right-skewed data.
2. Winsorization: Cap extreme values at a specified percentile.
3. Impute Outliers:
1. Replace outliers with statistically estimated values.
2. Methods:
1. Replace with the mean, median, or mode.
2. Interpolation for time-series data.
4. Analyze Separately:
1. Retain and analyze outliers as separate cases if they represent meaningful anomalies (e.g., fraud detection, extreme customer
Outliers Detection and Treatment
Visualization Methods
• Box Plots: Display the spread of data and identify outliers visually.
• Scatter Plots: Useful for bivariate relationships.
• Heatmaps: Identify multivariate anomalies using correlation matrices.