EDA Unit V
EDA Unit V
UNIT V
EXPLORATORY DATA ANALYSIS
Introducing a Third Variable - Causal Explanations - Three-Variable Contingency Tables and
Beyond – Fundamentals of TSA – Characteristics of time series data – Data Cleaning – Time-
based indexing – Visualizing – Grouping – Resampling.
Introducing a Third Variable in Exploratory Data Analysis (EDA)
Introduction
In Exploratory Data Analysis (EDA), examining relationships between two variables can provide
valuable insights. However, introducing a third variable can often reveal deeper, more nuanced
patterns and relationships that are not immediately apparent. This can help in identifying
potential interactions, confounding factors, or mediators in the data.
Importance of Introducing a Third Variable
1. Understanding Interactions: A third variable can help identify interactions between the
primary variables being analyzed. For example, the relationship between exercise and
weight loss might be influenced by diet.
2. Controlling for Confounders: Introducing a third variable can help control for
confounding factors that may distort the observed relationship between two variables.
3. Identifying Mediators: A third variable can act as a mediator, explaining the mechanism
through which one variable affects another.
Methods for Introducing a Third Variable
1. Stratification: Dividing the data into subgroups based on the third variable and analyzing
the primary relationship within each subgroup.
2. Multivariate Plots: Using multivariate visualizations such as 3D plots, color-coded
scatterplots, or facet grids to incorporate the third variable into the analysis.
3. Statistical Models: Employing statistical models like multiple regression, ANOVA, or
logistic regression to include the third variable and assess its impact on the primary
relationship.
Example: Analyzing the Relationship Between Study Hours, Exam Scores, and Sleep
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis
Data
Consider a dataset with the following variables:
Study Hours: Number of hours a student studies per week.
Exam Scores: Scores obtained by students in an exam.
Sleep Hours: Average number of hours a student sleeps per night.
Step-by-Step Analysis
1. Initial Bivariate Analysis
o Examine the relationship between Study Hours and Exam Scores using a
scatterplot.
python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Sample data
data = pd.DataFrame({
'Study Hours': [10, 15, 20, 25, 30, 35, 40],
'Exam Scores': [50, 55, 60, 65, 70, 75, 80],
'Sleep Hours': [6, 7, 6, 8, 7, 6, 7]
})
Regression Analysis: The multiple regression model provides quantitative insights into
the effect of both Study Hours and Sleep Hours on Exam Scores. The coefficients and p-
values indicate the significance and strength of these relationships.
Conclusion
Introducing a third variable in EDA enriches the analysis by revealing hidden patterns,
controlling for confounders, and understanding interactions. Utilizing multivariate visualizations
and statistical models allows for a comprehensive examination of complex relationships, leading
to more informed and accurate conclusions.
Causal Explanations
Causal explanations in EDA aim to identify and understand the cause-and-effect
relationships between variables. Unlike simple associations or correlations, causal explanations
provide insights into how and why changes in one variable lead to changes in another.
Establishing causality is crucial for making informed decisions and implementing effective
interventions in various fields, including healthcare, economics, social sciences, and more.
Importance of Causal Explanations
1. Understanding Mechanisms: Identifying the underlying mechanisms through which one
variable affects another.
2. Effective Interventions: Designing and implementing strategies or policies that can
effectively address the root causes of issues.
3. Predictive Accuracy: Improving the accuracy of predictive models by incorporating
causal relationships.
4. Policy Making: Providing evidence-based support for policy decisions and interventions.
Methods for Establishing Causality
1. Randomized Controlled Trials (RCTs): The gold standard for establishing causality,
involving random assignment of subjects to treatment and control groups.
2. Natural Experiments: Using naturally occurring events or circumstances to study causal
effects.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis
python
import statsmodels.api as sm
Three-variable contingency tables, also known as three-way tables, extend the concept of two-
way tables to analyze the relationship between three categorical variables simultaneously. These
tables allow researchers to explore more complex interactions and dependencies among
variables, providing a deeper understanding of the data.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis
print(data)
2. Create the Contingency Table
o Use the pd.crosstab function to create the three-variable contingency table.
python
# Create three-variable contingency table
contingency_table = pd.crosstab(index=[data['Gender'], data['Smoking Status']],
columns=data['Exercise Frequency'])
print(contingency_table)
3. Interpreting the Table
o Each cell represents the count of observations for a specific combination of
Gender, Smoking Status, and Exercise Frequency.
Beyond Three-Variable Contingency Tables
While three-variable contingency tables provide valuable insights, they can become complex and
difficult to interpret as the number of variables increases. Therefore, additional techniques and
tools are often used to analyze higher-dimensional data.
1. Higher-Dimensional Tables: Extending to four or more variables, though interpretation
becomes increasingly challenging.
2. Multivariate Analysis: Using statistical techniques like logistic regression, factor
analysis, or cluster analysis to handle multiple variables simultaneously.
3. Data Visualization: Employing advanced visualization techniques such as heatmaps,
mosaic plots, or parallel coordinates to represent high-dimensional data.
Example: Higher-Dimensional Analysis
Suppose we add a fourth variable, Age Group (Youth, Adult, Senior), to the previous example.
We can create a four-variable contingency table or use logistic regression to analyze the data.
Step-by-Step Higher-Dimensional Analysis
1. Create the Data
o Add the fourth variable to the dataset.
python
data['Age Group'] = ['Youth', 'Adult', 'Senior', 'Youth', 'Adult', 'Senior', 'Youth', 'Adult']
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
print(data)
2. Higher-Dimensional Contingency Table
o Create a four-variable contingency table using pd.crosstab.
python
# Create four-variable contingency table
contingency_table_4d = pd.crosstab(index=[data['Gender'], data['Smoking Status'],
data['Age Group']], columns=data['Exercise Frequency'])
print(contingency_table_4d)
3. Multivariate Analysis with Logistic Regression
o Use logistic regression to analyze the relationship between Exercise Frequency
and the other variables.
python
import statsmodels.api as sm
# Encode categorical variables
data_encoded = pd.get_dummies(data, drop_first=True)
# Define the target and predictors
y = data_encoded['Exercise Frequency_Rarely']
X = data_encoded.drop(columns=['Exercise Frequency_Rarely', 'Exercise
Frequency_Regular', 'Exercise Frequency_Occasional'])
# Add constant term for the intercept
X = sm.add_constant(X)
# Fit the logistic regression model
logit_model = sm.Logit(y, X).fit()
# Print the model summary
print(logit_model.summary())
Interpretation
Contingency Tables: Provide a clear count of observations for each combination of
variables, useful for preliminary analysis.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Objective:
● To understand concept of time series analysis
● To know the different characteristic of time series analysis
● To identify different methods of time series
● To have an overview of the univariate and multivariate time series
What Is Time Series Analysis?
Time-series analysis is a method of analyzing a collection of data points over a period of time.
Instead of recording data points intermittently or randomly, time series analysts record data
points at consistent intervals over a set period of time.
While time-series data is information gathered over time, various types of information describe
how and when that information was gathered. For example:
Time series data: It is a collection of observations on the values that a variable takes at
various points in time.
Cross-sectional data: Data from one or more variables that were collected simultaneously.
Pooled data: It is a combination of cross-sectional and time-series data.
The variable varies according to the probability distribution, showing which value Y can take
and with which probability those values are taken.
Yt = μt + εt
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Curve Fitting: It plots data on a curve to investigate the relationships between variables in the
data.
Descriptive Analysis: Patterns in time-series data, such as trends, cycles, and seasonal
variation, are identified.
Explanative analysis: It attempts to comprehend the data and the relationships between it and
cause and effect.
Segmentation: It splits the data into segments to reveal the source data's underlying
properties.
ARIMA
ARIMA is an acronym for Autoregressive Integrated Moving Average. The Box-Jenkins method
is another name for this method.
Now you will explore the ARIMA parameters in detail:
Autoregressive Component: AR stands for autoregressive, and is denoted by p. When the
value of p is 0, it means there is no correlation in the series. When the value of p is 1, it means
that the auto-correlation is up to one lag.
Moving Average: Moving average is denoted by q. When q=1, it means that there is an error
term.
Integration: Integration is denoted by d. When the value of d is 0, the series is stationary.
When the value of d is 1, the series is not stationary, and you can make it stationary by taking
the difference.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Time series analysis is a statistical technique that deals with time-ordered data. The primary
goal is to understand the underlying structure and function that produced the observations and to
forecast future values based on historical data. Time series data are ubiquitous in various fields
such as finance, economics, environmental science, and many others.
Characteristics of Time Series Data
1. Trend: Long-term movement in the data. It represents the general direction in which the
data is moving over time.
2. Seasonality: Regular, repeating patterns or cycles in data at fixed intervals.
3. Cyclic Patterns: Irregular fluctuations that are not of fixed period but occur over long
time frames.
4. Noise: Random variations that are unexplained by the model.
Steps in Time Series Analysis
1. Data Collection: Gather the data in a time-ordered sequence.
2. Data Cleaning: Handle missing values, outliers, and anomalies.
3. Data Visualization: Plot the data to identify patterns such as trends and seasonality.
4. Decomposition: Separate the time series into trend, seasonal, and residual components.
5. Modeling: Fit statistical models to the data for understanding and forecasting.
6. Validation: Evaluate the model's performance using validation techniques.
7. Forecasting: Predict future values using the developed model.
Decomposition of Time Series
Decomposition involves breaking down a time series into its constituent components:
Trend Component (T): Represents the long-term progression of the series.
Seasonal Component (S): Captures the repeating short-term cycle.
Residual Component (R): The remaining part after removing trend and seasonality.
Example: Decomposition using Python
python
import pandas as pd
import numpy as np
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
o Imputation: Filling missing values with a specific value like mean, median, or
mode.
o Interpolation: Estimating missing values based on surrounding data points.
2. Outlier Detection and Removal: Identifying and treating anomalies or unusual data
points that can skew the analysis.
3. Smoothing: Reducing noise to highlight underlying patterns using methods like moving
averages.
Example: Data Cleaning using Python
python
import pandas as pd
import numpy as np
# Sample time series data with missing values and outliers
date_rng = pd.date_range(start='2020-01-01', end='2020-01-10', freq='D')
data = pd.Series([1, np.nan, 3, 4, 5, np.nan, 7, 100, 9, 10], index=date_rng)
# Handling missing values
data_filled = data.interpolate()
# Removing outliers
data_filled[data_filled > 20] = np.nan # Assume values greater than 20 are outliers
data_cleaned = data_filled.interpolate()
print(data_cleaned)
Time-Based Indexing
Time-based indexing involves setting the index of a DataFrame or Series to a datetime object,
allowing for more convenient and efficient time series operations.
Example: Time-Based Indexing using Python
python
# Create DataFrame with time-based indexing
df = pd.DataFrame({
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}, index=pd.date_range(start='2020-01-01', end='2020-01-10'))
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
print(df)
Visualizing in Time Series Analysis
Visualization is a crucial aspect of time series analysis as it helps to understand the data's
structure, trends, patterns, and anomalies over time. Effective visualizations can reveal insights
that might not be apparent through statistical analysis alone.
Here are some common visualization techniques used in time series analysis:
1. Line Plots
Description: Line plots are the most basic and commonly used method for visualizing
time series data. They plot data points sequentially over time with lines connecting the
points.
Usage: Ideal for displaying trends and patterns over time.
Example:
python
import pandas as pd
import matplotlib.pyplot as plt
# Sample data
date_range = pd.date_range(start='2020-01-01', periods=100, freq='D')
data = pd.Series(range(100), index=date_range)
# Plotting
data.plot()
plt.title('Line Plot of Time Series Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()
2. Seasonal Plots
Description: Seasonal plots help to visualize seasonal patterns by plotting data for each
period (e.g., month, year) on the same graph.
Usage: Useful for identifying and comparing seasonal patterns across different periods.
Example:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
python
import seaborn as sns
# Assuming `data` is a pandas DataFrame with a datetime index and a 'value' column
data = pd.DataFrame({'date': date_range, 'value': range(100)})
data['month'] = data['date'].dt.month
# Seasonal plot
sns.lineplot(data=data, x='month', y='value')
plt.title('Seasonal Plot of Time Series Data')
plt.xlabel('Month')
plt.ylabel('Value')
plt.show()
3. Autocorrelation Plots (ACF)
Description: Autocorrelation plots display the correlation of the time series with its own
past values (lags).
Usage: Useful for identifying the degree of correlation between time steps and for
checking the randomness of data.
Example:
python
from statsmodels.graphics.tsaplots import plot_acf
# Plotting ACF
plot_acf(data['value'])
plt.title('Autocorrelation Plot')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.show()
4. Heatmaps
Description: Heatmaps represent data in a matrix form, where values are represented by
different colors. Time series heatmaps can show patterns across two dimensions, such as
time of day and day of week.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Usage: Useful for visualizing seasonal and cyclical patterns across multiple dimensions.
Example:
python
data['day_of_week'] = data['date'].dt.dayofweek
data['hour'] = data['date'].dt.hour
# Creating a pivot table
heatmap_data = data.pivot_table(index='day_of_week', columns='hour', values='value')
# Plotting heatmap
sns.heatmap(heatmap_data, cmap='coolwarm')
plt.title('Heatmap of Time Series Data')
plt.xlabel('Hour')
plt.ylabel('Day of Week')
plt.show()
5. Box Plots
Description: Box plots (or whisker plots) display the distribution of data based on a five-
number summary: minimum, first quartile (Q1), median, third quartile (Q3), and
maximum.
Usage: Useful for comparing distributions across different time periods.
Example:
python
sns.boxplot(x='month', y='value', data=data)
plt.title('Box Plot of Time Series Data')
plt.xlabel('Month')
plt.ylabel('Value')
plt.show()
6. Decomposition Plots
Description: Decomposition plots break down a time series into its component parts:
trend, seasonal, and residual components.
Usage: Useful for understanding the underlying structure of the time series data.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Example:
python
from statsmodels.tsa.seasonal import seasonal_decompose
# Decompose the time series
decomposition = seasonal_decompose(data['value'], model='additive', period=12)
# Plotting decomposition
decomposition.plot()
plt.show()
Importance of Visualization in Time Series Analysis
Trend Identification: Visualizations help identify long-term trends in the data.
Seasonal Patterns: They reveal recurring patterns and seasonality.
Anomaly Detection: Visualizing data can highlight outliers or anomalies.
Data Cleaning: Helps in identifying missing values and incorrect data points.
Model Selection: Assists in choosing appropriate models for forecasting and analysis
based on observed patterns.
Effective visualization is essential for gaining insights and making informed decisions based on
time series data. By utilizing these visualization techniques, analysts can better understand and
interpret their data.
Grouping
Grouping in Time Series Analysis (TSA)
Grouping in Time Series Analysis is a powerful technique used to aggregate data based on
specific time intervals or other criteria. This method helps in summarizing the data, identifying
patterns, and gaining insights over different periods.
Key Concepts of Grouping in TSA
1. Time-based Grouping:
o Grouping data based on time intervals such as hourly, daily, weekly, monthly, or
yearly.
o Useful for identifying trends and seasonal patterns over specific periods.
2. Custom Grouping:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
plt.ylabel('Average Value')
plt.xticks([0, 1], ['Non-Business Hours', 'Business Hours'])
plt.show()
Benefits of Grouping in TSA
1. Data Summarization:
o Grouping helps in summarizing large datasets into meaningful statistics such as
mean, sum, count, etc.
o It simplifies the data, making it easier to analyze and interpret.
2. Pattern Identification:
o By grouping data over different time intervals, one can identify trends and
seasonal patterns.
o It helps in understanding the behavior of the time series data.
3. Anomaly Detection:
o Grouping data can reveal anomalies or outliers that deviate from the expected
pattern.
o This is crucial for identifying unusual events or behaviors in the data.
Grouping is an essential technique in time series analysis that enables better data
management and analysis. Whether it’s time-based grouping or custom grouping, using tools like
Pandas makes it straightforward to implement and visualize the results. This approach provides
valuable insights into the data, aiding in more informed decision-making.
Resampling
Resampling is a powerful technique in time series analysis that involves changing the frequency
of your time series data. This can include:
Downsampling: Reducing the frequency of the data (e.g., converting daily data to
monthly data).
Upsampling: Increasing the frequency of the data (e.g., converting monthly data to daily
data).
Here's an example of how to perform resampling in pandas:
Example Dataset
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
print(df_upsampled)
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
This will fill the NaN values by propagating the last valid observation forward.
Example
Here is the complete example:
python
import pandas as pd
import numpy as np
# Create a date range
date_rng = pd.date_range(start='2020-01-01', end='2020-01-10', freq='D')
# Create a DataFrame with this date range
df = pd.DataFrame(date_rng, columns=['date'])
# Add a random value column
df['data'] = np.random.randint(0, 100, size=(len(date_rng)))
# Set the date column as the index
df.set_index('date', inplace=True)
print("Original DataFrame:")
print(df)
# Downsampling to 3-day frequency, calculating the mean
df_downsampled = df.resample('3D').mean()
print("\nDownsampled DataFrame (3-day frequency, mean):")
print(df_downsampled)
# Upsampling to hourly frequency, filling NaN values using forward fill method
df_upsampled = df.resample('H').ffill()
print("\nUpsampled DataFrame (hourly frequency, forward fill):")
print(df_upsampled)
Output
Original DataFrame:
2020-01-01 81
2020-01-02 43
2020-01-03 23
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
2020-01-04 76
2020-01-05 21
2020-01-06 34
2020-01-07 84
2020-01-08 45
2020-01-09 15
2020-01-10 57
Downsampled DataFrame (3-day frequency, mean):
2020-01-01 49.000000
2020-01-04 43.666667
2020-01-07 48.000000
2020-01-10 57.000000
Upsampled DataFrame (hourly frequency, forward fill):
2020-01-01 00:00:00 81
2020-01-01 01:00:00 81
2020-01-01 02:00:00 81
2020-01-01 03:00:00 81
2020-01-01 04:00:00 81
2020-01-09 20:00:00 15
2020-01-09 21:00:00 15
2020-01-09 22:00:00 15
2020-01-09 23:00:00 15
2020-01-10 00:00:00 57
[217 rows x 1 columns]
These examples show how you can resample your time series data to different frequencies using
pandas, providing flexibility in your data analysis.
Unit V Completed