0% found this document useful (0 votes)
549 views28 pages

EDA Unit V

Uploaded by

Srinithi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
549 views28 pages

EDA Unit V

Uploaded by

Srinithi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Rajalakshmi Institute of Technology

(An Autonomous Institution), Affiliated to Anna University, Chennai


Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

UNIT V
EXPLORATORY DATA ANALYSIS
Introducing a Third Variable - Causal Explanations - Three-Variable Contingency Tables and
Beyond – Fundamentals of TSA – Characteristics of time series data – Data Cleaning – Time-
based indexing – Visualizing – Grouping – Resampling.
Introducing a Third Variable in Exploratory Data Analysis (EDA)
Introduction
In Exploratory Data Analysis (EDA), examining relationships between two variables can provide
valuable insights. However, introducing a third variable can often reveal deeper, more nuanced
patterns and relationships that are not immediately apparent. This can help in identifying
potential interactions, confounding factors, or mediators in the data.
Importance of Introducing a Third Variable
1. Understanding Interactions: A third variable can help identify interactions between the
primary variables being analyzed. For example, the relationship between exercise and
weight loss might be influenced by diet.
2. Controlling for Confounders: Introducing a third variable can help control for
confounding factors that may distort the observed relationship between two variables.
3. Identifying Mediators: A third variable can act as a mediator, explaining the mechanism
through which one variable affects another.
Methods for Introducing a Third Variable
1. Stratification: Dividing the data into subgroups based on the third variable and analyzing
the primary relationship within each subgroup.
2. Multivariate Plots: Using multivariate visualizations such as 3D plots, color-coded
scatterplots, or facet grids to incorporate the third variable into the analysis.
3. Statistical Models: Employing statistical models like multiple regression, ANOVA, or
logistic regression to include the third variable and assess its impact on the primary
relationship.
Example: Analyzing the Relationship Between Study Hours, Exam Scores, and Sleep
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

Data
Consider a dataset with the following variables:
 Study Hours: Number of hours a student studies per week.
 Exam Scores: Scores obtained by students in an exam.
 Sleep Hours: Average number of hours a student sleeps per night.
Step-by-Step Analysis
1. Initial Bivariate Analysis
o Examine the relationship between Study Hours and Exam Scores using a
scatterplot.
python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data
data = pd.DataFrame({
'Study Hours': [10, 15, 20, 25, 30, 35, 40],
'Exam Scores': [50, 55, 60, 65, 70, 75, 80],
'Sleep Hours': [6, 7, 6, 8, 7, 6, 7]
})

# Scatterplot for Study Hours vs. Exam Scores


sns.scatterplot(x='Study Hours', y='Exam Scores', data=data)
plt.title('Study Hours vs. Exam Scores')
plt.show()
2. Introducing the Third Variable (Sleep Hours)
o Use color-coding to incorporate Sleep Hours into the scatterplot.
python
# Scatterplot with Sleep Hours as hue
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

sns.scatterplot(x='Study Hours', y='Exam Scores', hue='Sleep Hours', palette='viridis',


data=data)
plt.title('Study Hours vs. Exam Scores (Color-coded by Sleep Hours)')
plt.show()
3. Stratification by Sleep Hours
o Create separate scatterplots for different levels of Sleep Hours to see if the
relationship between Study Hours and Exam Scores changes.
python
# Facet grid for stratified analysis
g = sns.FacetGrid(data, col='Sleep Hours', col_wrap=3, height=4)
g.map(sns.scatterplot, 'Study Hours', 'Exam Scores')
g.add_legend()
plt.show()
4. Statistical Analysis
o Use multiple regression to assess the impact of both Study Hours and Sleep Hours
on Exam Scores.
python
import statsmodels.api as sm
# Prepare the data for regression
X = data[['Study Hours', 'Sleep Hours']]
X = sm.add_constant(X) # Adds a constant term to the predictor
y = data['Exam Scores']
# Fit the regression model
model = sm.OLS(y, X).fit()
# Print the regression results
print(model.summary())
Interpretation
 Scatterplots: The color-coded scatterplot and facet grid help visualize how Sleep Hours
might influence the relationship between Study Hours and Exam Scores.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

 Regression Analysis: The multiple regression model provides quantitative insights into
the effect of both Study Hours and Sleep Hours on Exam Scores. The coefficients and p-
values indicate the significance and strength of these relationships.
Conclusion
Introducing a third variable in EDA enriches the analysis by revealing hidden patterns,
controlling for confounders, and understanding interactions. Utilizing multivariate visualizations
and statistical models allows for a comprehensive examination of complex relationships, leading
to more informed and accurate conclusions.
Causal Explanations
Causal explanations in EDA aim to identify and understand the cause-and-effect
relationships between variables. Unlike simple associations or correlations, causal explanations
provide insights into how and why changes in one variable lead to changes in another.
Establishing causality is crucial for making informed decisions and implementing effective
interventions in various fields, including healthcare, economics, social sciences, and more.
Importance of Causal Explanations
1. Understanding Mechanisms: Identifying the underlying mechanisms through which one
variable affects another.
2. Effective Interventions: Designing and implementing strategies or policies that can
effectively address the root causes of issues.
3. Predictive Accuracy: Improving the accuracy of predictive models by incorporating
causal relationships.
4. Policy Making: Providing evidence-based support for policy decisions and interventions.
Methods for Establishing Causality
1. Randomized Controlled Trials (RCTs): The gold standard for establishing causality,
involving random assignment of subjects to treatment and control groups.
2. Natural Experiments: Using naturally occurring events or circumstances to study causal
effects.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

3. Quasi-Experimental Designs: Approaches like difference-in-differences, instrumental


variables, and regression discontinuity that attempt to infer causality when randomization
is not possible.
4. Longitudinal Studies: Tracking the same subjects over time to observe how changes in
one variable lead to changes in another.
5. Causal Inference Methods: Statistical techniques like propensity score matching,
Granger causality tests, and structural equation modeling.
Example: The Impact of Study Hours on Exam Scores
Data
Consider a dataset with the following variables:
 Study Hours: Number of hours a student studies per week.
 Exam Scores: Scores obtained by students in an exam.
 Previous Academic Performance: A measure of students' performance in prior exams.
Step-by-Step Analysis
1. Initial Correlation Analysis
o Examine the correlation between Study Hours and Exam Scores.
python
import pandas as pd
# Sample data
data = pd.DataFrame({
'Study Hours': [10, 15, 20, 25, 30, 35, 40],
'Exam Scores': [50, 55, 60, 65, 70, 75, 80],
'Previous Academic Performance': [45, 50, 55, 60, 65, 70, 75]
})
# Calculate correlation
correlation = data['Study Hours'].corr(data['Exam Scores'])
print(f"Correlation between Study Hours and Exam Scores: {correlation}")
2. Controlling for Confounding Variables
o Use multiple regression to control for Previous Academic Performance.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

python
import statsmodels.api as sm

# Prepare the data for regression


X = data[['Study Hours', 'Previous Academic Performance']]
X = sm.add_constant(X) # Adds a constant term to the predictor
y = data['Exam Scores']
# Fit the regression model
model = sm.OLS(y, X).fit()
# Print the regression results
print(model.summary())
3. Assessing Causality
o If Study Hours remain a significant predictor of Exam Scores after controlling for
Previous Academic Performance, this strengthens the causal argument.
python
# Interpretation of regression results
if model.pvalues['Study Hours'] < 0.05:
print("Study Hours have a significant impact on Exam Scores, controlling for Previous
Academic Performance.")
else:
print("Study Hours do not have a significant impact on Exam Scores after controlling
for Previous Academic Performance.")
4. Using Instrumental Variables
o Identify an instrumental variable (IV) that affects Study Hours but not directly
Exam Scores, to further validate causality.
python
# Example: Assuming 'Parental Encouragement' as an instrumental variable
data['Parental Encouragement'] = [3, 4, 2, 5, 3, 4, 5]
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

# First stage regression: Study Hours on Parental Encouragement


first_stage = sm.OLS(data['Study Hours'], sm.add_constant(data['Parental
Encouragement'])).fit()
# Predicted Study Hours from first stage
data['Predicted Study Hours'] = first_stage.predict(sm.add_constant(data['Parental
Encouragement']))
# Second stage regression: Exam Scores on Predicted Study Hours
iv_model = sm.OLS(data['Exam Scores'], sm.add_constant(data['Predicted Study
Hours'])).fit()
# Print the IV regression results
print(iv_model.summary())
Interpretation
 Correlation Analysis: The initial correlation provides a preliminary indication of a
relationship between Study Hours and Exam Scores.
 Multiple Regression: Controlling for Previous Academic Performance helps isolate the
effect of Study Hours on Exam Scores.
 Instrumental Variables: Using an IV like Parental Encouragement helps address
endogeneity issues, providing a more robust causal inference.
Causal explanations are a crucial component of EDA, enabling researchers and practitioners
to move beyond mere associations and uncover the true cause-and-effect relationships in their
data. By employing a combination of experimental designs, statistical controls, and advanced
inference techniques, we can gain deeper insights and make more informed decisions based on
data.
Three-Variable Contingency Tables and Beyond

Three-variable contingency tables, also known as three-way tables, extend the concept of two-
way tables to analyze the relationship between three categorical variables simultaneously. These
tables allow researchers to explore more complex interactions and dependencies among
variables, providing a deeper understanding of the data.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

Importance of Three-Variable Contingency Tables


1. Exploring Interactions: Understanding how the relationship between two variables
changes when considering a third variable.
2. Identifying Conditional Dependencies: Determining if the association between two
variables depends on the level of a third variable.
3. Controlling for Confounders: Adjusting for a third variable to see if the association
between the primary variables is genuine or confounded.
Constructing a Three-Variable Contingency Table
To construct a three-variable contingency table, we categorize data based on three variables.
Each cell in the table represents a combination of these variables and contains the count or
frequency of observations.
Example: Analyzing the Relationship Between Gender, Smoking Status, and Exercise Frequency
Consider a dataset with the following variables:
 Gender: Male, Female
 Smoking Status: Smoker, Non-Smoker
 Exercise Frequency: Regular, Occasional, Rarely
Step-by-Step Construction
1. Create the Data
o Prepare the dataset with the three variables.
python
import pandas as pd
# Sample data
data = pd.DataFrame({
'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
'Smoking Status': ['Smoker', 'Non-Smoker', 'Smoker', 'Non-Smoker', 'Non-Smoker',
'Smoker', 'Non-Smoker', 'Smoker'],
'Exercise Frequency': ['Regular', 'Occasional', 'Rarely', 'Regular', 'Occasional', 'Rarely',
'Regular', 'Occasional']
})
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

print(data)
2. Create the Contingency Table
o Use the pd.crosstab function to create the three-variable contingency table.
python
# Create three-variable contingency table
contingency_table = pd.crosstab(index=[data['Gender'], data['Smoking Status']],
columns=data['Exercise Frequency'])
print(contingency_table)
3. Interpreting the Table
o Each cell represents the count of observations for a specific combination of
Gender, Smoking Status, and Exercise Frequency.
Beyond Three-Variable Contingency Tables
While three-variable contingency tables provide valuable insights, they can become complex and
difficult to interpret as the number of variables increases. Therefore, additional techniques and
tools are often used to analyze higher-dimensional data.
1. Higher-Dimensional Tables: Extending to four or more variables, though interpretation
becomes increasingly challenging.
2. Multivariate Analysis: Using statistical techniques like logistic regression, factor
analysis, or cluster analysis to handle multiple variables simultaneously.
3. Data Visualization: Employing advanced visualization techniques such as heatmaps,
mosaic plots, or parallel coordinates to represent high-dimensional data.
Example: Higher-Dimensional Analysis
Suppose we add a fourth variable, Age Group (Youth, Adult, Senior), to the previous example.
We can create a four-variable contingency table or use logistic regression to analyze the data.
Step-by-Step Higher-Dimensional Analysis
1. Create the Data
o Add the fourth variable to the dataset.
python
data['Age Group'] = ['Youth', 'Adult', 'Senior', 'Youth', 'Adult', 'Senior', 'Youth', 'Adult']
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

print(data)
2. Higher-Dimensional Contingency Table
o Create a four-variable contingency table using pd.crosstab.
python
# Create four-variable contingency table
contingency_table_4d = pd.crosstab(index=[data['Gender'], data['Smoking Status'],
data['Age Group']], columns=data['Exercise Frequency'])
print(contingency_table_4d)
3. Multivariate Analysis with Logistic Regression
o Use logistic regression to analyze the relationship between Exercise Frequency
and the other variables.
python
import statsmodels.api as sm
# Encode categorical variables
data_encoded = pd.get_dummies(data, drop_first=True)
# Define the target and predictors
y = data_encoded['Exercise Frequency_Rarely']
X = data_encoded.drop(columns=['Exercise Frequency_Rarely', 'Exercise
Frequency_Regular', 'Exercise Frequency_Occasional'])
# Add constant term for the intercept
X = sm.add_constant(X)
# Fit the logistic regression model
logit_model = sm.Logit(y, X).fit()
# Print the model summary
print(logit_model.summary())
Interpretation
 Contingency Tables: Provide a clear count of observations for each combination of
variables, useful for preliminary analysis.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Logistic Regression: Offers a more nuanced understanding of the relationships,


especially when dealing with more than three variables. The model coefficients and p-
values indicate the significance and strength of these relationships.
Three-variable contingency tables are a powerful tool in EDA for exploring complex
interactions and dependencies among categorical variables. While they provide valuable insights,
higher-dimensional data often require advanced multivariate analysis and visualization
techniques to fully understand the relationships. Utilizing these methods allows for a deeper and
more comprehensive exploration of the data, leading to more accurate and actionable insights.

Fundamentals of Time series analysis:

Objective:
● To understand concept of time series analysis
● To know the different characteristic of time series analysis
● To identify different methods of time series
● To have an overview of the univariate and multivariate time series
What Is Time Series Analysis?
Time-series analysis is a method of analyzing a collection of data points over a period of time.
Instead of recording data points intermittently or randomly, time series analysts record data
points at consistent intervals over a set period of time.
While time-series data is information gathered over time, various types of information describe
how and when that information was gathered. For example:
 Time series data: It is a collection of observations on the values that a variable takes at
various points in time.
 Cross-sectional data: Data from one or more variables that were collected simultaneously.
 Pooled data: It is a combination of cross-sectional and time-series data.
The variable varies according to the probability distribution, showing which value Y can take
and with which probability those values are taken.
Yt = μt + εt
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Each instance of Yt is the result of the signal μt


εt is the noise term here.
Why Do We eed Time-Series Analysis?
Time series analysis has a range of applications in statistics, sales, economics, and many more
areas. The common point is the technique used to model the data over a given period of time.
The reasons for doing time series analysis are as follows:
 Features: Time series analysis can be used to track features like trend, seasonality, and
variability.
 Forecasting: Time series analysis can aid in the prediction of stock prices. It is used if you
would like to know if the price will rise or fall and how much it will rise or fall.
 Inferences: You can predict the value and draw inferences from data using Time series
analysis.
What is forecasting?
Forecasting is a special technique of making predictions for the future by using historical
data as inputs and analyzing trends. However, forecasting doesn't tell the future definitively, it
only shows probabilities. So, you should always double check the results before making a
decision.
What is a time series analysis and what are the benefits?
A time series analysis focuses on a series of data points ordered in time. This is one of the
most widely used data science analyses and is applied in a variety of industries.

Time Series Analysis Example


Non-stationary data—that is, data that is constantly fluctuating over time or is affected by time—
is analyzed using time series analysis. Because currency and sales are always changing,
industries like finance, retail, and e-commerce frequently use time series analysis. Stock market
analysis, especially when combined with automated trading algorithms, is an excellent example
of time series analysis in action.
Time series analysis can be used in -
 Rainfall measurements
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Automated stock trading


 Industry forecast
 Temperature readings
 Sales forecasting
Consider an example of railway passenger data over a period of time.
On the X-axis, we have years, and on the Y-axis, you have the number of passengers.

The following observations can be derived from the given data.


1. Trend: Over time, an increasing or decreasing pattern has been observed. The total number of
passengers has risen over time.
2. Seasonality: Cyclic patterns are the ones that repeat after a certain interval of time. In the case
of the railway passenger, you can see a cyclic pattern with a high and low point that is visible
throughout the interval.
Time Series Analysis Types
Some of the models of time series analysis include -
 Classification: It identifies and assigns categories to the data.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Curve Fitting: It plots data on a curve to investigate the relationships between variables in the
data.
 Descriptive Analysis: Patterns in time-series data, such as trends, cycles, and seasonal
variation, are identified.
 Explanative analysis: It attempts to comprehend the data and the relationships between it and
cause and effect.
 Segmentation: It splits the data into segments to reveal the source data's underlying
properties.

ARIMA
ARIMA is an acronym for Autoregressive Integrated Moving Average. The Box-Jenkins method
is another name for this method.
Now you will explore the ARIMA parameters in detail:
 Autoregressive Component: AR stands for autoregressive, and is denoted by p. When the
value of p is 0, it means there is no correlation in the series. When the value of p is 1, it means
that the auto-correlation is up to one lag.
 Moving Average: Moving average is denoted by q. When q=1, it means that there is an error
term.
 Integration: Integration is denoted by d. When the value of d is 0, the series is stationary.
When the value of d is 1, the series is not stationary, and you can make it stationary by taking
the difference.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Time series analysis is a statistical technique that deals with time-ordered data. The primary
goal is to understand the underlying structure and function that produced the observations and to
forecast future values based on historical data. Time series data are ubiquitous in various fields
such as finance, economics, environmental science, and many others.
Characteristics of Time Series Data
1. Trend: Long-term movement in the data. It represents the general direction in which the
data is moving over time.
2. Seasonality: Regular, repeating patterns or cycles in data at fixed intervals.
3. Cyclic Patterns: Irregular fluctuations that are not of fixed period but occur over long
time frames.
4. Noise: Random variations that are unexplained by the model.
Steps in Time Series Analysis
1. Data Collection: Gather the data in a time-ordered sequence.
2. Data Cleaning: Handle missing values, outliers, and anomalies.
3. Data Visualization: Plot the data to identify patterns such as trends and seasonality.
4. Decomposition: Separate the time series into trend, seasonal, and residual components.
5. Modeling: Fit statistical models to the data for understanding and forecasting.
6. Validation: Evaluate the model's performance using validation techniques.
7. Forecasting: Predict future values using the developed model.
Decomposition of Time Series
Decomposition involves breaking down a time series into its constituent components:
 Trend Component (T): Represents the long-term progression of the series.
 Seasonal Component (S): Captures the repeating short-term cycle.
 Residual Component (R): The remaining part after removing trend and seasonality.
Example: Decomposition using Python
python
import pandas as pd
import numpy as np
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

import matplotlib.pyplot as plt


from statsmodels.tsa.seasonal import seasonal_decompose
# Generate sample data
np.random.seed(0)
date_rng = pd.date_range(start='1/1/2020', end='1/1/2022', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.sin(np.linspace(-2 * np.pi, 2 * np.pi, len(df))) + np.random.normal(0, 0.1, len(df))
df.set_index('date', inplace=True)
# Decompose the time series
result = seasonal_decompose(df['data'], model='additive')
# Plot the decomposition
result.plot()
plt.show()
Common Time Series Models
1. Autoregressive (AR) Model: Predicts future values based on past values.
2. Moving Average (MA) Model: Predicts future values based on past errors.
3. ARIMA Model: Combines AR and MA models with differencing to make the series
stationary.
4. Seasonal ARIMA (SARIMA) Model: Extends ARIMA to capture seasonality.
5. Exponential Smoothing: Applies weighted averages of past observations, giving more
weight to recent observations.
6. Prophet: Developed by Facebook, suitable for forecasting time series with strong
seasonal effects and multiple seasonality.
Example: ARIMA Model using Python
python
from statsmodels.tsa.arima.model import ARIMA
# Fit an ARIMA model
model = ARIMA(df['data'], order=(1, 1, 1))
model_fit = model.fit()
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

# Summary of the model


print(model_fit.summary())
# Forecast future values
forecast = model_fit.forecast(steps=10)
print(forecast)
Evaluating Time Series Models
1. Mean Absolute Error (MAE): Average of absolute errors.
2. Mean Squared Error (MSE): Average of squared errors.
3. Root Mean Squared Error (RMSE): Square root of MSE.
4. Mean Absolute Percentage Error (MAPE): Average of absolute percentage errors.
Example: Evaluating Model using Python
python
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Actual vs. Predicted values
actual = df['data'].iloc[-10:]
predicted = model_fit.forecast(steps=10)
# Calculate MAE and RMSE
mae = mean_absolute_error(actual, predicted)
rmse = np.sqrt(mean_squared_error(actual, predicted))
print(f"MAE: {mae}, RMSE: {rmse}")
Time series analysis is a powerful tool for understanding and forecasting data that varies
over time. By decomposing the data, fitting appropriate models, and evaluating their
performance, we can make accurate predictions and uncover insights that inform decision-
making in various domains.
Data Cleaning
Data cleaning is a crucial step in time series analysis to ensure the accuracy and reliability of the
results. This involves:
1. Handling Missing Values: Imputing or interpolating missing data points to maintain
continuity.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

o Imputation: Filling missing values with a specific value like mean, median, or
mode.
o Interpolation: Estimating missing values based on surrounding data points.
2. Outlier Detection and Removal: Identifying and treating anomalies or unusual data
points that can skew the analysis.
3. Smoothing: Reducing noise to highlight underlying patterns using methods like moving
averages.
Example: Data Cleaning using Python
python
import pandas as pd
import numpy as np
# Sample time series data with missing values and outliers
date_rng = pd.date_range(start='2020-01-01', end='2020-01-10', freq='D')
data = pd.Series([1, np.nan, 3, 4, 5, np.nan, 7, 100, 9, 10], index=date_rng)
# Handling missing values
data_filled = data.interpolate()
# Removing outliers
data_filled[data_filled > 20] = np.nan # Assume values greater than 20 are outliers
data_cleaned = data_filled.interpolate()
print(data_cleaned)
Time-Based Indexing
Time-based indexing involves setting the index of a DataFrame or Series to a datetime object,
allowing for more convenient and efficient time series operations.
Example: Time-Based Indexing using Python
python
# Create DataFrame with time-based indexing
df = pd.DataFrame({
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}, index=pd.date_range(start='2020-01-01', end='2020-01-10'))
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

print(df)
Visualizing in Time Series Analysis
Visualization is a crucial aspect of time series analysis as it helps to understand the data's
structure, trends, patterns, and anomalies over time. Effective visualizations can reveal insights
that might not be apparent through statistical analysis alone.
Here are some common visualization techniques used in time series analysis:
1. Line Plots
 Description: Line plots are the most basic and commonly used method for visualizing
time series data. They plot data points sequentially over time with lines connecting the
points.
 Usage: Ideal for displaying trends and patterns over time.
 Example:
python
import pandas as pd
import matplotlib.pyplot as plt
# Sample data
date_range = pd.date_range(start='2020-01-01', periods=100, freq='D')
data = pd.Series(range(100), index=date_range)
# Plotting
data.plot()
plt.title('Line Plot of Time Series Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()
2. Seasonal Plots
 Description: Seasonal plots help to visualize seasonal patterns by plotting data for each
period (e.g., month, year) on the same graph.
 Usage: Useful for identifying and comparing seasonal patterns across different periods.
 Example:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

python
import seaborn as sns
# Assuming `data` is a pandas DataFrame with a datetime index and a 'value' column
data = pd.DataFrame({'date': date_range, 'value': range(100)})
data['month'] = data['date'].dt.month
# Seasonal plot
sns.lineplot(data=data, x='month', y='value')
plt.title('Seasonal Plot of Time Series Data')
plt.xlabel('Month')
plt.ylabel('Value')
plt.show()
3. Autocorrelation Plots (ACF)
 Description: Autocorrelation plots display the correlation of the time series with its own
past values (lags).
 Usage: Useful for identifying the degree of correlation between time steps and for
checking the randomness of data.
 Example:
python
from statsmodels.graphics.tsaplots import plot_acf
# Plotting ACF
plot_acf(data['value'])
plt.title('Autocorrelation Plot')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.show()
4. Heatmaps
 Description: Heatmaps represent data in a matrix form, where values are represented by
different colors. Time series heatmaps can show patterns across two dimensions, such as
time of day and day of week.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Usage: Useful for visualizing seasonal and cyclical patterns across multiple dimensions.
 Example:
python
data['day_of_week'] = data['date'].dt.dayofweek
data['hour'] = data['date'].dt.hour
# Creating a pivot table
heatmap_data = data.pivot_table(index='day_of_week', columns='hour', values='value')
# Plotting heatmap
sns.heatmap(heatmap_data, cmap='coolwarm')
plt.title('Heatmap of Time Series Data')
plt.xlabel('Hour')
plt.ylabel('Day of Week')
plt.show()
5. Box Plots
 Description: Box plots (or whisker plots) display the distribution of data based on a five-
number summary: minimum, first quartile (Q1), median, third quartile (Q3), and
maximum.
 Usage: Useful for comparing distributions across different time periods.
 Example:
python
sns.boxplot(x='month', y='value', data=data)
plt.title('Box Plot of Time Series Data')
plt.xlabel('Month')
plt.ylabel('Value')
plt.show()
6. Decomposition Plots
 Description: Decomposition plots break down a time series into its component parts:
trend, seasonal, and residual components.
 Usage: Useful for understanding the underlying structure of the time series data.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Example:
python
from statsmodels.tsa.seasonal import seasonal_decompose
# Decompose the time series
decomposition = seasonal_decompose(data['value'], model='additive', period=12)
# Plotting decomposition
decomposition.plot()
plt.show()
Importance of Visualization in Time Series Analysis
 Trend Identification: Visualizations help identify long-term trends in the data.
 Seasonal Patterns: They reveal recurring patterns and seasonality.
 Anomaly Detection: Visualizing data can highlight outliers or anomalies.
 Data Cleaning: Helps in identifying missing values and incorrect data points.
 Model Selection: Assists in choosing appropriate models for forecasting and analysis
based on observed patterns.
Effective visualization is essential for gaining insights and making informed decisions based on
time series data. By utilizing these visualization techniques, analysts can better understand and
interpret their data.
Grouping
Grouping in Time Series Analysis (TSA)
Grouping in Time Series Analysis is a powerful technique used to aggregate data based on
specific time intervals or other criteria. This method helps in summarizing the data, identifying
patterns, and gaining insights over different periods.
Key Concepts of Grouping in TSA
1. Time-based Grouping:
o Grouping data based on time intervals such as hourly, daily, weekly, monthly, or
yearly.
o Useful for identifying trends and seasonal patterns over specific periods.
2. Custom Grouping:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

o Grouping based on custom criteria such as weekdays vs. weekends, business


hours vs. non-business hours, etc.
o Helps in understanding the impact of external factors on the time series data.
Grouping with Pandas
Pandas is a powerful library for handling and analyzing time series data. The groupby method in
Pandas is commonly used for grouping data. Below are examples illustrating various grouping
techniques.
Example 1: Time-based Grouping
Hourly Grouping
python
import pandas as pd
import numpy as np
# Creating a sample time series data
date_range = pd.date_range(start='2023-01-01', periods=1000, freq='H')
data = pd.Series(np.random.randn(1000), index=date_range)
# Grouping by hour
hourly_group = data.groupby(data.index.hour).mean()
# Plotting the grouped data
hourly_group.plot(kind='bar')
plt.title('Average Value by Hour')
plt.xlabel('Hour of the Day')
plt.ylabel('Average Value')
plt.show()
Monthly Grouping
python
# Grouping by month
monthly_group = data.groupby(data.index.month).sum()
# Plotting the grouped data
monthly_group.plot(kind='bar')
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

plt.title('Total Value by Month')


plt.xlabel('Month')
plt.ylabel('Total Value')
plt.show()
Example 2: Custom Grouping
Weekday vs. Weekend
python
# Adding a column to indicate if the day is a weekday or weekend
data_frame = data.to_frame(name='value')
data_frame['is_weekend'] = data_frame.index.weekday >= 5
# Grouping by weekday vs. weekend
weekend_group = data_frame.groupby('is_weekend')['value'].mean()
# Plotting the grouped data
weekend_group.plot(kind='bar')
plt.title('Average Value: Weekday vs. Weekend')
plt.xlabel('Is Weekend')
plt.ylabel('Average Value')
plt.xticks([0, 1], ['Weekday', 'Weekend'])
plt.show()
Business Hours vs. Non-Business Hours
python
# Adding a column to indicate if the time is within business hours (9am-5pm)
data_frame['is_business_hours'] = data_frame.index.hour.isin(range(9, 17))
# Grouping by business hours vs. non-business hours
business_hours_group = data_frame.groupby('is_business_hours')['value'].mean()
# Plotting the grouped data
business_hours_group.plot(kind='bar')
plt.title('Average Value: Business Hours vs. Non-Business Hours')
plt.xlabel('Is Business Hours')
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

plt.ylabel('Average Value')
plt.xticks([0, 1], ['Non-Business Hours', 'Business Hours'])
plt.show()
Benefits of Grouping in TSA
1. Data Summarization:
o Grouping helps in summarizing large datasets into meaningful statistics such as
mean, sum, count, etc.
o It simplifies the data, making it easier to analyze and interpret.
2. Pattern Identification:
o By grouping data over different time intervals, one can identify trends and
seasonal patterns.
o It helps in understanding the behavior of the time series data.
3. Anomaly Detection:
o Grouping data can reveal anomalies or outliers that deviate from the expected
pattern.
o This is crucial for identifying unusual events or behaviors in the data.
Grouping is an essential technique in time series analysis that enables better data
management and analysis. Whether it’s time-based grouping or custom grouping, using tools like
Pandas makes it straightforward to implement and visualize the results. This approach provides
valuable insights into the data, aiding in more informed decision-making.
Resampling
Resampling is a powerful technique in time series analysis that involves changing the frequency
of your time series data. This can include:
 Downsampling: Reducing the frequency of the data (e.g., converting daily data to
monthly data).
 Upsampling: Increasing the frequency of the data (e.g., converting monthly data to daily
data).
Here's an example of how to perform resampling in pandas:
Example Dataset
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Let's create a sample time series dataset:


python
import pandas as pd
import numpy as np
# Create a date range
date_rng = pd.date_range(start='2020-01-01', end='2020-01-10', freq='D')
# Create a DataFrame with this date range
df = pd.DataFrame(date_rng, columns=['date'])
# Add a random value column
df['data'] = np.random.randint(0, 100, size=(len(date_rng)))
# Set the date column as the index
df.set_index('date', inplace=True)
print(df)
Downsampling
To downsample the data from daily to a different frequency (e.g., monthly), use the resample
method:
python
# Downsampling to 3-day frequency, calculating the mean
df_downsampled = df.resample('3D').mean()
print(df_downsampled)
This will output the mean value of the data for every 3 days.
Upsampling
To upsample the data from daily to a higher frequency (e.g., hourly), use the resample method
and then fill the resulting NaN values:
python
# Upsampling to hourly frequency, filling NaN values using forward fill method
df_upsampled = df.resample('H').ffill()

print(df_upsampled)
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

This will fill the NaN values by propagating the last valid observation forward.
Example
Here is the complete example:
python
import pandas as pd
import numpy as np
# Create a date range
date_rng = pd.date_range(start='2020-01-01', end='2020-01-10', freq='D')
# Create a DataFrame with this date range
df = pd.DataFrame(date_rng, columns=['date'])
# Add a random value column
df['data'] = np.random.randint(0, 100, size=(len(date_rng)))
# Set the date column as the index
df.set_index('date', inplace=True)
print("Original DataFrame:")
print(df)
# Downsampling to 3-day frequency, calculating the mean
df_downsampled = df.resample('3D').mean()
print("\nDownsampled DataFrame (3-day frequency, mean):")
print(df_downsampled)
# Upsampling to hourly frequency, filling NaN values using forward fill method
df_upsampled = df.resample('H').ffill()
print("\nUpsampled DataFrame (hourly frequency, forward fill):")
print(df_upsampled)
Output
Original DataFrame:
2020-01-01 81
2020-01-02 43
2020-01-03 23
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

2020-01-04 76
2020-01-05 21
2020-01-06 34
2020-01-07 84
2020-01-08 45
2020-01-09 15
2020-01-10 57
Downsampled DataFrame (3-day frequency, mean):
2020-01-01 49.000000
2020-01-04 43.666667
2020-01-07 48.000000
2020-01-10 57.000000
Upsampled DataFrame (hourly frequency, forward fill):
2020-01-01 00:00:00 81
2020-01-01 01:00:00 81
2020-01-01 02:00:00 81
2020-01-01 03:00:00 81
2020-01-01 04:00:00 81
2020-01-09 20:00:00 15
2020-01-09 21:00:00 15
2020-01-09 22:00:00 15
2020-01-09 23:00:00 15
2020-01-10 00:00:00 57
[217 rows x 1 columns]
These examples show how you can resample your time series data to different frequencies using
pandas, providing flexibility in your data analysis.
Unit V Completed

You might also like