Sample Sales Data Analysis
Sample Sales Data Analysis
ANALYSIS
Submission Date:
SURYANSHU KUMAR
2023000776
Table of Contents
1. Project Title Page
2. Table of Contents
3. Introduction
4. Requirements
5. Code Structure
6. Challenges & Solutions
7. Conclusion & Future Work
8. References
Introduction
Objectives
The primary objectives of this analysis are:
To perform descriptive, bivariate, and multivariate
statistical analyses on the Sample Sales Data.
To derive insights into sales patterns, customer
behavior, and shipping performance.
To identify factors influencing sales and customer
satisfaction.
Scope and Limitations
Scope: The analysis encompasses various
statistical techniques, including descriptive
statistics, hypothesis testing, correlation analysis,
regression analysis, and principal component
analysis (PCA).
Limitations: The dataset's quality and
completeness may affect the analysis. Additionally,
the findings are limited to the data provided and
may not be generalizable.
Requirements
Software & Libraries
Python 3.x
Libraries:
o pandas
o numpy
o matplotlib
o seaborn
o scipy
o statsmodels
o scikit-learn
Hardware Requirements
Standard computing hardware capable of running
Python and the aforementioned libraries.
Installation Instructions
To install the required libraries, execute:
pip install pandas numpy matplotlib seaborn scipy statsmodels scikit-
learn
Code Structure
a.Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from statsmodels.formula.api import ols
from sklearn.decomposition import PCA
b. Inputs (Data)
Dataset: Sample Sales Data
Source: Kaggle Dataset
c. Process (Methods)
Data Loading and Cleaning
# Load the dataset
df = pd.read_csv('sample_sales_data.csv')
1. Descriptive/Univariate Analysis
Summaries:
# Summary statistics
df.describe()
Plots:
# Histogram
df['Sales'].hist()
plt.title('Sales Distribution')
plt.xlabel('Sales')
plt.ylabel('Frequency')
plt.show()
# Boxplot
sns.boxplot(x=df['Sales'])
plt.title('Sales Boxplot')
plt.show()
# Heatmap
df_numeric = df.apply(pd.to_numeric, errors='coerce')
df_numeric = df_numeric.dropna(axis=1, how='all')
corr_matrix = df_numeric.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm",
fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()
Normality Tests:
# Shapiro-Wilk test
stat, p = stats.shapiro(df['Sales'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
Hypothesis Tests:
# One-sample t-test
t_stat, p_val = stats.ttest_1samp(df['Sales'], popmean=500)
print('t-statistic=%.3f, p-value=%.3f' % (t_stat, p_val))
2. Bivariate Analysis
Correlation:
# Correlation matrix
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True)
plt.title('Correlation Matrix')
plt.show()
3. Multivariate Analysis
Multiple Regression:
# Multiple regression
model = ols('Sales ~ Quantity + Discount', data=df).fit()
print(model.summary())
Visualization
Statistical Test Results:
Statistics=0.927, p=0.000
t-statistic=20.791, p-value=0.000
Regression Analysis:
PCA Results:
EFA Results:
Challenges & Solutions
Challenges
Data Quality: Missing values and potential outliers.
Assumptions: Ensuring statistical tests'
assumptions are met.
Solutions
Data Cleaning: Handled missing values by
removing incomplete records.
Validation: Conducted normality tests and
visualizations to validate assumptions.
References
Kaggle Dataset: Sample Sales Data
Python Libraries Documentation:
o pandas
o numpy
o matplotlib
o seaborn
o scipy
o statsmodels
o scikit-learn