0% found this document useful (0 votes)
7 views

Data Analysis

The document outlines a step-by-step guide for performing data analysis using Python libraries such as Pandas, Matplotlib, Seaborn, and Scikit-learn on a sales dataset. It covers essential tasks including data loading, exploration, cleaning, visualization, feature engineering, and building a predictive model for Total_Sales. The example demonstrates practical applications of data manipulation, visualization techniques, and machine learning model evaluation.

Uploaded by

Messih Grmay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Data Analysis

The document outlines a step-by-step guide for performing data analysis using Python libraries such as Pandas, Matplotlib, Seaborn, and Scikit-learn on a sales dataset. It covers essential tasks including data loading, exploration, cleaning, visualization, feature engineering, and building a predictive model for Total_Sales. The example demonstrates practical applications of data manipulation, visualization techniques, and machine learning model evaluation.

Uploaded by

Messih Grmay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Analysis

Data analysis using Pandas and other technologies is a common approach for handling,
analyzing, and visualizing datasets in Python. Below is a step-by-step example of how to
perform data analysis using Pandas, alongside other tools such as Matplotlib, Seaborn, and
Scikit-learn.

Let's walk through an example of performing data analysis on a CSV dataset that contains
information about customer sales transactions.

Step 1: Install Necessary Libraries


bash
CopyEdit
pip install pandas matplotlib seaborn scikit-learn

Step 2: Import Libraries


python
CopyEdit
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

Step 3: Load the Dataset

We'll load a sample dataset into Pandas using the read_csv method. Assume the dataset is a
CSV file named sales_data.csv, with columns like Date, Product, Price, Quantity,
Total_Sales, and Customer_ID.

python
CopyEdit
# Load dataset
df = pd.read_csv('sales_data.csv')

# Display first few rows of the dataset


df.head()

Sample Data (sales_data.csv):

Date Product Price Quantity Total_Sales Customer_ID

2021-01-01 Widget 25 2 50 101

2021-01-02 Gadget 15 3 45 102


Date Product Price Quantity Total_Sales Customer_ID

2021-01-03 Widget 25 5 125 103

2021-01-04 Widget 25 3 75 101

2021-01-05 Gadget 15 4 60 102

Step 4: Basic Data Exploration

Before starting analysis, it’s important to explore and clean the data.

python
CopyEdit
# Data summary and info
print(df.info()) # Check data types and null values
print(df.describe()) # Get summary statistics

# Check for missing values


print(df.isnull().sum())

# Convert 'Date' column to datetime type


df['Date'] = pd.to_datetime(df['Date'])

# Check for duplicate rows


df.drop_duplicates(inplace=True)

Step 5: Data Cleaning (if necessary)

In case there are missing or inconsistent values in the dataset, we can handle them:

python
CopyEdit
# Fill missing values (if any)
df['Quantity'].fillna(df['Quantity'].mean(), inplace=True)

# Drop rows with missing target variable (e.g., 'Total_Sales')


df.dropna(subset=['Total_Sales'], inplace=True)

Step 6: Data Visualization

Data visualization helps to better understand trends, relationships, and distributions in the
dataset.

Example 1: Sales Distribution by Product


python
CopyEdit
# Bar plot showing total sales for each product
product_sales = df.groupby('Product')['Total_Sales'].sum().sort_values()
product_sales.plot(kind='bar', color='skyblue')
plt.title('Total Sales by Product')
plt.xlabel('Product')
plt.ylabel('Total Sales')
plt.show()
Example 2: Scatter Plot for Price vs. Total Sales
python
CopyEdit
# Scatter plot to analyze the relationship between Price and Total Sales
plt.figure(figsize=(8,6))
sns.scatterplot(x='Price', y='Total_Sales', data=df)
plt.title('Price vs Total Sales')
plt.show()
Example 3: Sales Trends Over Time
python
CopyEdit
# Line plot to show sales trends over time
df_grouped = df.groupby('Date')['Total_Sales'].sum()
df_grouped.plot(kind='line', figsize=(10,6), color='green')
plt.title('Sales Trend Over Time')
plt.xlabel('Date')
plt.ylabel('Total Sales')
plt.show()

Step 7: Feature Engineering

In case you want to create new features or variables for predictive models:

python
CopyEdit
# Extract year and month from 'Date'
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

# Feature engineering: Calculate profit assuming a 30% profit margin


df['Profit'] = df['Total_Sales'] * 0.30

Step 8: Build a Simple Predictive Model (Example: Predicting Total Sales)

Let’s build a simple machine learning model to predict Total_Sales based on features like
Price, Quantity, and Product.

1. Split Data into Training and Testing Sets


python
CopyEdit
# Convert 'Product' into numerical category
df['Product'] = df['Product'].astype('category').cat.codes

# Features and target variable


X = df[['Price', 'Quantity', 'Product']]
y = df['Total_Sales']

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
2. Train the Model (Linear Regression Example)
python
CopyEdit
# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test data


y_pred = model.predict(X_test)
3. Evaluate the Model
python
CopyEdit
# Calculate Mean Absolute Error (MAE) to evaluate the model
mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error: {mae}')

Step 9: Save the Results

You can save the model or processed data for future use:

python
CopyEdit
# Save the processed dataset to a new CSV file
df.to_csv('processed_sales_data.csv', index=False)

# Save the trained model using pickle


import pickle
with open('sales_prediction_model.pkl', 'wb') as model_file:
pickle.dump(model, model_file)

Example Summary:

In this example, we loaded a sales dataset, performed data exploration, cleaning, and
visualization, and then built a machine learning model to predict Total_Sales. Along the way,
we used:

 Pandas: for data manipulation and cleaning.


 Matplotlib and Seaborn: for data visualization (scatter plots, line plots, and bar charts).
 Scikit-learn: for machine learning, including data splitting, model training, and evaluation.

This is just a simple demonstration. In real-world scenarios, the data analysis process can involve
more complex transformations, more advanced machine learning models, and more sophisticated
visualizations.

You might also like