0% found this document useful (0 votes)

7 views

Data Analysis

The document outlines a step-by-step guide for performing data analysis using Python libraries such as Pandas, Matplotlib, Seaborn, and Scikit-learn on a sales dataset. It covers essential tasks including data loading, exploration, cleaning, visualization, feature engineering, and building a predictive model for Total_Sales. The example demonstrates practical applications of data manipulation, visualization techniques, and machine learning model evaluation.

Uploaded by

Messih Grmay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Data Analysis

Uploaded by

Messih Grmay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Data Analysis

Data analysis using Pandas and other technologies is a common approach for handling,
analyzing, and visualizing datasets in Python. Below is a step-by-step example of how to
perform data analysis using Pandas, alongside other tools such as Matplotlib, Seaborn, and
Scikit-learn.

Let's walk through an example of performing data analysis on a CSV dataset that contains
information about customer sales transactions.

Step 1: Install Necessary Libraries

bash
CopyEdit
pip install pandas matplotlib seaborn scikit-learn

Step 2: Import Libraries

python
CopyEdit
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

Step 3: Load the Dataset

We'll load a sample dataset into Pandas using the read_csv method. Assume the dataset is a
CSV file named sales_data.csv, with columns like Date, Product, Price, Quantity,
Total_Sales, and Customer_ID.

python
CopyEdit
# Load dataset
df = pd.read_csv('sales_data.csv')

# Display first few rows of the dataset

df.head()

Sample Data (sales_data.csv):

Date Product Price Quantity Total_Sales Customer_ID

2021-01-01 Widget 25 2 50 101

2021-01-02 Gadget 15 3 45 102

Date Product Price Quantity Total_Sales Customer_ID

2021-01-03 Widget 25 5 125 103

2021-01-04 Widget 25 3 75 101

2021-01-05 Gadget 15 4 60 102

Step 4: Basic Data Exploration

Before starting analysis, it’s important to explore and clean the data.

python
CopyEdit
# Data summary and info
print(df.info()) # Check data types and null values
print(df.describe()) # Get summary statistics

# Check for missing values

print(df.isnull().sum())

# Convert 'Date' column to datetime type

df['Date'] = pd.to_datetime(df['Date'])

# Check for duplicate rows

df.drop_duplicates(inplace=True)

Step 5: Data Cleaning (if necessary)

In case there are missing or inconsistent values in the dataset, we can handle them:

python
CopyEdit
# Fill missing values (if any)
df['Quantity'].fillna(df['Quantity'].mean(), inplace=True)

# Drop rows with missing target variable (e.g., 'Total_Sales')

df.dropna(subset=['Total_Sales'], inplace=True)

Step 6: Data Visualization

Data visualization helps to better understand trends, relationships, and distributions in the
dataset.

Example 1: Sales Distribution by Product

python
CopyEdit
# Bar plot showing total sales for each product
product_sales = df.groupby('Product')['Total_Sales'].sum().sort_values()
product_sales.plot(kind='bar', color='skyblue')
plt.title('Total Sales by Product')
plt.xlabel('Product')
plt.ylabel('Total Sales')
plt.show()
Example 2: Scatter Plot for Price vs. Total Sales
python
CopyEdit
# Scatter plot to analyze the relationship between Price and Total Sales
plt.figure(figsize=(8,6))
sns.scatterplot(x='Price', y='Total_Sales', data=df)
plt.title('Price vs Total Sales')
plt.show()
Example 3: Sales Trends Over Time
python
CopyEdit
# Line plot to show sales trends over time
df_grouped = df.groupby('Date')['Total_Sales'].sum()
df_grouped.plot(kind='line', figsize=(10,6), color='green')
plt.title('Sales Trend Over Time')
plt.xlabel('Date')
plt.ylabel('Total Sales')
plt.show()

Step 7: Feature Engineering

In case you want to create new features or variables for predictive models:

python
CopyEdit
# Extract year and month from 'Date'
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

# Feature engineering: Calculate profit assuming a 30% profit margin

df['Profit'] = df['Total_Sales'] * 0.30

Step 8: Build a Simple Predictive Model (Example: Predicting Total Sales)

Let’s build a simple machine learning model to predict Total_Sales based on features like
Price, Quantity, and Product.

1. Split Data into Training and Testing Sets

python
CopyEdit
# Convert 'Product' into numerical category
df['Product'] = df['Product'].astype('category').cat.codes

# Features and target variable

X = df[['Price', 'Quantity', 'Product']]
y = df['Total_Sales']

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
2. Train the Model (Linear Regression Example)
python
CopyEdit
# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test data

y_pred = model.predict(X_test)
3. Evaluate the Model
python
CopyEdit
# Calculate Mean Absolute Error (MAE) to evaluate the model
mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error: {mae}')

Step 9: Save the Results

You can save the model or processed data for future use:

python
CopyEdit
# Save the processed dataset to a new CSV file
df.to_csv('processed_sales_data.csv', index=False)

# Save the trained model using pickle

import pickle
with open('sales_prediction_model.pkl', 'wb') as model_file:
pickle.dump(model, model_file)

Example Summary:

In this example, we loaded a sales dataset, performed data exploration, cleaning, and
visualization, and then built a machine learning model to predict Total_Sales. Along the way,
we used:

 Pandas: for data manipulation and cleaning.

 Matplotlib and Seaborn: for data visualization (scatter plots, line plots, and bar charts).
 Scikit-learn: for machine learning, including data splitting, model training, and evaluation.

This is just a simple demonstration. In real-world scenarios, the data analysis process can involve
more complex transformations, more advanced machine learning models, and more sophisticated
visualizations.

Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
RStreng Level 2 Example
No ratings yet
RStreng Level 2 Example
3 pages
Supermarket Sales Data analysis
No ratings yet
Supermarket Sales Data analysis
6 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Supermart Grocery Sales - Retail Analytics Dataset - (Data Analyst)
No ratings yet
Supermart Grocery Sales - Retail Analytics Dataset - (Data Analyst)
17 pages
Project Amazon Sales Data Analysis
No ratings yet
Project Amazon Sales Data Analysis
12 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
Supermarket Sales Analysis Project
No ratings yet
Supermarket Sales Analysis Project
8 pages
Intro To Pandas For Data Analytics
No ratings yet
Intro To Pandas For Data Analytics
20 pages
BIDA practical print
No ratings yet
BIDA practical print
56 pages
Learneverythingai
No ratings yet
Learneverythingai
9 pages
Unit-2 Bda
No ratings yet
Unit-2 Bda
11 pages
Analytical project using python BMBA-252
No ratings yet
Analytical project using python BMBA-252
4 pages
DOC-20250118-WA0002.
No ratings yet
DOC-20250118-WA0002.
4 pages
Assvid
No ratings yet
Assvid
13 pages
Business intelligent
No ratings yet
Business intelligent
20 pages
BasicAnalysis Using PYTHON
No ratings yet
BasicAnalysis Using PYTHON
6 pages
ds
No ratings yet
ds
114 pages
DATA AGGREGATION USING PYTHON (1)
No ratings yet
DATA AGGREGATION USING PYTHON (1)
33 pages
Guides
No ratings yet
Guides
23 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
29 pages
pandas_notes
No ratings yet
pandas_notes
8 pages
Python for Data Analysis
No ratings yet
Python for Data Analysis
84 pages
Practicals
No ratings yet
Practicals
42 pages
Supermarket Sales Analysis 1
No ratings yet
Supermarket Sales Analysis 1
13 pages
Analyzing Sales Data
No ratings yet
Analyzing Sales Data
11 pages
Excel To Pandas Advanced Data Techniques For BI Devs 1729266352
No ratings yet
Excel To Pandas Advanced Data Techniques For BI Devs 1729266352
9 pages
INDEX (1)
No ratings yet
INDEX (1)
16 pages
final dev record
No ratings yet
final dev record
49 pages
Exp 8_LM
No ratings yet
Exp 8_LM
10 pages
Coffee Sales - (Data Analyst)
No ratings yet
Coffee Sales - (Data Analyst)
31 pages
Data Collection and Data Cleaning: Next Connect To The Drive
No ratings yet
Data Collection and Data Cleaning: Next Connect To The Drive
16 pages
Avneesh_To be printed Information Practice
No ratings yet
Avneesh_To be printed Information Practice
8 pages
Python For Business Decision Making Asm2
No ratings yet
Python For Business Decision Making Asm2
21 pages
profitanalysis
No ratings yet
profitanalysis
18 pages
S-9
No ratings yet
S-9
18 pages
BI Pracrical
No ratings yet
BI Pracrical
12 pages
IIM PBA Assignment 2
No ratings yet
IIM PBA Assignment 2
3 pages
Data Analytics Fundamentals-2
No ratings yet
Data Analytics Fundamentals-2
34 pages
SalesMgmtSystem XII IP Projectreport 2022 23
No ratings yet
SalesMgmtSystem XII IP Projectreport 2022 23
18 pages
DAP writeups_merged
No ratings yet
DAP writeups_merged
33 pages
Unit 5 - Time Series Analysis and Predictive Modeling
No ratings yet
Unit 5 - Time Series Analysis and Predictive Modeling
21 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Sales Report Analysis Project for IP
No ratings yet
Sales Report Analysis Project for IP
17 pages
Ads Phase 5
No ratings yet
Ads Phase 5
23 pages
Walmart Sales Prediction
No ratings yet
Walmart Sales Prediction
21 pages
Python - Pandas_Numpy Interview Q&A
No ratings yet
Python - Pandas_Numpy Interview Q&A
12 pages
An Extensive Step by Step Guide To Exploratory Data Analysis
No ratings yet
An Extensive Step by Step Guide To Exploratory Data Analysis
26 pages
Python Quick Notes
No ratings yet
Python Quick Notes
2 pages
E Commerce
No ratings yet
E Commerce
23 pages
EDA LAB ASSIGNMENT2
No ratings yet
EDA LAB ASSIGNMENT2
10 pages
Document (2)
No ratings yet
Document (2)
29 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
DS3.1
No ratings yet
DS3.1
8 pages
SalesDataAnalysisProject
No ratings yet
SalesDataAnalysisProject
4 pages
Types of Data Analysis With Code
No ratings yet
Types of Data Analysis With Code
8 pages
DataCleaning
No ratings yet
DataCleaning
28 pages
Course_ Introduction to Data Science (SD211105)
No ratings yet
Course_ Introduction to Data Science (SD211105)
10 pages
Applied Datascience - Phase3
No ratings yet
Applied Datascience - Phase3
8 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
Multiphase Flow Modeling: Industry Solutions
No ratings yet
Multiphase Flow Modeling: Industry Solutions
4 pages
Test Specification For Degrees of Protection
No ratings yet
Test Specification For Degrees of Protection
24 pages
Brush Bond
No ratings yet
Brush Bond
4 pages
Chinu's Twisted Nursery Rhymes of Our Times
No ratings yet
Chinu's Twisted Nursery Rhymes of Our Times
15 pages
Sample Manifest Hazardous Chemicals
No ratings yet
Sample Manifest Hazardous Chemicals
6 pages
National Education Policy 2017
No ratings yet
National Education Policy 2017
12 pages
Borland Graphics
No ratings yet
Borland Graphics
40 pages
Meanwell SMPS CATLOGUE.
No ratings yet
Meanwell SMPS CATLOGUE.
1 page
Logframe Handbook WorldBank
100% (3)
Logframe Handbook WorldBank
113 pages
Bieniawski - Engineering Rock Mass Clasification PDF
100% (1)
Bieniawski - Engineering Rock Mass Clasification PDF
249 pages
Lab 2
No ratings yet
Lab 2
9 pages
Photoshop New Features CS-CC
No ratings yet
Photoshop New Features CS-CC
97 pages
8051 Variants
No ratings yet
8051 Variants
10 pages
Project Procedure With WBS
100% (2)
Project Procedure With WBS
45 pages
Flange: Navigation Search
No ratings yet
Flange: Navigation Search
21 pages
Offshore Rig Technology Catalog 2014: Cementing Wireline Logging Well Testing Surface and Mud Logging
No ratings yet
Offshore Rig Technology Catalog 2014: Cementing Wireline Logging Well Testing Surface and Mud Logging
108 pages
ASME PTC12.1 - 2000 - Closed Feedwater Heaters
100% (1)
ASME PTC12.1 - 2000 - Closed Feedwater Heaters
85 pages
Study of Temperature Differential in Different Concrete Slabs of Varying Slabthickness in Different Regions
No ratings yet
Study of Temperature Differential in Different Concrete Slabs of Varying Slabthickness in Different Regions
9 pages
Reflections On Systems and Their Models
No ratings yet
Reflections On Systems and Their Models
12 pages
Entrepreneurship: St. Paul University Surigao (ST Paul University System) Cor. San Nicolas & Rizal Streets Surigao City
No ratings yet
Entrepreneurship: St. Paul University Surigao (ST Paul University System) Cor. San Nicolas & Rizal Streets Surigao City
7 pages
Parts Manual - EH02-1
No ratings yet
Parts Manual - EH02-1
12 pages
JSON Interview Questions and Answers
No ratings yet
JSON Interview Questions and Answers
4 pages
Information Technology Management System
No ratings yet
Information Technology Management System
47 pages
LECTURE#9 EE258 F22 Part2 Draft v1
No ratings yet
LECTURE#9 EE258 F22 Part2 Draft v1
14 pages
Log
No ratings yet
Log
2 pages
Data Book: Automotive Technical
No ratings yet
Data Book: Automotive Technical
1 page
Fyp NG01017075 - 4510309004 Gen E03 00030 - 04 - MZ03
No ratings yet
Fyp NG01017075 - 4510309004 Gen E03 00030 - 04 - MZ03
51 pages
Waterproofing For Chajjas
No ratings yet
Waterproofing For Chajjas
9 pages

Uploaded by

Uploaded by

Data Analysis

Step 1: Install Necessary Libraries

Step 2: Import Libraries

Step 3: Load the Dataset

# Display first few rows of the dataset

Sample Data (sales_data.csv):

Date Product Price Quantity Total_Sales Customer_ID

2021-01-01 Widget 25 2 50 101

2021-01-02 Gadget 15 3 45 102

2021-01-03 Widget 25 5 125 103

2021-01-04 Widget 25 3 75 101

2021-01-05 Gadget 15 4 60 102

Step 4: Basic Data Exploration

# Check for missing values

# Convert 'Date' column to datetime type

# Check for duplicate rows

Step 5: Data Cleaning (if necessary)

# Drop rows with missing target variable (e.g., 'Total_Sales')

Step 6: Data Visualization

Example 1: Sales Distribution by Product

Step 7: Feature Engineering

# Feature engineering: Calculate profit assuming a 30% profit margin

Step 8: Build a Simple Predictive Model (Example: Predicting Total Sales)

1. Split Data into Training and Testing Sets

# Features and target variable

# Split data into training and testing sets

# Make predictions on the test data

Step 9: Save the Results

# Save the trained model using pickle

 Pandas: for data manipulation and cleaning.

You might also like