0% found this document useful (0 votes)
4 views

A Real World Scenario Solution using pandas

Uploaded by

asharyg5752
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

A Real World Scenario Solution using pandas

Uploaded by

asharyg5752
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Analyzing E-commerce Sales Data

Scenario:
An e-commerce company wants to analyze its 2023 sales performance. Data comes from multiple sources:
1. Transactions data (CSV) with missing values
2. Product catalog (Excel) with category information
3. Customer data (JSON) with signup dates
Key Objectives:
1. Clean and merge datasets
2. Calculate monthly sales trends
3. Identify top-selling products
4. Analyze customer purchasing patterns
Store both combining monthly sales and monthly profit, product performance, customer loyalty to separate
CSV files.

Solution:
import pandas as pd

import numpy as np

# 1. Load Data

transactions = pd.read_csv('transactions.csv', parse_dates=['order_date'])

products = pd.read_excel('product_catalog.xlsx')

customers = pd.read_json('customers.json', convert_dates=['signup_date'])

# 2. Clean Transactions Data

transactions['quantity'] = transactions['quantity'].fillna(1)

transactions['unit_price'] = transactions.groupby('product_id')['unit_price'].transform(lambda x: x.fillna(x.median()))

transactions['order_id'] = transactions['order_id'].astype('category')

transactions['customer_id'] = transactions['customer_id'].astype('int32')

# 3. Merge Datasets

merged_data = transactions.merge(products[['product_id', 'category', 'cost_price']], on='product_id', how='left').merge(

customers[['customer_id', 'signup_date', 'tier']], on='customer_id', how='left')

# 4. Feature Engineering

Instructor: Shazmina Gull Course: Tools and Techniques in Data Science FoC, IUB, RYK Campus
merged_data['total_sales'] = merged_data['quantity'] * merged_data['unit_price']

merged_data['profit'] = (merged_data['unit_price'] - merged_data['cost_price']) * merged_data['quantity']

# 5. Store Merged Data

merged_data.to_parquet('processed_data.parquet', index=False)

# 6. Monthly Sales Analysis

monthly_sales = merged_data.resample('M', on='order_date')['total_sales'].sum()

monthly_profit = merged_data.resample('M', on='order_date')['profit'].sum()

pd.DataFrame({'total_sales': monthly_sales, 'total_profit': monthly_profit}).to_csv('monthly_sales_report.csv')

# 7. Product Performance Analysis

product_performance = merged_data.groupby('product_id').agg({'total_sales': 'sum', 'quantity': 'sum', 'profit':


'mean'}).sort_values('total_sales', ascending=False)

product_performance.to_excel('top_products.xlsx')

# 8. Customer Segmentation

customer_loyalty = merged_data.groupby('customer_id').agg({'order_id': 'nunique', 'total_sales': 'sum', 'signup_date':


'first'}).rename(columns={'order_id': 'purchase_count'})

customer_loyalty['cohort'] = customer_loyalty['signup_date'].dt.to_period('M')

customer_loyalty['lifetime_months'] = (pd.Period('2023-12', freq='M') - customer_loyalty['cohort']).apply(lambda x: x.n)

customer_loyalty.to_csv('customer_segments.csv')

# 9. Data Validation

negative_profit = merged_data[merged_data['profit'] < 0]

if not negative_profit.empty:

negative_profit.to_csv('negative_profit_transactions.csv', index=False)

print(f"Saved {len(negative_profit)} negative profit transactions to file")

# 10. Verify data completeness

missing_categories = merged_data[merged_data['category'].isna()]

if not missing_categories.empty:

missing_categories.to_csv('missing_category_records.csv', index=False)

print("Saved records with missing categories to file")


Instructor: Shazmina Gull Course: Tools and Techniques in Data Science FoC, IUB, RYK Campus
print("Processing complete. Results saved to:")

print("- processed_data.parquet : Full cleaned dataset")

print("- monthly_sales_report.csv : Monthly sales summary")

print("- top_products.xlsx : Product performance analysis")

print("- customer_segments.csv : Customer loyalty metrics")

print("- negative_profit_transactions.csv : Data quality issues (if any)")

print("- missing_category_records.csv : Data quality issues (if any)")

Instructor: Shazmina Gull Course: Tools and Techniques in Data Science FoC, IUB, RYK Campus

You might also like