A Real World Scenario Solution using pandas
A Real World Scenario Solution using pandas
Scenario:
An e-commerce company wants to analyze its 2023 sales performance. Data comes from multiple sources:
1. Transactions data (CSV) with missing values
2. Product catalog (Excel) with category information
3. Customer data (JSON) with signup dates
Key Objectives:
1. Clean and merge datasets
2. Calculate monthly sales trends
3. Identify top-selling products
4. Analyze customer purchasing patterns
Store both combining monthly sales and monthly profit, product performance, customer loyalty to separate
CSV files.
Solution:
import pandas as pd
import numpy as np
# 1. Load Data
products = pd.read_excel('product_catalog.xlsx')
transactions['quantity'] = transactions['quantity'].fillna(1)
transactions['order_id'] = transactions['order_id'].astype('category')
transactions['customer_id'] = transactions['customer_id'].astype('int32')
# 3. Merge Datasets
# 4. Feature Engineering
Instructor: Shazmina Gull Course: Tools and Techniques in Data Science FoC, IUB, RYK Campus
merged_data['total_sales'] = merged_data['quantity'] * merged_data['unit_price']
merged_data.to_parquet('processed_data.parquet', index=False)
product_performance.to_excel('top_products.xlsx')
# 8. Customer Segmentation
customer_loyalty['cohort'] = customer_loyalty['signup_date'].dt.to_period('M')
customer_loyalty.to_csv('customer_segments.csv')
# 9. Data Validation
if not negative_profit.empty:
negative_profit.to_csv('negative_profit_transactions.csv', index=False)
missing_categories = merged_data[merged_data['category'].isna()]
if not missing_categories.empty:
missing_categories.to_csv('missing_category_records.csv', index=False)
Instructor: Shazmina Gull Course: Tools and Techniques in Data Science FoC, IUB, RYK Campus