My first ETL pipeline
My first ETL pipeline
Objective:
The goal of this project is to process and analyse retail sales data using PySpark and Spark SQL. Trainees
will perform ETL (Extract, Transform, Load) operations, conduct exploratory data analysis (EDA), and
generate insights from the data. The project will help trainees gain hands-on experience with PySpark,
data cleaning, data transformation, and SQL queries. By following medallion architecture – Bronze,
Silver and Gold layer.
Project Scope:
1. Data Extraction:
Load three datasets (customers.csv, products.csv, and sales.csv) into PySpark Data Frames.
Code:
Data Extraction: Load three datasets (customers.csv, products.csv, and sales.csv) into PySpark Data
Frames. Understand the structure and schema of each dataset.
Results:
2. Data Transformation:
o Handle missing values in the customers.csv dataset (e.g., fill nulls in city with "Unknown" and age
with the average age).
o Join the three datasets (customers, products, and sales) into a single Data Frame using
customer_id and product_id.
o Enrich the data by calculating a new column total_revenue (quantity * price) and deriving a
sale_month column from the date.
Code:
Result:
3. Data Filtering:
o Filter the dataset to keep only sales with quantity > 1 and total_revenue > 50.
4. Data Loading:
o Perform basic insights such as calculating total revenue, identifying top products by revenue, and
analysing total sales by store type.
o Conduct customer analysis to find the top 10 customers contributing the most to revenue and
calculate the average age of customers by product category.
o Analyse monthly sales trends and revenue contribution by product category over time.
Code:
Result:
1. Monthly sales:
2. Top products:
3. Top customers:
6. SQL Queries:
o Use Spark SQL to identify products that contributed at least 10% of the total revenue.
o Identify cities with more than 100 unique customers (if any).
Code, Result:
Conclusion:
This project was a great opportunity to apply PySpark and Spark SQL in a real-world retail context.
By following the Medallion Architecture, I gained practical experience in designing a structured data
pipeline — from raw data ingestion to enriched insights ready for business analysis.