0% found this document useful (0 votes)
23 views

My first ETL pipeline

This project focuses on processing and analyzing retail sales data using PySpark and Spark SQL, involving ETL operations and exploratory data analysis. Trainees will load datasets, handle data transformations, filter data, and perform SQL queries to generate insights. The project emphasizes practical experience in data cleaning, modular ETL workflows, and leveraging Delta Lake for data storage.

Uploaded by

lathakaruna493
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

My first ETL pipeline

This project focuses on processing and analyzing retail sales data using PySpark and Spark SQL, involving ETL operations and exploratory data analysis. Trainees will load datasets, handle data transformations, filter data, and perform SQL queries to generate insights. The project emphasizes practical experience in data cleaning, modular ETL workflows, and leveraging Delta Lake for data storage.

Uploaded by

lathakaruna493
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Retail Sales Data Processing and Analysis using PySpark

Objective:

The goal of this project is to process and analyse retail sales data using PySpark and Spark SQL. Trainees
will perform ETL (Extract, Transform, Load) operations, conduct exploratory data analysis (EDA), and
generate insights from the data. The project will help trainees gain hands-on experience with PySpark,
data cleaning, data transformation, and SQL queries. By following medallion architecture – Bronze,
Silver and Gold layer.

Project Scope:

1. Data Extraction:

Load three datasets (customers.csv, products.csv, and sales.csv) into PySpark Data Frames.

Understand the structure and schema of each dataset.

Code:

Bronze - Raw Data Ingestion

Data Extraction: Load three datasets (customers.csv, products.csv, and sales.csv) into PySpark Data
Frames. Understand the structure and schema of each dataset.
Results:
2. Data Transformation:

o Handle missing values in the customers.csv dataset (e.g., fill nulls in city with "Unknown" and age
with the average age).

o Drop rows with negative quantity or price in the sales.csv dataset.

o Join the three datasets (customers, products, and sales) into a single Data Frame using
customer_id and product_id.

o Enrich the data by calculating a new column total_revenue (quantity * price) and deriving a
sale_month column from the date.

Code:
Result:

3. Data Filtering:

o Filter the dataset to keep only sales with quantity > 1 and total_revenue > 50.
4. Data Loading:

o Save the transformed data as a Parquet file for further analysis.

5. Exploratory Data Analysis (EDA):

o Perform basic insights such as calculating total revenue, identifying top products by revenue, and
analysing total sales by store type.

o Conduct customer analysis to find the top 10 customers contributing the most to revenue and
calculate the average age of customers by product category.

o Analyse monthly sales trends and revenue contribution by product category over time.

Code:
Result:

1. Monthly sales:

2. Top products:
3. Top customers:
6. SQL Queries:
o Use Spark SQL to identify products that contributed at least 10% of the total revenue.
o Identify cities with more than 100 unique customers (if any).

Code, Result:
Conclusion:

This project was a great opportunity to apply PySpark and Spark SQL in a real-world retail context.
By following the Medallion Architecture, I gained practical experience in designing a structured data
pipeline — from raw data ingestion to enriched insights ready for business analysis.

It deepened my understanding of:

 Data cleaning and transformation techniques

 Building modular ETL workflows

 Leveraging Delta Lake for scalable and reliable data storage

 Applying SQL for business-driven analytics

You might also like