0% found this document useful (0 votes)

23 views

My first ETL pipeline

This project focuses on processing and analyzing retail sales data using PySpark and Spark SQL, involving ETL operations and exploratory data analysis. Trainees will load datasets, handle data transformations, filter data, and perform SQL queries to generate insights. The project emphasizes practical experience in data cleaning, modular ETL workflows, and leveraging Delta Lake for data storage.

Uploaded by

lathakaruna493

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

My first ETL pipeline

Uploaded by

lathakaruna493

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Retail Sales Data Processing and Analysis using PySpark

Objective:

The goal of this project is to process and analyse retail sales data using PySpark and Spark SQL. Trainees
will perform ETL (Extract, Transform, Load) operations, conduct exploratory data analysis (EDA), and
generate insights from the data. The project will help trainees gain hands-on experience with PySpark,
data cleaning, data transformation, and SQL queries. By following medallion architecture – Bronze,
Silver and Gold layer.

Project Scope:

1. Data Extraction:

Load three datasets (customers.csv, products.csv, and sales.csv) into PySpark Data Frames.

Understand the structure and schema of each dataset.

Code:

Bronze - Raw Data Ingestion

Data Extraction: Load three datasets (customers.csv, products.csv, and sales.csv) into PySpark Data
Frames. Understand the structure and schema of each dataset.
Results:
2. Data Transformation:

o Handle missing values in the customers.csv dataset (e.g., fill nulls in city with "Unknown" and age
with the average age).

o Drop rows with negative quantity or price in the sales.csv dataset.

o Join the three datasets (customers, products, and sales) into a single Data Frame using
customer_id and product_id.

o Enrich the data by calculating a new column total_revenue (quantity * price) and deriving a
sale_month column from the date.

Code:
Result:

3. Data Filtering:

o Filter the dataset to keep only sales with quantity > 1 and total_revenue > 50.
4. Data Loading:

o Save the transformed data as a Parquet file for further analysis.

5. Exploratory Data Analysis (EDA):

o Perform basic insights such as calculating total revenue, identifying top products by revenue, and
analysing total sales by store type.

o Conduct customer analysis to find the top 10 customers contributing the most to revenue and
calculate the average age of customers by product category.

o Analyse monthly sales trends and revenue contribution by product category over time.

Code:
Result:

1. Monthly sales:

2. Top products:
3. Top customers:
6. SQL Queries:
o Use Spark SQL to identify products that contributed at least 10% of the total revenue.
o Identify cities with more than 100 unique customers (if any).

Code, Result:
Conclusion:

This project was a great opportunity to apply PySpark and Spark SQL in a real-world retail context.
By following the Medallion Architecture, I gained practical experience in designing a structured data
pipeline — from raw data ingestion to enriched insights ready for business analysis.

It deepened my understanding of:

 Data cleaning and transformation techniques

 Building modular ETL workflows

 Leveraging Delta Lake for scalable and reliable data storage

 Applying SQL for business-driven analytics

Resume: Rakesh Kumar Prasad Mobile:-09560910462 Objective
No ratings yet
Resume: Rakesh Kumar Prasad Mobile:-09560910462 Objective
6 pages
DBMS Session 3 Notes
No ratings yet
DBMS Session 3 Notes
41 pages
Azure Data Engineer - Samatha Gudala
100% (1)
Azure Data Engineer - Samatha Gudala
8 pages
Big Mart Sales Analysis
No ratings yet
Big Mart Sales Analysis
3 pages
Mendix For Dummies - Now Everyone Can Build Software
100% (2)
Mendix For Dummies - Now Everyone Can Build Software
23 pages
Payroll Management System: Pt. J. L. N. Govt. College
100% (1)
Payroll Management System: Pt. J. L. N. Govt. College
36 pages
IIT_FDS_Assignment1
No ratings yet
IIT_FDS_Assignment1
2 pages
Internship Report of Sales Data Analysis
No ratings yet
Internship Report of Sales Data Analysis
21 pages
Synopsis
No ratings yet
Synopsis
4 pages
FILE_2620
No ratings yet
FILE_2620
24 pages
IIM PBA Assignment 2
No ratings yet
IIM PBA Assignment 2
3 pages
Retail Sales Dataset Project Documen1
No ratings yet
Retail Sales Dataset Project Documen1
2 pages
Python Project Report
No ratings yet
Python Project Report
19 pages
IIT FDS Assignment 1 Likhita
No ratings yet
IIT FDS Assignment 1 Likhita
7 pages
Report on Internship
No ratings yet
Report on Internship
28 pages
RITHIKA CONTENT
No ratings yet
RITHIKA CONTENT
25 pages
rithika.ppt
No ratings yet
rithika.ppt
16 pages
b3f21553-48d0-4047-b679-18ead90bcc52
No ratings yet
b3f21553-48d0-4047-b679-18ead90bcc52
24 pages
249 PRJ
No ratings yet
249 PRJ
31 pages
Pranita Dane - IBM - Internship Project Submission - Data Analytics
No ratings yet
Pranita Dane - IBM - Internship Project Submission - Data Analytics
28 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
DA108 Lab 08 Assignment
No ratings yet
DA108 Lab 08 Assignment
2 pages
A Real World Scenario Solution using pandas
No ratings yet
A Real World Scenario Solution using pandas
3 pages
final project ppt
No ratings yet
final project ppt
15 pages
Wrangle Report
No ratings yet
Wrangle Report
7 pages
Azure_5years_Cv__Retail3
No ratings yet
Azure_5years_Cv__Retail3
4 pages
Coffee Sales - (Data Analyst)
No ratings yet
Coffee Sales - (Data Analyst)
31 pages
sql capstone project
No ratings yet
sql capstone project
4 pages
Be Data Curious!: Be Data Curious!, #1
From Everand
Be Data Curious!: Be Data Curious!, #1
Nick Jewell
No ratings yet
CUSTOMER SEGMENTATION
No ratings yet
CUSTOMER SEGMENTATION
9 pages
Ads Phase 5
No ratings yet
Ads Phase 5
23 pages
Analysis of Superstore Database
No ratings yet
Analysis of Superstore Database
23 pages
Project Amazon Sales Data Analysis
No ratings yet
Project Amazon Sales Data Analysis
12 pages
IIM FBDA Assignment-1
No ratings yet
IIM FBDA Assignment-1
3 pages
Retail Performance
No ratings yet
Retail Performance
3 pages
Project Descriptioin
No ratings yet
Project Descriptioin
5 pages
Challenges_In_project (1) - Copy
No ratings yet
Challenges_In_project (1) - Copy
2 pages
Project PPTX Lyst5585
No ratings yet
Project PPTX Lyst5585
8 pages
Project PPT
No ratings yet
Project PPT
8 pages
Bike Store Document
No ratings yet
Bike Store Document
7 pages
Project List Data Analytics
No ratings yet
Project List Data Analytics
13 pages
B M Sale Analysis
No ratings yet
B M Sale Analysis
3 pages
Big Mart Sales Analysis
No ratings yet
Big Mart Sales Analysis
3 pages
advance database
No ratings yet
advance database
15 pages
Antaryami_Singh
No ratings yet
Antaryami_Singh
2 pages
Walmart_Sales_Data_Analysis
No ratings yet
Walmart_Sales_Data_Analysis
4 pages
Business Report On Sales
No ratings yet
Business Report On Sales
3 pages
Piyush Kumar Singh - Project Submission - Data Analytics
No ratings yet
Piyush Kumar Singh - Project Submission - Data Analytics
23 pages
In Tenshi PPP Tte Jum Am
No ratings yet
In Tenshi PPP Tte Jum Am
23 pages
Assignment
No ratings yet
Assignment
2 pages
20dcs009 Internal 1 Presentation
No ratings yet
20dcs009 Internal 1 Presentation
13 pages
Formatted_Big_Mart_Sale_Analysis
No ratings yet
Formatted_Big_Mart_Sale_Analysis
15 pages
Project Charter Coffee Shop
No ratings yet
Project Charter Coffee Shop
3 pages
Ali Shafi BSBA 2-A 6522 Sales Market Data
No ratings yet
Ali Shafi BSBA 2-A 6522 Sales Market Data
40 pages
AMAZON SALES ANALYSIS
No ratings yet
AMAZON SALES ANALYSIS
51 pages
Training
No ratings yet
Training
17 pages
ILANTENRALVBDA
No ratings yet
ILANTENRALVBDA
11 pages
SalesDataAnalysisProject
No ratings yet
SalesDataAnalysisProject
4 pages
HackathonRetail
No ratings yet
HackathonRetail
6 pages
Python Programming
No ratings yet
Python Programming
2 pages
Task-by-Task Guide - Retail Data Analysis (2)
No ratings yet
Task-by-Task Guide - Retail Data Analysis (2)
6 pages
4
No ratings yet
4
2 pages
Analyzing Sales Data
No ratings yet
Analyzing Sales Data
11 pages
Sales Data Analysis and Reporting for a Retail Chain-1
No ratings yet
Sales Data Analysis and Reporting for a Retail Chain-1
1 page
LECTURE 1 Data Base Menagement
No ratings yet
LECTURE 1 Data Base Menagement
4 pages
John Miller CV
No ratings yet
John Miller CV
1 page
Class Xii Cs Practical Solutions 2025-26
No ratings yet
Class Xii Cs Practical Solutions 2025-26
11 pages
Zap With Critical Report
No ratings yet
Zap With Critical Report
3 pages
Hashing
No ratings yet
Hashing
27 pages
Basics of Computer PDF
No ratings yet
Basics of Computer PDF
20 pages
DP 900
No ratings yet
DP 900
3 pages
How To Change Target Name in OEM 12C - 13C - Database Adminstration
No ratings yet
How To Change Target Name in OEM 12C - 13C - Database Adminstration
1 page
5 LO3a) - Types of Enterprise Data
No ratings yet
5 LO3a) - Types of Enterprise Data
22 pages
Gopalan College of Engineering and Management: Internal Assessment Test - Iii
No ratings yet
Gopalan College of Engineering and Management: Internal Assessment Test - Iii
3 pages
Course Description
No ratings yet
Course Description
3 pages
Chapter 04-2 Supertypes - Subtypes
No ratings yet
Chapter 04-2 Supertypes - Subtypes
37 pages
IT 402 I-PREP
No ratings yet
IT 402 I-PREP
5 pages
Databricks Unity Catalog - TechSession-Spain Oct. 2022
No ratings yet
Databricks Unity Catalog - TechSession-Spain Oct. 2022
51 pages
Integrated Planning With SAP Design Studio
No ratings yet
Integrated Planning With SAP Design Studio
5 pages
Class 10 DBMS Queries Notes
No ratings yet
Class 10 DBMS Queries Notes
10 pages
Dbms - Assignment - 1 Vaishnavi Ghodke
No ratings yet
Dbms - Assignment - 1 Vaishnavi Ghodke
9 pages
Introduction To The Semantic Web
No ratings yet
Introduction To The Semantic Web
7 pages
Social Media Hiring & Venue
No ratings yet
Social Media Hiring & Venue
2 pages
Wonderware Alarm Adviser 2017
No ratings yet
Wonderware Alarm Adviser 2017
7 pages
9 Maths NcertSolutions Chapter 14 4 PDF
No ratings yet
9 Maths NcertSolutions Chapter 14 4 PDF
6 pages
Java Assignment 2
No ratings yet
Java Assignment 2
11 pages
Bda PPT M1 P2 1
No ratings yet
Bda PPT M1 P2 1
19 pages
Week 5 Teradata Practice Exercise
No ratings yet
Week 5 Teradata Practice Exercise
14 pages
Data Warehouse Interview Questions
100% (1)
Data Warehouse Interview Questions
6 pages
ElectiveSubject3WebDevelopmentusingJavaQB PDF
No ratings yet
ElectiveSubject3WebDevelopmentusingJavaQB PDF
143 pages

Uploaded by

Uploaded by

Retail Sales Data Processing and Analysis using PySpark

Understand the structure and schema of each dataset.

Bronze - Raw Data Ingestion

o Drop rows with negative quantity or price in the sales.csv dataset.

o Save the transformed data as a Parquet file for further analysis.

5. Exploratory Data Analysis (EDA):

It deepened my understanding of:

 Data cleaning and transformation techniques

 Building modular ETL workflows

 Leveraging Delta Lake for scalable and reliable data storage

 Applying SQL for business-driven analytics

You might also like