DMV Lab 12
DMV Lab 12
*************************************************************************************
Part II
Assignment 12
*************************************************************************************
Title: Data Aggregation
Problem Statement: Analyzing Sales Performance by Region in a Retail Company.
Dataset: ""Retail_Sales_Data.csv"
Description: The dataset contains information about sales transactions in a retail
company. It includes attributes such as transaction date, product category, quantity sold,
and sales amount. The goal is to perform data aggregation to analyze the sales performance
by region and identify the top-performing regions.
Tasks to Perform:
1. Import the "Retail_Sales_Data.csv" dataset.
2. Explore the dataset to understand its structure and content.
3. Identify the relevant variables for aggregating sales data, such as region, sales amount,
and product category.
4. Group the sales data by region and calculate the total sales amount for each region.
5. Create bar plots or pie charts to visualize the sales distribution by region.
6. Identify the top-performing regions based on the highest sales amount.
7. Group the sales data by region and product category to calculate the total sales amount
for each combination.
8. Create stacked bar plots or grouped bar plots to compare the sales amounts across
different regions and product categories.
Theory:
Data aggregation is the process of combining data from multiple sources or summarizing
data from a single source to produce a more concise and meaningful representation of the
data. It can be used to identify trends, patterns, and relationships in the data that would not
be apparent if the data was analyzed individually.
Data aggregation can be performed on a variety of data types, including numerical data,
categorical data, and text data. Some common aggregation operations include:
Identifying the top-performing sales regions: A retail company can aggregate sales
data by region to identify the regions that are generating the most revenue.
Tracking website traffic: A website owner can aggregate website traffic data by
source to identify the most effective marketing channels.
Data aggregation is a powerful tool that can be used to gain insights from data and
make better decisions.
Benefits of data aggregation:
Increased accuracy: Data aggregation can help businesses to increase the accuracy
of their data by reducing the number of errors that occur when data is manually
processed.
It is csv file, we can use pandas. The read_csv is used to read it into a DataFrame:
In [5]:
import
2. pandas as pd
3.
# 4.Import Dataset
df=pd.read_csv('Retail_Sales_Data.csv');
5.
df.head();
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99457 entries, 0 to 99456
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 invoice_no 99457 non-null object
1 customer_id 99457 non-null object
2 gender 99457 non-null object
3 age 99457 non-null int64
4 category 99457 non-null object
5 quantity 99457 non-null int64
6 price 99457 non-null float64
7 payment_method 99457 non-null object
8 invoice_date 99457 non-null object
9 shopping_mall 99457 non-null object
dtypes: float64(1), int64(2), object(7)
memory usage: 7.6+ MB
3. Identify the relevant variables for aggregating sales data, such as region, sales
amount, and product category.
Aggregating sales data involves summarizing and grouping information to derive
meaningful insights. The choice of relevant variables depends on the specific goals of
analysis and the nature of the business. However, some common variables for aggregating
sales data include:
1. Shopping Mall:
Mall ID or Name: To identify and group sales data based on the shopping mall
where the transactions occurred.
2. Price:
Unit Price: The price of each unit of the product sold.
Total Sales: The total revenue generated from the sale of products.
3. Category:
Product Category: Categorization of products based on their type (e.g.,
electronics, clothing, food).
Sales by Category: Aggregating sales data based on the product categories.
4. Group the sales data by region and calculate the total sales amount for each
region.
Grouping sales data by region and shopping mall and calculating the total sales amount for
each region serves several important purposes in business analysis and decision-making.
In[14]:
#Group by shopping mall and calculate total sales amount
sales_by_region = data.groupby("shopping_mall")["price"].sum()
sales_by_region
Out[14]:
shopping_mall
Cevahir AVM 3433671.84
Emaar Square Mall 3390408.31
Forum Istanbul 3336073.82
Istinye Park 6717077.54
Kanyon 13710755.24
Mall of Istanbul 13851737.62
Metrocity 10249980.07
Metropol AVM 6937992.99
Viaport Outlet 3414019.46
Zorlu Center 3509649.02
Name: price, dtype: float64
5. Create bar plots or pie charts to visualize the sales distribution by region.
Pie charts represent data in a circular graph, where each slice (or sector) of the pie
corresponds to a particular category or group. In the context of sales distribution by region,
each slice represents a different region, and the size of each slice indicates the proportion
of sales attributed to that region.
In[15]:
#Create a pie plot to visualize sales distribution by region
plt.figure(figsize=(6, 6))
plt.pie(sales_by_region, labels=sales_by_region.index,
autopct="%1.1f%%", startangle=140)
plt.title("Sales Distribution by Region")
plt.axis("equal")
plt.show()
Out[15]:
In[16]:
plt.figure()
plt.bar(sales_by_region.index, sales_by_region.values)
plt.xlabel('Region')
plt.ylabel('Total Sales Amount')
plt.title('Total Sales Amount by Region')
plt.show()
Out[16]:
Out[17]:
Top-performing regions:
shopping_mall
Mall of Istanbul 13851737.62
Kanyon 13710755.24
Metrocity 10249980.07
Metropol AVM 6937992.99
Istinye Park 6717077.54
Name: price, dtype: float64
7. Group the sales data by region and product category to calculate the total sales amount for
each combination.
In[18]:
#Group by region of shopping mall and product category, calculate
total sales
sales_by_region_category = data.groupby(["shopping_mall",
"category"])["price"].sum()
Out[17]:
shopping_mall category
Cevahir AVM Books 11998.80
Clothing 1554414.40
Cosmetics 88394.84
Food & Beverage 11992.39
Shoes 884050.41
...
Zorlu Center Food & Beverage 11589.68
Shoes 953670.13
Souvenir 8398.68
Technology 803250.00
Toys 54691.84
Name: price, Length: 80, dtype: float64
8. Create stacked bar plots or grouped bar plots to compare the sales amounts across
different regions and product categories.
A stacked bar plot is a type of bar chart that represents individual data values as bars, with
each bar divided into segments that represent different categories or groups. In the context
of comparing sales amounts across different regions and product categories, a stacked bar
plot can be a useful visualization tool.
A stacked bar plot to compare sales amounts:
1. Plotting:
Use a plotting library (such as Matplotlib in Python or ggplot2 in R) to create a
stacked bar plot.
The x-axis should represent the regions, and the y-axis should represent the total
sales amount.
Each bar is divided into segments, with each segment representing a different
product category.
The height of each segment corresponds to the sales amount for that specific
product category in the respective region.
2. Color Coding:
Assign different colors to each product category to make it visually clear which part
of the bar corresponds to which category.
The stacked nature of the bars helps in comparing the total sales amounts across
regions while also understanding the contribution of each product category within a
region.
3. Legend and Labels:
Include a legend to explain the color-coding of the product categories.
Label the axes and provide a title to make the plot more informative.
Use following code to visualize stacked bar plot to compare sales across regions and
categories
In[19]:
#Create a stacked bar plot to compare sales across regions and
categories
sales_by_region_category.unstack().plot(kind="bar", stacked=True,
figsize=(12, 8))
plt.title("Sales Comparison by Region and Product Category")
plt.xlabel("Region")
plt.ylabel("Total Sales Amount")
plt.legend(title="Category")
plt.show()
Out[19]:
Conclusion:
We have implemented data aggregation, we were able to analyze the sales performance of
the retail company by region and identify the top-performing regions. We also identified
the top-selling product categories.
Dated Sign
Performance Innovation Completion Total of Subject
Teacher
3 1 1 5