0% found this document useful (0 votes)
39 views

DMV Lab 12

Aids
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

DMV Lab 12

Aids
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Department of AI & DS Engineering Computer Lab I

*************************************************************************************
Part II
Assignment 12
*************************************************************************************
Title: Data Aggregation
Problem Statement: Analyzing Sales Performance by Region in a Retail Company.
Dataset: ""Retail_Sales_Data.csv"
Description: The dataset contains information about sales transactions in a retail
company. It includes attributes such as transaction date, product category, quantity sold,
and sales amount. The goal is to perform data aggregation to analyze the sales performance
by region and identify the top-performing regions.
Tasks to Perform:
1. Import the "Retail_Sales_Data.csv" dataset.
2. Explore the dataset to understand its structure and content.
3. Identify the relevant variables for aggregating sales data, such as region, sales amount,
and product category.
4. Group the sales data by region and calculate the total sales amount for each region.
5. Create bar plots or pie charts to visualize the sales distribution by region.
6. Identify the top-performing regions based on the highest sales amount.
7. Group the sales data by region and product category to calculate the total sales amount
for each combination.
8. Create stacked bar plots or grouped bar plots to compare the sales amounts across
different regions and product categories.

Theory:

Data aggregation is the process of combining data from multiple sources or summarizing
data from a single source to produce a more concise and meaningful representation of the
data. It can be used to identify trends, patterns, and relationships in the data that would not
be apparent if the data was analyzed individually.

Data aggregation can be performed on a variety of data types, including numerical data,
categorical data, and text data. Some common aggregation operations include:

 Sum: Calculates the sum of all values in a column or group of columns.


 Mean: Calculates the average of all values in a column or group of columns.
 Median: Calculates the middle value in a sorted list of values.
 Mode: Calculates the most frequently occurring value in a column or group of
columns.
 Count: Counts the number of non-null values in a column or group of columns.
Examples of how data aggregation:

Matoshri College of Engineering & Research Centre, Nashik


Department of AI & DS Engineering Computer Lab I

 Identifying the top-performing sales regions: A retail company can aggregate sales
data by region to identify the regions that are generating the most revenue.
 Tracking website traffic: A website owner can aggregate website traffic data by
source to identify the most effective marketing channels.
 Data aggregation is a powerful tool that can be used to gain insights from data and
make better decisions.
Benefits of data aggregation:

 Improved efficiency: Data aggregation can help businesses to improve their


efficiency by automating tasks such as report generation and data analysis.

 Increased accuracy: Data aggregation can help businesses to increase the accuracy
of their data by reducing the number of errors that occur when data is manually
processed.

 Enhanced decision-making: Data aggregation can help businesses to make better


decisions by providing them with a more complete and accurate view of their data.

1. Import the "Retail_Sales_Data.csv" dataset.

It is csv file, we can use pandas. The read_csv is used to read it into a DataFrame:
In [5]:
import
2. pandas as pd
3.
# 4.Import Dataset
df=pd.read_csv('Retail_Sales_Data.csv');
5.
df.head();

2. Explore the dataset to understand its structure and content.


 data.head(): Assuming that data is a pandas DataFrame, head() is a method that is
used to display the first few rows of the DataFrame. It provides a quick way to
inspect the structure and content of the dataset. by default, shows the first 5 rows of
the DataFrame. You can specify the number of rows you want to display by passing a
number inside the parentheses, like data.head(10) to show the first 10 rows.
 info() method in Pandas is used to get a concise summary of a DataFrame, including
information about its columns, data types, non-null values, and memory usage.
When you call df.info(), it prints a summary report to the console.

Matoshri College of Engineering & Research Centre, Nashik


Department of AI & DS Engineering Computer Lab I

#Explore the dataset


data.head()

#Explore the dataset


data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99457 entries, 0 to 99456
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 invoice_no 99457 non-null object
1 customer_id 99457 non-null object
2 gender 99457 non-null object
3 age 99457 non-null int64
4 category 99457 non-null object
5 quantity 99457 non-null int64
6 price 99457 non-null float64
7 payment_method 99457 non-null object
8 invoice_date 99457 non-null object
9 shopping_mall 99457 non-null object
dtypes: float64(1), int64(2), object(7)
memory usage: 7.6+ MB

3. Identify the relevant variables for aggregating sales data, such as region, sales
amount, and product category.
Aggregating sales data involves summarizing and grouping information to derive
meaningful insights. The choice of relevant variables depends on the specific goals of
analysis and the nature of the business. However, some common variables for aggregating
sales data include:
1. Shopping Mall:
 Mall ID or Name: To identify and group sales data based on the shopping mall
where the transactions occurred.
2. Price:
 Unit Price: The price of each unit of the product sold.
 Total Sales: The total revenue generated from the sale of products.

Matoshri College of Engineering & Research Centre, Nashik


Department of AI & DS Engineering Computer Lab I

3. Category:
 Product Category: Categorization of products based on their type (e.g.,
electronics, clothing, food).
 Sales by Category: Aggregating sales data based on the product categories.

#Identify relevant variables & make list of relevant variables


relevant_columns = ["shopping_mall", "price", "category"]

4. Group the sales data by region and calculate the total sales amount for each
region.

Grouping sales data by region and shopping mall and calculating the total sales amount for
each region serves several important purposes in business analysis and decision-making.
In[14]:
#Group by shopping mall and calculate total sales amount
sales_by_region = data.groupby("shopping_mall")["price"].sum()
sales_by_region

Out[14]:
shopping_mall
Cevahir AVM 3433671.84
Emaar Square Mall 3390408.31
Forum Istanbul 3336073.82
Istinye Park 6717077.54
Kanyon 13710755.24
Mall of Istanbul 13851737.62
Metrocity 10249980.07
Metropol AVM 6937992.99
Viaport Outlet 3414019.46
Zorlu Center 3509649.02
Name: price, dtype: float64

5. Create bar plots or pie charts to visualize the sales distribution by region.

Pie charts represent data in a circular graph, where each slice (or sector) of the pie
corresponds to a particular category or group. In the context of sales distribution by region,
each slice represents a different region, and the size of each slice indicates the proportion
of sales attributed to that region.
In[15]:
#Create a pie plot to visualize sales distribution by region
plt.figure(figsize=(6, 6))
plt.pie(sales_by_region, labels=sales_by_region.index,
autopct="%1.1f%%", startangle=140)
plt.title("Sales Distribution by Region")
plt.axis("equal")
plt.show()

Matoshri College of Engineering & Research Centre, Nashik


Department of AI & DS Engineering Computer Lab I

Out[15]:

In[16]:
plt.figure()
plt.bar(sales_by_region.index, sales_by_region.values)
plt.xlabel('Region')
plt.ylabel('Total Sales Amount')
plt.title('Total Sales Amount by Region')
plt.show()
Out[16]:

Matoshri College of Engineering & Research Centre, Nashik


Department of AI & DS Engineering Computer Lab I

6. Identify the top-performing regions based on the highest sales amount.


In[17]:

#Identify top-performing regions


top_regions = sales_by_region.sort_values(ascending=False).head(5)
print("Top-performing regions:")
print(top_regions)

Out[17]:
Top-performing regions:
shopping_mall
Mall of Istanbul 13851737.62
Kanyon 13710755.24
Metrocity 10249980.07
Metropol AVM 6937992.99
Istinye Park 6717077.54
Name: price, dtype: float64

7. Group the sales data by region and product category to calculate the total sales amount for
each combination.
In[18]:
#Group by region of shopping mall and product category, calculate
total sales
sales_by_region_category = data.groupby(["shopping_mall",
"category"])["price"].sum()

Out[17]:
shopping_mall category
Cevahir AVM Books 11998.80
Clothing 1554414.40
Cosmetics 88394.84
Food & Beverage 11992.39
Shoes 884050.41
...
Zorlu Center Food & Beverage 11589.68
Shoes 953670.13
Souvenir 8398.68
Technology 803250.00
Toys 54691.84
Name: price, Length: 80, dtype: float64

8. Create stacked bar plots or grouped bar plots to compare the sales amounts across
different regions and product categories.

A stacked bar plot is a type of bar chart that represents individual data values as bars, with
each bar divided into segments that represent different categories or groups. In the context

Matoshri College of Engineering & Research Centre, Nashik


Department of AI & DS Engineering Computer Lab I

of comparing sales amounts across different regions and product categories, a stacked bar
plot can be a useful visualization tool.
A stacked bar plot to compare sales amounts:
1. Plotting:
 Use a plotting library (such as Matplotlib in Python or ggplot2 in R) to create a
stacked bar plot.
 The x-axis should represent the regions, and the y-axis should represent the total
sales amount.
 Each bar is divided into segments, with each segment representing a different
product category.
 The height of each segment corresponds to the sales amount for that specific
product category in the respective region.
2. Color Coding:
 Assign different colors to each product category to make it visually clear which part
of the bar corresponds to which category.
 The stacked nature of the bars helps in comparing the total sales amounts across
regions while also understanding the contribution of each product category within a
region.
3. Legend and Labels:
 Include a legend to explain the color-coding of the product categories.
 Label the axes and provide a title to make the plot more informative.

Use following code to visualize stacked bar plot to compare sales across regions and
categories
In[19]:
#Create a stacked bar plot to compare sales across regions and
categories
sales_by_region_category.unstack().plot(kind="bar", stacked=True,
figsize=(12, 8))
plt.title("Sales Comparison by Region and Product Category")
plt.xlabel("Region")
plt.ylabel("Total Sales Amount")
plt.legend(title="Category")
plt.show()

Out[19]:

Matoshri College of Engineering & Research Centre, Nashik


Department of AI & DS Engineering Computer Lab I

Conclusion:

We have implemented data aggregation, we were able to analyze the sales performance of
the retail company by region and identify the top-performing regions. We also identified
the top-selling product categories.

Dated Sign
Performance Innovation Completion Total of Subject
Teacher
3 1 1 5

Matoshri College of Engineering & Research Centre, Nashik

You might also like