0% found this document useful (0 votes)

5 views

Implement K-Means Clustering.: Preprocessing

Uploaded by

Mikasa Jaeger

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Implement K-Means Clustering.: Preprocessing

Uploaded by

Mikasa Jaeger

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

4/27/23, 11:08 AM sales_data - Jupyter Notebook

Implement K-Means clustering.

In [1]: import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#Importing the required libraries.

In [2]: from sklearn.cluster import KMeans, k_means #For clustering

from sklearn.decomposition import PCA #Linear Dimensionality reduction.

In [3]: df = pd.read_csv("sales_data_sample.csv") #Loading the dataset.

Preprocessing
In [4]: df.head()

Out[4]:
ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES ORDERDATE STATUS QTR_ID MONTH_ID Y

2/24/2003
0 10107 30 95.70 2 2871.00 Shipped 1 2
0:00

1 10121 34 81.35 5 2765.90 5/7/2003 0:00 Shipped 2 5

2 10134 41 94.74 2 3884.34 7/1/2003 0:00 Shipped 3 7

8/25/2003
3 10145 45 83.26 6 3746.70 Shipped 3 8
0:00

10/10/2003
4 10159 49 100.00 14 5205.27 Shipped 4 10
0:00

5 rows × 25 columns

In [5]: df.shape

Out[5]: (2823, 25)

In [6]: df.describe()

Out[6]:
ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES QTR_ID MONTH_ID YEAR_I

count 2823.000000 2823.000000 2823.000000 2823.000000 2823.000000 2823.000000 2823.000000 2823.0000

mean 10258.725115 35.092809 83.658544 6.466171 3553.889072 2.717676 7.092455 2003.8150

std 92.085478 9.741443 20.174277 4.225841 1841.865106 1.203878 3.656633 0.6996

min 10100.000000 6.000000 26.880000 1.000000 482.130000 1.000000 1.000000 2003.0000

25% 10180.000000 27.000000 68.860000 3.000000 2203.430000 2.000000 4.000000 2003.0000

50% 10262.000000 35.000000 95.700000 6.000000 3184.800000 3.000000 8.000000 2004.0000

75% 10333.500000 43.000000 100.000000 9.000000 4508.000000 4.000000 11.000000 2004.0000

max 10425.000000 97.000000 100.000000 18.000000 14082.800000 4.000000 12.000000 2005.0000

localhost:8888/notebooks/Downloads/sales_data.ipynb#Implement-K-Means-clustering/-hierarchical-clustering-on-sales_data_sample.csv-datase… 1/8
4/27/23, 11:08 AM sales_data - Jupyter Notebook

In [7]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2823 entries, 0 to 2822
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ORDERNUMBER 2823 non-null int64
1 QUANTITYORDERED 2823 non-null int64
2 PRICEEACH 2823 non-null float64
3 ORDERLINENUMBER 2823 non-null int64
4 SALES 2823 non-null float64
5 ORDERDATE 2823 non-null object
6 STATUS 2823 non-null object
7 QTR_ID 2823 non-null int64
8 MONTH_ID 2823 non-null int64
9 YEAR_ID 2823 non-null int64
10 PRODUCTLINE 2823 non-null object
11 MSRP 2823 non-null int64
12 PRODUCTCODE 2823 non-null object
13 CUSTOMERNAME 2823 non-null object
14 PHONE 2823 non-null object
15 ADDRESSLINE1 2823 non-null object
16 ADDRESSLINE2 302 non-null object
17 CITY 2823 non-null object
18 STATE 1337 non-null object
19 POSTALCODE 2747 non-null object
20 COUNTRY 2823 non-null object
21 TERRITORY 1749 non-null object
22 CONTACTLASTNAME 2823 non-null object
23 CONTACTFIRSTNAME 2823 non-null object
24 DEALSIZE 2823 non-null object
dtypes: float64(2), int64(7), object(16)
memory usage: 375.0+ KB

In [8]: df.isnull().sum()

Out[8]: ORDERNUMBER 0
QUANTITYORDERED 0
PRICEEACH 0
ORDERLINENUMBER 0
SALES 0
ORDERDATE 0
STATUS 0
QTR_ID 0
MONTH_ID 0
YEAR_ID 0
PRODUCTLINE 0
MSRP 0
PRODUCTCODE 0
CUSTOMERNAME 0
PHONE 0
ADDRESSLINE1 0
ADDRESSLINE2 2521
CITY 0
STATE 1486
POSTALCODE 76
COUNTRY 0
TERRITORY 1074
CONTACTLASTNAME 0
CONTACTFIRSTNAME 0
DEALSIZE 0
dtype: int64

localhost:8888/notebooks/Downloads/sales_data.ipynb#Implement-K-Means-clustering/-hierarchical-clustering-on-sales_data_sample.csv-datase… 2/8
4/27/23, 11:08 AM sales_data - Jupyter Notebook

In [9]: df.dtypes

Out[9]: ORDERNUMBER int64

QUANTITYORDERED int64
PRICEEACH float64
ORDERLINENUMBER int64
SALES float64
ORDERDATE object
STATUS object
QTR_ID int64
MONTH_ID int64
YEAR_ID int64
PRODUCTLINE object
MSRP int64
PRODUCTCODE object
CUSTOMERNAME object
PHONE object
ADDRESSLINE1 object
ADDRESSLINE2 object
CITY object
STATE object
POSTALCODE object
COUNTRY object
TERRITORY object
CONTACTLASTNAME object
CONTACTFIRSTNAME object
DEALSIZE object
dtype: object

In [10]: df_drop = ['ADDRESSLINE1', 'ADDRESSLINE2', 'STATUS','POSTALCODE', 'CITY', 'TERRITORY', 'PHONE', 'STATE',

df = df.drop(df_drop, axis=1) #Dropping the categorical uneccessary columns along with columns having null

In [11]: df.isnull().sum()

Out[11]: QUANTITYORDERED 0
PRICEEACH 0
ORDERLINENUMBER 0
SALES 0
ORDERDATE 0
QTR_ID 0
MONTH_ID 0
YEAR_ID 0
PRODUCTLINE 0
MSRP 0
PRODUCTCODE 0
COUNTRY 0
DEALSIZE 0
dtype: int64

In [12]: df.dtypes

Out[12]: QUANTITYORDERED int64

PRICEEACH float64
ORDERLINENUMBER int64
SALES float64
ORDERDATE object
QTR_ID int64
MONTH_ID int64
YEAR_ID int64
PRODUCTLINE object
MSRP int64
PRODUCTCODE object
COUNTRY object
DEALSIZE object
dtype: object

In [13]: # Checking the categorical columns.

In [14]: df['COUNTRY'].unique()

Out[14]: array(['USA', 'France', 'Norway', 'Australia', 'Finland', 'Austria', 'UK',

'Spain', 'Sweden', 'Singapore', 'Canada', 'Japan', 'Italy',
'Denmark', 'Belgium', 'Philippines', 'Germany', 'Switzerland',
'Ireland'], dtype=object)

localhost:8888/notebooks/Downloads/sales_data.ipynb#Implement-K-Means-clustering/-hierarchical-clustering-on-sales_data_sample.csv-datase… 3/8
4/27/23, 11:08 AM sales_data - Jupyter Notebook

In [15]: df['PRODUCTLINE'].unique()

Out[15]: array(['Motorcycles', 'Classic Cars', 'Trucks and Buses', 'Vintage Cars',

'Planes', 'Ships', 'Trains'], dtype=object)

In [16]: df['DEALSIZE'].unique()

Out[16]: array(['Small', 'Medium', 'Large'], dtype=object)

In [17]: productline = pd.get_dummies(df['PRODUCTLINE']) #Converting the categorical columns.

Dealsize = pd.get_dummies(df['DEALSIZE'])

In [18]: df = pd.concat([df,productline,Dealsize], axis = 1)

In [19]: df_drop = ['COUNTRY','PRODUCTLINE','DEALSIZE'] #Dropping Country too as there are alot of countries.
df = df.drop(df_drop, axis=1)

In [20]: df['PRODUCTCODE'] = pd.Categorical(df['PRODUCTCODE']).codes #Converting the datatype.

In [21]: df.drop('ORDERDATE', axis=1, inplace=True) #Dropping the Orderdate as Month is already included.

In [22]: df.dtypes #All the datatypes are converted into numeric

Out[22]: QUANTITYORDERED int64

PRICEEACH float64
ORDERLINENUMBER int64
SALES float64
QTR_ID int64
MONTH_ID int64
YEAR_ID int64
MSRP int64
PRODUCTCODE int8
Classic Cars uint8
Motorcycles uint8
Planes uint8
Ships uint8
Trains uint8
Trucks and Buses uint8
Vintage Cars uint8
Large uint8
Medium uint8
Small uint8
dtype: object

Plotting the Elbow Plot to determine the number of clusters.

In [23]: distortions = [] # Within Cluster Sum of Squares from the centroid
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k)
kmeanModel.fit(df)
distortions.append(kmeanModel.inertia_) #Appeding the intertia to the Distortions

localhost:8888/notebooks/Downloads/sales_data.ipynb#Implement-K-Means-clustering/-hierarchical-clustering-on-sales_data_sample.csv-datase… 4/8
4/27/23, 11:08 AM sales_data - Jupyter Notebook

In [24]: plt.figure(figsize=(16,8))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

As the number of k increases Inertia decreases.

Observations: A Elbow can be observed at 3 and after that the curve

decreases gradually.
In [25]: X_train = df.values #Returns a numpy array.

In [26]: X_train.shape

Out[26]: (2823, 19)

In [27]: model = KMeans(n_clusters=3,random_state=2) #Number of cluster = 3

model = model.fit(X_train) #Fitting the values to create a model.
predictions = model.predict(X_train) #Predicting the cluster values (0,1,or 2)

In [28]: unique,counts = np.unique(predictions,return_counts=True)

In [29]: counts = counts.reshape(1,3)

In [30]: counts_df = pd.DataFrame(counts,columns=['Cluster1','Cluster2','Cluster3'])

In [31]: counts_df.head()

Out[31]:
Cluster1 Cluster2 Cluster3

0 1083 1367 373

Visualization
In [32]: pca = PCA(n_components=2) #Converting all the features into 2 columns to make it easy to visualize using P

In [33]: reduced_X = pd.DataFrame(pca.fit_transform(X_train),columns=['PCA1','PCA2']) #Creating a DataFrame.

localhost:8888/notebooks/Downloads/sales_data.ipynb#Implement-K-Means-clustering/-hierarchical-clustering-on-sales_data_sample.csv-datase… 5/8
4/27/23, 11:08 AM sales_data - Jupyter Notebook

In [34]: reduced_X.head()

Out[34]:
PCA1 PCA2

0 -682.488323 -42.819535

1 -787.665502 -41.694991

2 330.732170 -26.481208

3 193.040232 -26.285766

4 1651.532874 -6.891196

In [35]: #Plotting the normal Scatter Plot

plt.figure(figsize=(14,10))
plt.scatter(reduced_X['PCA1'],reduced_X['PCA2'])

Out[35]: <matplotlib.collections.PathCollection at 0x24749910>

In [36]: model.cluster_centers_ #Finding the centriods. (3 Centriods in total. Each Array contains a centroids for

Out[36]: array([[ 3.72031394e+01, 9.52120960e+01, 6.44967682e+00,

4.13868425e+03, 2.72022161e+00, 7.09879963e+00,
2.00379409e+03, 1.13248384e+02, 5.04469067e+01,
3.74884580e-01, 1.15420129e-01, 9.41828255e-02,
8.21791320e-02, 1.84672207e-02, 1.16343490e-01,
1.98522622e-01, 2.08166817e-17, 1.00000000e+00,
-6.66133815e-16],
[ 3.08302853e+01, 7.00755230e+01, 6.67300658e+00,
2.12409474e+03, 2.71762985e+00, 7.09509876e+00,
2.00381127e+03, 7.84784199e+01, 6.24871982e+01,
2.64813460e-01, 1.21433797e-01, 1.29480614e-01,
1.00219459e-01, 3.87710315e-02, 9.21726408e-02,
2.53108998e-01, 2.08166817e-17, 6.21799561e-02,
9.37820044e-01],
[ 4.45871314e+01, 9.98931099e+01, 5.75603217e+00,
7.09596863e+03, 2.71045576e+00, 7.06434316e+00,
2.00389008e+03, 1.45823056e+02, 3.14959786e+01,
5.33512064e-01, 1.07238606e-01, 7.23860590e-02,
2.14477212e-02, 1.07238606e-02, 1.31367292e-01,
1.23324397e-01, 4.20911528e-01, 5.79088472e-01,
1.66533454e-16]])

localhost:8888/notebooks/Downloads/sales_data.ipynb#Implement-K-Means-clustering/-hierarchical-clustering-on-sales_data_sample.csv-datase… 6/8
4/27/23, 11:08 AM sales_data - Jupyter Notebook

In [37]: reduced_centers = pca.transform(model.cluster_centers_) #Transforming the centroids into 3 in x and y coor

In [38]: reduced_centers

Out[38]: array([[ 5.84994044e+02, -4.36786931e+00],

[-1.43005891e+03, 2.60041009e+00],
[ 3.54247180e+03, 3.15185487e+00]])

In [39]: plt.figure(figsize=(14,10))
plt.scatter(reduced_X['PCA1'],reduced_X['PCA2'])
plt.scatter(reduced_centers[:,0],reduced_centers[:,1],color='black',marker='x',s=300) #Plotting the centri

Out[39]: <matplotlib.collections.PathCollection at 0x53aa880>

In [40]: reduced_X['Clusters'] = predictions #Adding the Clusters to the reduced dataframe.

In [41]: reduced_X.head()

Out[41]:
PCA1 PCA2 Clusters

0 -682.488323 -42.819535 1

1 -787.665502 -41.694991 1

2 330.732170 -26.481208 0

3 193.040232 -26.285766 0

4 1651.532874 -6.891196 0

localhost:8888/notebooks/Downloads/sales_data.ipynb#Implement-K-Means-clustering/-hierarchical-clustering-on-sales_data_sample.csv-datase… 7/8
4/27/23, 11:08 AM sales_data - Jupyter Notebook

In [42]: #Plotting the clusters

plt.figure(figsize=(14,10))
# taking the cluster number and first column taking the same cluster number
plt.scatter(reduced_X[reduced_X['Clusters'] == 0].loc[:,'PCA1'],reduced_X[reduced_X['Clusters'] == 0].loc[
plt.scatter(reduced_X[reduced_X['Clusters'] == 1].loc[:,'PCA1'],reduced_X[reduced_X['Clusters'] == 1].loc[
plt.scatter(reduced_X[reduced_X['Clusters'] == 2].loc[:,'PCA1'],reduced_X[reduced_X['Clusters'] == 2].loc[

plt.scatter(reduced_centers[:,0],reduced_centers[:,1],color='black',marker='x',s=300)

Out[42]: <matplotlib.collections.PathCollection at 0x2430c448>

In [ ]:

localhost:8888/notebooks/Downloads/sales_data.ipynb#Implement-K-Means-clustering/-hierarchical-clustering-on-sales_data_sample.csv-datase… 8/8

Problem scenario
No ratings yet
Problem scenario
13 pages
375 501 021 Edition 2.0 EN
No ratings yet
375 501 021 Edition 2.0 EN
57 pages
4.07 Deswik - Ugdb For UGM Tutorial v5.0
100% (1)
4.07 Deswik - Ugdb For UGM Tutorial v5.0
202 pages
Chennai Database Live
100% (1)
Chennai Database Live
30 pages
METHOD OF STATEMENT For HDPE Pipe Installation
100% (5)
METHOD OF STATEMENT For HDPE Pipe Installation
18 pages
RStreng Level 2 Example
No ratings yet
RStreng Level 2 Example
3 pages
ML 5
No ratings yet
ML 5
11 pages
GRL - EX - 4 (1) .Ipynb - Colaboratory
No ratings yet
GRL - EX - 4 (1) .Ipynb - Colaboratory
7 pages
DMV - 1 - Jupyter Notebook
No ratings yet
DMV - 1 - Jupyter Notebook
4 pages
Amazon Sales Reports - Jupyter Notebook
No ratings yet
Amazon Sales Reports - Jupyter Notebook
29 pages
Siddhesh Asati: #Group: B (ML)
No ratings yet
Siddhesh Asati: #Group: B (ML)
9 pages
ML Practical 4D
No ratings yet
ML Practical 4D
11 pages
RFM - Analysis - Ipynb - Colaboratory
No ratings yet
RFM - Analysis - Ipynb - Colaboratory
10 pages
KPMG - Task 1
No ratings yet
KPMG - Task 1
22 pages
Deep Learning Assignments
No ratings yet
Deep Learning Assignments
13 pages
SPPUML6
No ratings yet
SPPUML6
9 pages
sales-data-clustering
No ratings yet
sales-data-clustering
15 pages
MRA-Part-A-
No ratings yet
MRA-Part-A-
30 pages
Lab 1 ML
No ratings yet
Lab 1 ML
2 pages
SalesDataAnalysis__1693296057
No ratings yet
SalesDataAnalysis__1693296057
14 pages
5-2a dataframes column operations - instruction
No ratings yet
5-2a dataframes column operations - instruction
2 pages
Marketing Analytics Assignment 1
No ratings yet
Marketing Analytics Assignment 1
6 pages
Project 4: Final Project: Bigmart Sales Prediction: Chapter 1: Problem Statement
No ratings yet
Project 4: Final Project: Bigmart Sales Prediction: Chapter 1: Problem Statement
35 pages
EcommerceAnalysis 1680541297
No ratings yet
EcommerceAnalysis 1680541297
11 pages
Guides
No ratings yet
Guides
23 pages
Task 6
No ratings yet
Task 6
14 pages
Online Sales Data Analysis
No ratings yet
Online Sales Data Analysis
9 pages
Task 2 Exploratory Data Analysis
No ratings yet
Task 2 Exploratory Data Analysis
5 pages
BigMart Sales Data Analysis
No ratings yet
BigMart Sales Data Analysis
16 pages
MeriSkill Sales Analysis
No ratings yet
MeriSkill Sales Analysis
17 pages
Customer Segmentation 1683225943
No ratings yet
Customer Segmentation 1683225943
34 pages
Ecommerce Purchases Exercise - Jupyter Notebook
No ratings yet
Ecommerce Purchases Exercise - Jupyter Notebook
2 pages
7
No ratings yet
7
18 pages
documentpython2
No ratings yet
documentpython2
22 pages
Sales Analysis Using Python and SQL
No ratings yet
Sales Analysis Using Python and SQL
15 pages
Grocery
No ratings yet
Grocery
41 pages
Customer Segmentation PDF
No ratings yet
Customer Segmentation PDF
18 pages
Sales Dataset Analysis
No ratings yet
Sales Dataset Analysis
28 pages
Ali Shafi BSBA 2-A 6522 Sales Market Data
No ratings yet
Ali Shafi BSBA 2-A 6522 Sales Market Data
40 pages
Data description
No ratings yet
Data description
6 pages
Supermarket Sales Analysis Project
No ratings yet
Supermarket Sales Analysis Project
8 pages
Project
No ratings yet
Project
12 pages
EnyEgLH5PpLWRjqAJCb8S65HT0Ty8Q
No ratings yet
EnyEgLH5PpLWRjqAJCb8S65HT0Ty8Q
9 pages
Masterclass Data Analysis.ipynb - Colab
No ratings yet
Masterclass Data Analysis.ipynb - Colab
4 pages
E-Commerce Product Delivery Prediction
No ratings yet
E-Commerce Product Delivery Prediction
13 pages
Amazon Apparel PDF
No ratings yet
Amazon Apparel PDF
138 pages
Ml-Exp-5 - Jupyter Notebook
No ratings yet
Ml-Exp-5 - Jupyter Notebook
5 pages
Data+Analysis+Project+on+Customer+Purchases+Dataset
No ratings yet
Data+Analysis+Project+on+Customer+Purchases+Dataset
1 page
Python
No ratings yet
Python
8 pages
Amazon Sales Analysis
No ratings yet
Amazon Sales Analysis
20 pages
Diwali Sales Analysis EDA 1696347982
No ratings yet
Diwali Sales Analysis EDA 1696347982
8 pages
SalesMgmtSystem XII IP Projectreport 2022 23
No ratings yet
SalesMgmtSystem XII IP Projectreport 2022 23
18 pages
Project Sale Analysis
No ratings yet
Project Sale Analysis
8 pages
P L Lohitha 11-11-22 Data Mining Business Report
No ratings yet
P L Lohitha 11-11-22 Data Mining Business Report
47 pages
Data Science Tutorial 1686911993
No ratings yet
Data Science Tutorial 1686911993
41 pages
Mini Project (BDA) Output
No ratings yet
Mini Project (BDA) Output
5 pages
Customer_Marketing_Analysis_1738244935
No ratings yet
Customer_Marketing_Analysis_1738244935
42 pages
Amazon Sales Analysis-1
No ratings yet
Amazon Sales Analysis-1
14 pages
PRJCT Report
No ratings yet
PRJCT Report
22 pages
IIT FDS Assignment 1 Likhita
No ratings yet
IIT FDS Assignment 1 Likhita
7 pages
Geakmindz Test.ipynb - Colab
No ratings yet
Geakmindz Test.ipynb - Colab
8 pages
Untitled0.ipynb - Colab
No ratings yet
Untitled0.ipynb - Colab
6 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
Professional Microsoft SQL Server 2014 Integration Services
From Everand
Professional Microsoft SQL Server 2014 Integration Services
Devin Knight
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
VFE Time Sheet A-E
No ratings yet
VFE Time Sheet A-E
1 page
Brush Bond
No ratings yet
Brush Bond
4 pages
Vanraj Case
No ratings yet
Vanraj Case
3 pages
Operating Instruction: Shanghai Kaiquan Machine Valve Co. LTD
No ratings yet
Operating Instruction: Shanghai Kaiquan Machine Valve Co. LTD
4 pages
B.Tech - CSE
No ratings yet
B.Tech - CSE
109 pages
Information System Implementation PDF
No ratings yet
Information System Implementation PDF
2 pages
AAS PTC Thermistors Training
No ratings yet
AAS PTC Thermistors Training
41 pages
Search Part Number: Lenntech
No ratings yet
Search Part Number: Lenntech
10 pages
Adroit Inspection Services PVT - LTD.: Ghaziabad Tel: - Fax: Email: Mail@adroitauto - in Website: WWW - Adroitauto.in
No ratings yet
Adroit Inspection Services PVT - LTD.: Ghaziabad Tel: - Fax: Email: Mail@adroitauto - in Website: WWW - Adroitauto.in
1 page
Chemicals Zetag DATA Organic Coagulants Magnafloc LT 7985 - 0410
No ratings yet
Chemicals Zetag DATA Organic Coagulants Magnafloc LT 7985 - 0410
2 pages
Premium
No ratings yet
Premium
4 pages
Professional Summary:: Ami, SNS, RDS, Iam, Auto Scaling, Cloudfront, Cloud Watch, Cloud Trail, Security Groups
No ratings yet
Professional Summary:: Ami, SNS, RDS, Iam, Auto Scaling, Cloudfront, Cloud Watch, Cloud Trail, Security Groups
8 pages
Multiphase Flow Modeling: Industry Solutions
No ratings yet
Multiphase Flow Modeling: Industry Solutions
4 pages
AFW - French Air Force - Combat Units Part 1
No ratings yet
AFW - French Air Force - Combat Units Part 1
1 page
Fabrication of Pneumatic Operated Emergency Overwing Exit System in Aircrafts - Synopsis
No ratings yet
Fabrication of Pneumatic Operated Emergency Overwing Exit System in Aircrafts - Synopsis
3 pages
Hand Gesture Based Wheel Chair Movement Control For The Disabled
No ratings yet
Hand Gesture Based Wheel Chair Movement Control For The Disabled
2 pages
F-Panel Dual Polarization Half-Power Beam Width Fixed Electrical Downtilt
No ratings yet
F-Panel Dual Polarization Half-Power Beam Width Fixed Electrical Downtilt
2 pages
36-11-52 R12 CMM
No ratings yet
36-11-52 R12 CMM
407 pages
460r Front Axle
No ratings yet
460r Front Axle
76 pages
BGP Questions
No ratings yet
BGP Questions
8 pages
CE Valid List
No ratings yet
CE Valid List
23 pages
20 CIS Controls v7.0
No ratings yet
20 CIS Controls v7.0
12 pages
Python Workbook 2
No ratings yet
Python Workbook 2
10 pages
48 Volt Electrical Systems Electric Mobility Engl 2016
100% (2)
48 Volt Electrical Systems Electric Mobility Engl 2016
40 pages

Uploaded by

Uploaded by

4/27/23, 11:08 AM sales_data - Jupyter Notebook

Implement K-Means clustering.

In [2]: from sklearn.cluster import KMeans, k_means #For clustering

In [3]: df = pd.read_csv("sales_data_sample.csv") #Loading the dataset.

1 10121 34 81.35 5 2765.90 5/7/2003 0:00 Shipped 2 5

2 10134 41 94.74 2 3884.34 7/1/2003 0:00 Shipped 3 7

Out[5]: (2823, 25)

count 2823.000000 2823.000000 2823.000000 2823.000000 2823.000000 2823.000000 2823.000000 2823.0000

mean 10258.725115 35.092809 83.658544 6.466171 3553.889072 2.717676 7.092455 2003.8150

std 92.085478 9.741443 20.174277 4.225841 1841.865106 1.203878 3.656633 0.6996

min 10100.000000 6.000000 26.880000 1.000000 482.130000 1.000000 1.000000 2003.0000

25% 10180.000000 27.000000 68.860000 3.000000 2203.430000 2.000000 4.000000 2003.0000

50% 10262.000000 35.000000 95.700000 6.000000 3184.800000 3.000000 8.000000 2004.0000

75% 10333.500000 43.000000 100.000000 9.000000 4508.000000 4.000000 11.000000 2004.0000

max 10425.000000 97.000000 100.000000 18.000000 14082.800000 4.000000 12.000000 2005.0000

Out[9]: ORDERNUMBER int64

In [10]: df_drop = ['ADDRESSLINE1', 'ADDRESSLINE2', 'STATUS','POSTALCODE', 'CITY', 'TERRITORY', 'PHONE', 'STATE',

Out[12]: QUANTITYORDERED int64

In [13]: # Checking the categorical columns.

Out[14]: array(['USA', 'France', 'Norway', 'Australia', 'Finland', 'Austria', 'UK',

Out[15]: array(['Motorcycles', 'Classic Cars', 'Trucks and Buses', 'Vintage Cars',

Out[16]: array(['Small', 'Medium', 'Large'], dtype=object)

In [17]: productline = pd.get_dummies(df['PRODUCTLINE']) #Converting the categorical columns.

In [18]: df = pd.concat([df,productline,Dealsize], axis = 1)

In [20]: df['PRODUCTCODE'] = pd.Categorical(df['PRODUCTCODE']).codes #Converting the datatype.

In [22]: df.dtypes #All the datatypes are converted into numeric

Out[22]: QUANTITYORDERED int64

Plotting the Elbow Plot to determine the number of clusters.

As the number of k increases Inertia decreases.

Observations: A Elbow can be observed at 3 and after that the curve

Out[26]: (2823, 19)

In [27]: model = KMeans(n_clusters=3,random_state=2) #Number of cluster = 3

In [28]: unique,counts = np.unique(predictions,return_counts=True)

In [29]: counts = counts.reshape(1,3)

In [30]: counts_df = pd.DataFrame(counts,columns=['Cluster1','Cluster2','Cluster3'])

0 1083 1367 373

In [33]: reduced_X = pd.DataFrame(pca.fit_transform(X_train),columns=['PCA1','PCA2']) #Creating a DataFrame.

In [35]: #Plotting the normal Scatter Plot

Out[35]: <matplotlib.collections.PathCollection at 0x24749910>

Out[36]: array([[ 3.72031394e+01, 9.52120960e+01, 6.44967682e+00,

In [37]: reduced_centers = pca.transform(model.cluster_centers_) #Transforming the centroids into 3 in x and y coor

Out[38]: array([[ 5.84994044e+02, -4.36786931e+00],

Out[39]: <matplotlib.collections.PathCollection at 0x53aa880>

In [40]: reduced_X['Clusters'] = predictions #Adding the Clusters to the reduced dataframe.

In [42]: #Plotting the clusters

Out[42]: <matplotlib.collections.PathCollection at 0x2430c448>

You might also like