Implement K-Means Clustering.: Preprocessing
Implement K-Means Clustering.: Preprocessing
Preprocessing
In [4]: df.head()
Out[4]:
ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES ORDERDATE STATUS QTR_ID MONTH_ID Y
2/24/2003
0 10107 30 95.70 2 2871.00 Shipped 1 2
0:00
8/25/2003
3 10145 45 83.26 6 3746.70 Shipped 3 8
0:00
10/10/2003
4 10159 49 100.00 14 5205.27 Shipped 4 10
0:00
5 rows × 25 columns
In [5]: df.shape
In [6]: df.describe()
Out[6]:
ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES QTR_ID MONTH_ID YEAR_I
localhost:8888/notebooks/Downloads/sales_data.ipynb#Implement-K-Means-clustering/-hierarchical-clustering-on-sales_data_sample.csv-datase… 1/8
4/27/23, 11:08 AM sales_data - Jupyter Notebook
In [7]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2823 entries, 0 to 2822
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ORDERNUMBER 2823 non-null int64
1 QUANTITYORDERED 2823 non-null int64
2 PRICEEACH 2823 non-null float64
3 ORDERLINENUMBER 2823 non-null int64
4 SALES 2823 non-null float64
5 ORDERDATE 2823 non-null object
6 STATUS 2823 non-null object
7 QTR_ID 2823 non-null int64
8 MONTH_ID 2823 non-null int64
9 YEAR_ID 2823 non-null int64
10 PRODUCTLINE 2823 non-null object
11 MSRP 2823 non-null int64
12 PRODUCTCODE 2823 non-null object
13 CUSTOMERNAME 2823 non-null object
14 PHONE 2823 non-null object
15 ADDRESSLINE1 2823 non-null object
16 ADDRESSLINE2 302 non-null object
17 CITY 2823 non-null object
18 STATE 1337 non-null object
19 POSTALCODE 2747 non-null object
20 COUNTRY 2823 non-null object
21 TERRITORY 1749 non-null object
22 CONTACTLASTNAME 2823 non-null object
23 CONTACTFIRSTNAME 2823 non-null object
24 DEALSIZE 2823 non-null object
dtypes: float64(2), int64(7), object(16)
memory usage: 375.0+ KB
In [8]: df.isnull().sum()
Out[8]: ORDERNUMBER 0
QUANTITYORDERED 0
PRICEEACH 0
ORDERLINENUMBER 0
SALES 0
ORDERDATE 0
STATUS 0
QTR_ID 0
MONTH_ID 0
YEAR_ID 0
PRODUCTLINE 0
MSRP 0
PRODUCTCODE 0
CUSTOMERNAME 0
PHONE 0
ADDRESSLINE1 0
ADDRESSLINE2 2521
CITY 0
STATE 1486
POSTALCODE 76
COUNTRY 0
TERRITORY 1074
CONTACTLASTNAME 0
CONTACTFIRSTNAME 0
DEALSIZE 0
dtype: int64
localhost:8888/notebooks/Downloads/sales_data.ipynb#Implement-K-Means-clustering/-hierarchical-clustering-on-sales_data_sample.csv-datase… 2/8
4/27/23, 11:08 AM sales_data - Jupyter Notebook
In [9]: df.dtypes
In [11]: df.isnull().sum()
Out[11]: QUANTITYORDERED 0
PRICEEACH 0
ORDERLINENUMBER 0
SALES 0
ORDERDATE 0
QTR_ID 0
MONTH_ID 0
YEAR_ID 0
PRODUCTLINE 0
MSRP 0
PRODUCTCODE 0
COUNTRY 0
DEALSIZE 0
dtype: int64
In [12]: df.dtypes
In [14]: df['COUNTRY'].unique()
localhost:8888/notebooks/Downloads/sales_data.ipynb#Implement-K-Means-clustering/-hierarchical-clustering-on-sales_data_sample.csv-datase… 3/8
4/27/23, 11:08 AM sales_data - Jupyter Notebook
In [15]: df['PRODUCTLINE'].unique()
In [16]: df['DEALSIZE'].unique()
In [19]: df_drop = ['COUNTRY','PRODUCTLINE','DEALSIZE'] #Dropping Country too as there are alot of countries.
df = df.drop(df_drop, axis=1)
In [21]: df.drop('ORDERDATE', axis=1, inplace=True) #Dropping the Orderdate as Month is already included.
localhost:8888/notebooks/Downloads/sales_data.ipynb#Implement-K-Means-clustering/-hierarchical-clustering-on-sales_data_sample.csv-datase… 4/8
4/27/23, 11:08 AM sales_data - Jupyter Notebook
In [24]: plt.figure(figsize=(16,8))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
In [26]: X_train.shape
In [31]: counts_df.head()
Out[31]:
Cluster1 Cluster2 Cluster3
Visualization
In [32]: pca = PCA(n_components=2) #Converting all the features into 2 columns to make it easy to visualize using P
localhost:8888/notebooks/Downloads/sales_data.ipynb#Implement-K-Means-clustering/-hierarchical-clustering-on-sales_data_sample.csv-datase… 5/8
4/27/23, 11:08 AM sales_data - Jupyter Notebook
In [34]: reduced_X.head()
Out[34]:
PCA1 PCA2
0 -682.488323 -42.819535
1 -787.665502 -41.694991
2 330.732170 -26.481208
3 193.040232 -26.285766
4 1651.532874 -6.891196
In [36]: model.cluster_centers_ #Finding the centriods. (3 Centriods in total. Each Array contains a centroids for
localhost:8888/notebooks/Downloads/sales_data.ipynb#Implement-K-Means-clustering/-hierarchical-clustering-on-sales_data_sample.csv-datase… 6/8
4/27/23, 11:08 AM sales_data - Jupyter Notebook
In [38]: reduced_centers
In [39]: plt.figure(figsize=(14,10))
plt.scatter(reduced_X['PCA1'],reduced_X['PCA2'])
plt.scatter(reduced_centers[:,0],reduced_centers[:,1],color='black',marker='x',s=300) #Plotting the centri
In [41]: reduced_X.head()
Out[41]:
PCA1 PCA2 Clusters
0 -682.488323 -42.819535 1
1 -787.665502 -41.694991 1
2 330.732170 -26.481208 0
3 193.040232 -26.285766 0
4 1651.532874 -6.891196 0
localhost:8888/notebooks/Downloads/sales_data.ipynb#Implement-K-Means-clustering/-hierarchical-clustering-on-sales_data_sample.csv-datase… 7/8
4/27/23, 11:08 AM sales_data - Jupyter Notebook
In [ ]:
In [ ]:
localhost:8888/notebooks/Downloads/sales_data.ipynb#Implement-K-Means-clustering/-hierarchical-clustering-on-sales_data_sample.csv-datase… 8/8