Data Mini Proj
Data Mini Proj
Problem 1: Clustering
A leading bank wants to develop a customer segmentation to give promotional offers to its
customers. They collected a sample that summarizes the activities of users during the past few
months. You are given the task to identify the segments based on credit card usage.
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate,
Bi-variate, and multivariate analysis).
Univariate Analysis:
Observation:
• Based on summary descriptive, the data looks good.
• We see for most of the variable, mean/medium are nearly equal.
• Include a 90% to see variations and it looks distributely evenly.
• Std Deviation is high for spending variable.
Bi-Variate Analysis:
Observations
- Credit limit average is around $3.258(10000s).
- Distribution is skewed to right tail for all the variable except probability_of_full_payment variable,
which has left tail.
Multivariate analysis:
Observation:
- Strong positive correlation between:
• spending & advance_payments
• advance_payments & current_balance
• credit_limit & spending
• spending & current_balance
• credit_limit & advance_payments
• max_spent_in_single_shopping current_balance
1.2 Do you think scaling is necessary for clustering in this case? Justify.
• Scaling needs to be done as the values of the variables are different.
• spending, advance_payments are in different values and this may get more weightage.
• Scaling will have all the values in the relative same range.
• I have used zscore to standarised the data to relative same scale -3 to +3.
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters
using Dendrogram and briefly describe them.
Dendrogram (Ward link):
Observation:
• Both the method is almost similar means, minor variation, which we know it occurs.
• We for cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis,
and based on the dataset had gone for 3 group cluster solution based on the hierarchical
clustering.
• Also in real time, there could have been more variables value captured - tenure,
BALANCE_FREQUENCY, balance, purchase, installment of purchase, others.
• And three group cluster solution gives a pattern based on high/medium/low spending with
max_spent_in_single_shopping (high value item) and probability_of_full_payment(payment
made).
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and silhouette score.
4-Cluster:
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters.
3 group cluster via Kmeans:
3 group cluster via hierarchical clustering:
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis).
• Minimum Age: 8
• Maximum Age: 84
• Mean value: 38.091
• Median value: 36.0
• Standard deviation: 10.463518245377944
• It seems to be right skewed
• Too many outliers present in lower as well as upper
• Minimum Commision: 0.0
• Maximum Commision: 210.21
• Mean value: 14.529203333333266
• Median value: 4.63
• Standard deviation: 25.48145450662553
• It is right skewed
• Too many outliers are present in upper
• Minimum Duration: -1
• Maximum Duration: 4580
• Mean value: 70.00133333333333
• Median value: 26.5
• Standard deviation: 134.05331313253495
• It is right skewed
• Many outliers are present in upper
• Minimum Sales: 0.0
• Maximum Sales: 539.0
• Mean value: 60.24991333333344
• Median value: 33.0
• Standard deviation: 70.73395353143047
• It is right skewed
• Too many outliers present in upper
There are outliers in all the variables, but the sales and commission can be a genuine business value.
Random Forest and CART can handle the outliers. Hence, Outliers are not treated for now, we will
keep the data as it is.
Bi-Variate Analysis:
Agency_Code
Type
Channel
Product Name
Destination
Multivariate Analysis:
There is not much collinearity between any variable. But among them Sales & Commision has the
highest collinearity of 0.77.
2.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest, Artificial Neural Network.
Cart Model:
Train Data (ROC curve)
Train Data:
• AUC: 82%
• Accuracy: 79%
• Precision: 70%
• f1-Score: 60%
Test Data:
• AUC: 80%
• Accuracy: 77%
• Precision: 80%
• f1-Score: 84%
Train Data
• AUC: 86%
• Accuracy: 80%
• Precision: 72%
• f1-Score: 66%
Test Data
• AUC: 82%
• Accuracy: 78%
• Precision: 68%
• f1-Score: 62
Training and Test set results are almost similar, and with the overall measures high, the model is a
good model.
Change is again the most important variable for predicting diabetes.
NN Model:
Train Data (ROC curve)
Train Data:
• AUC: 82%
• Accuracy: 78%
• Precision: 68%
• f1-Score: 59
Test Data:
• AUC: 80%
• Accuracy: 77%
• Precision: 67%
• f1-Score: 57%
Training and Test set results are almost similar, and with the overall measures high, the model is a
good model.
2.4 Final Model: Compare all the models and write an inference which model is
best/optimized.
CONCLUSION:
Selecting the RF model, as it has better accuracy, precision, recall, f1 score better than other two CART &
NN.
2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations.
I strongly recommended we collect more real time unstructured data and past data if possible.
This is understood by looking at the insurance data by drawing relations between different variables such as day of
the incident, time, age group, and associating it with other external information such as location, behavior
patterns, weather information, airline/vehicle types, etc.