100% found this document useful (2 votes)
918 views44 pages

Data Mini Proj

The document describes analyzing customer data from a bank using clustering techniques. It first performs exploratory data analysis on the data, finding it has 7 variables and 210 records with no missing values. Univariate analysis shows the variables are numeric and right skewed. Bivariate analysis finds strong positive correlations between variables like spending and advance payments. The document applies hierarchical and k-means clustering with 3 clusters identified as optimal. Customer profiles are described for high, medium, and low spending groups with promotional strategy recommendations.

Uploaded by

Zohaib Imam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
918 views44 pages

Data Mini Proj

The document describes analyzing customer data from a bank using clustering techniques. It first performs exploratory data analysis on the data, finding it has 7 variables and 210 records with no missing values. Univariate analysis shows the variables are numeric and right skewed. Bivariate analysis finds strong positive correlations between variables like spending and advance payments. The document applies hierarchical and k-means clustering with 3 clusters identified as optimal. Customer profiles are described for high, medium, and low spending groups with promotional strategy recommendations.

Uploaded by

Zohaib Imam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

BUSINESS REPORT

On Data Mining (Clustering, CART-RF-ANN)


By - Zohaib Imam.

Problem 1: Clustering
A leading bank wants to develop a customer segmentation to give promotional offers to its
customers. They collected a sample that summarizes the activities of users during the past few
months. You are given the task to identify the segments based on credit card usage.

1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate,
Bi-variate, and multivariate analysis).

Initial Observation (EDA):


• There are 7 variables and 210 records.
• No missing record based on initial analysis.
• All the variables numeric type.
• There are no missing values.
• There are no duplicate rows.

Univariate Analysis:
Observation:
• Based on summary descriptive, the data looks good.
• We see for most of the variable, mean/medium are nearly equal.
• Include a 90% to see variations and it looks distributely evenly.
• Std Deviation is high for spending variable.

• Minimum spending is 10.59


• Maximum spending is 21.18
• Mean value is 14.847523809523818
• Median value: 14.355
• Standard deviation is 2.90969943068736
• Spending seems to be right skewed
• There are no outliers present in spending
• Minimum advance_payments is 12.41
• Maximum advance_payments is 17.25
• Mean value is 14.559285714285727
• Median value is 14.32
• Standard deviation is 1.305958726564022
• advance_payments seem to be right skewed
• There are no outliers present in advance_payments

• Minimum probability_of_full_payment is 0.8081


• Maximum probability_of_full_payment is 0.9183
• Mean value is 0.8709985714285714
• Median value is 0.8734500000000001
• Standard deviation is 0.023629416583846496
• probability_of_full_payment seem to be slightly left skewed
• There are 3 outliers in probability_of_full_payment lower
• Mean value of current_balance is 5.628533333333334
• Median value of current_balance is 5.5235
• Standard deviation of current_balance is 0.4430634777264493
• current_balance seem to be right skewed
• There are no outliers present

• Minimum credit_limit: 2.63


• Maximum credit_limit: 4.033
• Mean value: 3.258604761904763
• Median value: 3.237
• Standard deviation: 0.3777144449065874
• credit_limit seem to right skewed
• There are no outliers present
• Minimum max_spent_in_single_shopping is 4.519
• Maximum max_spent_in_single_shoppings is 6.55
• Mean value is 5.408071428571429
• Median value is 5.223000000000001
• Standard deviation is 0.4914804991024054
• max_spent_in_single_shopping seems to be left skewed
• There are no outliers present

• Minimum min_payment_amt: 0.7651


• Maximum min_payment_amt: 8.456
• Mean value: 3.7002009523809507
• Median value: 3.599
• Standard deviation: 1.5035571308217792
• min_payment_amt seems to be very slight right skewed
• There are 2 outliers present in min_payment_amt upper

Bi-Variate Analysis:
Observations
- Credit limit average is around $3.258(10000s).
- Distribution is skewed to right tail for all the variable except probability_of_full_payment variable,
which has left tail.
Multivariate analysis:
Observation:
- Strong positive correlation between:
• spending & advance_payments
• advance_payments & current_balance
• credit_limit & spending
• spending & current_balance
• credit_limit & advance_payments
• max_spent_in_single_shopping current_balance
1.2 Do you think scaling is necessary for clustering in this case? Justify.
• Scaling needs to be done as the values of the variables are different.
• spending, advance_payments are in different values and this may get more weightage.
• Scaling will have all the values in the relative same range.
• I have used zscore to standarised the data to relative same scale -3 to +3.

1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters
using Dendrogram and briefly describe them.
Dendrogram (Ward link):
Observation:
• Both the method is almost similar means, minor variation, which we know it occurs.
• We for cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis,
and based on the dataset had gone for 3 group cluster solution based on the hierarchical
clustering.
• Also in real time, there could have been more variables value captured - tenure,
BALANCE_FREQUENCY, balance, purchase, installment of purchase, others.
• And three group cluster solution gives a pattern based on high/medium/low spending with
max_spent_in_single_shopping (high value item) and probability_of_full_payment(payment
made).
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and silhouette score.

From SC Score, the number of optimal clusters could be 3 or 4.


K-Means Clustering & Cluster Information:
3-Cluster:

4-Cluster:

1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters.
3 group cluster via Kmeans:
3 group cluster via hierarchical clustering:

Cluster Group Profiles:


Group 1: High Spending
Group 3: Medium Spending
Group 2: Low Spending

Promotional strategies for each cluster:


Group 1: High Spending Group
• Giving any reward points might increase their purchases.
• maximum max_spent_in_single_shopping is high for this group, so can be offered discount/offer
on next transactions upon full payment.
• Increase their credit limit and
• Increase spending habits.
• Give loan against the credit card, as they are customers with good repayment record.
• Tie up with luxury brands, which will drive more one_time_maximun spending.

Group 3: Medium Spending Group


• They are potential target customers who are paying bills and doing purchases and maintaining
comparatively good credit score. So, we can increase credit limit or can lower down interest
rate.
• Promote premium cards/loyalty cars to increase transactions.
• Increase spending habits by trying with premium ecommerce sites, travel portal, travel
airlines/hotel, as this will encourage them to spend more.

Group 2: Low Spending Group


• customers should be given remainders for payments. Offers can be provided on early payments
to improve their payment rate.
• Increase their spending habits by tying up with grocery stores, utilities (electricity, phone, gas,
others).
Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The management
decides to collect data from the past few years. You are assigned the task to make a model
which predicts the claim status and provide recommendations to management. Use CART, RF &
ANN and compare the models' performances in train and test sets.

2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis).

Initial Observation (EDA):


• 10 variables
• Age, Commision, Duration, Sales are numeric variable
• Rest are categorial variable
• 3000 records and no missing one
• 9 independent variable and one target variable – Claimed
• No missing value
• duration has negative value; it is not possible. Wrong entry
• Commision & Sales- mean and median varies significantly
• Categorial code variable maximum unique count is 5
• Data looks good at first glance
• Though it shows there are 139 records, but it can be of different customers, there is no
customer ID or any unique identifier, so not going to drop them off
Univariate Analysis:

• Minimum Age: 8
• Maximum Age: 84
• Mean value: 38.091
• Median value: 36.0
• Standard deviation: 10.463518245377944
• It seems to be right skewed
• Too many outliers present in lower as well as upper
• Minimum Commision: 0.0
• Maximum Commision: 210.21
• Mean value: 14.529203333333266
• Median value: 4.63
• Standard deviation: 25.48145450662553
• It is right skewed
• Too many outliers are present in upper
• Minimum Duration: -1
• Maximum Duration: 4580
• Mean value: 70.00133333333333
• Median value: 26.5
• Standard deviation: 134.05331313253495
• It is right skewed
• Many outliers are present in upper
• Minimum Sales: 0.0
• Maximum Sales: 539.0
• Mean value: 60.24991333333344
• Median value: 33.0
• Standard deviation: 70.73395353143047
• It is right skewed
• Too many outliers present in upper

There are outliers in all the variables, but the sales and commission can be a genuine business value.
Random Forest and CART can handle the outliers. Hence, Outliers are not treated for now, we will
keep the data as it is.
Bi-Variate Analysis:

Agency_Code
Type
Channel
Product Name
Destination
Multivariate Analysis:
There is not much collinearity between any variable. But among them Sales & Commision has the
highest collinearity of 0.77.
2.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest, Artificial Neural Network.

Decision Tree Classifier


• {'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 50, 'min_samples_split': 450}

DecisionTreeClassifier(max_depth=10, min_samples_leaf=50, min_samples_split=450,


random_state=1)

• {'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 20, 'min_samples_split': 150}

DecisionTreeClassifier(max_depth=5, min_samples_leaf=20, min_samples_split=150,


random_state=1)

• {'criterion': 'gini', 'max_depth': 3.5, 'min_samples_leaf': 44, 'min_samples_split': 250}

DecisionTreeClassifier(max_depth=3.5, min_samples_leaf=44, min_samples_split=250,


random_state=1)

• {'criterion': 'gini', 'max_depth': 4.85, 'min_samples_leaf': 44, 'min_samples_split': 260}

DecisionTreeClassifier(max_depth=4.85, min_samples_leaf=44, min_samples_split=260,


random_state=1)

Variable Importance-Decision Tree Classifier


Predicted Classes and Probs

Random Forest Classifier


• {'max_depth': 6, 'max_features': 3, 'min_samples_leaf': 8, 'min_samples_split': 46,
'n_estimators': 350}

RandomForestClassifier(max_depth=6, max_features=3, min_samples_leaf=8,


min_samples_split=46, n_estimators=350, random_state=1)

Predicted Classes and Probs


Variable Importance via Random Forest

Neural Network Classifier


MLPClassifier(hidden_layer_sizes=200, max_iter=2500, random_state=1, tol=0.01)

Predicted Classes and Probs


2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and
Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score,
classification reports for each model.

Cart Model:
Train Data (ROC curve)

Test Data (ROC curve)


Confusion Matrix (Train data)

array ([[1309, 144],

[ 307, 340]], dtype=int64)

Confusion Matrix (Test data)

array ([[553, 70],

[136, 141]], dtype=int64)

Train Data:
• AUC: 82%
• Accuracy: 79%
• Precision: 70%
• f1-Score: 60%

Test Data:
• AUC: 80%
• Accuracy: 77%
• Precision: 80%
• f1-Score: 84%

Random Forest Model:

Train Data (ROC curve)


Test Data (ROC curve)

Confusion Matrix (Train data)

array ([[1297, 156],

[ 255, 392]], dtype=int64

Confusion Matrix (Test data)

array ([[550, 73],

[121, 156]], dtype=int64)

Train Data
• AUC: 86%
• Accuracy: 80%
• Precision: 72%
• f1-Score: 66%

Test Data
• AUC: 82%
• Accuracy: 78%
• Precision: 68%
• f1-Score: 62
Training and Test set results are almost similar, and with the overall measures high, the model is a
good model.
Change is again the most important variable for predicting diabetes.

NN Model:
Train Data (ROC curve)

Test Data (ROC curve)


Confusion Matrix (Train data)
array ([[1298, 155],

[ 315, 332]], dtype=int64)

Confusion Matrix (Test data)


array ([[553, 70],

[138, 139]], dtype=int64)

Train Data:
• AUC: 82%
• Accuracy: 78%
• Precision: 68%
• f1-Score: 59

Test Data:
• AUC: 80%
• Accuracy: 77%
• Precision: 67%
• f1-Score: 57%

Training and Test set results are almost similar, and with the overall measures high, the model is a
good model.

2.4 Final Model: Compare all the models and write an inference which model is
best/optimized.

Comparison of the performance metrics from the 3 models.


ROC Curve for the 3 models on the Training data.

ROC Curve for the 3 models on the Test data.

CONCLUSION:
Selecting the RF model, as it has better accuracy, precision, recall, f1 score better than other two CART &
NN.
2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations.

I strongly recommended we collect more real time unstructured data and past data if possible.

This is understood by looking at the insurance data by drawing relations between different variables such as day of
the incident, time, age group, and associating it with other external information such as location, behavior
patterns, weather information, airline/vehicle types, etc.

• Streamlining online experiences benefitted customers, leading to an increase in conversions, which


subsequently raised profits.
• As per the data 90% of insurance is done by online channel.
• Other interesting fact, is almost all the offline business has a claimed associated, need to find why?
• Need to train the JZI agency resources to pick up sales as they are in bottom, need to run promotional
marketing campaign or evaluate if we need to tie up with alternate agency.
• Also based on the model we are getting 80%accuracy, so we need customer books airline tickets or plans,
cross sell the insurance based on the claim data pattern.
• Other interesting fact is more sales happen via Agency than Airlines and the trend shows the claim are
processed more at Airline. So, we may need to deep dive into the process to understand the workflow
and why?

Key performance indicators (KPI) The KPI’s of insurance claims are:


• Reduce claims cycle time
• Increase customer satisfaction
• Combat fraud
• Optimize claims recovery
• Reduce claim handling costs Insights gained from data and AI-powered analytics could expand the
boundaries of insurability, extend existing products, and give rise to new risk transfer solutions in areas
like a non-damage business interruption and reputational damage.

You might also like