Data Analytics Report - Case Study - Employee Attrition
Data Analytics Report - Case Study - Employee Attrition
There’s a law with the name “Joy’s law” which states “no matter who you are, most of the smartest
people work for someone else”. Employee attrition is a very critical problem for Human Resources
department. In this era of competition, it becomes imperative to understand factors leading to
employee attrition and employee retention. Some of those factors could be obvious while the others
could be hidden.
Can our data on employee attrition give insights into why people are leaving the company? Can it
tell us which segment of employees are leaving? Can it tell us where should we focus on?
Answers to some of these questions will help a CHRO take steps in correct direction, improve
employee morale and engagement to reduce attrition.
In this data analytics report we take you through a methodical framework developed by us at
bigintsolutions.com, and deep dive into each steps understanding the data, visualizing it and seeing
the factors influencing attrition.
1. Explore employee attrition data through various statistical and visualization techniques
2. Find out factors influencing attrition
3. Create a model to predict attrition
4. Provide final conclusions
1. Data Exploration
2. Distribution Analysis with respect to the variable of interest
3. Model Development
4. Model Analysis and Conclusion
Language used: R
Tools used: R Studio, Power BI
Live demo of Employee Attrition Analytics Power BI Report:
https://bigintsolutions.com/showcase/employee-attrition-analytics/
Disclaimer: The report is intended to showcase our approach to data analytics problems. By no
means, the model performance is to be taken as final. The purpose is not to fine-tune and provide
the best model. The purpose is only to showcase the process and approach.
About the Founder
Ranbeer Makin, is a passionate data analytics professional with 11+ yrs of industry experience
working with enterprises and providing them solutions to large-scale data problems for driving
business decisions. He has helped customers in architecting, designing and creating next-generation
data analytics, machine learning, and AI platforms. He is a thought leader and an experienced
professional with a demonstrated history in marketing and healthcare domain. Skilled in Data
Engineering and Product Management, he has extensive hands-on experience in Data Analytics, Data
Warehousing, Data Visualization, and DevOps techniques and tools.
Prior to starting his entrepreneurial journey, Ranbeer was Director Engineering/CTO for a healthcare
start-up helping them develop a predictive analytics and patient care engagement platform. He
started his career with application development in C++ where he and his team architected and
developed a search rewards platform for a popular search engine. He then spent 8+ yrs in data and
BI space where he worked on big data marketing and analytics platform, and was responsible for
delivery of large-scale projects. He is a certified agile software practitioner from Project
Management Institute (PMI-ACP).
Connect him on
LinkedIn: https://www.linkedin.com/in/ranbeer/
Contact: https://www.BIGINTSolutions.com/Contact
Table of Contents
Data Exploration ........................................................................................................................ 5
What kind of variables are we going to work with? .............................................................. 5
What is the basic statistics of our data? ............................................................................ 5
What is the structure of our data? .................................................................................... 6
How is our data distributed? ............................................................................................. 7
Do we have missing data? ..................................................................................................... 9
What is the detailed profile of our variables? ..................................................................... 10
Which variables to remove? ................................................................................................ 12
Distribution Analysis ................................................................................................................ 13
Univariate Analysis............................................................................................................... 13
Attrition Distribution by Age ............................................................................................ 14
Attrition Distribution by Education Field ......................................................................... 14
Attrition Distribution by Job Role .................................................................................... 15
Attrition Distribution by Business Travel ......................................................................... 16
Attrition Distribution by Department .............................................................................. 16
Attrition Distribution by Gender ...................................................................................... 17
Attrition Distribution by Marital Status ........................................................................... 17
Attrition Distribution by OverTime .................................................................................. 18
Attrition Distribution by NumCompaniesWorked ........................................................... 18
Attrition Distribution by DistanceFromHome.................................................................. 19
Attrition Distribution by JobLevel .................................................................................... 19
Attrition Distribution by PercentSalaryHike .................................................................... 20
Attrition Distribution by StockOptionLevel ..................................................................... 20
Attrition Distribution by TotalWorkingYears ................................................................... 21
Attrition Distribution by TrainingTimesLastYear ............................................................. 21
Attrition Distribution by YearsInCurrentRole .................................................................. 22
Attrition Distribution by YearsAtCompany ...................................................................... 22
Attrition Distribution by YearsSinceLastPromotion ......................................................... 23
Attrition Distribution by YearsWithCurrManager ........................................................... 23
Attrition Distribution by Rates ......................................................................................... 24
Attrition Distribution by Job Satisfaction......................................................................... 24
Attrition Distribution by Education .................................................................................. 25
Attrition Distribution by Job Involvement ....................................................................... 25
3
Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by Environmental Satisfaction ...................................................... 26
Attrition Distribution by other factors ............................................................................. 26
Multivariate Analysis............................................................................................................ 27
Why are young people leaving the company? ................................................................ 28
Why is attrition so high in Sales Rep Job Role? ............................................................... 28
What is the characteristics of employees who are doing overtime? .............................. 29
Advanced Distribution Analysis ............................................................................................... 29
Box-Whisker Plot.................................................................................................................. 29
Model Development ................................................................................................................ 30
Top factors influencing attrition .......................................................................................... 31
Feature Engineering ............................................................................................................. 32
Fitting a model ..................................................................................................................... 36
Model analysis ..................................................................................................................... 36
Error analysis .................................................................................................................... 36
Tuning .............................................................................................................................. 38
ROC/AUC .......................................................................................................................... 40
Conclusion ................................................................................................................................ 41
Next Steps ................................................................................................................................ 41
References ............................................................................................................................... 41
Contact Us ................................................................................................................................ 41
Contact: https://www.BIGINTSolutions.com/Contact
Data Exploration
The first step in our framework is “Data Exploration”. In this step we provide answers to following
questions:
1. What kind of data are we going to work with?
2. Do we have missing data?
3. What profile of data do we have?
4. Should we remove obvious data points?
For this step, we use “DataExplorer” library to provide basic data statistics.
Install library:
install.packages("DataExplorer")
Data report:
library("DataExplorer")
attrition <- read.csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
create_report(attrition)
The create_report method will generate a detailed report for our data. Let’s see what this report
has.
The data on employee attrition comprises 1470 observations of 35 features, i.e. there are 1,470
rows and 35 columns. Of all the 35 columns, 9 are discrete, 26 are continuous, and there are 0
missing values out of 51,450 data points.
Contact: https://www.BIGINTSolutions.com/Contact
What is the structure of our data?
You can also see the structure of your data in a tabular format using
str(attrition)
Contact: https://www.BIGINTSolutions.com/Contact
## $ JobLevel : num 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : chr "Sales Executive" "Research Scientist" "Laboratory Technician"
"Research Scientist" ...
## $ JobSatisfaction : num 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : chr "Single" "Married" "Single" "Married" ...
## $ MonthlyIncome : num 5993 5130 2090 2909 3468 ...
## $ MonthlyRate : num 19479 24907 2396 23159 16632 ...
## $ NumCompaniesWorked : num 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : chr "Y" "Y" "Y" "Y" ...
## $ OverTime : chr "Yes" "No" "Yes" "Yes" ...
## $ PercentSalaryHike : num 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : num 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: num 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : num 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : num 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : num 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : num 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : num 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : num 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : num 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : num 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : num 5 7 0 0 2 6 0 0 8 7 …
The report also gives univariate distribution through histograms and bar charts. Here’s a sample of
the charts that the library generated.
Histogram
Contact: https://www.BIGINTSolutions.com/Contact
8
Contact: https://www.BIGINTSolutions.com/Contact
Bar Chart (by frequency)
We will see in subsequent sections on how to do a detailed analysis of our data with our variable of
interest (i.e., attrition).
From the report that is generated from DataExplorer package, we see that there are no missing
values in our dataset. That’s a good thing, but in general cases, we may have missing data.
Contact: https://www.BIGINTSolutions.com/Contact
What is the detailed profile of our variables?
In order to understand the spread of our data, it is important to understand each variable’s
characteristics. We do this with “summary” function
Code:
summary(attrition)
10
Contact: https://www.BIGINTSolutions.com/Contact
How to read above summary?
We see that ages of the employees are between 18 yrs to 60 yrs, the average age of the
employees is approximately 37 yrs, where the median is 36 yrs, i.e. most of the employees
are young.
Positive cases (Attrition = yes) is around 237 and negative cases around 1233 in our dataset
The distance from home to their office is between 1 to 29 units, the average being close to 9
units.
The employee count we see is singular in all cases, mean=median=min=max=1st
quartile=3rd quartile. So we can infer from here itself it has no effect on employee attrition.
So from the further analysis, we can eliminate this column.
1. Data spread based on min, max, median, mean, 1st quartile and 3rd quartile (this is best
represented as box-whisker plot which we will see in subsequent sections)
2. Variables which we can safely remove. Variables such as EmployeeCount, EmployeeNumber,
StandardHours, Over18
But, there are certain variables like Education, JobInvolvement etc. which are discreet and factor
variables but in numeric form. We would like to convert them into character form.
Code:
attrition$EnvironmentSatisfaction<-
cut(attrition$EnvironmentSatisfaction,
breaks=c(0,1,2,3,4),
labels=c("Low","Medium","High","Very High"),
right=TRUE)
attrition$Education <-cut(attrition$Education,
breaks=c(0,1,2,3,4,5),
labels=c("Below
College","College","Bachelor",
"Master","Doctor"),
right=TRUE)
attrition$JobInvolvement <-cut(attrition$JobInvolvement,
breaks=c(0,1,2,3,4),
labels=c("Low","Medium","High","Very
High"),
right=TRUE)
attrition$JobSatisfaction <-cut(attrition$JobSatisfaction,
breaks=c(0,1,2,3,4),
labels=c("Low","Medium","High","Very High"),
right=TRUE)
attrition$RelationshipSatisfaction <-
cut(attrition$RelationshipSatisfaction,
11
Contact: https://www.BIGINTSolutions.com/Contact
breaks=c(0,1,2,3,4),
labels=c("Low","Medium","High","Very High"),
right=TRUE)
labels=c("Low","Good","Excellent","Outstanding"),
right=TRUE)
labels=c("Bad","Good","Better","Best"),
right=TRUE)
Let’s run the summary function again and see the output.
summary(attrition)
Based on above analysis we will drop some of the variables as they are obvious candidates for
removal.
Code:
attrition <- subset(attrition, select = -c(EmployeeCount,
EmployeeNumber, StandardHours, Over18))
summary(attrition)
12
Contact: https://www.BIGINTSolutions.com/Contact
Distribution Analysis
We now know basic statistics and structure of our data, it’s time to analyze our data set with our
variable of interest. We will see univariate and multivariate analysis in this section. Univariate
analysis is done in R and multivariate analysis is done in Power BI.
Univariate Analysis
Our variable of interest is attrition, so we would like to analyze each of the variables with respect to
attrition %.
Here’s a general code to generate a plot like this for each variable.
library(ggplot2)
library(ggthemes)
library(dplyr)
theme(legend.position="right", legend.direction="vertical",
legend.title = element_text("Attrition"),
axis.text.x = element_text(angle = 45, hjust = 1, vjust =
1))+
labs(x=variableText, y="% Employees") +
ggtitle(paste("Attrition distribution by", variableText, sep=" "))
p4
13
Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by Age
It is very evident from the above graphical representation that attrition is maximum within 18 -25
years of age. Then onwards till 45 years, attrition tends to decrease. But employees above 55 years
face comparatively high attrition of around ~16%.
14
Contact: https://www.BIGINTSolutions.com/Contact
We see here attrition is most faced by employees whose education field is Human Resources and
least by the ones who are from an educational field other than Life Science, Marketing, Medical and
Technical grounds. According to their percentage of attrition, we can rank attrition % by education
field as follows:
Human Resources > Technical Degree > Marketing > Life Science > Medical > Other
Attrition of sales representatives is maximum i.e. ~40%, minimum being of a Research Director. We
can rank attrition % by job roles according to their attrition.
Sales Representative > Laboratory Technician > Human Resources > Sales Executive > Research
Scientist > Health Representative -> Manufacturing Director > Manager > Research Director.
15
Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by Business Travel
Here we see, non-travelers face the least attrition, whereas the employees who need to travel
frequently have the maximum attrition of around 25%.
16
Contact: https://www.BIGINTSolutions.com/Contact
Attrition is maximum in the sales department and minimum in the research and development
department. We can already infer this from the attrition in job roles. There we have already seen
that sales representatives have almost 40% attrition and sales executive 17.5%.
Attrition in employees who are single is the most, 25.5%, least being in employees who are divorced.
17
Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by OverTime
Self-evident.
Here we can see that employees with high work experience, i.e. who have worked for 7-9 companies
face the maximum attrition and the employees with least work experience like about a few months
to close to 2 years have the second most attrition. The attrition margin between highly experienced
and new joiners is 3.1%.
18
Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by DistanceFromHome
Employees who have to travel 1 units to 11 units face attrition roughly 13% to 15 %, among which
employees who have to travel 1 - 3 units to work face the least attrition, 12.9% ~13%, whereas those
who have to travel 3 units to 11 units to their work face almost 15% attrition. Employees staying
within 11 units to 15 units from their workplace face the most attrition, 22.5%. Employees staying
20-29 units of their workplace also have comparatively high attrition, 21.4%. Overall we can say
employees staying further to 11 units of their workplace have a high rate of attrition.
Attrition in the very first stage is the most, 26.3%. Later in the second stage of responsibility, it
decreases to 9.7%~10%, fall of 16.6% in attrition. Again there is a slight increase around 5% in
attrition in stage 3. Lastly, in stages, 4 & 5 attrition is least, as by then the employees have had a lot
of experience, and we have already seen with the increase in age the overall attrition decreases.
19
Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by PercentSalaryHike
From above it looks like there is no effect of different salary hike ranges in attrition.
Employees with the least stock option level face the most attrition, 24.4%, whereas employees with
2 or 3 stock option levels have a comparatively low attrition.
20
Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by TotalWorkingYears
Employees who have just started working or have worked for almost 6 years by now has the most
attrition, of around 28.8%~29%, while employees who have worked for about 11 years by now have
attrition around 15%, whereas people working for more than 11 years have even lesser attrition of
10.1%.
21
Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by YearsInCurrentRole
As years in a particular company increases, experience in the company increases, the attrition %
decreases, from 34.9% to 8.9%. When an employee joins or is working for the last 2 years has the
maximum chances of attrition.
22
Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by YearsSinceLastPromotion
From above it looks like there is no effect of different years since last promotion ranges in attrition.
23
Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by Rates
It is somewhat obvious if the job satisfaction is low employees will tend to leave the prevailing job
and job role for a better opportunity. We see that the attrition due job satisfaction follows a
decreasing trend, i.e. with the increase in job satisfaction the attrition rate decreases up to 11.3%.
24
Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by Education
Employees whose education is below the college have the highest attrition, at 18.2%~18%, with
doctors as the lowest at 10.4%.
Here we again see that higher the involvement in their job, lower is the rate of attrition, following a
decreasing trend. When the involvement is low the rate is 33.7%, whereas if the involvement is very
high, the rate drops down to 9%.
25
Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by Environmental Satisfaction
It is obvious that low environmental satisfaction would lead to high attrition, whereas if the
satisfaction is medium, high or very high the rate lies between 13.5%%-15%, the margin 1.5%, i.e,
the overall rate remains the same.
In this section we saw univariate analysis is helpful in understanding the effect of individual variables
on attrition. We measured attrition % with respect to each variable. We found how distance from
home or years in company impact attrition. At the same time, we saw how years with current
manager and over time leads to attrition.
This analysis is good, but is not comprehensive. It does not allow us to check if there are multiple
factors impacting attrition. We will analyze this using multivariate analysis.
26
Contact: https://www.BIGINTSolutions.com/Contact
Multivariate Analysis
Based on the univariate analysis done in previous section we can ask several deeper questions to
strengthen our standing.
For answering such questions we use Power BI tool to analyze multiple factors at once. This tool
allows us to interact with our data and give deeper level understanding on causes of attrition.
For a live demo of the report, you can check our company website:
https://bigintsolutions.com/showcase/employee-attrition-analytics/
27
Contact: https://www.BIGINTSolutions.com/Contact
Why are young people leaving the company?
After selecting age range for young employees (18-35), we can see:
Contact: https://www.BIGINTSolutions.com/Contact
• More than 30% of them who have attrition are divorced
• 50% of the attrition is because of distance from home (more than 7 miles)
• 71% of the attrition is because their job involvement is low
• This group has the lowest job satisfaction
As you can see above, Power BI allows us to measure attrition with respect to various factors at
once. We can select filters and see how the data changes based on that selection.
We invite you to play with report hosted on our website and see it yourself.
Box-Whisker Plot
We will plot age distribution of the dataset and segregate by positive and negative cases.
29
Contact: https://www.BIGINTSolutions.com/Contact
Code to achieve this:
Output:
In simpler terms, we see employees who are facing attrition are mostly young population where
mean age is around 34. The first quartile being from 18-27 yrs.
You can very easily see the spread of the data using a visualization like this.
Model Development
Ok, it’s time to start developing our model. Till now we have analyzed our dataset through various
techniques including basic statistics, data profiling, univariate and multivariate distribution analysis,
and have seen one example of advanced distribution analysis.
We have a good understanding of the variables involved and their importance (implicit). In this
section, we will look at how to statistically select top factors influencing our variable of interest, do a
bit of feature engineering, and fit a model and deep dive into model results.
30
Contact: https://www.BIGINTSolutions.com/Contact
Top factors influencing attrition
There are various ways to determine factors influencing attrition. We can either use PCA (Principal
component analysis) or RFE (Recursive Feature Elimination) or use Random Forest algorithm. For this
report, we will go with Random Forest algorithm to determine factors influencing attrition.
Code:
#Feature importance
library(randomForest)
set.seed(100)
rf<-randomForest(Attrition~.,data=attrition, importance=T)
varImpPlot(rf, type=1)
Output:
Higher the value of mean decrease accuracy score, higher the importance of the variable in
the model. In the plot shown above, OverTime is the most important variable.
There are other measures of importance using MeanDecreaseGini which we are not
discussing in this report
Note: We have not yet done any feature engineering, so results of variable importance may
change after that.
31
Contact: https://www.BIGINTSolutions.com/Contact
Code:
rf.10<-randomForest(Attrition~.,data=attrition, importance = T)
varImpPlot(rf.10, type=1, sort = T, n.var = 10, main= "Top -10
Variable Importance")
We see above our selection of important feature changes. We will analyze this further after doing
some feature engineering.
Feature Engineering
1. Convert “string” columns into their numeric form. Usually, Machine Learning algorithms are
not good at working with string columns.
2. Convert those columns into factor columns (since those columns are factors)
3. Normalize numeric columns like MonthlyIncome, PercentySalaryHike, DistanceFromHome
etc.
These steps are a subset of the steps involved in “feature engineering”. Feature engineering step is
essential to select features which can maximize the performance of our machine learning algorithm.
32
Contact: https://www.BIGINTSolutions.com/Contact
Code:
#Feature Engineering
#2. Convert numeric columns which are actually factors into factors
attrition$Education<-as.factor(attrition$Education)
attrition$EnvironmentSatisfaction<-
as.factor(attrition$EnvironmentSatisfaction)
attrition$JobInvolvement<-as.factor(attrition$JobInvolvement)
attrition$JobSatisfaction<-as.factor(attrition$JobSatisfaction)
attrition$PerformanceRating<-as.factor(attrition$PerformanceRating)
attrition$RelationshipSatisfaction<-
as.factor(attrition$RelationshipSatisfaction)
attrition$WorkLifeBalance<-as.factor(attrition$WorkLifeBalance)
attrition$StockOptionLevel<-as.factor(attrition$StockOptionLevel)
RelationshipSatisfactionC,
WorkLifeBalanceC,
33
Contact: https://www.BIGINTSolutions.com/Contact
Age,
AgeBins,
MonthlyRate,
HourlyRate,
DailyRate))
rf.2<-randomForest(AttritionN~.,data=attrition.fe1)
varImpPlot(rf.2)
library(caret)
TrainingTimesLastYear,
YearsAtCompany,
YearsInCurrentRole,
YearsSinceLastPromotion,
YearsWithCurrManager
))
summary(attrition.n.t)
TrainingTimesLastYear,
YearsAtCompany,
YearsInCurrentRole,
YearsSinceLastPromotion,
YearsWithCurrManager))
Contact: https://www.BIGINTSolutions.com/Contact
Let’s see our feature importance chart after performing feature engineering.
rf.3<-randomForest(AttritionN~.,data=attrition.fe3, importance= T)
varImpPlot(rf.3, type=1)
Did you see any change from our baseline feature importance chart?
We are now ready to fit our model on the above dataset. We have all variables in numbers format,
factor variables as type factors, and numeric variables normalized. Here’s a screenshot of our
dataset.
35
Contact: https://www.BIGINTSolutions.com/Contact
Fitting a model
For fitting a model, we will take the above dataset and fit Random Forest algorithm.
We need to first split the data into train and test in 70/30 split. We fit the model on train data,
improve model accuracy on train data, and then evaluate it on test data.
Code:
set.seed(100)
rf.4 <- randomForest(AttritionN~.,data=train, importance=T)
print(rf.4)
Call:
randomForest(formula = AttritionN ~ ., data = train, importance = T)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
Model analysis
In this section, we will deep dive into model analysis and provide approaches to tune a model. We
need to tune a model with respect to a metric. We can either choose accuracy or out of box (OOB)
error as the metric to optimize. In this report, we will select OOB error to analyze and optimize.
Error analysis
Code:
layout(matrix(c(1,2),nrow=1),
width=c(4,1))
par(mar=c(5,4,4,0)) #No margin on the right side
plot(rf.4)
par(mar=c(5,0,4,2)) #No margin on the left side
plot(c(0,1),type="n", axes=F, xlab="", ylab="")
legend("top", colnames(rf.4$err.rate),col=1:4,cex=0.8,fill=1:4)
36
Contact: https://www.BIGINTSolutions.com/Contact
How to read the above chart?
If you want to see the error rate at a more granular level, then we can plot “log” of error rate.
37
Contact: https://www.BIGINTSolutions.com/Contact
Tuning
In this section, we will tune certain parameters of the RandomForest algorithm. Specifically, we will
tune number of trees and mtry parameter.
For this report, we will restrict ntree as 500 and optimize mtry.
Code:
set.seed(120)
res <- tuneRF(x=subset(train, select = -c(AttritionN)),
y=train$AttritionN,
stepFactor = 1.5,
plot=TRUE,
ntreeTry = 500,
trace = TRUE,
improve = 0.0001,
importance=TRUE)
print(res)
mtry OOBError
4.OOB 4 0.1422805
5.OOB 5 0.1382442
7.OOB 7 0.1352170
10.OOB 10 0.1382442
38
Contact: https://www.BIGINTSolutions.com/Contact
The mtry value for which the model gives the best OOB error is 7. Let’s put mtry as 7 in our model.
Code:
set.seed(120)
rf.6 <- randomForest(AttritionN~.,data=train, importance=T, mtry=7,
ntree=500)
print(rf.6)
Call:
randomForest(formula = AttritionN ~ ., data = train, importance = T,
mtry = 7, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 7
From 13.82 we went to 13.52 OOB error rate, a marginal improvement. How’s the model
performance on test data?
Code:
set.seed(120)
rf.7 <- randomForest(AttritionN~.,data=test, importance=T, mtry=7,
ntree=500)
print(rf.7)
Call:
randomForest(formula = AttritionN ~ ., data = test, importance = T,
mtry = 7, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 7
The model performance on test data is 13.57% which is very close to train data. Often when
evaluating a model performance we may face overfitting and underfitting problems. Either we have
fewer data to train or we have not chosen optimal parameters. These problems can be solved
iteratively by adding more data, pruning features, optimizing mtry and ntree. The discussion on this
topic is out of scope for this report.
We will now see the AUC of our model and plot ROC chart. This will be the last part in analyzing our
model performance.
39
Contact: https://www.BIGINTSolutions.com/Contact
ROC/AUC
Usually, we measure the performance of a classification problem by another metric called as “AUC”
– area under the curve. Let’s see the AUC for our model and plot ROC chart.
Code:
library(ROCR)
predict<-predict(rf.6, type="prob")
prediction<-prediction(predict[,2], train$AttritionN)
auc<-performance(prediction, "auc")
performance<-performance(prediction, "tpr","fpr")
legend(.8,.3,auc,title="AUC")
The AUC for our model is close to 84% which is a decent coverage. The more the area under the
curve the better it is.
40
Contact: https://www.BIGINTSolutions.com/Contact
Conclusion
In this report, we saw various steps to approach a data analytics problem. We presented various
statistical and visualization techniques to analyze the data. We also presented univariate and
multivariate analysis for our problem and provided steps to create, analyze and fine-tune a model.
For feature engineering and model fitting we used R programming, and for multivariate and
interactive analysis we used Power BI.
We also deep dived into several questions related to employee attrition. Questions we answered
were “why are young people leaving the company”, “why is attrition so high in Sales Rep role”,
“what is the characteristics of people doing overtime”?
Next Steps
We can do further work on model selection and tuning. We decided to use Random Forest
algorithm. But what is the basis for choosing this algorithm? A general approach that we follow is to
use multiple algorithms and measure the performance of each of the algorithms. Go with the
algorithm giving the best performance. We could have evaluated algorithms like SVM, K-NN, and
Neural Networks.
We can remove certain variables which are “correlated”. This can reduce the dimensionality of our
dataset and improve model performance.
We can enhance our dataset to provide an end to end view of employee attrition problem. We can
include hiring data (by job role, department, time to hire), competitor data (where employees are
going), and industry data to compare attrition with industry averages.
References
We referred to several posts on StackOverflow community, R-bloggers website,
machinelearningmastery, and other community websites to pick the best R code to solve this
problem.
Contact Us
If you have any questions on this report, or you have a data analytics problem for us to work on, feel
free to reach us through our website. We will get back to you within 12-24 business hours.
https://www.bigintsolutions.com/Contact
Thank you.
41
Contact: https://www.BIGINTSolutions.com/Contact