100% found this document useful (1 vote)
639 views41 pages

Data Analytics Report - Case Study - Employee Attrition

This data analytics report analyzes employee attrition data using statistical and visualization techniques to understand the key drivers of attrition. The report explores the data, identifies factors influencing attrition through univariate and multivariate analysis, develops a predictive model for attrition, and provides conclusions. The dataset contains employee records from IBM with variables like age, department, job satisfaction, overtime hours, etc. Distribution analysis identifies that attrition is higher among younger employees, sales reps, and those doing overtime. A predictive model is developed and analyzed to identify the top factors influencing employee attrition.

Uploaded by

Krishna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
639 views41 pages

Data Analytics Report - Case Study - Employee Attrition

This data analytics report analyzes employee attrition data using statistical and visualization techniques to understand the key drivers of attrition. The report explores the data, identifies factors influencing attrition through univariate and multivariate analysis, develops a predictive model for attrition, and provides conclusions. The dataset contains employee records from IBM with variables like age, department, job satisfaction, overtime hours, etc. Distribution analysis identifies that attrition is higher among younger employees, sales reps, and those doing overtime. A predictive model is developed and analyzed to identify the top factors influencing employee attrition.

Uploaded by

Krishna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Report version: 1.

Data Analytics Report


Case Study - Employee Attrition Analysis

There’s a law with the name “Joy’s law” which states “no matter who you are, most of the smartest
people work for someone else”. Employee attrition is a very critical problem for Human Resources
department. In this era of competition, it becomes imperative to understand factors leading to
employee attrition and employee retention. Some of those factors could be obvious while the others
could be hidden.

Can our data on employee attrition give insights into why people are leaving the company? Can it
tell us which segment of employees are leaving? Can it tell us where should we focus on?

Answers to some of these questions will help a CHRO take steps in correct direction, improve
employee morale and engagement to reduce attrition.

In this data analytics report we take you through a methodical framework developed by us at
bigintsolutions.com, and deep dive into each steps understanding the data, visualizing it and seeing
the factors influencing attrition.

Through this report we would:

1. Explore employee attrition data through various statistical and visualization techniques
2. Find out factors influencing attrition
3. Create a model to predict attrition
4. Provide final conclusions

Our data analytics framework has 4 major steps:

1. Data Exploration
2. Distribution Analysis with respect to the variable of interest
3. Model Development
4. Model Analysis and Conclusion

Dataset: The dataset has been taken from IBM resource.


Link to the dataset: https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-
UseC_-HR-Employee-Attrition.xlsx

Language used: R
Tools used: R Studio, Power BI
Live demo of Employee Attrition Analytics Power BI Report:
https://bigintsolutions.com/showcase/employee-attrition-analytics/

Disclaimer: The report is intended to showcase our approach to data analytics problems. By no
means, the model performance is to be taken as final. The purpose is not to fine-tune and provide
the best model. The purpose is only to showcase the process and approach.
About the Founder

Ranbeer Makin, is a passionate data analytics professional with 11+ yrs of industry experience
working with enterprises and providing them solutions to large-scale data problems for driving
business decisions. He has helped customers in architecting, designing and creating next-generation
data analytics, machine learning, and AI platforms. He is a thought leader and an experienced
professional with a demonstrated history in marketing and healthcare domain. Skilled in Data
Engineering and Product Management, he has extensive hands-on experience in Data Analytics, Data
Warehousing, Data Visualization, and DevOps techniques and tools.

Prior to starting his entrepreneurial journey, Ranbeer was Director Engineering/CTO for a healthcare
start-up helping them develop a predictive analytics and patient care engagement platform. He
started his career with application development in C++ where he and his team architected and
developed a search rewards platform for a popular search engine. He then spent 8+ yrs in data and
BI space where he worked on big data marketing and analytics platform, and was responsible for
delivery of large-scale projects. He is a certified agile software practitioner from Project
Management Institute (PMI-ACP).

Ranbeer is a Computer Science graduate from IIIT


Hyderabad with specialization in data engineering.

Connect him on
LinkedIn: https://www.linkedin.com/in/ranbeer/

Contact: https://www.BIGINTSolutions.com/Contact
Table of Contents
Data Exploration ........................................................................................................................ 5
What kind of variables are we going to work with? .............................................................. 5
What is the basic statistics of our data? ............................................................................ 5
What is the structure of our data? .................................................................................... 6
How is our data distributed? ............................................................................................. 7
Do we have missing data? ..................................................................................................... 9
What is the detailed profile of our variables? ..................................................................... 10
Which variables to remove? ................................................................................................ 12
Distribution Analysis ................................................................................................................ 13
Univariate Analysis............................................................................................................... 13
Attrition Distribution by Age ............................................................................................ 14
Attrition Distribution by Education Field ......................................................................... 14
Attrition Distribution by Job Role .................................................................................... 15
Attrition Distribution by Business Travel ......................................................................... 16
Attrition Distribution by Department .............................................................................. 16
Attrition Distribution by Gender ...................................................................................... 17
Attrition Distribution by Marital Status ........................................................................... 17
Attrition Distribution by OverTime .................................................................................. 18
Attrition Distribution by NumCompaniesWorked ........................................................... 18
Attrition Distribution by DistanceFromHome.................................................................. 19
Attrition Distribution by JobLevel .................................................................................... 19
Attrition Distribution by PercentSalaryHike .................................................................... 20
Attrition Distribution by StockOptionLevel ..................................................................... 20
Attrition Distribution by TotalWorkingYears ................................................................... 21
Attrition Distribution by TrainingTimesLastYear ............................................................. 21
Attrition Distribution by YearsInCurrentRole .................................................................. 22
Attrition Distribution by YearsAtCompany ...................................................................... 22
Attrition Distribution by YearsSinceLastPromotion ......................................................... 23
Attrition Distribution by YearsWithCurrManager ........................................................... 23
Attrition Distribution by Rates ......................................................................................... 24
Attrition Distribution by Job Satisfaction......................................................................... 24
Attrition Distribution by Education .................................................................................. 25
Attrition Distribution by Job Involvement ....................................................................... 25
3

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by Environmental Satisfaction ...................................................... 26
Attrition Distribution by other factors ............................................................................. 26
Multivariate Analysis............................................................................................................ 27
Why are young people leaving the company? ................................................................ 28
Why is attrition so high in Sales Rep Job Role? ............................................................... 28
What is the characteristics of employees who are doing overtime? .............................. 29
Advanced Distribution Analysis ............................................................................................... 29
Box-Whisker Plot.................................................................................................................. 29
Model Development ................................................................................................................ 30
Top factors influencing attrition .......................................................................................... 31
Feature Engineering ............................................................................................................. 32
Fitting a model ..................................................................................................................... 36
Model analysis ..................................................................................................................... 36
Error analysis .................................................................................................................... 36
Tuning .............................................................................................................................. 38
ROC/AUC .......................................................................................................................... 40
Conclusion ................................................................................................................................ 41
Next Steps ................................................................................................................................ 41
References ............................................................................................................................... 41
Contact Us ................................................................................................................................ 41

Contact: https://www.BIGINTSolutions.com/Contact
Data Exploration
The first step in our framework is “Data Exploration”. In this step we provide answers to following
questions:
1. What kind of data are we going to work with?
2. Do we have missing data?
3. What profile of data do we have?
4. Should we remove obvious data points?

Let’s deep dive.

What kind of variables are we going to work with?

For this step, we use “DataExplorer” library to provide basic data statistics.

Install library:
install.packages("DataExplorer")

Data report:

library("DataExplorer")
attrition <- read.csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
create_report(attrition)

The create_report method will generate a detailed report for our data. Let’s see what this report
has.

What is the basic statistics of our data?

The data on employee attrition comprises 1470 observations of 35 features, i.e. there are 1,470
rows and 35 columns. Of all the 35 columns, 9 are discrete, 26 are continuous, and there are 0
missing values out of 51,450 data points.

Contact: https://www.BIGINTSolutions.com/Contact
What is the structure of our data?

You can also see the structure of your data in a tabular format using
str(attrition)

## Classes 'tbl_df', 'tbl' and 'data.frame': 1470 obs. of 35 variables:


## $ Age : num 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : chr "Yes" "No" "Yes" "No" ...
## $ BusinessTravel : chr "Travel_Rarely" "Travel_Frequently" "Travel_Rarely"
"Travel_Frequently" ...
## $ DailyRate : num 1102 279 1373 1392 591 ...
## $ Department : chr "Sales" "Research & Development" "Research & Development"
"Research & Development" ...
## $ DistanceFromHome : num 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : num 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : chr "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
## $ EmployeeCount : num 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : num 1 2 4 5 7 8 10 11 12 13 ...
## $ EnvironmentSatisfaction : num 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : chr "Female" "Male" "Male" "Female" ...
## $ HourlyRate : num 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : num 3 2 2 3 3 3 4 3 2 3 ...
6

Contact: https://www.BIGINTSolutions.com/Contact
## $ JobLevel : num 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : chr "Sales Executive" "Research Scientist" "Laboratory Technician"
"Research Scientist" ...
## $ JobSatisfaction : num 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : chr "Single" "Married" "Single" "Married" ...
## $ MonthlyIncome : num 5993 5130 2090 2909 3468 ...
## $ MonthlyRate : num 19479 24907 2396 23159 16632 ...
## $ NumCompaniesWorked : num 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : chr "Y" "Y" "Y" "Y" ...
## $ OverTime : chr "Yes" "No" "Yes" "Yes" ...
## $ PercentSalaryHike : num 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : num 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: num 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : num 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : num 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : num 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : num 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : num 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : num 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : num 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : num 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : num 5 7 0 0 2 6 0 0 8 7 …

How is our data distributed?

The report also gives univariate distribution through histograms and bar charts. Here’s a sample of
the charts that the library generated.

Histogram

From the histograms, we can do distribution analysis of our dataset. Example:


1. The dataset consists of employees in the age range (25-45)
2. Most of the employees stay within 10 miles from the office
3. The Monthly income for most of the employees is towards lower range (less than 10K)

Contact: https://www.BIGINTSolutions.com/Contact
8

Contact: https://www.BIGINTSolutions.com/Contact
Bar Chart (by frequency)

A similar analysis can be done using bar chart based on frequency.

We will see in subsequent sections on how to do a detailed analysis of our data with our variable of
interest (i.e., attrition).

Do we have missing data?

From the report that is generated from DataExplorer package, we see that there are no missing
values in our dataset. That’s a good thing, but in general cases, we may have missing data.

Contact: https://www.BIGINTSolutions.com/Contact
What is the detailed profile of our variables?

In order to understand the spread of our data, it is important to understand each variable’s
characteristics. We do this with “summary” function

Code:
summary(attrition)

10

Contact: https://www.BIGINTSolutions.com/Contact
How to read above summary?

 We see that ages of the employees are between 18 yrs to 60 yrs, the average age of the
employees is approximately 37 yrs, where the median is 36 yrs, i.e. most of the employees
are young.
 Positive cases (Attrition = yes) is around 237 and negative cases around 1233 in our dataset
 The distance from home to their office is between 1 to 29 units, the average being close to 9
units.
 The employee count we see is singular in all cases, mean=median=min=max=1st
quartile=3rd quartile. So we can infer from here itself it has no effect on employee attrition.
So from the further analysis, we can eliminate this column.

If you see the above results, you can clearly identify

1. Data spread based on min, max, median, mean, 1st quartile and 3rd quartile (this is best
represented as box-whisker plot which we will see in subsequent sections)
2. Variables which we can safely remove. Variables such as EmployeeCount, EmployeeNumber,
StandardHours, Over18

But, there are certain variables like Education, JobInvolvement etc. which are discreet and factor
variables but in numeric form. We would like to convert them into character form.

Code:

attrition$EnvironmentSatisfaction<-
cut(attrition$EnvironmentSatisfaction,
breaks=c(0,1,2,3,4),
labels=c("Low","Medium","High","Very High"),
right=TRUE)

attrition$Education <-cut(attrition$Education,
breaks=c(0,1,2,3,4,5),
labels=c("Below
College","College","Bachelor",
"Master","Doctor"),
right=TRUE)

attrition$JobInvolvement <-cut(attrition$JobInvolvement,
breaks=c(0,1,2,3,4),
labels=c("Low","Medium","High","Very
High"),
right=TRUE)

attrition$JobSatisfaction <-cut(attrition$JobSatisfaction,
breaks=c(0,1,2,3,4),

labels=c("Low","Medium","High","Very High"),
right=TRUE)

attrition$RelationshipSatisfaction <-
cut(attrition$RelationshipSatisfaction,
11

Contact: https://www.BIGINTSolutions.com/Contact
breaks=c(0,1,2,3,4),

labels=c("Low","Medium","High","Very High"),
right=TRUE)

attrition$PerformanceRating <- cut(attrition$PerformanceRating,


breaks=c(0,1,2,3,4),

labels=c("Low","Good","Excellent","Outstanding"),
right=TRUE)

attrition$WorkLifeBalance <- cut(attrition$WorkLifeBalance,


breaks=c(0,1,2,3,4),

labels=c("Bad","Good","Better","Best"),
right=TRUE)

Let’s run the summary function again and see the output.

summary(attrition)

You see those variables are now in their character form.

Which variables to remove?

Based on above analysis we will drop some of the variables as they are obvious candidates for
removal.

Code:
attrition <- subset(attrition, select = -c(EmployeeCount,
EmployeeNumber, StandardHours, Over18))
summary(attrition)

12

Contact: https://www.BIGINTSolutions.com/Contact
Distribution Analysis
We now know basic statistics and structure of our data, it’s time to analyze our data set with our
variable of interest. We will see univariate and multivariate analysis in this section. Univariate
analysis is done in R and multivariate analysis is done in Power BI.

Univariate Analysis

Our variable of interest is attrition, so we would like to analyze each of the variables with respect to
attrition %.

Here’s a general code to generate a plot like this for each variable.

library(ggplot2)
library(ggthemes)
library(dplyr)

breaks <- c(min(attrition$Age),25,35,45,55, max(attrition$Age))


# bucketing data points into bins
AgeBins <- cut(attrition$Age, breaks, include.lowest = T,
right=FALSE)
attrition <- cbind(attrition, AgeBins)
variableText <- "Age"

proportion <- attrition %>%


group_by(AgeBins, Attrition) %>%
tally() %>%
group_by(AgeBins) %>%
mutate(pct = n / sum(n))

proportion <- proportion %>%


group_by(AgeBins) %>%
mutate(label_y = paste0(round(pct*100, 1), "%"))

p4 <- ggplot(proportion, aes(y = pct, x = factor(AgeBins), fill =


Attrition))+
theme_bw() +
geom_bar( stat="identity") +
geom_text(aes(label = label_y),
vjust = 1.5, color = "white", size=3) +
scale_y_continuous(labels = scales::percent) +
scale_fill_manual(values=c("#00BFC4", "#F8766D")) +

theme(legend.position="right", legend.direction="vertical",
legend.title = element_text("Attrition"),
axis.text.x = element_text(angle = 45, hjust = 1, vjust =
1))+
labs(x=variableText, y="% Employees") +
ggtitle(paste("Attrition distribution by", variableText, sep=" "))

p4
13

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by Age

It is very evident from the above graphical representation that attrition is maximum within 18 -25
years of age. Then onwards till 45 years, attrition tends to decrease. But employees above 55 years
face comparatively high attrition of around ~16%.

Attrition Distribution by Education Field

14

Contact: https://www.BIGINTSolutions.com/Contact
We see here attrition is most faced by employees whose education field is Human Resources and
least by the ones who are from an educational field other than Life Science, Marketing, Medical and
Technical grounds. According to their percentage of attrition, we can rank attrition % by education
field as follows:
Human Resources > Technical Degree > Marketing > Life Science > Medical > Other

Attrition Distribution by Job Role

Attrition of sales representatives is maximum i.e. ~40%, minimum being of a Research Director. We
can rank attrition % by job roles according to their attrition.
Sales Representative > Laboratory Technician > Human Resources > Sales Executive > Research
Scientist > Health Representative -> Manufacturing Director > Manager > Research Director.

15

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by Business Travel

Here we see, non-travelers face the least attrition, whereas the employees who need to travel
frequently have the maximum attrition of around 25%.

Attrition Distribution by Department

16

Contact: https://www.BIGINTSolutions.com/Contact
Attrition is maximum in the sales department and minimum in the research and development
department. We can already infer this from the attrition in job roles. There we have already seen
that sales representatives have almost 40% attrition and sales executive 17.5%.

Attrition Distribution by Gender

Attrition in male is a bit higher than female, the difference is 2.2%.

Attrition Distribution by Marital Status

Attrition in employees who are single is the most, 25.5%, least being in employees who are divorced.

17

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by OverTime

Self-evident.

Attrition Distribution by NumCompaniesWorked

Here we can see that employees with high work experience, i.e. who have worked for 7-9 companies
face the maximum attrition and the employees with least work experience like about a few months
to close to 2 years have the second most attrition. The attrition margin between highly experienced
and new joiners is 3.1%.

18

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by DistanceFromHome

Employees who have to travel 1 units to 11 units face attrition roughly 13% to 15 %, among which
employees who have to travel 1 - 3 units to work face the least attrition, 12.9% ~13%, whereas those
who have to travel 3 units to 11 units to their work face almost 15% attrition. Employees staying
within 11 units to 15 units from their workplace face the most attrition, 22.5%. Employees staying
20-29 units of their workplace also have comparatively high attrition, 21.4%. Overall we can say
employees staying further to 11 units of their workplace have a high rate of attrition.

Attrition Distribution by JobLevel

Attrition in the very first stage is the most, 26.3%. Later in the second stage of responsibility, it
decreases to 9.7%~10%, fall of 16.6% in attrition. Again there is a slight increase around 5% in
attrition in stage 3. Lastly, in stages, 4 & 5 attrition is least, as by then the employees have had a lot
of experience, and we have already seen with the increase in age the overall attrition decreases.

19

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by PercentSalaryHike

From above it looks like there is no effect of different salary hike ranges in attrition.

Attrition Distribution by StockOptionLevel

Employees with the least stock option level face the most attrition, 24.4%, whereas employees with
2 or 3 stock option levels have a comparatively low attrition.

20

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by TotalWorkingYears

Employees who have just started working or have worked for almost 6 years by now has the most
attrition, of around 28.8%~29%, while employees who have worked for about 11 years by now have
attrition around 15%, whereas people working for more than 11 years have even lesser attrition of
10.1%.

Attrition Distribution by TrainingTimesLastYear

21

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by YearsInCurrentRole

Attrition Distribution by YearsAtCompany

As years in a particular company increases, experience in the company increases, the attrition %
decreases, from 34.9% to 8.9%. When an employee joins or is working for the last 2 years has the
maximum chances of attrition.

22

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by YearsSinceLastPromotion

From above it looks like there is no effect of different years since last promotion ranges in attrition.

Attrition Distribution by YearsWithCurrManager

23

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by Rates

Attrition Distribution by Job Satisfaction

It is somewhat obvious if the job satisfaction is low employees will tend to leave the prevailing job
and job role for a better opportunity. We see that the attrition due job satisfaction follows a
decreasing trend, i.e. with the increase in job satisfaction the attrition rate decreases up to 11.3%.

24

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by Education

Employees whose education is below the college have the highest attrition, at 18.2%~18%, with
doctors as the lowest at 10.4%.

Attrition Distribution by Job Involvement

Here we again see that higher the involvement in their job, lower is the rate of attrition, following a
decreasing trend. When the involvement is low the rate is 33.7%, whereas if the involvement is very
high, the rate drops down to 9%.

25

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by Environmental Satisfaction

It is obvious that low environmental satisfaction would lead to high attrition, whereas if the
satisfaction is medium, high or very high the rate lies between 13.5%%-15%, the margin 1.5%, i.e,
the overall rate remains the same.

Attrition Distribution by other factors

In this section we saw univariate analysis is helpful in understanding the effect of individual variables
on attrition. We measured attrition % with respect to each variable. We found how distance from
home or years in company impact attrition. At the same time, we saw how years with current
manager and over time leads to attrition.

This analysis is good, but is not comprehensive. It does not allow us to check if there are multiple
factors impacting attrition. We will analyze this using multivariate analysis.

26

Contact: https://www.BIGINTSolutions.com/Contact
Multivariate Analysis

Based on the univariate analysis done in previous section we can ask several deeper questions to
strengthen our standing.

1. Why are young people leaving the company?


2. Why is attrition so high in human resources and technical degree in Education Field?
3. Why is attrition so high in Sales Rep Job Role?
4. Why is attrition high with business travel as travel_frequently
5. Why are employees who have not been promoted not leaving the company (attrition is very
less)?
6. What is the characteristics of employees who are doing overtime?

For answering such questions we use Power BI tool to analyze multiple factors at once. This tool
allows us to interact with our data and give deeper level understanding on causes of attrition.

For a live demo of the report, you can check our company website:
https://bigintsolutions.com/showcase/employee-attrition-analytics/

Full report screenshot:

Let’s deep dive into our questions.

27

Contact: https://www.BIGINTSolutions.com/Contact
Why are young people leaving the company?

After selecting age range for young employees (18-35), we can see:

• 21.95% of attrition among young employees


• Monthly income less than the median income (4K)
• Most young employees who are leaving are Sales Representative, Lab tech, Sales Exec
• 33% of them have bad work life balance

Why is attrition so high in Sales Rep Job Role?

After selecting ‘Job Role’ as “Sales Representative”, we see:

• 39.76 % of attrition in that job role


28

Contact: https://www.BIGINTSolutions.com/Contact
• More than 30% of them who have attrition are divorced
• 50% of the attrition is because of distance from home (more than 7 miles)
• 71% of the attrition is because their job involvement is low
• This group has the lowest job satisfaction

What is the characteristics of employees who are doing overtime?

After selecting OverTime as “Yes”, we see

 30.53% of attrition when OverTime is “Yes”


 More than 50% of who have attrition are Single
 40% of them are young population (age 18-36)
 35% of them who have attrition have been with their current manager for more than 2 years
 45% of them who have attrition have high or very high job involvement

As you can see above, Power BI allows us to measure attrition with respect to various factors at
once. We can select filters and see how the data changes based on that selection.

We invite you to play with report hosted on our website and see it yourself.

Advanced Distribution Analysis


Like we mentioned in the previous section, there are certain ways and visual techniques in which we
can analyze the spread of our data effortlessly. In this section, we will take a look at one such visual –
box-whisker plot.

Box-Whisker Plot

We will plot age distribution of the dataset and segregate by positive and negative cases.

29

Contact: https://www.BIGINTSolutions.com/Contact
Code to achieve this:

p <- ggplot(attrition, aes(x=Attrition, y=Age, color=Attrition)) +


geom_boxplot() +
scale_color_manual(values=c("#00BFC4", "#F8766D"))+
stat_summary(fun.y=mean, geom="point", shape=23, size=4)

Output:

How to read the above chart?

In simpler terms, we see employees who are facing attrition are mostly young population where
mean age is around 34. The first quartile being from 18-27 yrs.
You can very easily see the spread of the data using a visualization like this.

Model Development
Ok, it’s time to start developing our model. Till now we have analyzed our dataset through various
techniques including basic statistics, data profiling, univariate and multivariate distribution analysis,
and have seen one example of advanced distribution analysis.

We have a good understanding of the variables involved and their importance (implicit). In this
section, we will look at how to statistically select top factors influencing our variable of interest, do a
bit of feature engineering, and fit a model and deep dive into model results.

30

Contact: https://www.BIGINTSolutions.com/Contact
Top factors influencing attrition

There are various ways to determine factors influencing attrition. We can either use PCA (Principal
component analysis) or RFE (Recursive Feature Elimination) or use Random Forest algorithm. For this
report, we will go with Random Forest algorithm to determine factors influencing attrition.
Code:

#Feature importance
library(randomForest)

set.seed(100)
rf<-randomForest(Attrition~.,data=attrition, importance=T)
varImpPlot(rf, type=1)

Output:

From the above chart you can see:

 Higher the value of mean decrease accuracy score, higher the importance of the variable in
the model. In the plot shown above, OverTime is the most important variable.
 There are other measures of importance using MeanDecreaseGini which we are not
discussing in this report
 Note: We have not yet done any feature engineering, so results of variable importance may
change after that.

How to check top 10 factors influencing attrition?

31

Contact: https://www.BIGINTSolutions.com/Contact
Code:

rf.10<-randomForest(Attrition~.,data=attrition, importance = T)
varImpPlot(rf.10, type=1, sort = T, n.var = 10, main= "Top -10
Variable Importance")

We see above our selection of important feature changes. We will analyze this further after doing
some feature engineering.

Feature Engineering

In this section we will perform certain steps:

1. Convert “string” columns into their numeric form. Usually, Machine Learning algorithms are
not good at working with string columns.
2. Convert those columns into factor columns (since those columns are factors)
3. Normalize numeric columns like MonthlyIncome, PercentySalaryHike, DistanceFromHome
etc.

These steps are a subset of the steps involved in “feature engineering”. Feature engineering step is
essential to select features which can maximize the performance of our machine learning algorithm.

32

Contact: https://www.BIGINTSolutions.com/Contact
Code:

#Feature Engineering

#1. Convert string columns into numeric


attrition$GenderN <-
as.factor(as.numeric(as.factor(attrition$Gender)))
attrition$JobRoleN <-
as.factor(as.numeric(as.factor(attrition$JobRole)))
attrition$MaritalStatusN <-
as.factor(as.numeric(as.factor(attrition$MaritalStatus)))
attrition$OverTimeN <-
as.factor(as.numeric(as.factor(attrition$OverTime)))
attrition$DepartmentN <-
as.factor(as.numeric(as.factor(attrition$Department)))
attrition$EducationFieldN <-
as.factor(as.numeric(as.factor(attrition$EducationField)))
attrition$BusinessTravelN <-
as.factor(as.numeric(as.factor(attrition$BusinessTravel)))
attrition$AttritionN <-
as.factor(as.numeric(as.factor(attrition$Attrition)))
attrition$AgeBinsN <-
as.factor(as.numeric(as.factor(attrition$AgeBins)))

#2. Convert numeric columns which are actually factors into factors
attrition$Education<-as.factor(attrition$Education)
attrition$EnvironmentSatisfaction<-
as.factor(attrition$EnvironmentSatisfaction)
attrition$JobInvolvement<-as.factor(attrition$JobInvolvement)
attrition$JobSatisfaction<-as.factor(attrition$JobSatisfaction)
attrition$PerformanceRating<-as.factor(attrition$PerformanceRating)
attrition$RelationshipSatisfaction<-
as.factor(attrition$RelationshipSatisfaction)
attrition$WorkLifeBalance<-as.factor(attrition$WorkLifeBalance)
attrition$StockOptionLevel<-as.factor(attrition$StockOptionLevel)

#3. Remove not required variables

attrition.fe1 <- subset(attrition, select = -


c(Attrition,Gender,JobRole,
MaritalStatus,
OverTime,
Department,
EducationField,
BusinessTravel,
EducationC,
EnvironmentSatisfactionC,
JobInvolvementC,
JobSatisfactionC,
PerformanceRatingC,

RelationshipSatisfactionC,
WorkLifeBalanceC,
33

Contact: https://www.BIGINTSolutions.com/Contact
Age,
AgeBins,
MonthlyRate,
HourlyRate,
DailyRate))

rf.2<-randomForest(AttritionN~.,data=attrition.fe1)
varImpPlot(rf.2)

#4. Normalize numeric variables

library(caret)

attrition.n <- subset(attrition.fe1, select = c(MonthlyIncome,


NumCompaniesWorked,
PercentSalaryHike,
TotalWorkingYears,

TrainingTimesLastYear,
YearsAtCompany,
YearsInCurrentRole,

YearsSinceLastPromotion,
YearsWithCurrManager
))

preprocessParams <- preProcess(attrition.n, method=c("range"))

attrition.n.t <- predict(preprocessParams, attrition.n)

summary(attrition.n.t)

#put these back in the data frame

attrition.fe2 <- subset(attrition.fe1, select = -c(MonthlyIncome,


NumCompaniesWorked,
PercentSalaryHike,
TotalWorkingYears,

TrainingTimesLastYear,
YearsAtCompany,
YearsInCurrentRole,

YearsSinceLastPromotion,

YearsWithCurrManager))

attrition.fe3 <- cbind(attrition.fe2, attrition.n.t)


34

Contact: https://www.BIGINTSolutions.com/Contact
Let’s see our feature importance chart after performing feature engineering.

rf.3<-randomForest(AttritionN~.,data=attrition.fe3, importance= T)
varImpPlot(rf.3, type=1)

Did you see any change from our baseline feature importance chart?

We are now ready to fit our model on the above dataset. We have all variables in numbers format,
factor variables as type factors, and numeric variables normalized. Here’s a screenshot of our
dataset.

Our variable of interest is “AttritionN”.

35

Contact: https://www.BIGINTSolutions.com/Contact
Fitting a model

For fitting a model, we will take the above dataset and fit Random Forest algorithm.

We need to first split the data into train and test in 70/30 split. We fit the model on train data,
improve model accuracy on train data, and then evaluate it on test data.

Code:

ind <- sample(2, nrow(attrition.fe3), replace = TRUE , prob= c(0.7,


0.3))
train <- attrition.fe3[ind==1,]
test <- attrition.fe3[ind==2,]

set.seed(100)
rf.4 <- randomForest(AttritionN~.,data=train, importance=T)
print(rf.4)

Call:
randomForest(formula = AttritionN ~ ., data = train, importance = T)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5

OOB estimate of error rate: 13.82%


Confusion matrix:
1 2 class.error
1 812 9 0.01096224
2 128 42 0.75294118

The OOB error is 13.82% if we take all the variables.

Model analysis

In this section, we will deep dive into model analysis and provide approaches to tune a model. We
need to tune a model with respect to a metric. We can either choose accuracy or out of box (OOB)
error as the metric to optimize. In this report, we will select OOB error to analyze and optimize.

Error analysis

Let’s plot the error rate with number of trees.

Code:

layout(matrix(c(1,2),nrow=1),
width=c(4,1))
par(mar=c(5,4,4,0)) #No margin on the right side
plot(rf.4)
par(mar=c(5,0,4,2)) #No margin on the left side
plot(c(0,1),type="n", axes=F, xlab="", ylab="")
legend("top", colnames(rf.4$err.rate),col=1:4,cex=0.8,fill=1:4)
36

Contact: https://www.BIGINTSolutions.com/Contact
How to read the above chart?

 The plot denotes error rate by the number of trees


 The red colored line is for negative cases “No” or 1, green is for positive cases “Yes” or 2

If you want to see the error rate at a more granular level, then we can plot “log” of error rate.

37

Contact: https://www.BIGINTSolutions.com/Contact
Tuning

In this section, we will tune certain parameters of the RandomForest algorithm. Specifically, we will
tune number of trees and mtry parameter.

mtry: Number of variables randomly sampled as candidates at each split.


ntree: Number of trees to grow.

For this report, we will restrict ntree as 500 and optimize mtry.

Code:

set.seed(120)
res <- tuneRF(x=subset(train, select = -c(AttritionN)),
y=train$AttritionN,
stepFactor = 1.5,
plot=TRUE,
ntreeTry = 500,
trace = TRUE,
improve = 0.0001,
importance=TRUE)

print(res)

mtry OOBError
4.OOB 4 0.1422805
5.OOB 5 0.1382442
7.OOB 7 0.1352170
10.OOB 10 0.1382442

38

Contact: https://www.BIGINTSolutions.com/Contact
The mtry value for which the model gives the best OOB error is 7. Let’s put mtry as 7 in our model.

Code:

set.seed(120)
rf.6 <- randomForest(AttritionN~.,data=train, importance=T, mtry=7,
ntree=500)
print(rf.6)

Call:
randomForest(formula = AttritionN ~ ., data = train, importance = T,
mtry = 7, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 7

OOB estimate of error rate: 13.52%


Confusion matrix:
1 2 class.error
1 812 9 0.01096224
2 125 45 0.73529412

From 13.82 we went to 13.52 OOB error rate, a marginal improvement. How’s the model
performance on test data?

Code:
set.seed(120)
rf.7 <- randomForest(AttritionN~.,data=test, importance=T, mtry=7,
ntree=500)
print(rf.7)

Call:
randomForest(formula = AttritionN ~ ., data = test, importance = T,
mtry = 7, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 7

OOB estimate of error rate: 13.57%


Confusion matrix:
1 2 class.error
1 411 1 0.002427184
2 64 3 0.955223881

The model performance on test data is 13.57% which is very close to train data. Often when
evaluating a model performance we may face overfitting and underfitting problems. Either we have
fewer data to train or we have not chosen optimal parameters. These problems can be solved
iteratively by adding more data, pruning features, optimizing mtry and ntree. The discussion on this
topic is out of scope for this report.

We will now see the AUC of our model and plot ROC chart. This will be the last part in analyzing our
model performance.

39

Contact: https://www.BIGINTSolutions.com/Contact
ROC/AUC

Usually, we measure the performance of a classification problem by another metric called as “AUC”
– area under the curve. Let’s see the AUC for our model and plot ROC chart.

Code:
library(ROCR)

predict<-predict(rf.6, type="prob")
prediction<-prediction(predict[,2], train$AttritionN)

auc<-performance(prediction, "auc")

performance<-performance(prediction, "tpr","fpr")

plot(performance,main="ROC Curve for Random Forest",col=2,lwd=2)


abline(a=0,b=1,lwd=2,lty=2,col="gray")
auc<-unlist(slot(auc,"y.values"))
auc<-round(auc,4)

legend(.8,.3,auc,title="AUC")

The AUC for our model is close to 84% which is a decent coverage. The more the area under the
curve the better it is.

40

Contact: https://www.BIGINTSolutions.com/Contact
Conclusion
In this report, we saw various steps to approach a data analytics problem. We presented various
statistical and visualization techniques to analyze the data. We also presented univariate and
multivariate analysis for our problem and provided steps to create, analyze and fine-tune a model.
For feature engineering and model fitting we used R programming, and for multivariate and
interactive analysis we used Power BI.

We also deep dived into several questions related to employee attrition. Questions we answered
were “why are young people leaving the company”, “why is attrition so high in Sales Rep role”,
“what is the characteristics of people doing overtime”?

Next Steps
We can do further work on model selection and tuning. We decided to use Random Forest
algorithm. But what is the basis for choosing this algorithm? A general approach that we follow is to
use multiple algorithms and measure the performance of each of the algorithms. Go with the
algorithm giving the best performance. We could have evaluated algorithms like SVM, K-NN, and
Neural Networks.

We can remove certain variables which are “correlated”. This can reduce the dimensionality of our
dataset and improve model performance.

We can enhance our dataset to provide an end to end view of employee attrition problem. We can
include hiring data (by job role, department, time to hire), competitor data (where employees are
going), and industry data to compare attrition with industry averages.

References
We referred to several posts on StackOverflow community, R-bloggers website,
machinelearningmastery, and other community websites to pick the best R code to solve this
problem.

Contact Us
If you have any questions on this report, or you have a data analytics problem for us to work on, feel
free to reach us through our website. We will get back to you within 12-24 business hours.

https://www.bigintsolutions.com/Contact

Thank you.

41

Contact: https://www.BIGINTSolutions.com/Contact

You might also like