100% found this document useful (1 vote)

639 views41 pages

Data Analytics Report - Case Study - Employee Attrition

This data analytics report analyzes employee attrition data using statistical and visualization techniques to understand the key drivers of attrition. The report explores the data, identifies factors influencing attrition through univariate and multivariate analysis, develops a predictive model for attrition, and provides conclusions. The dataset contains employee records from IBM with variables like age, department, job satisfaction, overtime hours, etc. Distribution analysis identifies that attrition is higher among younger employees, sales reps, and those doing overtime. A predictive model is developed and analyzed to identify the top factors influencing employee attrition.

Uploaded by

Krishna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

639 views41 pages

Data Analytics Report - Case Study - Employee Attrition

Uploaded by

Krishna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Report version: 1.

Data Analytics Report

Case Study - Employee Attrition Analysis

There’s a law with the name “Joy’s law” which states “no matter who you are, most of the smartest
people work for someone else”. Employee attrition is a very critical problem for Human Resources
department. In this era of competition, it becomes imperative to understand factors leading to
employee attrition and employee retention. Some of those factors could be obvious while the others
could be hidden.

Can our data on employee attrition give insights into why people are leaving the company? Can it
tell us which segment of employees are leaving? Can it tell us where should we focus on?

Answers to some of these questions will help a CHRO take steps in correct direction, improve
employee morale and engagement to reduce attrition.

In this data analytics report we take you through a methodical framework developed by us at
bigintsolutions.com, and deep dive into each steps understanding the data, visualizing it and seeing
the factors influencing attrition.

Through this report we would:

1. Explore employee attrition data through various statistical and visualization techniques
2. Find out factors influencing attrition
3. Create a model to predict attrition
4. Provide final conclusions

Our data analytics framework has 4 major steps:

1. Data Exploration
2. Distribution Analysis with respect to the variable of interest
3. Model Development
4. Model Analysis and Conclusion

Dataset: The dataset has been taken from IBM resource.

Link to the dataset: https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-
UseC_-HR-Employee-Attrition.xlsx

Language used: R
Tools used: R Studio, Power BI
Live demo of Employee Attrition Analytics Power BI Report:
https://bigintsolutions.com/showcase/employee-attrition-analytics/

Disclaimer: The report is intended to showcase our approach to data analytics problems. By no
means, the model performance is to be taken as final. The purpose is not to fine-tune and provide
the best model. The purpose is only to showcase the process and approach.
About the Founder

Ranbeer Makin, is a passionate data analytics professional with 11+ yrs of industry experience
working with enterprises and providing them solutions to large-scale data problems for driving
business decisions. He has helped customers in architecting, designing and creating next-generation
data analytics, machine learning, and AI platforms. He is a thought leader and an experienced
professional with a demonstrated history in marketing and healthcare domain. Skilled in Data
Engineering and Product Management, he has extensive hands-on experience in Data Analytics, Data
Warehousing, Data Visualization, and DevOps techniques and tools.

Prior to starting his entrepreneurial journey, Ranbeer was Director Engineering/CTO for a healthcare
start-up helping them develop a predictive analytics and patient care engagement platform. He
started his career with application development in C++ where he and his team architected and
developed a search rewards platform for a popular search engine. He then spent 8+ yrs in data and
BI space where he worked on big data marketing and analytics platform, and was responsible for
delivery of large-scale projects. He is a certified agile software practitioner from Project
Management Institute (PMI-ACP).

Ranbeer is a Computer Science graduate from IIIT

Hyderabad with specialization in data engineering.

Connect him on
LinkedIn: https://www.linkedin.com/in/ranbeer/

Contact: https://www.BIGINTSolutions.com/Contact
Table of Contents
Data Exploration ........................................................................................................................ 5
What kind of variables are we going to work with? .............................................................. 5
What is the basic statistics of our data? ............................................................................ 5
What is the structure of our data? .................................................................................... 6
How is our data distributed? ............................................................................................. 7
Do we have missing data? ..................................................................................................... 9
What is the detailed profile of our variables? ..................................................................... 10
Which variables to remove? ................................................................................................ 12
Distribution Analysis ................................................................................................................ 13
Univariate Analysis............................................................................................................... 13
Attrition Distribution by Age ............................................................................................ 14
Attrition Distribution by Education Field ......................................................................... 14
Attrition Distribution by Job Role .................................................................................... 15
Attrition Distribution by Business Travel ......................................................................... 16
Attrition Distribution by Department .............................................................................. 16
Attrition Distribution by Gender ...................................................................................... 17
Attrition Distribution by Marital Status ........................................................................... 17
Attrition Distribution by OverTime .................................................................................. 18
Attrition Distribution by NumCompaniesWorked ........................................................... 18
Attrition Distribution by DistanceFromHome.................................................................. 19
Attrition Distribution by JobLevel .................................................................................... 19
Attrition Distribution by PercentSalaryHike .................................................................... 20
Attrition Distribution by StockOptionLevel ..................................................................... 20
Attrition Distribution by TotalWorkingYears ................................................................... 21
Attrition Distribution by TrainingTimesLastYear ............................................................. 21
Attrition Distribution by YearsInCurrentRole .................................................................. 22
Attrition Distribution by YearsAtCompany ...................................................................... 22
Attrition Distribution by YearsSinceLastPromotion ......................................................... 23
Attrition Distribution by YearsWithCurrManager ........................................................... 23
Attrition Distribution by Rates ......................................................................................... 24
Attrition Distribution by Job Satisfaction......................................................................... 24
Attrition Distribution by Education .................................................................................. 25
Attrition Distribution by Job Involvement ....................................................................... 25
3

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by Environmental Satisfaction ...................................................... 26
Attrition Distribution by other factors ............................................................................. 26
Multivariate Analysis............................................................................................................ 27
Why are young people leaving the company? ................................................................ 28
Why is attrition so high in Sales Rep Job Role? ............................................................... 28
What is the characteristics of employees who are doing overtime? .............................. 29
Advanced Distribution Analysis ............................................................................................... 29
Box-Whisker Plot.................................................................................................................. 29
Model Development ................................................................................................................ 30
Top factors influencing attrition .......................................................................................... 31
Feature Engineering ............................................................................................................. 32
Fitting a model ..................................................................................................................... 36
Model analysis ..................................................................................................................... 36
Error analysis .................................................................................................................... 36
Tuning .............................................................................................................................. 38
ROC/AUC .......................................................................................................................... 40
Conclusion ................................................................................................................................ 41
Next Steps ................................................................................................................................ 41
References ............................................................................................................................... 41
Contact Us ................................................................................................................................ 41

Contact: https://www.BIGINTSolutions.com/Contact
Data Exploration
The first step in our framework is “Data Exploration”. In this step we provide answers to following
questions:
1. What kind of data are we going to work with?
2. Do we have missing data?
3. What profile of data do we have?
4. Should we remove obvious data points?

Let’s deep dive.

What kind of variables are we going to work with?

For this step, we use “DataExplorer” library to provide basic data statistics.

Install library:
install.packages("DataExplorer")

Data report:

library("DataExplorer")
attrition <- read.csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
create_report(attrition)

The create_report method will generate a detailed report for our data. Let’s see what this report
has.

What is the basic statistics of our data?

The data on employee attrition comprises 1470 observations of 35 features, i.e. there are 1,470
rows and 35 columns. Of all the 35 columns, 9 are discrete, 26 are continuous, and there are 0
missing values out of 51,450 data points.

Contact: https://www.BIGINTSolutions.com/Contact
What is the structure of our data?

You can also see the structure of your data in a tabular format using
str(attrition)

## Classes 'tbl_df', 'tbl' and 'data.frame': 1470 obs. of 35 variables:

## $ Age : num 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : chr "Yes" "No" "Yes" "No" ...
## $ BusinessTravel : chr "Travel_Rarely" "Travel_Frequently" "Travel_Rarely"
"Travel_Frequently" ...
## $ DailyRate : num 1102 279 1373 1392 591 ...
## $ Department : chr "Sales" "Research & Development" "Research & Development"
"Research & Development" ...
## $ DistanceFromHome : num 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : num 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : chr "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
## $ EmployeeCount : num 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : num 1 2 4 5 7 8 10 11 12 13 ...
## $ EnvironmentSatisfaction : num 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : chr "Female" "Male" "Male" "Female" ...
## $ HourlyRate : num 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : num 3 2 2 3 3 3 4 3 2 3 ...
6

Contact: https://www.BIGINTSolutions.com/Contact
## $ JobLevel : num 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : chr "Sales Executive" "Research Scientist" "Laboratory Technician"
"Research Scientist" ...
## $ JobSatisfaction : num 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : chr "Single" "Married" "Single" "Married" ...
## $ MonthlyIncome : num 5993 5130 2090 2909 3468 ...
## $ MonthlyRate : num 19479 24907 2396 23159 16632 ...
## $ NumCompaniesWorked : num 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : chr "Y" "Y" "Y" "Y" ...
## $ OverTime : chr "Yes" "No" "Yes" "Yes" ...
## $ PercentSalaryHike : num 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : num 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: num 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : num 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : num 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : num 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : num 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : num 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : num 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : num 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : num 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : num 5 7 0 0 2 6 0 0 8 7 …

How is our data distributed?

The report also gives univariate distribution through histograms and bar charts. Here’s a sample of
the charts that the library generated.

Histogram

From the histograms, we can do distribution analysis of our dataset. Example:

1. The dataset consists of employees in the age range (25-45)
2. Most of the employees stay within 10 miles from the office
3. The Monthly income for most of the employees is towards lower range (less than 10K)

Contact: https://www.BIGINTSolutions.com/Contact
8

Contact: https://www.BIGINTSolutions.com/Contact
Bar Chart (by frequency)

A similar analysis can be done using bar chart based on frequency.

We will see in subsequent sections on how to do a detailed analysis of our data with our variable of
interest (i.e., attrition).

Do we have missing data?

From the report that is generated from DataExplorer package, we see that there are no missing
values in our dataset. That’s a good thing, but in general cases, we may have missing data.

Contact: https://www.BIGINTSolutions.com/Contact
What is the detailed profile of our variables?

In order to understand the spread of our data, it is important to understand each variable’s
characteristics. We do this with “summary” function

Code:
summary(attrition)

Contact: https://www.BIGINTSolutions.com/Contact
How to read above summary?

 We see that ages of the employees are between 18 yrs to 60 yrs, the average age of the
employees is approximately 37 yrs, where the median is 36 yrs, i.e. most of the employees
are young.
 Positive cases (Attrition = yes) is around 237 and negative cases around 1233 in our dataset
 The distance from home to their office is between 1 to 29 units, the average being close to 9
units.
 The employee count we see is singular in all cases, mean=median=min=max=1st
quartile=3rd quartile. So we can infer from here itself it has no effect on employee attrition.
So from the further analysis, we can eliminate this column.

If you see the above results, you can clearly identify

1. Data spread based on min, max, median, mean, 1st quartile and 3rd quartile (this is best
represented as box-whisker plot which we will see in subsequent sections)
2. Variables which we can safely remove. Variables such as EmployeeCount, EmployeeNumber,
StandardHours, Over18

But, there are certain variables like Education, JobInvolvement etc. which are discreet and factor
variables but in numeric form. We would like to convert them into character form.

Code:

attrition$EnvironmentSatisfaction<-
cut(attrition$EnvironmentSatisfaction,
breaks=c(0,1,2,3,4),
labels=c("Low","Medium","High","Very High"),
right=TRUE)

attrition$Education <-cut(attrition$Education,
breaks=c(0,1,2,3,4,5),
labels=c("Below
College","College","Bachelor",
"Master","Doctor"),
right=TRUE)

attrition$JobInvolvement <-cut(attrition$JobInvolvement,
breaks=c(0,1,2,3,4),
labels=c("Low","Medium","High","Very
High"),
right=TRUE)

attrition$JobSatisfaction <-cut(attrition$JobSatisfaction,
breaks=c(0,1,2,3,4),

labels=c("Low","Medium","High","Very High"),
right=TRUE)

attrition$RelationshipSatisfaction <-
cut(attrition$RelationshipSatisfaction,
11

Contact: https://www.BIGINTSolutions.com/Contact
breaks=c(0,1,2,3,4),

labels=c("Low","Medium","High","Very High"),
right=TRUE)

attrition$PerformanceRating <- cut(attrition$PerformanceRating,

breaks=c(0,1,2,3,4),

labels=c("Low","Good","Excellent","Outstanding"),
right=TRUE)

attrition$WorkLifeBalance <- cut(attrition$WorkLifeBalance,

breaks=c(0,1,2,3,4),

labels=c("Bad","Good","Better","Best"),
right=TRUE)

Let’s run the summary function again and see the output.

summary(attrition)

You see those variables are now in their character form.

Which variables to remove?

Based on above analysis we will drop some of the variables as they are obvious candidates for
removal.

Code:
attrition <- subset(attrition, select = -c(EmployeeCount,
EmployeeNumber, StandardHours, Over18))
summary(attrition)

Contact: https://www.BIGINTSolutions.com/Contact
Distribution Analysis
We now know basic statistics and structure of our data, it’s time to analyze our data set with our
variable of interest. We will see univariate and multivariate analysis in this section. Univariate
analysis is done in R and multivariate analysis is done in Power BI.

Univariate Analysis

Our variable of interest is attrition, so we would like to analyze each of the variables with respect to
attrition %.

Here’s a general code to generate a plot like this for each variable.

library(ggplot2)
library(ggthemes)
library(dplyr)

breaks <- c(min(attrition$Age),25,35,45,55, max(attrition$Age))

# bucketing data points into bins
AgeBins <- cut(attrition$Age, breaks, include.lowest = T,
right=FALSE)
attrition <- cbind(attrition, AgeBins)
variableText <- "Age"

proportion <- attrition %>%

group_by(AgeBins, Attrition) %>%
tally() %>%
group_by(AgeBins) %>%
mutate(pct = n / sum(n))

proportion <- proportion %>%

group_by(AgeBins) %>%
mutate(label_y = paste0(round(pct*100, 1), "%"))

p4 <- ggplot(proportion, aes(y = pct, x = factor(AgeBins), fill =

Attrition))+
theme_bw() +
geom_bar( stat="identity") +
geom_text(aes(label = label_y),
vjust = 1.5, color = "white", size=3) +
scale_y_continuous(labels = scales::percent) +
scale_fill_manual(values=c("#00BFC4", "#F8766D")) +

theme(legend.position="right", legend.direction="vertical",
legend.title = element_text("Attrition"),
axis.text.x = element_text(angle = 45, hjust = 1, vjust =
1))+
labs(x=variableText, y="% Employees") +
ggtitle(paste("Attrition distribution by", variableText, sep=" "))

p4
13

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by Age

It is very evident from the above graphical representation that attrition is maximum within 18 -25
years of age. Then onwards till 45 years, attrition tends to decrease. But employees above 55 years
face comparatively high attrition of around ~16%.

Attrition Distribution by Education Field

Contact: https://www.BIGINTSolutions.com/Contact
We see here attrition is most faced by employees whose education field is Human Resources and
least by the ones who are from an educational field other than Life Science, Marketing, Medical and
Technical grounds. According to their percentage of attrition, we can rank attrition % by education
field as follows:
Human Resources > Technical Degree > Marketing > Life Science > Medical > Other

Attrition Distribution by Job Role

Attrition of sales representatives is maximum i.e. ~40%, minimum being of a Research Director. We
can rank attrition % by job roles according to their attrition.
Sales Representative > Laboratory Technician > Human Resources > Sales Executive > Research
Scientist > Health Representative -> Manufacturing Director > Manager > Research Director.

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by Business Travel

Here we see, non-travelers face the least attrition, whereas the employees who need to travel
frequently have the maximum attrition of around 25%.

Attrition Distribution by Department

Contact: https://www.BIGINTSolutions.com/Contact
Attrition is maximum in the sales department and minimum in the research and development
department. We can already infer this from the attrition in job roles. There we have already seen
that sales representatives have almost 40% attrition and sales executive 17.5%.

Attrition Distribution by Gender

Attrition in male is a bit higher than female, the difference is 2.2%.

Attrition Distribution by Marital Status

Attrition in employees who are single is the most, 25.5%, least being in employees who are divorced.

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by OverTime

Self-evident.

Attrition Distribution by NumCompaniesWorked

Here we can see that employees with high work experience, i.e. who have worked for 7-9 companies
face the maximum attrition and the employees with least work experience like about a few months
to close to 2 years have the second most attrition. The attrition margin between highly experienced
and new joiners is 3.1%.

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by DistanceFromHome

Employees who have to travel 1 units to 11 units face attrition roughly 13% to 15 %, among which
employees who have to travel 1 - 3 units to work face the least attrition, 12.9% ~13%, whereas those
who have to travel 3 units to 11 units to their work face almost 15% attrition. Employees staying
within 11 units to 15 units from their workplace face the most attrition, 22.5%. Employees staying
20-29 units of their workplace also have comparatively high attrition, 21.4%. Overall we can say
employees staying further to 11 units of their workplace have a high rate of attrition.

Attrition Distribution by JobLevel

Attrition in the very first stage is the most, 26.3%. Later in the second stage of responsibility, it
decreases to 9.7%~10%, fall of 16.6% in attrition. Again there is a slight increase around 5% in
attrition in stage 3. Lastly, in stages, 4 & 5 attrition is least, as by then the employees have had a lot
of experience, and we have already seen with the increase in age the overall attrition decreases.

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by PercentSalaryHike

From above it looks like there is no effect of different salary hike ranges in attrition.

Attrition Distribution by StockOptionLevel

Employees with the least stock option level face the most attrition, 24.4%, whereas employees with
2 or 3 stock option levels have a comparatively low attrition.

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by TotalWorkingYears

Employees who have just started working or have worked for almost 6 years by now has the most
attrition, of around 28.8%~29%, while employees who have worked for about 11 years by now have
attrition around 15%, whereas people working for more than 11 years have even lesser attrition of
10.1%.

Attrition Distribution by TrainingTimesLastYear

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by YearsInCurrentRole

Attrition Distribution by YearsAtCompany

As years in a particular company increases, experience in the company increases, the attrition %
decreases, from 34.9% to 8.9%. When an employee joins or is working for the last 2 years has the
maximum chances of attrition.

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by YearsSinceLastPromotion

From above it looks like there is no effect of different years since last promotion ranges in attrition.

Attrition Distribution by YearsWithCurrManager

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by Rates

Attrition Distribution by Job Satisfaction

It is somewhat obvious if the job satisfaction is low employees will tend to leave the prevailing job
and job role for a better opportunity. We see that the attrition due job satisfaction follows a
decreasing trend, i.e. with the increase in job satisfaction the attrition rate decreases up to 11.3%.

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by Education

Employees whose education is below the college have the highest attrition, at 18.2%~18%, with
doctors as the lowest at 10.4%.

Attrition Distribution by Job Involvement

Here we again see that higher the involvement in their job, lower is the rate of attrition, following a
decreasing trend. When the involvement is low the rate is 33.7%, whereas if the involvement is very
high, the rate drops down to 9%.

Contact: https://www.BIGINTSolutions.com/Contact
Attrition Distribution by Environmental Satisfaction

It is obvious that low environmental satisfaction would lead to high attrition, whereas if the
satisfaction is medium, high or very high the rate lies between 13.5%%-15%, the margin 1.5%, i.e,
the overall rate remains the same.

Attrition Distribution by other factors

In this section we saw univariate analysis is helpful in understanding the effect of individual variables
on attrition. We measured attrition % with respect to each variable. We found how distance from
home or years in company impact attrition. At the same time, we saw how years with current
manager and over time leads to attrition.

This analysis is good, but is not comprehensive. It does not allow us to check if there are multiple
factors impacting attrition. We will analyze this using multivariate analysis.

Contact: https://www.BIGINTSolutions.com/Contact
Multivariate Analysis

Based on the univariate analysis done in previous section we can ask several deeper questions to
strengthen our standing.

1. Why are young people leaving the company?

2. Why is attrition so high in human resources and technical degree in Education Field?
3. Why is attrition so high in Sales Rep Job Role?
4. Why is attrition high with business travel as travel_frequently
5. Why are employees who have not been promoted not leaving the company (attrition is very
less)?
6. What is the characteristics of employees who are doing overtime?

For answering such questions we use Power BI tool to analyze multiple factors at once. This tool
allows us to interact with our data and give deeper level understanding on causes of attrition.

For a live demo of the report, you can check our company website:
https://bigintsolutions.com/showcase/employee-attrition-analytics/

Full report screenshot:

Let’s deep dive into our questions.

Contact: https://www.BIGINTSolutions.com/Contact
Why are young people leaving the company?

After selecting age range for young employees (18-35), we can see:

• 21.95% of attrition among young employees

• Monthly income less than the median income (4K)
• Most young employees who are leaving are Sales Representative, Lab tech, Sales Exec
• 33% of them have bad work life balance

Why is attrition so high in Sales Rep Job Role?

After selecting ‘Job Role’ as “Sales Representative”, we see:

• 39.76 % of attrition in that job role

Contact: https://www.BIGINTSolutions.com/Contact
• More than 30% of them who have attrition are divorced
• 50% of the attrition is because of distance from home (more than 7 miles)
• 71% of the attrition is because their job involvement is low
• This group has the lowest job satisfaction

What is the characteristics of employees who are doing overtime?

After selecting OverTime as “Yes”, we see

 30.53% of attrition when OverTime is “Yes”

 More than 50% of who have attrition are Single
 40% of them are young population (age 18-36)
 35% of them who have attrition have been with their current manager for more than 2 years
 45% of them who have attrition have high or very high job involvement

As you can see above, Power BI allows us to measure attrition with respect to various factors at
once. We can select filters and see how the data changes based on that selection.

We invite you to play with report hosted on our website and see it yourself.

Advanced Distribution Analysis

Like we mentioned in the previous section, there are certain ways and visual techniques in which we
can analyze the spread of our data effortlessly. In this section, we will take a look at one such visual –
box-whisker plot.

Box-Whisker Plot

We will plot age distribution of the dataset and segregate by positive and negative cases.

Contact: https://www.BIGINTSolutions.com/Contact
Code to achieve this:

p <- ggplot(attrition, aes(x=Attrition, y=Age, color=Attrition)) +

geom_boxplot() +
scale_color_manual(values=c("#00BFC4", "#F8766D"))+
stat_summary(fun.y=mean, geom="point", shape=23, size=4)

Output:

How to read the above chart?

In simpler terms, we see employees who are facing attrition are mostly young population where
mean age is around 34. The first quartile being from 18-27 yrs.
You can very easily see the spread of the data using a visualization like this.

Model Development
Ok, it’s time to start developing our model. Till now we have analyzed our dataset through various
techniques including basic statistics, data profiling, univariate and multivariate distribution analysis,
and have seen one example of advanced distribution analysis.

We have a good understanding of the variables involved and their importance (implicit). In this
section, we will look at how to statistically select top factors influencing our variable of interest, do a
bit of feature engineering, and fit a model and deep dive into model results.

Contact: https://www.BIGINTSolutions.com/Contact
Top factors influencing attrition

There are various ways to determine factors influencing attrition. We can either use PCA (Principal
component analysis) or RFE (Recursive Feature Elimination) or use Random Forest algorithm. For this
report, we will go with Random Forest algorithm to determine factors influencing attrition.
Code:

#Feature importance
library(randomForest)

set.seed(100)
rf<-randomForest(Attrition~.,data=attrition, importance=T)
varImpPlot(rf, type=1)

Output:

From the above chart you can see:

 Higher the value of mean decrease accuracy score, higher the importance of the variable in
the model. In the plot shown above, OverTime is the most important variable.
 There are other measures of importance using MeanDecreaseGini which we are not
discussing in this report
 Note: We have not yet done any feature engineering, so results of variable importance may
change after that.

How to check top 10 factors influencing attrition?

Contact: https://www.BIGINTSolutions.com/Contact
Code:

rf.10<-randomForest(Attrition~.,data=attrition, importance = T)
varImpPlot(rf.10, type=1, sort = T, n.var = 10, main= "Top -10
Variable Importance")

We see above our selection of important feature changes. We will analyze this further after doing
some feature engineering.

Feature Engineering

In this section we will perform certain steps:

1. Convert “string” columns into their numeric form. Usually, Machine Learning algorithms are
not good at working with string columns.
2. Convert those columns into factor columns (since those columns are factors)
3. Normalize numeric columns like MonthlyIncome, PercentySalaryHike, DistanceFromHome
etc.

These steps are a subset of the steps involved in “feature engineering”. Feature engineering step is
essential to select features which can maximize the performance of our machine learning algorithm.

Contact: https://www.BIGINTSolutions.com/Contact
Code:

#Feature Engineering

#1. Convert string columns into numeric

attrition$GenderN <-
as.factor(as.numeric(as.factor(attrition$Gender)))
attrition$JobRoleN <-
as.factor(as.numeric(as.factor(attrition$JobRole)))
attrition$MaritalStatusN <-
as.factor(as.numeric(as.factor(attrition$MaritalStatus)))
attrition$OverTimeN <-
as.factor(as.numeric(as.factor(attrition$OverTime)))
attrition$DepartmentN <-
as.factor(as.numeric(as.factor(attrition$Department)))
attrition$EducationFieldN <-
as.factor(as.numeric(as.factor(attrition$EducationField)))
attrition$BusinessTravelN <-
as.factor(as.numeric(as.factor(attrition$BusinessTravel)))
attrition$AttritionN <-
as.factor(as.numeric(as.factor(attrition$Attrition)))
attrition$AgeBinsN <-
as.factor(as.numeric(as.factor(attrition$AgeBins)))

#2. Convert numeric columns which are actually factors into factors
attrition$Education<-as.factor(attrition$Education)
attrition$EnvironmentSatisfaction<-
as.factor(attrition$EnvironmentSatisfaction)
attrition$JobInvolvement<-as.factor(attrition$JobInvolvement)
attrition$JobSatisfaction<-as.factor(attrition$JobSatisfaction)
attrition$PerformanceRating<-as.factor(attrition$PerformanceRating)
attrition$RelationshipSatisfaction<-
as.factor(attrition$RelationshipSatisfaction)
attrition$WorkLifeBalance<-as.factor(attrition$WorkLifeBalance)
attrition$StockOptionLevel<-as.factor(attrition$StockOptionLevel)

#3. Remove not required variables

attrition.fe1 <- subset(attrition, select = -

c(Attrition,Gender,JobRole,
MaritalStatus,
OverTime,
Department,
EducationField,
BusinessTravel,
EducationC,
EnvironmentSatisfactionC,
JobInvolvementC,
JobSatisfactionC,
PerformanceRatingC,

RelationshipSatisfactionC,
WorkLifeBalanceC,
33

Contact: https://www.BIGINTSolutions.com/Contact
Age,
AgeBins,
MonthlyRate,
HourlyRate,
DailyRate))

rf.2<-randomForest(AttritionN~.,data=attrition.fe1)
varImpPlot(rf.2)

#4. Normalize numeric variables

library(caret)

attrition.n <- subset(attrition.fe1, select = c(MonthlyIncome,

NumCompaniesWorked,
PercentSalaryHike,
TotalWorkingYears,

TrainingTimesLastYear,
YearsAtCompany,
YearsInCurrentRole,

YearsSinceLastPromotion,
YearsWithCurrManager
))

preprocessParams <- preProcess(attrition.n, method=c("range"))

attrition.n.t <- predict(preprocessParams, attrition.n)

summary(attrition.n.t)

#put these back in the data frame

attrition.fe2 <- subset(attrition.fe1, select = -c(MonthlyIncome,

NumCompaniesWorked,
PercentSalaryHike,
TotalWorkingYears,

TrainingTimesLastYear,
YearsAtCompany,
YearsInCurrentRole,

YearsSinceLastPromotion,

YearsWithCurrManager))

attrition.fe3 <- cbind(attrition.fe2, attrition.n.t)

Contact: https://www.BIGINTSolutions.com/Contact
Let’s see our feature importance chart after performing feature engineering.

rf.3<-randomForest(AttritionN~.,data=attrition.fe3, importance= T)
varImpPlot(rf.3, type=1)

Did you see any change from our baseline feature importance chart?

We are now ready to fit our model on the above dataset. We have all variables in numbers format,
factor variables as type factors, and numeric variables normalized. Here’s a screenshot of our
dataset.

Our variable of interest is “AttritionN”.

Contact: https://www.BIGINTSolutions.com/Contact
Fitting a model

For fitting a model, we will take the above dataset and fit Random Forest algorithm.

We need to first split the data into train and test in 70/30 split. We fit the model on train data,
improve model accuracy on train data, and then evaluate it on test data.

Code:

ind <- sample(2, nrow(attrition.fe3), replace = TRUE , prob= c(0.7,

0.3))
train <- attrition.fe3[ind==1,]
test <- attrition.fe3[ind==2,]

set.seed(100)
rf.4 <- randomForest(AttritionN~.,data=train, importance=T)
print(rf.4)

Call:
randomForest(formula = AttritionN ~ ., data = train, importance = T)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5

OOB estimate of error rate: 13.82%

Confusion matrix:
1 2 class.error
1 812 9 0.01096224
2 128 42 0.75294118

The OOB error is 13.82% if we take all the variables.

Model analysis

In this section, we will deep dive into model analysis and provide approaches to tune a model. We
need to tune a model with respect to a metric. We can either choose accuracy or out of box (OOB)
error as the metric to optimize. In this report, we will select OOB error to analyze and optimize.

Error analysis

Let’s plot the error rate with number of trees.

Code:

layout(matrix(c(1,2),nrow=1),
width=c(4,1))
par(mar=c(5,4,4,0)) #No margin on the right side
plot(rf.4)
par(mar=c(5,0,4,2)) #No margin on the left side
plot(c(0,1),type="n", axes=F, xlab="", ylab="")
legend("top", colnames(rf.4$err.rate),col=1:4,cex=0.8,fill=1:4)
36

Contact: https://www.BIGINTSolutions.com/Contact
How to read the above chart?

 The plot denotes error rate by the number of trees

 The red colored line is for negative cases “No” or 1, green is for positive cases “Yes” or 2

If you want to see the error rate at a more granular level, then we can plot “log” of error rate.

Contact: https://www.BIGINTSolutions.com/Contact
Tuning

In this section, we will tune certain parameters of the RandomForest algorithm. Specifically, we will
tune number of trees and mtry parameter.

mtry: Number of variables randomly sampled as candidates at each split.

ntree: Number of trees to grow.

For this report, we will restrict ntree as 500 and optimize mtry.

Code:

set.seed(120)
res <- tuneRF(x=subset(train, select = -c(AttritionN)),
y=train$AttritionN,
stepFactor = 1.5,
plot=TRUE,
ntreeTry = 500,
trace = TRUE,
improve = 0.0001,
importance=TRUE)

print(res)

mtry OOBError
4.OOB 4 0.1422805
5.OOB 5 0.1382442
7.OOB 7 0.1352170
10.OOB 10 0.1382442

Contact: https://www.BIGINTSolutions.com/Contact
The mtry value for which the model gives the best OOB error is 7. Let’s put mtry as 7 in our model.

Code:

set.seed(120)
rf.6 <- randomForest(AttritionN~.,data=train, importance=T, mtry=7,
ntree=500)
print(rf.6)

Call:
randomForest(formula = AttritionN ~ ., data = train, importance = T,
mtry = 7, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 7

OOB estimate of error rate: 13.52%

Confusion matrix:
1 2 class.error
1 812 9 0.01096224
2 125 45 0.73529412

From 13.82 we went to 13.52 OOB error rate, a marginal improvement. How’s the model
performance on test data?

Code:
set.seed(120)
rf.7 <- randomForest(AttritionN~.,data=test, importance=T, mtry=7,
ntree=500)
print(rf.7)

Call:
randomForest(formula = AttritionN ~ ., data = test, importance = T,
mtry = 7, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 7

OOB estimate of error rate: 13.57%

Confusion matrix:
1 2 class.error
1 411 1 0.002427184
2 64 3 0.955223881

The model performance on test data is 13.57% which is very close to train data. Often when
evaluating a model performance we may face overfitting and underfitting problems. Either we have
fewer data to train or we have not chosen optimal parameters. These problems can be solved
iteratively by adding more data, pruning features, optimizing mtry and ntree. The discussion on this
topic is out of scope for this report.

We will now see the AUC of our model and plot ROC chart. This will be the last part in analyzing our
model performance.

Contact: https://www.BIGINTSolutions.com/Contact
ROC/AUC

Usually, we measure the performance of a classification problem by another metric called as “AUC”
– area under the curve. Let’s see the AUC for our model and plot ROC chart.

Code:
library(ROCR)

predict<-predict(rf.6, type="prob")
prediction<-prediction(predict[,2], train$AttritionN)

auc<-performance(prediction, "auc")

performance<-performance(prediction, "tpr","fpr")

plot(performance,main="ROC Curve for Random Forest",col=2,lwd=2)

abline(a=0,b=1,lwd=2,lty=2,col="gray")
auc<-unlist(slot(auc,"y.values"))
auc<-round(auc,4)

legend(.8,.3,auc,title="AUC")

The AUC for our model is close to 84% which is a decent coverage. The more the area under the
curve the better it is.

Contact: https://www.BIGINTSolutions.com/Contact
Conclusion
In this report, we saw various steps to approach a data analytics problem. We presented various
statistical and visualization techniques to analyze the data. We also presented univariate and
multivariate analysis for our problem and provided steps to create, analyze and fine-tune a model.
For feature engineering and model fitting we used R programming, and for multivariate and
interactive analysis we used Power BI.

We also deep dived into several questions related to employee attrition. Questions we answered
were “why are young people leaving the company”, “why is attrition so high in Sales Rep role”,
“what is the characteristics of people doing overtime”?

Next Steps
We can do further work on model selection and tuning. We decided to use Random Forest
algorithm. But what is the basis for choosing this algorithm? A general approach that we follow is to
use multiple algorithms and measure the performance of each of the algorithms. Go with the
algorithm giving the best performance. We could have evaluated algorithms like SVM, K-NN, and
Neural Networks.

We can remove certain variables which are “correlated”. This can reduce the dimensionality of our
dataset and improve model performance.

We can enhance our dataset to provide an end to end view of employee attrition problem. We can
include hiring data (by job role, department, time to hire), competitor data (where employees are
going), and industry data to compare attrition with industry averages.

References
We referred to several posts on StackOverflow community, R-bloggers website,
machinelearningmastery, and other community websites to pick the best R code to solve this
problem.

Contact Us
If you have any questions on this report, or you have a data analytics problem for us to work on, feel
free to reach us through our website. We will get back to you within 12-24 business hours.

https://www.bigintsolutions.com/Contact

Thank you.

Contact: https://www.BIGINTSolutions.com/Contact

Ai For Everyone Andrew NG 190818125324 PDF
No ratings yet
Ai For Everyone Andrew NG 190818125324 PDF
19 pages
Data Analyst Case Study Example
No ratings yet
Data Analyst Case Study Example
10 pages
Data Science Exam
No ratings yet
Data Science Exam
9 pages
Sample Test
No ratings yet
Sample Test
6 pages
Technical Internship Report - HR Dataset
No ratings yet
Technical Internship Report - HR Dataset
52 pages
Statistical Forcasting - Excel, ARIMA
No ratings yet
Statistical Forcasting - Excel, ARIMA
14 pages
Data Analytics Interview Handbook Isb
No ratings yet
Data Analytics Interview Handbook Isb
40 pages
Data Analytics Applications - Case Studies
No ratings yet
Data Analytics Applications - Case Studies
20 pages
PG Program Dsba Classroom
No ratings yet
PG Program Dsba Classroom
16 pages
ALX Data Analytics Program Description
No ratings yet
ALX Data Analytics Program Description
6 pages
Data Scientist Interview Questions and Answers PDF
No ratings yet
Data Scientist Interview Questions and Answers PDF
37 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
31 pages
APSCHE Approved Short Term Internship Proposal by ExcelR
No ratings yet
APSCHE Approved Short Term Internship Proposal by ExcelR
4 pages
Basic Business Analytics Using Excel, Chapter 01
No ratings yet
Basic Business Analytics Using Excel, Chapter 01
21 pages
Looker
No ratings yet
Looker
57 pages
N.R.narayana Murthy
No ratings yet
N.R.narayana Murthy
15 pages
Internship Report On Machine Learning
100% (1)
Internship Report On Machine Learning
26 pages
Data Science: Concepts and Practice: Course Slides
No ratings yet
Data Science: Concepts and Practice: Course Slides
9 pages
Resume - Rajat Chaturvedi
No ratings yet
Resume - Rajat Chaturvedi
3 pages
Programming For Everybody (Getting Started With Python) - Home - Coursera - 2 PDF
No ratings yet
Programming For Everybody (Getting Started With Python) - Home - Coursera - 2 PDF
3 pages
Top Data Analyst Interview Questions
No ratings yet
Top Data Analyst Interview Questions
28 pages
Kenny-230722-65 Excel Interview Questions For Data Analysts
No ratings yet
Kenny-230722-65 Excel Interview Questions For Data Analysts
11 pages
Uber HYD COE Business Analyst JD - Analytics & Reporting PDF
No ratings yet
Uber HYD COE Business Analyst JD - Analytics & Reporting PDF
3 pages
(Google Interview Prep Guide) Data Science Lead
No ratings yet
(Google Interview Prep Guide) Data Science Lead
7 pages
HRM Project
100% (1)
HRM Project
26 pages
Data Analytics Interview QnAs
No ratings yet
Data Analytics Interview QnAs
21 pages
2nd Unit - 2.2 - Data Analytics
No ratings yet
2nd Unit - 2.2 - Data Analytics
22 pages
Data Analyst Syllabus
No ratings yet
Data Analyst Syllabus
25 pages
Data Science Interview Quesions
No ratings yet
Data Science Interview Quesions
22 pages
Contingent Workforce Management - V1
No ratings yet
Contingent Workforce Management - V1
9 pages
Data Science Note
No ratings yet
Data Science Note
24 pages
How Artificial Intelligence (Ai) Is Revolutionizing Learning and Development (L&D) Practices
100% (1)
How Artificial Intelligence (Ai) Is Revolutionizing Learning and Development (L&D) Practices
36 pages
Data Science Interview Preparation 7
No ratings yet
Data Science Interview Preparation 7
10 pages
100 Days Data Analyst Learning Roadmap
No ratings yet
100 Days Data Analyst Learning Roadmap
6 pages
Human Resource Analytics: Dataval Analytics Provides You Hrsuite Which Can Answer All Your HR Queries
No ratings yet
Human Resource Analytics: Dataval Analytics Provides You Hrsuite Which Can Answer All Your HR Queries
4 pages
Google Coursera Data Analytics
No ratings yet
Google Coursera Data Analytics
37 pages
Business Analytics Using Excel
100% (1)
Business Analytics Using Excel
2 pages
Assignment Data Analysis Example
100% (1)
Assignment Data Analysis Example
10 pages
Leads Generation To Client Onboarding
No ratings yet
Leads Generation To Client Onboarding
18 pages
I Am Sharing 'Interview' With You
100% (3)
I Am Sharing 'Interview' With You
65 pages
Vignesh R 22071471559 Jan 2024: Tcs NQT - It
No ratings yet
Vignesh R 22071471559 Jan 2024: Tcs NQT - It
1 page
Data Scientist Master Program Slimup v2
No ratings yet
Data Scientist Master Program Slimup v2
26 pages
How Big Companies Use Big Data
No ratings yet
How Big Companies Use Big Data
4 pages
BUSINESS ANALYTICS WITH POWER BI
No ratings yet
BUSINESS ANALYTICS WITH POWER BI
35 pages
Foundation of Data Analysis
No ratings yet
Foundation of Data Analysis
39 pages
Data Science Resource Package!
No ratings yet
Data Science Resource Package!
14 pages
Master in Business Analytics Big Data
No ratings yet
Master in Business Analytics Big Data
40 pages
Data Analytics
No ratings yet
Data Analytics
12 pages
Machine Learning Interview Questions
100% (1)
Machine Learning Interview Questions
4 pages
CIB R&a Banking Junior Analyst Academic Intern FinTech
No ratings yet
CIB R&a Banking Junior Analyst Academic Intern FinTech
2 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
6 pages
Ibm HR Report
No ratings yet
Ibm HR Report
12 pages
How To Write A Data Analyst Job Description
No ratings yet
How To Write A Data Analyst Job Description
2 pages
Use of Data Analytics in HR: A Presentation On
No ratings yet
Use of Data Analytics in HR: A Presentation On
12 pages
Data Science Interview Questions Leaked
100% (3)
Data Science Interview Questions Leaked
12 pages
Case Study Moonlighting - Issue or Not - Final
No ratings yet
Case Study Moonlighting - Issue or Not - Final
2 pages
5 Data Analytics Projects For Beginners - Coursera
No ratings yet
5 Data Analytics Projects For Beginners - Coursera
7 pages
SkillDzire - Interview Questions
No ratings yet
SkillDzire - Interview Questions
15 pages
A Beginner's Guide To Getting Your First Data Science Job: 2019 Edition
No ratings yet
A Beginner's Guide To Getting Your First Data Science Job: 2019 Edition
63 pages
Anti Hacking Security: Fight Data Breach
From Everand
Anti Hacking Security: Fight Data Breach
Vivek Ashvinbhai Pancholi
No ratings yet
OpenAI API Cookbook: Build intelligent applications including chatbots, virtual assistants, and content generators
From Everand
OpenAI API Cookbook: Build intelligent applications including chatbots, virtual assistants, and content generators
Henry Habib
No ratings yet
1 Conflict in NG - Report - 28FEB2018
No ratings yet
1 Conflict in NG - Report - 28FEB2018
46 pages
Zones Trading Strate PDF
No ratings yet
Zones Trading Strate PDF
15 pages
Become Microsoft Certified: Azure Business Applications Modern Workplace
No ratings yet
Become Microsoft Certified: Azure Business Applications Modern Workplace
1 page
Ai For Everyone: by Eng - Felicity Mecha Community Lead AI Saturdays Nairobi
No ratings yet
Ai For Everyone: by Eng - Felicity Mecha Community Lead AI Saturdays Nairobi
40 pages
Python Project
No ratings yet
Python Project
4 pages
Lesson 2 Forming Expressions 2019
No ratings yet
Lesson 2 Forming Expressions 2019
14 pages
X63C_pwr_db
No ratings yet
X63C_pwr_db
1 page
Android SQLite Database Example Tutorial - DigitalOcean
No ratings yet
Android SQLite Database Example Tutorial - DigitalOcean
25 pages
FortiAnalyzer 7.4 Architecture Guide
No ratings yet
FortiAnalyzer 7.4 Architecture Guide
16 pages
MC Manual
No ratings yet
MC Manual
68 pages
SAP Transactions
No ratings yet
SAP Transactions
30 pages
humanoid_robots_final (1)
No ratings yet
humanoid_robots_final (1)
27 pages
Lecture 1 To 4 - ICT
No ratings yet
Lecture 1 To 4 - ICT
27 pages
Designing A Logo For A Bank
No ratings yet
Designing A Logo For A Bank
17 pages
NM Unit - 3 (Notes) 31.8.20
100% (1)
NM Unit - 3 (Notes) 31.8.20
53 pages
Shortcut Keys and Their Functions
No ratings yet
Shortcut Keys and Their Functions
3 pages
IDT MODULE 2 and 3 Mcq Questions Module
No ratings yet
IDT MODULE 2 and 3 Mcq Questions Module
6 pages
Component Maintenance Manual With Illustrated Parts List: Cockpit Evaporator
No ratings yet
Component Maintenance Manual With Illustrated Parts List: Cockpit Evaporator
90 pages
Network Reconfiguration For Load Balancing in Distribution System With Distributed Generation and Capacitor Placement
No ratings yet
Network Reconfiguration For Load Balancing in Distribution System With Distributed Generation and Capacitor Placement
6 pages
Cape Town Map - Google Search 2
No ratings yet
Cape Town Map - Google Search 2
1 page
9618_w24_ms_11
No ratings yet
9618_w24_ms_11
9 pages
Mri Devices Corporation: Technical Report
No ratings yet
Mri Devices Corporation: Technical Report
9 pages
CT and DT Full Report
No ratings yet
CT and DT Full Report
221 pages
Mod Menu Log - Zombie - Survival.craft.z
No ratings yet
Mod Menu Log - Zombie - Survival.craft.z
6 pages
9085-Article Text-35285-1-10-20150122 PDF
No ratings yet
9085-Article Text-35285-1-10-20150122 PDF
8 pages
Homework III
No ratings yet
Homework III
3 pages
Me6703 Cim Mech Viis Au Unit V
No ratings yet
Me6703 Cim Mech Viis Au Unit V
33 pages
Comparison of Simulation Tools Atp Emtp and Matlab Simulink For Time Domain Power System Transient Studies
No ratings yet
Comparison of Simulation Tools Atp Emtp and Matlab Simulink For Time Domain Power System Transient Studies
14 pages
Mel-A320 Lan
No ratings yet
Mel-A320 Lan
1,330 pages
Input and Output Devices
No ratings yet
Input and Output Devices
19 pages
Hayes Command Set / Register Formats
No ratings yet
Hayes Command Set / Register Formats
5 pages
Adient - EDI Implementation Guide - DESADV UN D96A - Updated Logo
No ratings yet
Adient - EDI Implementation Guide - DESADV UN D96A - Updated Logo
30 pages
ML Question Bank
No ratings yet
ML Question Bank
7 pages

Uploaded by

Uploaded by

Report version: 1.

Data Analytics Report

Through this report we would:

Our data analytics framework has 4 major steps:

Dataset: The dataset has been taken from IBM resource.

Ranbeer is a Computer Science graduate from IIIT

Let’s deep dive.

What kind of variables are we going to work with?

What is the basic statistics of our data?

## Classes 'tbl_df', 'tbl' and 'data.frame': 1470 obs. of 35 variables:

How is our data distributed?

From the histograms, we can do distribution analysis of our dataset. Example:

A similar analysis can be done using bar chart based on frequency.

Do we have missing data?

If you see the above results, you can clearly identify

attrition$PerformanceRating <- cut(attrition$PerformanceRating,

attrition$WorkLifeBalance <- cut(attrition$WorkLifeBalance,

You see those variables are now in their character form.

Which variables to remove?

breaks <- c(min(attrition$Age),25,35,45,55, max(attrition$Age))

proportion <- attrition %>%

proportion <- proportion %>%

p4 <- ggplot(proportion, aes(y = pct, x = factor(AgeBins), fill =

Attrition Distribution by Education Field

Attrition Distribution by Job Role

Attrition Distribution by Department

Attrition Distribution by Gender

Attrition in male is a bit higher than female, the difference is 2.2%.

Attrition Distribution by Marital Status

Attrition Distribution by NumCompaniesWorked

Attrition Distribution by JobLevel

Attrition Distribution by StockOptionLevel

Attrition Distribution by TrainingTimesLastYear

Attrition Distribution by YearsAtCompany

Attrition Distribution by YearsWithCurrManager

Attrition Distribution by Job Satisfaction

Attrition Distribution by Job Involvement

Attrition Distribution by other factors

1. Why are young people leaving the company?

Full report screenshot:

Let’s deep dive into our questions.

• 21.95% of attrition among young employees

Why is attrition so high in Sales Rep Job Role?

After selecting ‘Job Role’ as “Sales Representative”, we see:

• 39.76 % of attrition in that job role

What is the characteristics of employees who are doing overtime?

After selecting OverTime as “Yes”, we see

 30.53% of attrition when OverTime is “Yes”

Advanced Distribution Analysis

p <- ggplot(attrition, aes(x=Attrition, y=Age, color=Attrition)) +

How to read the above chart?

From the above chart you can see:

How to check top 10 factors influencing attrition?

In this section we will perform certain steps:

#1. Convert string columns into numeric

#3. Remove not required variables

attrition.fe1 <- subset(attrition, select = -

#4. Normalize numeric variables

attrition.n <- subset(attrition.fe1, select = c(MonthlyIncome,

preprocessParams <- preProcess(attrition.n, method=c("range"))

attrition.n.t <- predict(preprocessParams, attrition.n)

#put these back in the data frame

attrition.fe2 <- subset(attrition.fe1, select = -c(MonthlyIncome,

attrition.fe3 <- cbind(attrition.fe2, attrition.n.t)

Our variable of interest is “AttritionN”.

ind <- sample(2, nrow(attrition.fe3), replace = TRUE , prob= c(0.7,

OOB estimate of error rate: 13.82%

The OOB error is 13.82% if we take all the variables.

Let’s plot the error rate with number of trees.

 The plot denotes error rate by the number of trees

mtry: Number of variables randomly sampled as candidates at each split.

OOB estimate of error rate: 13.52%

OOB estimate of error rate: 13.57%

plot(performance,main="ROC Curve for Random Forest",col=2,lwd=2)

You might also like