100% found this document useful (2 votes)

132 views31 pages

Predictive Modeling Project Report

The document summarizes the steps and analysis performed in a predictive modeling project on cubic zirconia price data. Key steps include: 1) Exploratory data analysis of the 26,967 observation, 11 variable dataset to understand distributions, outliers, missing data and variable correlations. Several variables were right-skewed including the target price variable. 2) Imputation of 697 null depth values using median imputation. Zero dimension values were also addressed. 3) Consideration of combining ordinal variable sub-levels and reasons for doing so. 4) The project involved linear regression modeling on cleaned data to predict price and identify top 5 influential variables, along with business insights. Logistic regression and LDA were

Uploaded by

Archana shukla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

132 views31 pages

Predictive Modeling Project Report

Uploaded by

Archana shukla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

PREDICTIVE MODELING

PROJECT REPORT
Contents
Problem 1 (Linear Regression)_______________________________________
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate Analysis………………………8

1.2. Impute null values if present, also check for the values which are equal to zero. Do they have
any meaning or do we need to change them or drop them? Check for the possibility of combining
the sub levels of a ordinal variables and take actions accordingly. Explain why you are combining
these sub levels with appropriate reasoning………………………………………………………………………………………5

1.3. Encode the data (having string values) for Modelling. Split the data into train and test (70:30).
Apply Linear regression using scikit learn. Perform checks for significant variables using appropriate
method from statsmodel. Create multiple models and check the performance of Predictions on Train
and Test sets using Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning……………………………………………………………………………………………………………..12

1.4. Inference: Basis on these predictions, what are the business insights and recommendations.

Please explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present……………………………………………………………………5

Problem 2(Logistic Regression and LDA)_______________________________

2.1. Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data analysis…….5

2.2. Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis)……..7

2.3. Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis)……..7

2.4. Inference: Basis on these predictions, what are the insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present……………………………………………………………………5
Problem 1: Linear Regression
INTRODUCTION
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are
provided with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia
(which is an inexpensive diamond alternative with many of the same qualities as a diamond). The
company is earning different profits on different prize slots. You have to help the company in
predicting the price for the stone on the bases of the details given in the dataset so it can distinguish
between higher profitable stones and lower profitable stones so as to have better profit share. Also,
provide them with the best 5 attributes that are most important.

DATA DESCRIPTION
Variable Name Description Detail Data Type
Carat Weight of the cubic Carat Numeric
zirconia
Cut Describe cut quality of Quality in increasing Categorical (Ordinal)
the cubic zirconia order Fair, Good, Very
Good, Premium, Ideal
Colour Colour of the cubic D being the worst and Categorical (Ordinal)
zirconia J the best
Clarity Cubic zirconia Clarity (In order from Best to Categorical (Ordinal)
refers to the absence Worst, IF = flawless,
of the Inclusions and l1= level 1inclusion) IF,
Blemishes VVS1, VVS2, VS1, VS2,
Sl1, Sl2, l1
Depth The Height of a cubic Numeric
zirconia, measured
from the Culet to the
table, divided by its
average Girdle
Diameter
Table The Width of the cubic Numeric
zirconia's Table
expressed as a
Percentage of its
Average Diameter
Price Price of the cubic In mm Numeric
zirconia
X Length of the cubic In mm Numeric
zirconia
Y Width of the cubic In mm Numeric
zirconia
Z Height of the cubic In mm Numeric
zirconia
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check
the null values, Data types, shape, EDA, duplicate values). Perform Univariate and
Bivariate Analysis.

Solution.

Summary of the dataset

The data set contains 26967 row and 11 columns. In the given data set there are 2 Integer
type features,6 Float type features and 3 Object type features. Where 'price' is the target
variable and all other are predictor variable. The first column is an index ("Unnamed: 0") as
this only serial no, we can remove it. Except for the column depth, the rest null count is
26967.

EXPLORATORY DATA ANALYSIS

Step 1: Check and remove any duplicates in the dataset
Step 2: Check and treat any missing values in the dataset
Step 3: Outlier Treatment
Step 4: Univariate Analysis
Step 5: Bi-variate Analysis

Step 1: Check and remove any duplicates in the dataset After checking for any duplicate
values present in the dataset it is confirmed that there are no duplicates hence it doesn't
require treatment to remove duplicates.

Step 2: Check and treat any missing values in the dataset

Step 3: Outlier Treatment Using the boxplot we confirm and visualise the presence of outliers
in the dataset and then proceed to treat the outliers present.

Below we see that the outliers have been treated accordingly.

Step 4: Univariate Analysis

The dataset indicates that there is significant amount of outliers present in one or few of the
variable and skewness is measured for every attributes present and after performing the
univariate analysis we can notice that the distribution of some quantitative features like
"Carat" and the target feature “Price” are heavily "right-skewed".

Step 5: Bi-variate Analysis

➢ It involves the analysis of two variables (often denoted as X, Y), for the purpose of
determining the empirical relationship between them.
➢ It can be inferred that most features correlate with the price of Diamond. The notable
exception is "depth" which has a negligible correlation (<1%).
OBSERVATIONS BASED ON EDA

The inferences drawn from the above Exploratory Data analysis:

Observation-1: 'Price' is the target variable while all others are the predictors. The data
set contains 26967 row, 11 column. In the given data set there are 2 Integer type
features,6 Float type features. 3 Object type features. Where 'price' is the target variable
and all other are predictor variable. The first column is an index ("Unnamed: 0")as this
only serial no, we can remove it.

Observation-2: On the given data set the mean and median values does not have much
difference. We can observe Min value of "x", "y", "z" are zero this indicates that they are
faulty values. As we know dimensionless or 2-dimensional diamonds are not possible. So
we have filter out those as it clearly faulty data entries. There are three object data type
'cut', 'colour' and 'clarity'.

Observation-3: We can observe there are 697 missing value in the depth column. There
are some duplicate row present. (33 duplicate rows out of 26958). which is nearly 0.12 %
of the total data. So on this case we have dropped the duplicated row.

Observation-4: There are significant amount of outlier present in some variable, the
features with datapoint that are far from the rest of dataset which will affect the outcome
of our regression model. So we have treat the outlier. We can see that the distribution of
some quantitative features like "carat" and the target feature "price" are heavily "right-
skewed".

Observation-5: It looks like most features do correlate with the price of Diamond. The
notable exception is "depth" which has a negligible correlation (r-s1%). Observation on
'CUT': The Premium Cut on Diamonds are the most Expensive, followed by Very Good
Cut.

1.2 Impute null values if present, also check for the values which are equal to zero.
Do they have any meaning or do we need to change them or drop them? Check for
the possibility of combining the sub levels of a ordinal variables and take actions
accordingly. Explain why you are combining these sub levels with appropriate
reasoning.

Solution.
➢ We start by checking through the dataset for any null values that are present as seen in
Figure 8, it shows that there are a total of 697 null values in the depth column.
➢ Followed by which the median is computed for each attribute so that it can be used to
replace the null values that are present in the dataset.
➢ In below given figure 9 we can see that the null values are replaced by the median that's
computed.
➢ After the removing the null values the shape of the dataset becomes 26925 rows and 10
columns.
Is scaling necessary in this case?

No, it is not necessary, we'll get an equivalent solution whether we apply some kind of
linear scaling or not. But is recommended for regression techniques as well because it
would help gradient descent to converge fast and reach the global minima. When
number of features becomes large, it helps in running model quickly else the starting
point would be very far from minima, if the scaling is not done in pre-processing.

For now we will process the model without scaling and later we will check the output
with scaled data of regression model output.

1.3 Encode the data (having string values) for Modelling. Split the data into train and
test (70:30). Apply Linear regression using scikit learn. Perform checks for significant
variables using appropriate method from statsmodel. Create multiple models and
check the performance of Predictions on Train and Test sets using Rsquare, RMSE &
Adj Rsquare. Compare these models and select the best one with appropriate
reasoning.
Solution.

Train-Test Split:

➢ Copy all the predictor variables into X data frame and copy target into the y data frame.
Using the dependent variable we split the X and Y data frames into training set and test
set.
➢ For this we use the Sklearn package and then split X and Y in 70:30 ration and then
invoke the linear regression function and find the best fit model on training data.
➢ The intercept for our model is -3171.9504473076336.
➢ The intercept (often labelled the constant) is the expected mean values of Y when
x=0,and when X is not equal to zero then the intercept has no intrinsic meaning.
➢ In the present case when the other predictor variable is zero i.e., like carat, cut, color,
clarity then C=-3172 ( Y = m/X/ m2X2+……+ mnXn + C+e), which means that the price is
-3172 which doesn't make any sense so in order to deal with this we have to carry out z-
score and make it nearly zero.

R square on training data : 0.9311935886926559

R square on testing data : 0.931543712584074

➢ R square is the percentage of the response variable variation that is explained by a linear
model and computed by the formula as:
R-square = Explained Variation / Total Variation
➢ It is always between 0 and 100%, in which 0% indicates that the model explains none of
the variability of the response data around its mean and 100% indicates that the model
explains all the variability of the response data around its mean.
➢ In the regression model we can see the R-square value on training and test data
respectively as 0.9311935886926559 and - 0.931543712584074.
➢ The RMSE on training and test data respectively is 907.1312415459143 and
911.8447345328437.
➢ From the scatter plot, we see that it is a linear and there is very strong correlation
present between the predicted y and actual y.
➢ It also indicates that there's a lot spread which indicates some unexplained variances on
the output.
➢ As the training data & Test data score are almost inline we can conclude that this model
is a Right-Fit model.
Training Data Test Data
R-square 0.9311935886926559 0.931543712584074
RMSE 907.1312415459143 911.8447345328436

Applying z- score stats models

➢ We initiate the linear Regression function and find the best fit model on the training
data and then explore the coefficients for each of the attributes.

➢ The intercept for our model is -5.879615251304736e-16 and the co-efficient of

determinant is 0.9315051288558229.
➢ It's observed that by applying z score the intercept has changed from -
3171.950447307667 to 5.87961525130473e-16, which tells that the co-efficient has
changed and the bias has become nearly zero but the overall accuracy is still the same.

Check Multi-collinearity using VIF

• We can observe very strong multi collinearity present in the data set when ideally it
should be within 1 to 5.
Linear Regression using stats models
➢ Assuming the null hypothesis is true, i.e. price from that universe we have drawn co-
efficient for the variable shown above.
➢ Now we can ask what is the probability of finding this co-efficient in this drawn sample if
in the real world the co-efficient is zero. As we see here the overall P value is less than
alpha, so rejecting HO and accepting Ha that at least 1 regression co-efficient is not '0'.
Here all regression co-efficient are not '0'.
➢ For example, we can see the p value is showing 0.449 for 'depth' variable, which is much
higher than 0.05. That means this dimension is of no use. So we can say that the
attribute which are having p value greater than 0.05 are poor predictor for price.

Root Mean Squared Error (Training) ------RMSE: 907.1312415459133

Root Mean Squared Error (lest) ------------RMSE: 911.8447345328433

1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.

Please explain and summarise the various steps performed in this project. There
should be proper business interpretation and actionable insights present.

Solution.

Inference:
We can see that the from the linear plot, very strong corelation between the predicted y and actual y.
But there are lots of spread. That indicates some kind noise present on the data set i.e. Unexplained
variances on the output.

Linear regression Performance Metrics:

Intercept for the model: -3171.950447307667 R square on training data: 0.9311935886926559 R
square on testing data: 0.931543712584074 RMSE on Training data: 907.1312415459143 RMSE on
Testing data: 911.8447345328436 As the training data & testing data score are almost inline, we can
conclude this model is a Right-Fit Model.

Impact of scaling:
We can observe by applying z score the intercept became -5.87961525130473e-16. Earlier it
was -3171.950447307667. the co-efficient has changed, the bias became nearly zero but the
overall accuracy still same.

Multi collinearity: We can observe there are very strong multi collinearity present in
the data set.

From statsmodels: we can see R-squared:0.931 and Adj. R-squared: 0.931 are same.
The overall P value is less than alpha.
➢ Finally we can conclude that Best 5 attributes that are most important are 'Carat', 'Cut',
'colour', clarity' and width i.e. 'y' for predicting the price.
➢ When 'carat' increases by 1 unit, diamond price increases by 8901.94 units, keeping all
other predictors constant.
➢ When 'cut' increases by 1 unit, diamond price increases by 109.19 units, keeping all other
predictors constant.
➢ When 'colour' increases by 1 unit, diamond price increases by 272.92 units, keeping all
other predictors constant.
➢ When 'clarity' increases by 1 unit, diamond price increases by 436.44 units, keeping all
other predictors constant.
➢ When 'y' increases by 1 unit, diamond price increases by 1464.83 units, keeping all other
predictors constant.
➢ We can see that the p value is 0.449 for depth variable, which is much greater than 0.05.
That means this attribute is of no use.
➢ There are also some negative co-efficient values, we can see the 'X' i.e Length of the cubic
zirconia in mm. having negative co-efficient -1417.9089. And the p value is less than 0.05,
so can conclude that as higher the length of the stone is a lower profitable stones.
➢ Similarly for the 'z' variable having negative co-efficient i.e. -711.23. And the p value is less
than 0.05, so we can conclude that as higher the 'z' of the stone is a lower profitable
stones.

Recommendations:
➢ The Gem Stones company should consider the features 'Carat', 'Cut', 'colour', 'clarity' and
width i.e. 'y' as most important for predicting the price. To distinguish between higher
profitable stones and lower profitable stones so as to have better profit share.
➢ As we can see from the model Higher the widtb('y') of the stone is higher the price.
➢ So the stones having higher widtb('y') should consider in higher profitable stones. The
'Premium Cut' on Diamonds are the most Expensive, followed by 'Very Good' Cut, these
should consider in higher profitable stones.
➢ The Diamonds clarity with 'VS1' &'VS2' are the most expensive. So these two category also
consider in higher profitable stones.
➢ As we see for 'X' i.e. Length. of the stone, higher the length of the stone is lower the price.
➢ So higher the Length('x') of the stone are lower is the profitabilim higher the 'z' i.e Height
of the stone is, lower the price. This is because if a Diamond's Height is too large Diamond
will become 'Dark' in appearance because it will no longer return an Attractive amount of
light. That is why.
➢ Stones with higher 'z' is also are lower in profitability.
Problem 2: Logistic Regression and LDA
INTRODUCTION
You are hired by a tour and travel agency which deals in selling holiday packages. You are provided
details of 872 employees of a company. Among these employees, some opted for the package and
some didn't. You have to help the company in predicting whether an employee will opt for the
package or not on the basis of the information given in the data set. Also, find out the important
factors on the basis of which the company will focus on particular employees to sell their packages.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis.

Solution.

Here I am loading all the necessary library for the model building and reading the head and tail of
the dataset to check whether data has been properly fed.
• We have no null values in the dataset.
• We have integer and object data.

The data that we have is of integer and continuous data, here the holiday package is our target
variable .

Salary, age, educ and number young children, number older children of employee have the went to
foreign, those are the given attributes we have to cross examine and help the company predict
weather the person will opt for holiday package or not.

NULL VALUES

There are no null values in the dataset

CHECK FOR DUPLICATES IN THE GIVEN DATASET
Number of duplicate rows = 0
Unique values for categorical variables

Percentage of employees that are interested in the holiday package 45.9%

UNIVARIATE ANALYSIS
SKEWNESS

• We can see that most of the distribution are right skewe except for educ
• Salary distribution has the max no of outliers
• There are some outliers in educ , no of young children and no. of older children

CATOGORICAL UNIVARIATE ANALYSIS

Maximum of the employees don’t prefer to go to foreign

The employees who prefer holiday package are slightly less than who don’t.
• As we can observe people with salaries below 150000 prefer holiday package.
• Employee age over 50 to 60 have seems to be not taking the holiday package,
whereas in the age 30 to 50 and salary less than 50000 people have opted more for
holiday package

BIVARITE ANALYIS
DATA DISTRIBUTION

There is hardly any correlation between the data, the data seems to be normal. There is
no huge difference in the data distribution among the holiday package, I don’t see any clear two
different distributions in the dataset provided.
CHECKING FOR CORRELATION

There is hardly any correlation between the data so no collinearity

1. AFTER TREATING OUTLIERS DATA LOOKS LIKE THIS

2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA
(linear discriminant analysis).

Solution.

Encoding the data(having string variables)

Head of the dataset

Here we have done ONE HOT ENCODING to create dummy variables and we can see all values
for foreign_yes are 0.
Better results are predicted by logistic regression model if encoding is done.
Train/ Test split
We will split the data in 70/30 ratio

Applying Logistic Regression

Applying GridSearchCV for Logistic Regression

The grid search method is used for logistic regression to find the optimal solving and the
parameters for solving.
We have found the parameters using grid search such as penalty=12 , solver: liblinear ,
tolerance=1e-06

Prediction on the training set

ytrain_predict = best_model.predict(X_train)
ytest_predict = best_model.predict(X_test)

Getting the probabilities on the test set

Performance Metrics will be discussed in 2.3

LDA (linear discriminant analysis)
DATASET HEAD

DATASET HEAD AFTER DATA PROCESSING

Build LDA Model

PROBABILITY PREDICTION

Performance Metrics will be discussed in 2.3

2.3 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model Final Model: Compare Both the models and write inference which model
is best/optimized.

Solution.

PEFORMANCE METRICS FOR LINEAR REGRESSION

Confusion matrix on the training data

Here we see that precision for 1 is 0.63 , recall is 0.45 accuracy is 0.63 and f1 score is 0.63

Confusion matrix on the test data

Here we see that precision for 1 is 0.69 , recall is 0.45 accuracy is 0.66 and f1 score is 0.55

Accuracy - Training Data

0.6344262295081967
AUC and ROC for the training data

Accuracy - Test Data

0.6564885496183206

AUC and ROC for the testing data

Metrics for train data

lr_train_precision 0.65

lr_train_recall 0.45

lr_train_f1 0.53

Metrics for test data

lr_test_precision 0.69

lr_test_recall 0.45

lr_test_f1 0.55

PERFORMANCE METRICS FOR LDA(linear discriminant analysis)

MODEL SCORE

0.6327868852459017
CLASSIFICATION REPORT TRAIN DATA

Here we see that precision for 1 is 0.65 , recall is 0.44 accuracy is 0.63 and f1 score is 0.52

confusion_matrix for train data

array([[263, 66],
[158, 123]]

Model score for test data

0.6564885496183206

Classification report for test data

Confusion matrix for test data

array([[118, 24],
[ 66, 54]]
CHANGING THE CUTT OFF VALUE TO CHECK OPTIMAL VALUE
THAT GIVES BETTER ACCURACY AND F1 SCORE

AUC and ROC for the training data

Comparing both these models, we find both results are same, but LDA works better when there is
category target variable.
As we can see the results for AUC/ROC for both the models are almost equivalent to each other
So it is very difficult to differentiate between the two . The scores are also almost at par with
each other . Both the models are working perfectly at par with each other.
Since LDA works better with categorical values so we will pick it in this situation.

2.4 Inference: Basis on these predictions, what are the insights and
recommendations.

Please explain and summarise the various steps performed in this project. There
should be proper business interpretation and actionable insights present.

Solution.
So we had been given a problem where we had to find out whether the employees will opt for a
holiday package or not .
We looked in the data using logistic regression and LDA.
We found out that the results using both the methods is same. Predictions were done using both
the models.

While doing EDA we found out that

• Most of the employees who are above 50 don’t opt for holiday packages. It seems
like they are not interested in holiday packages at all .

• Employees who are in the age gap of 30 to 50 opt for holiday packages .It seems
like young people believe I spending on holiday packages so age here plays a very
important role in deciding whether they will opt for package or not

• Also people who have salary less than 50000 opt for holiday packages . So salary
is also a deciding factor for the holiday package.

• Education also plays an important role in deciding the holiday packages .

• To improve our customer base we need to look into those factors

Recommendations

As we already have the customer base who are of the age of 30 to 50 so we need to look for the
options and target the older people and the people who are earning more than 150000.
• As we know most of the people who are older prefer to visit religious places so it would
be better if we target those places and provide them with packages where they can visit
religious places.

• We can also look into the family dynamics of the people of the older people , if the older
people have elder children e.g 30 to 40 they can use the holiday packages so the deal
should include the family package .

• People who earn more than 150000 don’t spend much on the holiday packages , they tend
to go for lavish holidays and we can provide them with customized packages according to
their wish , such as fancy hotels , longer vacations , personal cars during the holiday to
attract such employees .

• Plus such people who earn more than 150000 we can provide them extra facilities
according to their own wishes at the moment.

In this project we started with EDA , descriptive statistics and did null value condition check, we
performed Univariate and Bivariate Analysis. did exploratory data analysis ,we treated outliers
then we moved on to Logistic regression . We encoded the data (having string values) for
Modelling. We split data into train and test (70:30) and finally we applied Logistic Regression
and LDA (linear discriminant analysis).

Isss602 Data Analytics Lab: Assignment 2: Be Customer Wise or Otherwise
No ratings yet
Isss602 Data Analytics Lab: Assignment 2: Be Customer Wise or Otherwise
34 pages
QuantEconlectures Python3 PDF
100% (1)
QuantEconlectures Python3 PDF
1,125 pages
Summary Game Theory
No ratings yet
Summary Game Theory
54 pages
Statistical Methods For Decision Making (SMDM) Project Report
100% (2)
Statistical Methods For Decision Making (SMDM) Project Report
22 pages
Linear Regression Chap01
100% (1)
Linear Regression Chap01
7 pages
Logistic Regression
100% (2)
Logistic Regression
30 pages
In All The Regression Models That We Have Considered So
100% (1)
In All The Regression Models That We Have Considered So
52 pages
Data Science - Unit II
100% (2)
Data Science - Unit II
173 pages
1 The Role of Statistics and The Data Analysis Process
100% (1)
1 The Role of Statistics and The Data Analysis Process
30 pages
Introduction To STATISTICS-new
100% (1)
Introduction To STATISTICS-new
46 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
1.1 Simple Linear Regression Model
100% (1)
1.1 Simple Linear Regression Model
15 pages
Regression - Elements of AI 4-2
100% (2)
Regression - Elements of AI 4-2
20 pages
CPE412 Pattern Recognition (Week 8)
100% (1)
CPE412 Pattern Recognition (Week 8)
25 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
KPMG Data
50% (2)
KPMG Data
3,723 pages
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Python Vs R in Data and Machine Learning PDF
100% (1)
Python Vs R in Data and Machine Learning PDF
6 pages
Course Title: Data Pre-Processing and Visualization
100% (2)
Course Title: Data Pre-Processing and Visualization
11 pages
Leer Los Datos: Import As Import As Import As From Import From Import
100% (1)
Leer Los Datos: Import As Import As Import As From Import From Import
14 pages
Powerbivstableau 160912230240
100% (1)
Powerbivstableau 160912230240
34 pages
SMDM - Week 1 Checklist
100% (1)
SMDM - Week 1 Checklist
3 pages
Scip y Lectures
100% (1)
Scip y Lectures
329 pages
Lecture 4 Linear Regression
100% (1)
Lecture 4 Linear Regression
44 pages
Stat1012 Cheatsheet Double-Sided
100% (1)
Stat1012 Cheatsheet Double-Sided
2 pages
KPMG - Data Set
100% (1)
KPMG - Data Set
1,685 pages
EDA Assignment
No ratings yet
EDA Assignment
15 pages
Unit3 160420200647 PDF
No ratings yet
Unit3 160420200647 PDF
146 pages
SAS Presentation
No ratings yet
SAS Presentation
49 pages
Regression Models Course Project
100% (1)
Regression Models Course Project
4 pages
Stats For Managers - Intro
100% (1)
Stats For Managers - Intro
101 pages
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
100% (1)
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
16 pages
Homework 2
100% (1)
Homework 2
14 pages
Correlation & Regression
100% (1)
Correlation & Regression
53 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Tutorial On "R" Programming Language
No ratings yet
Tutorial On "R" Programming Language
25 pages
1
100% (1)
1
385 pages
Using Statistical Techniq Ues in Analyzing Data
100% (1)
Using Statistical Techniq Ues in Analyzing Data
40 pages
Logistic Regression
100% (1)
Logistic Regression
29 pages
Community Medicine Trans - Epidemic Investigation 2
100% (1)
Community Medicine Trans - Epidemic Investigation 2
10 pages
Assignment-Based Subjective Questions/Answers
No ratings yet
Assignment-Based Subjective Questions/Answers
3 pages
Poly
100% (1)
Poly
108 pages
Data Mining
No ratings yet
Data Mining
27 pages
Homework 2
100% (1)
Homework 2
12 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Predictive Modelling
100% (1)
Predictive Modelling
58 pages
Forecasting of Stock Prices Using Multi Layer Perceptron
100% (1)
Forecasting of Stock Prices Using Multi Layer Perceptron
6 pages
Exploratory Spatial Data Analysis
No ratings yet
Exploratory Spatial Data Analysis
54 pages
Monte Carlo Studies Using SAS
100% (2)
Monte Carlo Studies Using SAS
258 pages
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
100% (1)
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
10 pages
8multiple Linear Regression
100% (1)
8multiple Linear Regression
21 pages
WINE Prediction Quality
100% (1)
WINE Prediction Quality
6 pages
Statistical Modeling
No ratings yet
Statistical Modeling
22 pages
Question and Answers For Pyplots
No ratings yet
Question and Answers For Pyplots
11 pages
Excel 2013/2016: Get Your Hands Dirty
From Everand
Excel 2013/2016: Get Your Hands Dirty
Sam Akrasi
No ratings yet
Python Natural Language Processing Cookbook: Over 60 recipes for building powerful NLP solutions using Python and LLM libraries
From Everand
Python Natural Language Processing Cookbook: Over 60 recipes for building powerful NLP solutions using Python and LLM libraries
Zhenya Antić
No ratings yet
The Power of Prediction in Health Care: A Step-by-step Guide to Data Science in Health Care
From Everand
The Power of Prediction in Health Care: A Step-by-step Guide to Data Science in Health Care
Rafiq Muhammad
No ratings yet
KRUPS - Automatski Espresso Aparat EA 8010 8108
No ratings yet
KRUPS - Automatski Espresso Aparat EA 8010 8108
20 pages
The Hindu Editorial Vocabulary Compilation August
No ratings yet
The Hindu Editorial Vocabulary Compilation August
25 pages
Adkins: Merit, Responsibility, and Thucydides
No ratings yet
Adkins: Merit, Responsibility, and Thucydides
13 pages
AN ISC 2 1029 - Transceiver - Handling
No ratings yet
AN ISC 2 1029 - Transceiver - Handling
4 pages
Logarithm: Meaning of Logarithm: Exponential Form
No ratings yet
Logarithm: Meaning of Logarithm: Exponential Form
40 pages
Mdat PR1
No ratings yet
Mdat PR1
14 pages
Asme CSD-1 2009
100% (2)
Asme CSD-1 2009
80 pages
MODULE 5. Lesson Proper 1
No ratings yet
MODULE 5. Lesson Proper 1
5 pages
Lorenzo Meyer, "Las Vicisitudes de La Normalidad (1941-1988) - El Proyecto Modernizador", en México Frente A Estados Unidos, FCE, 1982, pp.171-210.
No ratings yet
Lorenzo Meyer, "Las Vicisitudes de La Normalidad (1941-1988) - El Proyecto Modernizador", en México Frente A Estados Unidos, FCE, 1982, pp.171-210.
37 pages
Sanjeevini Hospital Narsipatnam: Doctor
No ratings yet
Sanjeevini Hospital Narsipatnam: Doctor
14 pages
JPR Project Format
No ratings yet
JPR Project Format
24 pages
Program ETABS Version 9
No ratings yet
Program ETABS Version 9
23 pages
Cost Concepts and Classification
No ratings yet
Cost Concepts and Classification
3 pages
Published AJSJ2106643 Nutman
No ratings yet
Published AJSJ2106643 Nutman
37 pages
Samvrudhi Technologies: Revised Quotation 250
No ratings yet
Samvrudhi Technologies: Revised Quotation 250
5 pages
Support Specialist - English
No ratings yet
Support Specialist - English
2 pages
Iman Mohammed Elshafei: Professional
No ratings yet
Iman Mohammed Elshafei: Professional
3 pages
Tutorial Mass Transfer
No ratings yet
Tutorial Mass Transfer
5 pages
HDPE Class Information
No ratings yet
HDPE Class Information
3 pages
The Schematic Structure of Literature Reviews in Doctoral Theses of Applied Linguistics
No ratings yet
The Schematic Structure of Literature Reviews in Doctoral Theses of Applied Linguistics
26 pages
Aoc f22s
No ratings yet
Aoc f22s
43 pages
Reconciliatiton Project
No ratings yet
Reconciliatiton Project
2 pages
Danfoss Multiple Motors
No ratings yet
Danfoss Multiple Motors
3 pages
Innovation and Security Software Devtar Singh IDC MOSC2011
No ratings yet
Innovation and Security Software Devtar Singh IDC MOSC2011
20 pages
2 Service Integration and Management - Satyendra Shrotri, Tata
No ratings yet
2 Service Integration and Management - Satyendra Shrotri, Tata
21 pages
3331 Ch.10 Circuit Switching and Packet Switching
No ratings yet
3331 Ch.10 Circuit Switching and Packet Switching
30 pages
Rrs Foreign Literature
No ratings yet
Rrs Foreign Literature
15 pages
Helicopter Stringing Management Plan
No ratings yet
Helicopter Stringing Management Plan
44 pages
1 Activity C Spirolaterals Lesson Plan
No ratings yet
1 Activity C Spirolaterals Lesson Plan
7 pages

Uploaded by

Uploaded by

PREDICTIVE MODELING

Problem 2(Logistic Regression and LDA)_______________________________

Summary of the dataset

EXPLORATORY DATA ANALYSIS

Step 2: Check and treat any missing values in the dataset

Below we see that the outliers have been treated accordingly.

Step 5: Bi-variate Analysis

The inferences drawn from the above Exploratory Data analysis:

R square on training data : 0.9311935886926559

R square on testing data : 0.931543712584074

Applying z- score stats models

➢ The intercept for our model is -5.879615251304736e-16 and the co-efficient of

Check Multi-collinearity using VIF

Root Mean Squared Error (Training) ------RMSE: 907.1312415459133

Root Mean Squared Error (lest) ------------RMSE: 911.8447345328433

Linear regression Performance Metrics:

There are no null values in the dataset

Percentage of employees that are interested in the holiday package 45.9%

CATOGORICAL UNIVARIATE ANALYSIS

Maximum of the employees don’t prefer to go to foreign

There is hardly any correlation between the data so no collinearity

1. AFTER TREATING OUTLIERS DATA LOOKS LIKE THIS

Encoding the data(having string variables)

Applying Logistic Regression

Prediction on the training set

Getting the probabilities on the test set

Performance Metrics will be discussed in 2.3

DATASET HEAD AFTER DATA PROCESSING

Build LDA Model

Performance Metrics will be discussed in 2.3

PEFORMANCE METRICS FOR LINEAR REGRESSION

Confusion matrix on the test data

Accuracy - Training Data

Accuracy - Test Data

AUC and ROC for the testing data

Metrics for train data

Metrics for test data

PERFORMANCE METRICS FOR LDA(linear discriminant analysis)

confusion_matrix for train data

Model score for test data

Classification report for test data

Confusion matrix for test data

AUC and ROC for the training data

While doing EDA we found out that

• Education also plays an important role in deciding the holiday packages .

• To improve our customer base we need to look into those factors

You might also like