StatLearning3r PDF
StatLearning3r PDF
Ingrassia
3.1 Ouverture
4 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Introduction
5 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Ouverture
Classification problems occur often. Some examples include:
1. A person arrives at the emergency room with a set of symptoms that
could possibly be attributed to one of three medical conditions. Which of
the three conditions does the individual have?
2. An online banking service must be able to determine whether or not a
transaction being performed on the site is fraudulent, on the basis of the
user’s IP address, past transaction history, and so forth.
3. On the basis of DNA sequence data for a number of patients with and
without a given disease, a biologist would like to figure out which DNA
mutations are deleterious (disease-causing) and which are not.
Just as in the regression setting, in the classification setting we have a set of
training observations (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) that can used to build a
classifier.
Remark
We want our classifier to perform well not only on the training data, but also
on test observations that were not used to train the classifier.
10 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
o
o o o oo o o o ooooo o o o
oo o oo o o o
o o o oo
o Unist in Class 2
✛
o o o o
o o o oo o ooo o
o oo o
o correctly classified
oo o ooooo oooo
✒
$ o
o
o o
oo oo o oo oo o o
$ o o ✕
✁ oo oo
ooo
❍❍
❨
✁
o o o o
$
oo o
❍❍
o
Decision boundary
o ✁ o o
✁ Area of units in Class 2
✁ X1
Unit in Class 1
wrongly classified in Class 2
11 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
where
! ŷi is the predicted class label for the ith obsevation using the estimated
model,
! I(yi ̸= ŷi ) is an indicator function that equals 1 if yi ̸= ŷi and 0 otherwise.
Thus the error rate represents the proportion of incorrect classifications, or
misclassifications.
Obviously, the error rate can be measured also on the test set T:
m
1 !
δT = I(yj ̸= ŷj ).
m j= 1
Introductive concepts
13 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
library(ISLR)
attach(Default)
str(Default)
’data.frame’: 10000 obs. of 4 variables:
$ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 ...
$ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 ...
$ balance: num 730 817 1074 529 786 ...
$ income : num 44362 12106 31767 35704 38463 ...
14 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
60000
40000
income
20000
0
balance
15 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
20000
0
balance
! the plot shows that default is associated with high balance
16 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
2500
2000
1500
Balance
1000
500
0
No Yes
Default
17 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
60000
40000
Income
20000
0
No Yes
Default
18 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
22 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
23 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
24 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
27 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
A possible solution?
Consider again the linear regression model
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. ..
.. ..
.. ..
o ..... ..... ..... .....o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... o..... ..... ..... .....o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....
..
..
..
.. ..
.. ..
.. ..
.. ..
.. ..
.. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. ..
.. .. .. .. .. .. o ..
.. .. .. .. .. ..o.. .. .. .. .. .. .. ..o .. ..
.. .. o.. .. ... ...o .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o
.. .. .. .. .. ... ... ...o... ... ... ...o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. ..
o .. .. .. .. .. .. o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ...o .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. ..
o
o o o..... o..... ..... ..... ..... ..... ..... .....o..... ..... ..... ..... ..... ..... ..... .....o..... ..... .....oo..... ..... .....o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....
.. ..
.. ..
..
..
.. ..
.. ..
.. ..
.. ..
.. .. . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . .. ..o.. o
.. .. .. .. ..
o .... .... .... .... .... .... .... .... .... .... .... .... .... o.... .... .... .... .... ....o.... .... .... .... ....oo.... .... .... .... .... .... .... ....o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. ..o
.. .. .. .. .. .. .. o. .. .... o.... .... .... .... .... .... .... .... .... .... .... .... ....o.... .... .... .... .... .... .... .... .... .... .... .... .... ....
.. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o... ... ... ... ...o... ... o
.. .. .. .. ..
.. .. o .. .. .. .. o . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. o
.. . . .. .. .. .. .. .. .. .. .. ... ... ...o... ... ... ... ... ... ... ... ... ... ... ... ... ... o
.. .. .. ..
. ... ... o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. o .. . .. .. .. .. .. .. .. .. o . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. .. .. .. . . . . . . . .o . . . . . . . . . . ..o.. .. .. .. .. .. ..
.. .. .. .. .. .. .. ... o .. . . . .. .. .. .. .. ...o . . .. o
.. . .. ..o.. o . . . . .o ... . . .o. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o
... ... .. .. .. .. .. .. .. ... ... ...o ... ... ... ... ........
.. ..o ..o .. .. .. .. ... ... ... ... o .. ..o.. ... ... ... ... ... ... ... ... ... ...o o .. .. .. .. .. .. .. o . . . .o . . .o . . . . . . . .. ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. o .. .o . ..o .. o .. .. . .. o . .. .. .. .. .. .. .. .. ... ...o... ... ... ... ... ... ... ... ... ... ... ... ... o . . . . . .o.
.o
o ... ... . . .. .. .. .. .. .. .. .. .. .. .. o
... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. ... ... o .. . ... ... o . . .. .. .. .. .. .. .. .. .. ...o . ...o . . . .o . . . . .. .. .. .. .. .. .. .. o . .. .. .. .. .. .. ..
. .. .. .. .. .. .. .. o
. ..o . . . .. o
.. .. .. .. .. .. .. .. o .. . . . . . . . .. ..o . . ..o.. o
.. .o .....o .. ..o ..
. . .. o . ...o . .. .. .. .. o . . .. .. .. .. .. .. ... ... ... ... ... ... ... ...
... ... ..o .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ...o... ... ... ... ... ...o
.. ... ... ... ...
..o.. ... ...o... ...o .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... o
..
... ...
..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o
..
... ...
o .. .. o
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... o
..
... ... .. .. .. .. .. ..o.. .. .. ..o.. .. .. .. .. .. ...o
. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. ...
..
... ...
o .. .. o .. .. .. .. o . . . .. .. .. .. o
... ...
. . . .
. . . .. .. .. .. o . . . . . . . . . . . .
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. o. . . . . o . . . . .. .. .. .. .. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... o ... ... ... ...o .o.o . ... ... ... o . .. .. .. ... ... ... o . .. .. .. ... ... ... ... o . .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ...
..
. . . . . . . . .o
.. .. .. .. .. .. .. .. o
. . . . . . .. o
.. .. . . .. .. .. .. .. .. .. .. .. o
. .
.. .. .. .. .. ... ... ... ...o
..
... ... .. .. .. .. .. .. .. .. .. .. .. .. .. o ..
.. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... o ... ... ... ... . .. ...o o. . . .. .. .. ..o.. .. .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... ....
.. . . . .. .. o . .o. . ... ... o . .. .. .. o
. . .. ... ... ... o
.. .. .. . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o
.. .. .. .. .. .. .. o
... ... ..o...o... ...o ... ... . .. ... ... ... ... ... ... ... ... ... ... ... ... ... ...o
.. ... o
..............
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... o
.. .. .. .. .. ... ... ... o .. ..o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o... ... ... ... ...o... ...o
.. ... ... ... ... ... ... ... o .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. o
.. .. .. ..
... ... .. ... ... ... ... . . . .. o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o
.. .. .. .. ..o.. .. .. o .. .. .. .. ... ... ... ... o .. .. .. .. .. .. .. ..o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... ... ... ... ...o
.. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... o
..
.. ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ...o... ... ... ... ... ... o .o. . . . . o
.. .. ..
... ... .. ... ... ... ...
. .. .. .. .. ... ... ... ... ... ...o... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. ......................
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o .o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... o
.. .. ..
... ... .. ... ... ... ...
. . . . .. .. o
. . .. ... ... ... ...o
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
............................o .. . . .. .. ... ... o
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o... ... ... ... ... ...o
.. .. .. .....................
... ... ..
.. ... ... . . .. .. .. o
... ... . . . . .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... o
.. .. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... .. ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ...o
.. ... ... o. . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. ..
... ... .. ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. ... ... ................
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. ..
... ... ..
.. ... ... ... ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o
In other words, for a binary (two level) qualitative response, the response
variable Y could assume a dummy variable approach, i.e. Y takes values in
{0, 1}.
Could we do better?
29 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
30 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Introduction
! Consider again the Example about Brand Preference for Orange Juice.
Rather than modeling this response Y directly, try to predict P(Y = 1),
the probability a customer buys Citrus Hill (CH) juice.
! Thus, we can model P(Y = 1) using a function that gives outputs
between 0 and 1.
! How should we model the relationship between p(x) = P(Y = 1|x) and
x?
33 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Y Probability
1 P(Y = 1) = p
0 P(Y = 0) = 1 − p
34 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Y Probability
1 P(Y = 1) = p
0 P(Y = 0) = 1 − p
eβ0 + β1 x
P(Y = 1) = p(x) = .
1 + eβ0 + β1 x
35 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
eβ0 + β1 x
P(Y = 1) = p(x) = .
1 + eβ0 + β1 x
1.0
0.8
0.6
P(Y=1)
0.4
0.2
0.0
−3 −2 −1 0 1 2 3
36 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
! For low balances we now predict the probability of default as close to,
but never below, zero.
! Likewise, for high balances we predict a default probability close to, but
never above, one.
! The logistic function will always produce an S-shaped curve of this form,
and so regardless of the value of x, we will obtain a sensible prediction.
! We also see that the logistic model is better able to capture the range of
probabilities than is the linear regression model.
40 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
41 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
42 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
2500
60000
2000
60000
Balance
1500
Income
40000
Income
40000
1000
20000
20000
500
0
0
0
43 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1.0
| | | | || | ||||| || ||| ||||||||||| ||||| |||||||||||||||||| |||||||| ||||||||||| |||||||||||||||||||||||||||| ||||||||||||||||||||||||||| | |||| | | | | | |
Probability of Default
! If we fit a linear regression to the
0.8
Default data, then for very low
0.6
balances we predict a negative
0.4
probability, and for high balances
0.2
we predict a probability above 1!
0.0
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||
|||||||||||||
||||||||||||||||||||||||||||||||||
||||||||||||
|||||||||||||||||||
||||||||||||
|||||||||||||||||
||||||||||||
|||||||||||||||||||
|||||
||||
||||||||||||
|||||||||
||||
||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||| || || ||| | | | ||
✂✂✍
0 500 1000 1500 2000 2500
✂ Balance
✂
When balance< 500
then P(default) is negative!
44 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1.0
| | | | || | ||||| || ||| ||||||||||| ||||| |||||||||||||||||| |||||||| |||||||||| |||||||||||||||||||||||||||| ||||||||||||||||||||||||||| | |||| | | | | | |
Probability of Default
0.8
! Now the probability of default is
0.6
close to, but not less than zero
0.4
for low balances. And close to
0.2
but not above 1 for high balances
0.0
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||
|||||||||||||
||||||||||||||||||||||||||||||||||
||||||||||||
|||||||||||||||||||
||||||||||||
|||||||||||||||||
||||||||||||
|||||||||||||||||||
|||||
||||
||||||||||||
|||||||||
||||
||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||| || || ||| | | | ||
Balance
45 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Let us write
eβ0 + β1 x
p(x) =
1 + eβ0 + β1 x
and then
eβ0 + β1 x 1
1− p(x) = 1 − = .
1 + eβ0 + β1 x 1 + eβ0 + β1 x
The quantity
p(x) β1 x
= eβ0 +
1− p(x)
is called the odds, and can take on any value between 0 and ∞.
Values of the odds close to 0 and ∞ indicate very low and very high
probabilities of Y = 1 compared to probability of Y = 0, respectively.
48 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
51 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
56 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The estimated value of the linear predictor in xi is η̂i = η̂(xi ) = β#0 + β#1 xi and
the fitted value of the logistic regression model is
58 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Interpreting β1 1/2
In a linear regression model, β1 gives the average change in Y associated
with a one-unit increase in X. Consider now the interpretation of the
parameters is the logistic regression models.
! Consider first the fitted value of the linear predictor value at x:
η̂(x) = β#0 + β#1 x.
Interpreting β1 2/2
64 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2697 -0.1465 -0.0589 -0.0221 3.7589
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1
65 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
70 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Algorithm note
75 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2697 -0.1465 -0.0589 -0.0221 3.7589
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1
76 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Making Predictions
79 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
ei = yi − ŷi i = 1, 2, . . . , n.
80 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
ei = yi − ŷi i = 1, 2, . . . , n.
81 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
82 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
83 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2697 -0.1465 -0.0589 -0.0221 3.7589
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1
84 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Deviance
The goodness of fit of the logistic regression model can be assessed by the
deviance which is a measure based on the loglikelihood and is defined by
! n ) $ %*
yi 1 − yi
D=2 yi ln + (1 − yi ) ln
i= 1
#i
π 1 − ŷi
86 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2697 -0.1465 -0.0589 -0.0221 3.7589
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1
87 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
In this case, we have a value of 2920.6 on 9999 degrees of freedom. Including the
independent variable (balance) decreased the deviance to 1596.5points on 9998
degrees of freedom, a significant reduction in deviance.
The Residual Deviance has reduced by 1324.1 with a loss of one degree of
freedom.
88 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Since β!1 is positive: This indicates students tend to have higher default probabilities than
non-students:
exp(−3.50413 + 0.40489 × 1)
P(default = Yes|student = Yes) = = 0.0431
1 + exp(−3.50413 + 0.40489 × 1)
exp(−3.50413 + 0.40489 × 0)
P(default = Yes|student = No) = = 0.0292
1 + exp(−3.50413 + 0.40489 × 0)
90 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
91 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4691 -0.1418 -0.0557 -0.0203 3.7383
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
income 3.033e-06 8.203e-06 0.370 0.71152
studentYes -6.468e-01 2.363e-01 -2.738 0.00619 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1
into
eβ0 + β1 x1 + β2 x2 + ···+ βpxp
p(x) =
1 + eβ0 + β1 x+ β2 x2 + ···+ βpxp
we can make predictions.
For example, a student with a credit card balance of $1,500 and an income of
$40,000 has an estimated probability of default of
93 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
into
eβ0 + β1 x1 + β2 x2 + ···+ βpxp
p(x) =
1 + eβ0 + β1 x+ β2 x2 + ···+ βpxp
we can make predictions.
A non-student with
e−10.869+ 0.00574× 1500+ 0.003× 40−0.6468× 0
p̂(1500, 40000, No) = = 0.105
1 + e−10.869+ 0.00574× 1500+ 0.003× 40−0.6468× 0
94 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Consider the two models with one predictor (student) and three predictors
(student, balance and income):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
income 3.033e-06 8.203e-06 0.370 0.71152
studentYes -6.468e-01 2.363e-01 -2.738 0.00619 ***
and
95 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
96 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
and
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.50413 0.07071 -49.55 <2e-16 ***
studentYes 0.40489 0.11502 3.52 0.000431 ***
! In the model with three predictors, the p-values associated with
balance and the dummy variable for student status are very small,
indicating that each of these variables is associated with the probability
of default.
! However, the coefficient for the dummy variable is negative, indicating
that students are less likely to default than non-students.
97 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
and
98 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
0.8
0.6
Default Rate
0.4
0.2
0.0
! The solid lines show the average default rates for students (orange) and
non-students (blue), respectively as a function of credit card balance.
99 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
0.8
0.6
Default Rate
0.4
0.2
0.0
100 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
0.8
0.6
Default Rate
0.4
0.2
0.0
101 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
0.8
0.6
Default Rate
0.4
0.2
0.0
102 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
0.8
0.6
Default Rate
0.4
0.2
0.0
! But the horizontal broken lines near the base of the plot, which show
the default rates for students and non-students averaged over all values
of balance and income, suggest the opposite effect: the overall
student default rate is higher than the non-student default rate.
! Consequently, there is a positive coefficient for student in the single
variable logistic regression output
103 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
2500
Credit Card Balance
2000
1500
1000
500
0
No Yes
Student Status
104 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
2500
Credit Card Balance
2000
1500
1000
500
0
No Yes
Student Status
105 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
2500
Credit Card Balance
2000
1500
1000
500
0
No Yes
Student Status
! In other words, students are more likely to have large credit card
balances, which tend to be associated with high default rates.
106 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
2500
Credit Card Balance
2000
1500
1000
500
0
No Yes
Student Status
107 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
2500
Credit Card Balance
2000
1500
1000
500
0
No Yes
Student Status
108 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
This simple example illustrates the dangers and subtleties associated with
performing regressions involving only a single predictor when other predictors
may also be relevant.
As in the linear regression setting, the results obtained using one predictor
may be quite different from those obtained using multiple predictors,
especially when there is correlation among the predictors.
In general, this phenomenon is known as confounding.
109 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
110 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
111 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Deviance Residuals
Min 1Q Median 3Q Max
-2.2697 -0.1465 -0.0589 -0.0221 3.7589
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1
112 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
When the true classifications are known for the data, we can compute the
confusion matrix C which is a contigency table containing the information
about actual and predicted classifications.
In particular the entry chg denotes the relative frequency of units coming from
class h and classified in class g . The trace of C, namely tr(C), gives the
percentage of right classified units and then 1 − tr(C) is the misclassification
error.
113 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
In the case of a binary classifier, one can make two types of errors:
! it can incorrectly assign an individual who defaults to the no default
category, or
! it can incorrectly assign an individual who does not default to the default
category.
It is often of interest to determine which of these two types of errors are being
made.
114 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
true values
predicted values No Yes Total
No 9628 228 9856
Yes 39 105 144
Total 9667 333 10000
and
228 + 39
δ= · 100 = 2.67%.
10000
115 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
116 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
117 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
118 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
119 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Preliminary concepts
124 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Preliminary concepts
125 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
127 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
130 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
fk (x) = p(x|Y = k)
133 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Assumption
In linear discriminant analysis we assume the homoscedastic model, i.e.
135 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
137 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
with π1 = π2 = 0.5.
0.4
0.3
0.2
0.1
0.0
−6 −4 −2 0 2 4 6
140 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
−6 −4 −2 0 2 4 6
141 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
−6 −4 −2 0 2 4 6
In general the boundary between the two classes is the point h which is the
solution of the equation δ1 (x) = δ2 (x), i.e.
µ1 µ21 µ2 µ22
ln π1 + x · − = ln π2 + x · 2 − .
σ2 2σ 2 σ 2σ 2
142 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
−6 −4 −2 0 2 4 6
leading to
(µ21 − µ22 ) + σ 2 ln(π1 /π2 )
h= .
2(µ1 − µ2 )
143 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
−6 −4 −2 0 2 4 6
144 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
−6 −4 −2 0 2 4 6
−6 −4 −2 0 2 4 6
146 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
−6 −4 −2 0 2 4 6
147 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
−6 −4 −2 0 2 4 6
148 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The linear discriminant analysis (LDA) method approximates the Bayes clas-
sifier through the estimates for πk , µk (k = 1, . . . , K), and σ 2 :
1 !
µ̂k = xi
nk i:y = k
i
2 1 !
# =
σ (xi − µ̂k )2
n1 + n2 − K i:yi = k
nk
#k =
π
n
0.2
0.1
0.0
-4 -2 0 2 4
151 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
with π1 = π2 = 0.5.
Estimate the parameters:
0.5
0.4
0.3
Density
0.2
0.1
0.0
-4 -2 0 2 4
152 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
with π1 = π2 = 0.5.
and then estimate the underlying mixture distribution:
0.5
0.4
0.3
Density
0.2
0.1
0.0
-4 -2 0 2 4
153 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
0.2
0.1
0.0
-4 -2 0 2 4
! assign x to Ω1 if x ≤ h∗ ,
! assign x to Ω1 if x > h∗ .
154 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Once we get the decision boundary, denote by Ω̂1 , Ω̂2 the estimated classes
and then compute the confusion matrix
True class
Estimated class Ω1 Ω2 tot
Ω̂1 46 4 50
Ω̂2 1 49 50
155 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
156 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
x2
x2
x1 x1
Two multivariate Gaussian density functions are shown, with p= 2. Left: The
two predictors are uncorrelated. Right: The two variables have a correlation
of 0.7.
157 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
δk (x) = δl (x) k ̸= l
i.e.
1 ′ 1 ′
x′ Σµk − µ Σµk + ln πk = x′ Σµl − µ Σµl + ln πl k ̸= l.
2 k 2 l
159 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
4
2
X2
0
−2
−4
−4 −2 0 2 4
X1
! The three ellipses represent regions that contain 95% of the probability
for each of the three classes,
! the dashed lines are the Bayes decision boundaries.
160 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Remark
Obiviously, the population decision boundary (Bayes rule) and the sample
decision boundary (LDA) are different:
4
2
X2
0
−2
−4
−4 −2 0 2 4
X1 161 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
162 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The analysis of the confusion matrix shows that LDA gives an overall
misclassification error equal to (23 + 252)/10000 · 100 = 2.75% but we need to
pay attention to the different situations:
! a total of 104 people would default and, of these people, 81 actually
defaulted and 23 did not;
! In particular only 23 out of 9667 of the individuals who did not default
were incorrectly labeled;
163 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The analysis of the confusion matrix shows that LDA gives an overall
misclassification error equal to (23 + 252)/10000 · 100 = 2.75% but we need to
pay attention to the different situations:
! a total of 104 people would default and, of these people, 81 actually
defaulted and 23 did not;
! In particular only 23 out of 9667 of the individuals who did not default
were incorrectly labeled;
! of the 333 individuals who defaulted, 252 were missed by LDA.
! From the perspective of a credit card company that is trying to identify
high-risk individuals, an error rate of 252/333 = 75.7% among
individuals who default may well be unacceptable.
164 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
lda.predict=predict(lda.fit,Default)
pred.class=lda.predict$class
table(pred.class,default)
default
pred.class No Yes
No 9644 252
Yes 23 81
addmargins(table(pred.class,default))
default
pred.class No Yes Sum
No 9644 252 9896
Yes 23 81 104
Sum 9667 333 10000
165 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Misclassifications
There are two kind of misclassifications:
False positive
some misclassifications will result from incorrectly assigning a customer who
does not default to the default class,
False negative
and others will result from incorrectly assigning a customer who defaults to the
non-default class.
The two errors are not equivalent (it depends on the case study). For
example a credit card company
! might particularly wish to avoid incorrectly classifying an individual who
will default,
! whereas incorrectly classifying an individual who will not default, though
still to be avoided, is less problematic.
166 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The Bayes classifier works by assigning an observation to the class for which
the posterior probability is greatest. In the two-class case, this amounts to
assigning an observation to the default class if
Thus, the Bayes classifier, and by extension LDA, uses a threshold of 50% for
the posterior probability of default in order to assign an observation to the
default class.
However, if we are concerned about incorrectly predicting the default status
for individuals who default, then we can consider lowering this threshold.
In other words, perhaps, we shouldn’t use 0.5 as threshold for predicting de-
fault?
167 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
168 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Company policy
But a credit card company may consider this slight increase in the total error
rate to be a small price to pay for more accurate identification of individuals
who do indeed default.
169 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
0.6
Error Rate
0.4
0.2
0.0
Threshold
170 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
0.6
Error Rate
0.4
0.2
0.0
Threshold
lda.predict=predict(lda.fit,Default)
post.class=lda.predict$posterior[,2]
table((post.class>=0.2),default)
default
No Yes
FALSE 9632 138
TRUE 235 195
addmargins(table((post.class>=0.2),default))
default
No Yes Sum
FALSE 9632 252 9570
TRUE 235 195 430
Sum 9667 333 10000
172 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Sensitivity
Sensitivity is the percentage of true defaulters that are identified.
In this case, the sensitivity is 81/333 · 100 = 24.31%.
Specificity
Specificity is the is the percentage of non-defaulters that are correctly
identified.
In this case, the specificity (1 − 23/9667) · 100 = 99.8%.
173 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Predicted class
− or Null + or Non-null Total
True −or Null True Neg. (TN) False Pos. (FP) TN+FP
class + or Non-null False Neg. (FN) True Pos. (TP) FN+TP
Total TN+FN FP+TP
Thus:
TP
TPR = ,
TP + FP
TN TN FP
specificity = FPR= 1 − =
FP + TN FP + TN FP + TN
174 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Predicted class
− or Null + or Non-null Total
True −or Null True Neg. (TN) False Pos. (FP) TN+FP
class + or Non-null False Neg. (FN) True Pos. (TP) FN+TP
Total TN+FN FP+TP
175 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
against
H1 :positive (e.g.: default)
we have that
! false positives are associated with the error of the first kind,
! while the false negatives are associated with the error of the second kind.
176 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
177 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Formally, for a fixed cutoff c, the true positive rate depends on c, say TPR(c),
and it is the probability of a test positive among the diseased population:
and the false positive rate, say FPR(c)m is the probability of a test positive
among the healthy population:
! Since the cutoff c is not usually fixed in advance, we can plot the
TPR(c) against the FPR(c) for all possible values of c.
! This is exactly what the ROC curve is: FPR(c) on the x and TPR(c)
along the y axis.
178 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1.0
2
0.8
1.6
True positive rate
0.6
1.2
0.4
0.8
0.2
0.4
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0
180 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
0.94
1.0
0.75
0.8
0.56
True positive rate
0.6
0.2
0.38
0.4
0.5
0.19
0.2
0.8
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0
181 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
0.94
1.0
0.75
0.8
0.56
True positive rate
0.6
0.2
0.38
0.4
0.5
0.19
0.2
0.8
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0
We consider the ROC curve on the test set and compare the AUC for
different models:
! Logistic Regression → AUC=0.9530,
! LDA → AUC=0.9538.
The two models are in practice equivalent, here.
182 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
LDA assumes that the observations within each class are drawn from a
multivariate Gaussian distribution with a class-specific mean vector µk
(k = 1, . . . , K) and a covariance matrix Σ that is common to all K classes.
Quadratic discriminant analysis (QDA) provides an alternative approach. Like
LDA, the QDA classifier results from assuming that the observations from
each class are drawn from a Gaussian distribution, and plugging estimates
for the parameters into Bayes’ theorem in order to perform prediction.
However, unlike LDA, QDA assumes that each class has its own covariance
matrix. That is, it assumes that an observation from the kth class is of the
form X ∼ Np(µk , Σk ), where Σk is a covariance matrix for the kth class.
183 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
184 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Why does it matter whether or not we assume that the K classes share a
common covariance matrix? In other words, why would one prefer LDA to
QDA, or vice-versa? in other words, which approach is better: LDA or QDA?
Since QDA allows for different variances among classes, the resulting
boundaries become quadratic, thus:
! QDA will work best when the variances are very different between
classes and we have enough observations to accurately estimate the
variances,
! LDA will work best when the variances are similar among classes or we
don’t have enough data to accurately estimate the variances
185 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Consequently, LDA is a much less flexible classifier than QDA, and so has
substantially lower variance.
186 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Roughly speaking, LDA tends to be a better bet than QDA if there are relatively
few training observations and so reducing variance is crucial.
In contrast, QDA is recommended if the training set is very large, so that the
variance of the classifier is not a major concern, or if the assumption of a
common covariance matrix for the K classes is clearly untenable.
187 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
2
1
0
X2
−1
−2
−3
✛ LDA decision boundary
−4 −2 0 2 4
X1
Two Gaussian classes with common correlation of 0.7 between X1 and X2 .
As a result, the Bayes decision boundary is linear and is accurately
approximated by the LDA decision boundary.
The QDA decision boundary is inferior, because it suffers from higher
variance without a corresponding decrease in bias.
188 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
2
1
0
X2
−1
−2
−3
✛ QDA decision boundary
−4 −2 0 2 4
X1
A situation in which the orange class has a correlation of 0.7 between the
variables and the blue class has a correlation of − 0.7.
Now the Bayes decision boundary is quadratic, and so QDA more accurately
approximates this boundary than does LDA.
189 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Both expressions are linear functions of x. Hence, both logistic regression and
LDA produce linear decision boundaries.
190 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
191 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
192 / 192