0% found this document useful (0 votes)
121 views136 pages

StatLearning3r PDF

This document introduces classification problems in supervised machine learning. Classification involves predicting a categorical response variable based on features. Examples include medical diagnosis, fraud detection, and disease prediction from DNA data. The goal is to build a classifier from training data that performs well on new test observations. Classification accuracy is measured using the error rate, which is the proportion of incorrect predictions. The document also defines related concepts like clustering, discrimination, and classification. Finally, it presents an example classification problem using a credit card default dataset.

Uploaded by

Bruno Casella
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views136 pages

StatLearning3r PDF

This document introduces classification problems in supervised machine learning. Classification involves predicting a categorical response variable based on features. Examples include medical diagnosis, fraud detection, and disease prediction from DNA data. The goal is to build a classifier from training data that performs well on new test observations. Classification accuracy is measured using the error rate, which is the proportion of incorrect predictions. The document also defines related concepts like clustering, discrimination, and classification. Finally, it presents an example classification problem using a credit card default dataset.

Uploaded by

Bruno Casella
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 136

Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S.

Ingrassia

3.1 Ouverture

4 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Introduction

! Supervised learning problems can be divided into regression and


classification problems.
! Regression covers situations where the response Y is
continuous/numerical, e.g.
! Predicting the value of the Dow in 6 months.
! Predicting the value of a given house based on various inputs.

! Classification covers situations where the response Y is categorical e.g.


! Will the Dow be up (U) or down (D) in 6 months?
! Is this email a SPAM or not?

5 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Ouverture
Classification problems occur often. Some examples include:
1. A person arrives at the emergency room with a set of symptoms that
could possibly be attributed to one of three medical conditions. Which of
the three conditions does the individual have?
2. An online banking service must be able to determine whether or not a
transaction being performed on the site is fraudulent, on the basis of the
user’s IP address, past transaction history, and so forth.
3. On the basis of DNA sequence data for a number of patients with and
without a given disease, a biologist would like to figure out which DNA
mutations are deleterious (disease-causing) and which are not.
Just as in the regression setting, in the classification setting we have a set of
training observations (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) that can used to build a
classifier.

Remark
We want our classifier to perform well not only on the training data, but also
on test observations that were not used to train the classifier.

10 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Basic ideas in classification: a summary


Unit in Class 2
Area of units in Class 1 wrongly classified in Class 1

❍❍
❍❍ oo o

❍ oo o
o

o
o oo oo o
o o
o oo o oo o oo
o o o
o o o
oo o o o o ooo oo oo

o o oo oo
Units in Class 1 o o o o oo o
oo oo o o o o o
o o o oo o o
correctly classified o oo o o o o o
o o o o
o o ooo o ooo o oo ooo
o
X2

o
o o o oo o o o ooooo o o o
oo o oo o o o
o o o oo
o Unist in Class 2

o o o o
o o o oo o ooo o
o oo o
o correctly classified
oo o ooooo oooo

$ o
o
o o
oo oo o oo oo o o

$ o o ✕
✁ oo oo
ooo
❍❍


o o o o
$
oo o
❍❍
o
Decision boundary
o ✁ o o
✁ Area of units in Class 2
✁ X1
Unit in Class 1
wrongly classified in Class 2

11 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The classification setting


In classification, the response Y takes value on a finite set C = {1, 2, . . . , K}
of categories.
For a classification problem we can use the error rate δL
n
1!
δL = I(yi ̸= ŷi )
n i= 1

where
! ŷi is the predicted class label for the ith obsevation using the estimated
model,
! I(yi ̸= ŷi ) is an indicator function that equals 1 if yi ̸= ŷi and 0 otherwise.
Thus the error rate represents the proportion of incorrect classifications, or
misclassifications.
Obviously, the error rate can be measured also on the test set T:
m
1 !
δT = I(yj ̸= ŷj ).
m j= 1

A good classifier is one for which the test error is smallest.


12 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Introductive concepts

Let Ω = {ω1 , ω2 , . . . , ωn } a set of objects.


! Clustering: refers to grouping objects into classes according to their
”similarity”, see e.g. cluster analysis;
! Discrimination: refers to the process of deriving classification rules
from samples of already classified objects;
! Classification: refers to applying the rules to new objects of unknown
class;

13 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 1/23

Consider the Default data:

library(ISLR)
attach(Default)
str(Default)
’data.frame’: 10000 obs. of 4 variables:
$ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 ...
$ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 ...
$ balance: num 730 817 1074 529 786 ...
$ income : num 44362 12106 31767 35704 38463 ...

We are interested in predicting whether an individual will default on his or her


credit card payment, on the basis of annual income and monthly credit card
balance.

14 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 2/23


plot(balance,income, pch=1, col="cadetblue3", cex=0.7)

60000
40000
income

20000
0

0 500 1000 1500 2000 2500

balance

! this plot does not give much information

15 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 2/23


plot(balance,income,col=c("cadetblue3","chocolate")[default],
pch=c(1,3)[default], cex=0.7)
legend("topright", legend=c("No Default", "Default"), pch=c(1,3),
col=c("cadetblue3","chocolate"), cex=0.8)
No Default
60000
40000 Default
income

20000
0

0 500 1000 1500 2000 2500

balance
! the plot shows that default is associated with high balance
16 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 2/23


boxplot(balance default, col=(c("cadetblue1","chocolate")),
xlab="Default", ylab="Balance")

2500
2000
1500
Balance

1000
500
0

No Yes

Default

! the boxplot confirms that default is associated with high balance

17 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 2/23


boxplot(balance default, col=(c("cadetblue1","chocolate")),
xlab="Default", ylab="Income")

60000
40000
Income

20000
0

No Yes

Default

! At first glance, default is not associated with income

18 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example B: Brand Preference for Orange Juice 1/3

! We would like to predict what customers prefer to buy: Citrus Hill or


Minute Maid orange juice?
! The Y (Purchase) variable is categorical: 0 (Minute Maid) or 1 (Citrus
Hill).
! The X (LoyalCH) variable is a numerical value (between 0 and 1) which
specifies the how much the customers are loyal to the Citrus Hill (CH)
orange juice
! Can we use Linear Regression when Y is categorical?

22 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example B: Brand Preference for Orange Juice 2/3

! When Y only takes on values of 0 and 1, why standard linear regression


in inappropriate?

✛ How do we interpret values of Y


greater than 1?

23 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example B: Brand Preference for Orange Juice 2/3

! When Y only takes on values of 0 and 1, why standard linear regression


in inappropriate?

✛ How do we interpret values of Y


greater than 1?

✛ How do we interpret values of Y


between 0 and 1?

24 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example B: Brand Preference for Orange Juice 3/3

! The regression line β0 + β1 x can take on any value between negative


and positive infinity
! In the orange juice classification problem, Y can only take on two
possible values: 0 or 1.
! Therefore the regression line almost always predicts the wrong value for
Y in classification problems

27 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

A possible solution?
Consider again the linear regression model
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. ..
.. ..
.. ..
o ..... ..... ..... .....o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... o..... ..... ..... .....o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....
..
..
..
.. ..
.. ..
.. ..
.. ..
.. ..
.. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. ..
.. .. .. .. .. .. o ..
.. .. .. .. .. ..o.. .. .. .. .. .. .. ..o .. ..
.. .. o.. .. ... ...o .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o
.. .. .. .. .. ... ... ...o... ... ... ...o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. ..
o .. .. .. .. .. .. o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ...o .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. ..
o
o o o..... o..... ..... ..... ..... ..... ..... .....o..... ..... ..... ..... ..... ..... ..... .....o..... ..... .....oo..... ..... .....o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....
.. ..
.. ..
..
..
.. ..
.. ..
.. ..
.. ..
.. .. . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . .. ..o.. o
.. .. .. .. ..
o .... .... .... .... .... .... .... .... .... .... .... .... .... o.... .... .... .... .... ....o.... .... .... .... ....oo.... .... .... .... .... .... .... ....o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. ..o
.. .. .. .. .. .. .. o. .. .... o.... .... .... .... .... .... .... .... .... .... .... .... ....o.... .... .... .... .... .... .... .... .... .... .... .... .... ....
.. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o... ... ... ... ...o... ... o
.. .. .. .. ..
.. .. o .. .. .. .. o . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. o
.. . . .. .. .. .. .. .. .. .. .. ... ... ...o... ... ... ... ... ... ... ... ... ... ... ... ... ... o
.. .. .. ..
. ... ... o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. o .. . .. .. .. .. .. .. .. .. o . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. .. .. .. . . . . . . . .o . . . . . . . . . . ..o.. .. .. .. .. .. ..
.. .. .. .. .. .. .. ... o .. . . . .. .. .. .. .. ...o . . .. o
.. . .. ..o.. o . . . . .o ... . . .o. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o
... ... .. .. .. .. .. .. .. ... ... ...o ... ... ... ... ........
.. ..o ..o .. .. .. .. ... ... ... ... o .. ..o.. ... ... ... ... ... ... ... ... ... ...o o .. .. .. .. .. .. .. o . . . .o . . .o . . . . . . . .. ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. o .. .o . ..o .. o .. .. . .. o . .. .. .. .. .. .. .. .. ... ...o... ... ... ... ... ... ... ... ... ... ... ... ... o . . . . . .o.
.o
o ... ... . . .. .. .. .. .. .. .. .. .. .. .. o
... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. ... ... o .. . ... ... o . . .. .. .. .. .. .. .. .. .. ...o . ...o . . . .o . . . . .. .. .. .. .. .. .. .. o . .. .. .. .. .. .. ..
. .. .. .. .. .. .. .. o
. ..o . . . .. o
.. .. .. .. .. .. .. .. o .. . . . . . . . .. ..o . . ..o.. o
.. .o .....o .. ..o ..
. . .. o . ...o . .. .. .. .. o . . .. .. .. .. .. .. ... ... ... ... ... ... ... ...
... ... ..o .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ...o... ... ... ... ... ...o
.. ... ... ... ...
..o.. ... ...o... ...o .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... o
..
... ...
..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o
..
... ...
o .. .. o
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... o
..
... ... .. .. .. .. .. ..o.. .. .. ..o.. .. .. .. .. .. ...o
. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. ...
..
... ...
o .. .. o .. .. .. .. o . . . .. .. .. .. o
... ...
. . . .
. . . .. .. .. .. o . . . . . . . . . . . .
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. o. . . . . o . . . . .. .. .. .. .. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... o ... ... ... ...o .o.o . ... ... ... o . .. .. .. ... ... ... o . .. .. .. ... ... ... ... o . .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ...
..
. . . . . . . . .o
.. .. .. .. .. .. .. .. o
. . . . . . .. o
.. .. . . .. .. .. .. .. .. .. .. .. o
. .
.. .. .. .. .. ... ... ... ...o
..
... ... .. .. .. .. .. .. .. .. .. .. .. .. .. o ..
.. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... o ... ... ... ... . .. ...o o. . . .. .. .. ..o.. .. .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... ....
.. . . . .. .. o . .o. . ... ... o . .. .. .. o
. . .. ... ... ... o
.. .. .. . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o
.. .. .. .. .. .. .. o
... ... ..o...o... ...o ... ... . .. ... ... ... ... ... ... ... ... ... ... ... ... ... ...o
.. ... o
..............
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... o
.. .. .. .. .. ... ... ... o .. ..o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o... ... ... ... ...o... ...o
.. ... ... ... ... ... ... ... o .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. o
.. .. .. ..
... ... .. ... ... ... ... . . . .. o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o
.. .. .. .. ..o.. .. .. o .. .. .. .. ... ... ... ... o .. .. .. .. .. .. .. ..o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... ... ... ... ...o
.. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... o
..
.. ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ...o... ... ... ... ... ... o .o. . . . . o
.. .. ..
... ... .. ... ... ... ...
. .. .. .. .. ... ... ... ... ... ...o... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. ......................
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o .o
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... o
.. .. ..
... ... .. ... ... ... ...
. . . . .. .. o
. . .. ... ... ... ...o
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
............................o .. . . .. .. ... ... o
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o... ... ... ... ... ...o
.. .. .. .....................
... ... ..
.. ... ... . . .. .. .. o
... ... . . . . .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... o
.. .. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
... ... .. ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ...o
.. ... ... o. . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. ..
... ... .. ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o
.. ... ... ................
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. ..
... ... ..
.. ... ... ... ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
o

and assume "


0 if β#0 + β#1 x > 0.5
Y=
1 if β#0 + β#1 x ≤ 0.5

In other words, for a binary (two level) qualitative response, the response
variable Y could assume a dummy variable approach, i.e. Y takes values in
{0, 1}.

Could we do better?
29 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

3.2 Logistic Regression

30 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Introduction

! Consider again the Example about Brand Preference for Orange Juice.
Rather than modeling this response Y directly, try to predict P(Y = 1),
the probability a customer buys Citrus Hill (CH) juice.
! Thus, we can model P(Y = 1) using a function that gives outputs
between 0 and 1.
! How should we model the relationship between p(x) = P(Y = 1|x) and
x?

33 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The Logistic Regression Model 1/3

We will assume that the response variable is a random variable having a


Bernoulli distribution with probability distribution as follows:

Y Probability
1 P(Y = 1) = p
0 P(Y = 0) = 1 − p

34 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The Logistic Regression Model 1/3

We will assume that the response variable is a random variable having a


Bernoulli distribution with probability distribution as follows:

Y Probability
1 P(Y = 1) = p
0 P(Y = 0) = 1 − p

Then, we can consider the logistic response function

eβ0 + β1 x
P(Y = 1) = p(x) = .
1 + eβ0 + β1 x

35 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The Logistic Regression Model 1/3


Then, we can consider the logistic response function

eβ0 + β1 x
P(Y = 1) = p(x) = .
1 + eβ0 + β1 x
1.0
0.8
0.6
P(Y=1)

0.4
0.2
0.0

−3 −2 −1 0 1 2 3

36 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The Logistic Regression Model 2/3

! For low balances we now predict the probability of default as close to,
but never below, zero.
! Likewise, for high balances we predict a default probability close to, but
never above, one.
! The logistic function will always produce an S-shaped curve of this form,
and so regardless of the value of x, we will obtain a sensible prediction.
! We also see that the logistic model is better able to capture the range of
probabilities than is the linear regression model.

40 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The Logistic Regression Model 3/3

! Logistic regression is very similar to


linear regression
! We come up with β#0 and β#1 to estimate
β0 and β1 .
! We have similar problems and questions
as in linear regression e.g.
- Is β1 equal to 0?
- How sure are we about our guesses for
▼❇❇
β0 and β1 ?


If LoyalCH is about .6
then Pr(CH) ≈ 0.7.

41 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 3/23

! We would like to be able to predict customers that are likely to default.


! Possible X variables are:
! Annual Income (income),
! Monthly credit card balance (balance).
! The Y variable (Default) is categorical: Yes or No.
! How do we check the relationship between Y and X1 , X2 ?

42 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 4/23

2500

60000
2000
60000

Balance
1500

Income
40000
Income
40000

1000

20000
20000

500
0

0
0

0 500 1000 1500 2000 2500 No Yes No Yes

Balance Default Default

43 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 5/23

1.0
| | | | || | ||||| || ||| ||||||||||| ||||| |||||||||||||||||| |||||||| ||||||||||| |||||||||||||||||||||||||||| ||||||||||||||||||||||||||| | |||| | | | | | |

Probability of Default
! If we fit a linear regression to the

0.8
Default data, then for very low

0.6
balances we predict a negative

0.4
probability, and for high balances

0.2
we predict a probability above 1!

0.0
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||
|||||||||||||
||||||||||||||||||||||||||||||||||
||||||||||||
|||||||||||||||||||
||||||||||||
|||||||||||||||||
||||||||||||
|||||||||||||||||||
|||||
||||
||||||||||||
|||||||||
||||
||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||| || || ||| | | | ||

✂✂✍
0 500 1000 1500 2000 2500

✂ Balance


When balance< 500
then P(default) is negative!

44 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 6/23

1.0
| | | | || | ||||| || ||| ||||||||||| ||||| |||||||||||||||||| |||||||| |||||||||| |||||||||||||||||||||||||||| ||||||||||||||||||||||||||| | |||| | | | | | |

Probability of Default
0.8
! Now the probability of default is

0.6
close to, but not less than zero

0.4
for low balances. And close to

0.2
but not above 1 for high balances

0.0
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||
|||||||||||||
||||||||||||||||||||||||||||||||||
||||||||||||
|||||||||||||||||||
||||||||||||
|||||||||||||||||
||||||||||||
|||||||||||||||||||
|||||
||||
||||||||||||
|||||||||
||||
||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||| || || ||| | | | ||

0 500 1000 1500 2000 2500

Balance

45 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The Odds 1/2

Let us write
eβ0 + β1 x
p(x) =
1 + eβ0 + β1 x
and then
eβ0 + β1 x 1
1− p(x) = 1 − = .
1 + eβ0 + β1 x 1 + eβ0 + β1 x

The quantity
p(x) β1 x
= eβ0 +
1− p(x)
is called the odds, and can take on any value between 0 and ∞.
Values of the odds close to 0 and ∞ indicate very low and very high
probabilities of Y = 1 compared to probability of Y = 0, respectively.

48 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The Odds 2/2

Consider the Default data.


For example, on average1 in 5 people with an odds of 1/4 will default, since
p(x) = 0.2 implies an odds of
0.2 0.2
= = 0.25.
1− 0.2 0.8
Likewise on average nine out of every ten people with an odds of 9 will
default, since p(x) = 0.9 implies an odds of
0.9 0.9
= = 9.
1− 0.9 0.1
Odds are traditionally used instead of probabilities in horse-racing, since they
relate more naturally to the correct betting strategy.

51 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The log-odds or logit


By taking the logarithm of both sides of
p(x) β1 x
= eβ0 +
1− p(x)
we arrive at $ %
p(x)
log = β0 + β1 x
1 − p(x)
and this quantity is called the log-odds or logit.
In general, let
η(x) = β0 + β1 x
be the linear predictor, where η(x) is defined by the transformation
p(x)
η(x) = log
1− p(x)
and this transformation is the logit transformation of the probability p(x). In
particular, we see that the logistic regression model
eη(x) eβ0 + β1 x
p(x) = =
1+e η(x) 1 + eβ0 + β1 x
has a logit η(x) that is linear in x.
54 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Estimating the Regression Coefficients 1/2

The coefficients β0 and β1 i must be estimated based on the available training


data.
In fitting logistic regression models, the more general method of maximum
likelihood.
Given a training set (x1 , y1 ), (x2 , y2 ) . . . , (xn , yn ) ∈ Rn × {0, 1}n , we have to
maximize the function
& n
& n
&
L(β0 , β1 ) = p(xi ) (1 − p(xj )) = p(xi )yi [
1− 1−yi
p(yi )] .
i:yi = 1 j:yj = 0 i= 1

56 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Estimating the Regression Coefficients 2/2


In practice, algorithms maximize the log-likelihood function
n
&
L(β0 , β1 ) = log L(β0 , β1 ) = log p(xi )yi [
1− 1−yi
p(yi )]
i= 1
n
! ' (
= log p(xi )yi [
1− 1−yi
p(yi )]
i= 1
n
!
= yi log p(xi ) + (1 − yi ) log[
1− p(yi )]
i= 1
! n ) $ %* n
!
p(xi )
= yi log + 1−
log[ p(yi )]
i= 1
(1 − p(yi ) i= 1

The estimated value of the linear predictor in xi is η̂i = η̂(xi ) = β#0 + β#1 xi and
the fitted value of the logistic regression model is

exp(η̂i ) exp(β#0 + β#1 xi ) 1


ŷi = p̂(xi ) = = = .
exp(1 + η̂i ) 1 + exp(β#0 + β#1 xi ) − (β#0 + β#1 xi )]
1 + exp[

58 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Interpreting β1 1/2
In a linear regression model, β1 gives the average change in Y associated
with a one-unit increase in X. Consider now the interpretation of the
parameters is the logistic regression models.
! Consider first the fitted value of the linear predictor value at x:
η̂(x) = β#0 + β#1 x.

! The fitted value at x + 1 is


η̂(x + 1) = β#0 + β#1 (x + 1)
and the difference in the two predicted values is
η̂(x + 1) − η̂(x) = β#1 .

! Remember that η̂(x) is the log-odds when the regressor variable is


equal to x
η̂(x) = β#0 + β#1 x = log(oddsx )
and η̂(x + 1) is the log-odds when the regressor variable is equal to x + 1
η̂(x + 1) = β#0 + β#1 (x + 1) = log(oddsx+ 1 )
62 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Interpreting β1 2/2

! Therefore, the difference in the two fitted values is


$ %
oddsx+ 1
η̂(x + 1) − η̂(x) = log(oddsx+ 1 ) − log(oddsx ) = log = β#1 .
oddsx
And then we get the odds ratio:
oddsx+ 1 !
= eβ1 .
oddsx

Interpretation of the odds ratio


The estimated variation in the odds ratio associated with a one-unit change in
!
the value of the predictor variable is eβ1 .
In general, the estimated variation in the odds ratio associated with a change
!
of d units in the predictor variable is ed β1 .

64 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 7/23


glm.fit=glm(default∼balance, family=binomial)
summary(glm.fit)
Call:
glm(formula = default∼balance, family = binomial)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.2697 -0.1465 -0.0589 -0.0221 3.7589
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

(Dispersion parameter for binomial family taken to be 1)


Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1596.5 on 9998 degrees of freedom
AIC: 1600.5
Number of Fisher Scoring iterations: 8

65 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 8/23


The table
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
shows the coefficient estimates and related information that result from fitting a logistic
regression model on the Default data in order to predict the probability of default=Yes
using balance.
We see that β!1 = 0.0055; this indicates that an increase in balance is associated with an
increase in the probability of default. To be precise, a one-unit increase in balance is
associated with an increase in the log odds of default by 0.0055 units.
Many aspects of the logistic regression output shown are similar to the linear regression
output. For example, we can measure the accuracy of the coefficient estimates by computing
their standard errors using the z value and the p-value.
Since the p-value associated with balance is very small, we can reject H0 . In other words, we
conclude that there is indeed an association between balance and probability of default.
Moreover, since β!1 is positive, we can conclude that if the balance increase, then the
probability of default will increase as well.
The estimated intercept is typically not of interest; its main purpose is to adjust the average
fitted probabilities to the proportion of ones in the data.

70 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Algorithm note

The Log-likelihood function presents only one maximum value.


The log-likelihood function can maximized through some numerical
optimization algorithm like the Fisher scoring.
Fisher’s scoring algorithm is a derivative of Newton’s method for solving
maximum likelihood problems numerically.
For Credit Card we see that Fisher’s Scoring Algorithm needed 8 iterations to
perform the fit.
This does really tell that the fact that the model did indeed converge, and had
no trouble doing it.

75 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 9/23


glm.fit=glm(default∼balance, family=binomial)
summary(glm.fit)
Call:
glm(formula = default∼balance, family = binomial)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.2697 -0.1465 -0.0589 -0.0221 3.7589
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

(Dispersion parameter for binomial family taken to be 1)


Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1596.5 on 9998 degrees of freedom
AIC: 1600.5
Number of Fisher Scoring iterations: 8

76 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Making Predictions

Once the coefficients have been estimated, it is a simple matter to compute


the probability of default for any given credit card balance Suppose an
individual has an average balance of $1,000. The probability of default is
given by

exp(β#0 + β#1 x) exp(β#0 + β#1 x


p̂(1,000) = =
1 + exp(β#0 + β#1 x) 1 + exp(β#0 + β#1 x)
exp(− 10.6513 + 0.0055 × 1, 000)
= = 0.00576
1 + exp(− 10.6513 + 0.0055 × 1, 000)
The predicted probability ŷ of default for an individual with a balance of
x = $1,000 is less than 1%.
In contrast, for a balance of x = $2,000, the probability is much higher, and
equals to ŷ = 0.586, i.e. (58.6%).

79 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Diagnostic checking in Logistic Regression

Residuals can be used for diagnostic checking and investigating model


adequacy in logistic regression. The ordinary residuals are defined as usual

ei = yi − ŷi i = 1, 2, . . . , n.

80 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Diagnostic checking in Logistic Regression

Residuals can be used for diagnostic checking and investigating model


adequacy in logistic regression. The ordinary residuals are defined as usual

ei = yi − ŷi i = 1, 2, . . . , n.

In linear regression, the ordinary residuals are components of the residual


sum of squares; that is, if the residuals are squared and summed, the
residual sum of squares results.

81 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Diagnostic checking in Logistic Regression

In linear regression, the ordinary residuals are components of the residual


sum of squares; that is, if the residuals are squared and summed, the
residual sum of squares results.
In logistic regression, the quantity analogous to the residual sum of squares
is the deviance.
This leads to the deviance residual, defined as
+ ) $ % $ %*,1/2
yi 1 − yi
d i = ± 2 yi ln + (1 − yi ) ln .
ŷi i − ŷi

82 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Diagnostic checking in Logistic Regression

In logistic regression, the quantity analogous to the residual sum of squares


is the deviance.
This leads to the deviance residual, defined as
+ ) $ % $ %*,1/2
yi 1 − yi
d i = ± 2 yi ln + (1 − yi ) ln .
ŷi i − ŷi
In particular:
-
! when yi = 0, then d i = − − 2 ln(1 − ŷi ) ;

! when yi = 1, then d i = − 2 ln ŷi .

83 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 10/23


glm.fit=glm(default∼balance, family=binomial)
summary(glm.fit)
Call:
glm(formula = default∼balance, family = binomial)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.2697 -0.1465 -0.0589 -0.0221 3.7589
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

(Dispersion parameter for binomial family taken to be 1)


Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1596.5 on 9998 degrees of freedom
AIC: 1600.5
Number of Fisher Scoring iterations: 8

84 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Deviance
The goodness of fit of the logistic regression model can be assessed by the
deviance which is a measure based on the loglikelihood and is defined by
! n ) $ %*
yi 1 − yi
D=2 yi ln + (1 − yi ) ln
i= 1
#i
π 1 − ŷi

where y ln(y/ŷ) = 0 if y = 0 and (1 − y) ln(1 − y)/(1 − ŷ) = 0 if y = 1.


The deviance is a measure of goodness of fit of a generalized linear model.
Or rather, it’s a measure of badness of fit, where higher numbers indicate
worse fit.
R reports two forms of deviance:
! the null deviance that shows how well the response variable is predicted
by a model that includes only the intercept (grand mean);
! the residual deviance that how well the response is predicted by the
model when the predictors are included.
If the Null Deviance is really small, it means that the Null Model explains the
data pretty well. Likewise with the Residual Deviance.

86 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 11/23


glm.fit=glm(default∼balance, family=binomial)
summary(glm.fit)
Call:
glm(formula = default∼balance, family = binomial)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.2697 -0.1465 -0.0589 -0.0221 3.7589
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

(Dispersion parameter for binomial family taken to be 1)


Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1596.5 on 9998 degrees of freedom
AIC: 1600.5
Number of Fisher Scoring iterations: 8

87 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 11/23

Null deviance: 2920.6 on 9999 degrees of freedom


Residual deviance: 1596.5 on 9998 degrees of freedom
AIC: 1600.5
Number of Fisher Scoring iterations: 8

In this case, we have a value of 2920.6 on 9999 degrees of freedom. Including the
independent variable (balance) decreased the deviance to 1596.5points on 9998
degrees of freedom, a significant reduction in deviance.
The Residual Deviance has reduced by 1324.1 with a loss of one degree of
freedom.

88 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Qualitative Predictors in Logistic Regression


We can predict if an individual default by checking if she is a student or not. Thus we
can use a dummy variable student coded as (Student = 1, Non-student =0).
glm.fit=glm(default∼student, family=binomial)
summary(glm.fit)

Estimate Std. Error t value Pr(>|t|)


(Intercept) -3.50413 0.07071 -49.55 <2e-16 ***
studentYes 0.40489 0.11502 3.52 0.000431 ***

Since β!1 is positive: This indicates students tend to have higher default probabilities than
non-students:
exp(−3.50413 + 0.40489 × 1)
P(default = Yes|student = Yes) = = 0.0431
1 + exp(−3.50413 + 0.40489 × 1)
exp(−3.50413 + 0.40489 × 0)
P(default = Yes|student = No) = = 0.0292
1 + exp(−3.50413 + 0.40489 × 0)

90 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Multiple Logistic Regression

Assume we are provided with ppredictors X1 , X2 , . . . , Xp. We can fit multiple


logistic just like regular regression

eβ0 + β1 x1 + β2 x2 + ···+ βpxp


p(x) =
1 + eβ0 + β1 x+ β2 x2 + ···+ βpxp
where x = (x1 , x2 , . . . , xp)′ .
Also in this case we use the maximum likelihood method to estimate
β0 , β1 , . . . , βp.

91 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 12/23


glm.fit=glm(default∼balance+income+student, family=binomial)
summary(glm.fit)
Call:
glm(formula = default∼balance+income+student, family = binomial)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.4691 -0.1418 -0.0557 -0.0203 3.7383
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
income 3.033e-06 8.203e-06 0.370 0.71152
studentYes -6.468e-01 2.363e-01 -2.738 0.00619 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

(Dispersion parameter for binomial family taken to be 1)


Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1571.5 on 9996 degrees of freedom
AIC: 1579.5
Number of Fisher Scoring iterations: 8
92 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 13/23


Prediction
By substituting estimates for the regression coefficients from
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
income 3.033e-06 8.203e-06 0.370 0.71152
studentYes -6.468e-01 2.363e-01 -2.738 0.00619 ***

into
eβ0 + β1 x1 + β2 x2 + ···+ βpxp
p(x) =
1 + eβ0 + β1 x+ β2 x2 + ···+ βpxp
we can make predictions.
For example, a student with a credit card balance of $1,500 and an income of
$40,000 has an estimated probability of default of

e−10.869+ 0.00574× 1500+ 0.003× 40−0.6468× 1


p̂(1500, 40000, Yes) = = 0.058
1 + e−10.869+ 0.00574× 1500+ 0.003× 40−0.6468× 1

93 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 13/23


Prediction
By substituting estimates for the regression coefficients from
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
income 3.033e-06 8.203e-06 0.370 0.71152
studentYes -6.468e-01 2.363e-01 -2.738 0.00619 ***

into
eβ0 + β1 x1 + β2 x2 + ···+ βpxp
p(x) =
1 + eβ0 + β1 x+ β2 x2 + ···+ βpxp
we can make predictions.
A non-student with
e−10.869+ 0.00574× 1500+ 0.003× 40−0.6468× 0
p̂(1500, 40000, No) = = 0.105
1 + e−10.869+ 0.00574× 1500+ 0.003× 40−0.6468× 0

94 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 14/23


An apparent contradiction

Consider the two models with one predictor (student) and three predictors
(student, balance and income):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
income 3.033e-06 8.203e-06 0.370 0.71152
studentYes -6.468e-01 2.363e-01 -2.738 0.00619 ***

and

Estimate Std. Error t value Pr(>|t|)


(Intercept) -3.50413 0.07071 -49.55 <2e-16 ***
studentYes 0.40489 0.11502 3.52 0.000431 ***

95 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 14/23


An apparent contradiction
Consider the two models with one predictor (student) and three predictors
(student, balance and income):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
income 3.033e-06 8.203e-06 0.370 0.71152
studentYes -6.468e-01 2.363e-01 -2.738 0.00619 ***
Negative
and
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.50413 0.07071 -49.55 <2e-16 ***
studentYes 0.40489 0.11502 3.52 0.000431 ***
Positive

96 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 14/23


An apparent contradiction
Consider the two models with one predictor (student) and three predictors
(student, balance and income):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
income 3.033e-06 8.203e-06 0.370 0.71152
studentYes -6.468e-01 2.363e-01 -2.738 0.00619 ***

and
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.50413 0.07071 -49.55 <2e-16 ***
studentYes 0.40489 0.11502 3.52 0.000431 ***
! In the model with three predictors, the p-values associated with
balance and the dummy variable for student status are very small,
indicating that each of these variables is associated with the probability
of default.
! However, the coefficient for the dummy variable is negative, indicating
that students are less likely to default than non-students.

97 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 14/23


An apparent contradiction
Consider the two models with one predictor (student) and three predictors
(student, balance and income):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
income 3.033e-06 8.203e-06 0.370 0.71152
studentYes -6.468e-01 2.363e-01 -2.738 0.00619 ***

and

Estimate Std. Error t value Pr(>|t|)


(Intercept) -3.50413 0.07071 -49.55 <2e-16 ***
studentYes 0.40489 0.11502 3.52 0.000431 ***
! In contrast, the coefficient for the dummy variable is positive for the
model with one predictor.
! How is it possible for student status to be associated with an increase in
probability of default in one case and a decrease in probability of default
in the other case?

98 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 15/23


An apparent contradiction - Explanation

0.8
0.6
Default Rate
0.4
0.2
0.0

500 1000 1500 2000

Credit Card Balance

! The solid lines show the average default rates for students (orange) and
non-students (blue), respectively as a function of credit card balance.

99 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 15/23


An apparent contradiction - Explanation

0.8
0.6
Default Rate
0.4
0.2
0.0

500 1000 1500 2000

Credit Card Balance

! The negative coefficient for student in the multiple logistic regression


indicates that for a fixed value of balance and income, a student is
less likely to default than a non-student.

100 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 15/23


An apparent contradiction - Explanation

0.8
0.6
Default Rate
0.4
0.2
0.0

500 1000 1500 2000

Credit Card Balance

! We note that the student default rate is at or below that of the


non-student default rate for every value of balance.

101 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 15/23


An apparent contradiction - Explanation

0.8
0.6
Default Rate
0.4
0.2
0.0

500 1000 1500 2000

Credit Card Balance

! We note that the student default rate is at or below that of the


non-student default rate for every value of balance.
! But the horizontal broken lines near the base of the plot, which show
the default rates for students and non-students averaged over all values
of balance and income, suggest the opposite effect: the overall
student default rate is higher than the non-student default rate.

102 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 15/23


An apparent contradiction - Explanation

0.8
0.6
Default Rate
0.4
0.2
0.0

500 1000 1500 2000

Credit Card Balance

! But the horizontal broken lines near the base of the plot, which show
the default rates for students and non-students averaged over all values
of balance and income, suggest the opposite effect: the overall
student default rate is higher than the non-student default rate.
! Consequently, there is a positive coefficient for student in the single
variable logistic regression output

103 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 16/23


An apparent contradiction - Explanation

2500
Credit Card Balance
2000
1500
1000
500
0

No Yes

Student Status

104 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 16/23


An apparent contradiction - Explanation

2500
Credit Card Balance
2000
1500
1000
500
0

No Yes

Student Status

! The variables student and balance are correlated.


! Students tend to hold higher levels of debt, which is in turn associated
with higher probability of default.

105 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 16/23


An apparent contradiction - Explanation

2500
Credit Card Balance
2000
1500
1000
500
0

No Yes

Student Status

! The variables student and balance are correlated.


! Students tend to hold higher levels of debt, which is in turn associated
with higher probability of default.

! In other words, students are more likely to have large credit card
balances, which tend to be associated with high default rates.

106 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 16/23


An apparent contradiction - Explanation

2500
Credit Card Balance
2000
1500
1000
500
0

No Yes

Student Status

! Thus, even though an individual student with a given credit card


balance will tend to have a lower probability of default than a
non-student with the same credit card balance, the fact that students on
the whole tend to have higher credit card balances means that overall,
students tend to default at a higher rate than non-students.

107 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 16/23


An apparent contradiction - Explanation

2500
Credit Card Balance
2000
1500
1000
500
0

No Yes

Student Status

! This is an important distinction for a credit card company that is trying to


determine to whom they should offer credit. A student is riskier than a
non-student if no information about the student’s credit card balance is
available.
! However, that student is less risky than a non-student with the same
credit card balance!

108 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 17/23


An apparent contradiction - Conclusion

This simple example illustrates the dangers and subtleties associated with
performing regressions involving only a single predictor when other predictors
may also be relevant.
As in the linear regression setting, the results obtained using one predictor
may be quite different from those obtained using multiple predictors,
especially when there is correlation among the predictors.
In general, this phenomenon is known as confounding.

109 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Model selection criteria 1/2


Model selection criteria are methods for assessing the quality of your model
through comparison of related models.
In general, model selection criteria are an extension of the maximum
likelihood and have the following form:
# + CK
M = − 2L(ψ)

where CK is a penalty term which increases with the number K of the


parameters of the model m.
Much like adjusted R-squared, it’s intent is to prevent from including irrelevant
predictors.
In particular,
! if the model is too simple (that is, the fit to data is poor) then M has a
large value due to the small likelihood;
! on the contrary, for overparameterized models M is large due to the
complexity term CK .

110 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Model selection criteria 2/2

Among model selection criteria, often one considers the Akaike’s


Information Criterion (AIC) or the Bayesian Information Criterion
(BIC) defined as
! + 2K,
AIC = −2L(ψ)
! + K logn.
BIC = −2L(ψ)

We remark that in literature we can the BIC also defined as


! − K logN.
BIC = 2L(ψ)

111 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 18/23


glm.fit=glm(default ∼balance, family=binomial)
summary(glm.fit)
Call:
glm(formula = default∼balance, family = binomial)

Deviance Residuals
Min 1Q Median 3Q Max
-2.2697 -0.1465 -0.0589 -0.0221 3.7589
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

(Dispersion parameter for binomial family taken to be 1)


Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1596.5 on 9998 degrees of freedom
AIC: 1600.5
Number of Fisher Scoring iterations: 8

112 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Confusion matrix 1/2

When the true classifications are known for the data, we can compute the
confusion matrix C which is a contigency table containing the information
about actual and predicted classifications.
In particular the entry chg denotes the relative frequency of units coming from
class h and classified in class g . The trace of C, namely tr(C), gives the
percentage of right classified units and then 1 − tr(C) is the misclassification
error.

113 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Confusion matrix 2/2

In the case of a binary classifier, one can make two types of errors:
! it can incorrectly assign an individual who defaults to the no default
category, or
! it can incorrectly assign an individual who does not default to the default
category.
It is often of interest to determine which of these two types of errors are being
made.

114 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 17/23


Confusion matrix - all data

Fit the model on the entire dataset (N = 10000).

true values
predicted values No Yes Total
No 9628 228 9856
Yes 39 105 144
Total 9667 333 10000

and
228 + 39
δ= · 100 = 2.67%.
10000

115 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 18/23


Confusion matrix - training and test data A
Fit the model on the train data (nL around 70% of data) and compute the
model on the test data.
results on the fitted training data
true values
predicted values No Yes Total
159 + 28
No 6739 159 6898 δ= · 100 = 2.67%.
7001
Yes 28 75 103
Total 6767 234 7001

compute the fitted model on the test data


true values
predicted values No Yes Total
72 + 10
No 2890 72 2962 δ= · 100 = 2.73%.
2999
Yes 10 27 37
Total 2900 99 2999

116 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 19/23


Confusion matrix - training and test data B
Fit the model on the train data (nL around 70% of data) and compute the
model on the test data.
results on the fitted training data
true values
predicted values No Yes Total
164 + 26
No 6741 164 6905 δ= · 100 = 2.71%.
7001
Yes 26 70 196
Total 6767 234 7001

compute the fitted model on the test data


true values
predicted values No Yes Total
64 + 12
No 2888 64 2952 δ= · 100 = 2.53%.
2999
Yes 12 35 47
Total 2900 99 2999

117 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Labo activity with R

Labo activity 3.R

118 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

3.3 Discriminant Analysis

119 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Preliminary concepts

Assume we are provided with a set of n objects Ω = {ω1 , ω2 , . . . , ωn }.


A vector representation x = x(ω) of numerical and/or categorical features,
with values in some space X , is associated with ω ∈ Ω. x is the pattern of ω.
Assume that Ω is partitioned into disjoint subsets Ω1 , . . . , ΩK , i.e.
Ω = Ω1 ∪ · · · , ∪ΩK , where K ≥ 2 is the number of classes.
An object ω is classified as belonging to Ωk (k = 1, . . . , K) based on its
pattern x by means of a decision rule.
The decision rule can classify the object ω in either the true class (right
classification) or in a wrong class (wrong classification).
In other words, assume that there a relationship between x and a qualitative
response Y taking values on a finite set C = {1, 2, . . . , K}, where the
elements of C are unordered labels.

124 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Preliminary concepts

Assume we are provided with a set of n objects Ω = {ω1 , ω2 , . . . , ωn }.


A vector representation x = x(ω) of numerical and/or categorical features,
with values in some space X , is associated with ω ∈ Ω. x is the pattern of ω.
Assume that Ω is partitioned into disjoint subsets Ω1 , . . . , ΩK , i.e.
Ω = Ω1 ∪ · · · , ∪ΩK , where K ≥ 2 is the number of classes.
An object ω is classified as belonging to Ωk (k = 1, . . . , K) based on its
pattern x by means of a decision rule.
The decision rule can classify the object ω in either the true class (right
classification) or in a wrong class (wrong classification).
In other words, assume that there a relationship between x and a qualitative
response Y taking values on a finite set C = {1, 2, . . . , K}, where the
elements of C are unordered labels.

125 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Classification through prior probability

Assume we now the probability

πk = P(ω ∈ Ωk ) = P(Ωk ) = P(Y = k)

that an object ω ∈ Ω comes from the class Ωk (k = 1, . . . , K). Since this is an


overall probability, independent of x, and known before we have taken any
observation, it is called prior probability.
If we had no further information, then the best decision rule would be
classify ω as coming from the class Ωk if P(Y = k) > P(Y = j) for all j ̸= k,
where equal probabilities can be settled arbitrarily.

127 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Classification through the Bayes’s theorem 1/2

Usually, we do have further information, namely the pattern x. In this case,


we can compare the probabilities of belonging to each class at x and classify
according to whichever is the larger:
classify ω as coming from the class Ωk if P(Y = k|x) > P(Y = j|x) for all j ̸= k.
This fundamental rule is known as Bayes minimum error rule, In contrast to
the prior probabilities P(Y = k), now the probabilities P(Y = k|x) depend on x.
For this reason, they are known as posterior or a posteriori probabilities.
Unfortunately, we rarely know these a posteriori probabilities.

130 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Classification through the Bayes’s theorem 2/2


A more convenient formulation is obtained by applying the Bayes’s theorem:
P(Y = k)p(x|Y = k) πk p(x|Y = k)
P(Y = k|x) = = .
p(x) p(x)
Assume that x is the realization of some ramdom vector X and let

fk (x) = p(x|Y = k)

be the discrete probability density of X in the class Ωk . Thus, p(x) is a mixture


distribution with discrete probability density
K
!
p(x) = πk fk (x)
k= 1
and finally
πk fk (x)
P(Y = k|x) = P(Ωk |x) = .K = pk (x).
h= 1 πh fh (x)
Analogous result holds for continuous random vectors.

133 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis (LDA) for p= 1 - 1/7


Consider first a classification problem involving (univariate) random Gaussian
variables:
$ %
1 (x − µ1 )2
X|Ω1 ∼ N(µ1 , σ12 ) ⇒ f1 (x) = - exp −
2πσ12 2σ12
$ %
1 (x − µ2 )2
X|Ω2 ∼ N(µ2 , σ22 ) ⇒ f2 (x) = - exp −
2πσ22 2σ22
.. ..
. .
$ %
1 (x − µK )2
X|ΩK ∼ N(µ1 , σ12 ) ⇒ fK (x) = - exp −
2πσK2 2σK2

Assumption
In linear discriminant analysis we assume the homoscedastic model, i.e.

σ12 = σ22 = · · · = σK2 = σ 2 .

135 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p= 1 - 2/7

In this case, the posterior probability pk (x) is given by


$ %
1 (x − µk )2
πk √ exp −
2πσ 2 2σ 2
pk (x) = K $ % k = 1, . . . , K.
! 1 (x − µh )2
πh √ exp −
h= 1 2πσ 2 2σ 2

The Bayes classifier involves assigning an observation X = x to the class for


which pk (x) is largest.
The denominator does not depend on the response k, so we can write it as a
constant:
1
C= K $ %.
! 1 (x − µh )2
πh √ exp −
h= 1 2πσ 2 2σ 2

137 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p= 1 - 3/7


Thus $ %
πk (x − µk )2
pk (x) = C × πk fk (x) = C × √ exp −
2πσ 2 2σ 2
and absorb everything that does not depend on k into a constant C′ :
$ %
(x − µk )2
pk (x) = C′ πk exp − .
2σ 2

Taking the logarithm of both sides:


+ $ %,
(x − µk )2 (x − µk )2
ln pk (x) = ln C′ πk exp − = ln C′ + ln πk −
2σ 2 2σ 2
x2 µk µ2k
= ln C′ + ln πk − +x· 2 − .
2σ 2 σ 2σ 2
Omitting the terms that do not depend on k, the observation x is assigned to
the class for which
µk µ2k
δk (x) = ln πk + x · 2 −
σ 2σ 2
is maximum.
139 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p= 1 - 4/7


Consider two classes:

Ω1 : X|Ω1 ∼N(− 2, 1) and Ω2 : X|Ω2 ∼N(2, 1)

with π1 = π2 = 0.5.
0.4
0.3
0.2
0.1
0.0

−6 −4 −2 0 2 4 6

140 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p= 1 - 4/7


Consider two classes:

Ω1 : X|Ω1 ∼N(− 2, 1) and Ω2 : X|Ω2 ∼N(2, 1)

with π1 = π2 = 0.5. 0.4


0.3
0.2
0.1
0.0
The resulting density p(x) = π1 f1 (x) + π2 f2 (x) is given by

−6 −4 −2 0 2 4 6

141 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p= 1 - 4/7


Consider two classes:
Ω1 : X|Ω1 ∼N(− 2, 1) and Ω2 : X|Ω2 ∼N(2, 1)
with π1 = π2 = 0.5. The resulting density p(x) = π1 f1 (x) + π2 f2 (x) is given by
0.4
0.3
0.2
0.1
0.0

−6 −4 −2 0 2 4 6

In general the boundary between the two classes is the point h which is the
solution of the equation δ1 (x) = δ2 (x), i.e.
µ1 µ21 µ2 µ22
ln π1 + x · − = ln π2 + x · 2 − .
σ2 2σ 2 σ 2σ 2
142 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p= 1 - 4/7


Consider two classes:

Ω1 : X|Ω1 ∼N(− 2, 1) and Ω2 : X|Ω2 ∼N(2, 1)

with π1 = π2 = 0.5. The resulting density p(x) = π1 f1 (x) + π2 f2 (x) is given by


0.4
0.3
0.2
0.1
0.0

−6 −4 −2 0 2 4 6

leading to
(µ21 − µ22 ) + σ 2 ln(π1 /π2 )
h= .
2(µ1 − µ2 )
143 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p= 1 - 4/7


Consider two classes:
Ω1 : X|Ω1 ∼N(− 2, 1) and Ω2 : X|Ω2 ∼N(2, 1)
with π1 = π2 = 0.5. The resulting density p(x) = π1 f1 (x) + π2 f2 (x) is given
by 0.4
0.3
0.2
0.1
0.0

−6 −4 −2 0 2 4 6

and in our case


h = 0.

144 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p= 1 - 4/7


Consider two classes:
Ω1 : X|Ω1 ∼N(− 2, 1) and Ω2 : X|Ω2 ∼N(2, 1)
with π1 = π2 = 0.5. The resulting density p(x) = π1 f1 (x) + π2 f2 (x) is given
by
0.4
0.3
0.2
0.1
0.0

−6 −4 −2 0 2 4 6

Thus the rule is


! if x ≤ h = 0, the observation is assigned to Ω1
! otherwise x is assigned to Ω2 .
145 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p= 1 - 5/7


Consider two classes:

Ω1 : X|Ω1 ∼N(− 2, 1) and Ω2 : X|Ω2 ∼N(2, 1)

where now π1 = 0.2 and π2 = 1 − π1 = 0.8.


0.4
0.3
0.2
0.1
0.0

−6 −4 −2 0 2 4 6

146 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p= 1 - 5/7


Consider two classes:

Ω1 : X|Ω1 ∼N(− 2, 1) and Ω2 : X|Ω2 ∼N(2, 1)

where now π1 = 0.2 and π2 = 1 − π1 = 0.8.


0.4
0.3
0.2
0.1
0.0

−6 −4 −2 0 2 4 6

In this case we get


h = − 0.1733

147 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p= 1 - 5/7


Consider two classes:
Ω1 : X|Ω1 ∼N(− 2, 1) and Ω2 : X|Ω2 ∼N(2, 1)
where now π1 = 0.2 and π2 = 1 − π1 = 0.8.
0.4
0.3
0.2
0.1
0.0

−6 −4 −2 0 2 4 6

and thus the rule is


! if x ≤ h = − 0.1733, the observation is assigned to Ω1
! otherwise x is assigned to Ω2 .

148 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p= 1 - 6/7


In practice, even if we are quite certain of our assumption that X is drawn
from a Gaussian distribution within each class, we still have to estimate the
parameters µ1 , . . . , µK , π1 , . . . , πK , and σ 2 .
Assume we have a sample of size n1 from Ω1 , n2 from Ω2 , . . ., nK from ΩK ,
with n = n1 + n2 + · · · + nK .

The linear discriminant analysis (LDA) method approximates the Bayes clas-
sifier through the estimates for πk , µk (k = 1, . . . , K), and σ 2 :
1 !
µ̂k = xi
nk i:y = k
i

2 1 !
# =
σ (xi − µ̂k )2
n1 + n2 − K i:yi = k
nk
#k =
π
n

where πk is the estimated proportion of the training observations belonging to


the class k.
150 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p= 1 - 7/7


Consider two samples of size n1 = n2 = 50 from:

Ω1 : X|Ω1 ∼N(− 2, 1) and Ω2 : X|Ω2 ∼N(2, 1)

with π1 = π2 = 0.5. 0.5


0.4
0.3
Density

0.2
0.1
0.0

-4 -2 0 2 4

151 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p= 1 - 7/7


Consider two samples of size n1 = n2 = 50 from:

Ω1 : X|Ω1 ∼N(− 2, 1) and Ω2 : X|Ω2 ∼N(2, 1)

with π1 = π2 = 0.5.
Estimate the parameters:
0.5
0.4
0.3
Density

0.2
0.1
0.0

-4 -2 0 2 4

152 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p= 1 - 7/7


Consider two samples of size n1 = n2 = 50 from:

Ω1 : X|Ω1 ∼N(− 2, 1) and Ω2 : X|Ω2 ∼N(2, 1)

with π1 = π2 = 0.5.
and then estimate the underlying mixture distribution:
0.5
0.4
0.3
Density

0.2
0.1
0.0

-4 -2 0 2 4

153 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p= 1 - 7/7


Consider two samples of size n1 = n2 = 50 from:
Ω1 : X|Ω1 ∼N(− 2, 1) and Ω2 : X|Ω2 ∼N(2, 1)
with π1 = π2 = 0.5.
Finally we compute the decision boundary h∗ = 0.061:
0.5
0.4
0.3
Density

0.2
0.1
0.0

-4 -2 0 2 4

and the rule is x

! assign x to Ω1 if x ≤ h∗ ,
! assign x to Ω1 if x > h∗ .
154 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p= 1 - 7/7

Once we get the decision boundary, denote by Ω̂1 , Ω̂2 the estimated classes
and then compute the confusion matrix

True class
Estimated class Ω1 Ω2 tot
Ω̂1 46 4 50
Ω̂2 1 49 50

and the misclassification error rate is


4+1
δ= = 5%.
100

155 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Why LDA rather than Logistic Regression?

! Logistic regression is unstable when the classes are well separated;


! In the case where n is small, and the distribution of predictors X is
approximately normal, then LDA is more stable than Logistic
Regression;
! LDA is more popular when we have more than two response classes.

156 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Preliminary note: Multivariate Gaussian distribution


Let X = (X1 , . . . , Xp) a random vector with values in Rp. We say that X has a
multivariate Gaussian distribution with vector mean E(X) = µ and covariance
matrix Cov(X) = Σ, and write X ∼ Np(µ, Σ), if it has density
$ %
1 1 ′ −1
f (x) = exp − (x − µ) Σ (x − µ) .
(2π)p/2 |Σ|1/2 2

x2

x2
x1 x1

Two multivariate Gaussian density functions are shown, with p= 2. Left: The
two predictors are uncorrelated. Right: The two variables have a correlation
of 0.7.
157 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p> 1 - 1/3


In the case of p> 1 predictors, the LDA classifier assumes that the
observations in the kth class are drawn from a multivariate Gaussian
distribution Np(µk , Σ), where µk is a class-specific mean vector, and Σ is a
covariance matrix that is common to all K classes.
Plugging the density function for the kth class like in the univariate case, and
performing a little bit of algebra reveals that the Bayes classifier assigns an
observation X = x to the class for which
1 ′
δk (x) = x′ Σµk − µ Σµk + ln πk
2 k
is largest. δk (x) is a linear function of x and is the vector/matrix version of the
case with p= 1.
The decision boundaries are given by the set of x ∈ Rp such that

δk (x) = δl (x) k ̸= l

i.e.
1 ′ 1 ′
x′ Σµk − µ Σµk + ln πk = x′ Σµl − µ Σµl + ln πl k ̸= l.
2 k 2 l

159 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p> 1 - 2/3


An example is shown in the left-hand panel of Figure 4.6. Three equally sized
Gaussian classes are shown with class-specific mean vectors and a common
covariance matrix.

4
2
X2
0
−2
−4

−4 −2 0 2 4

X1

! The three ellipses represent regions that contain 95% of the probability
for each of the three classes,
! the dashed lines are the Bayes decision boundaries.

160 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Linear Discriminant Analysis for p> 1 - 3/3


Once we estimate the unknown parameters µ1 , . . . , µK , π1 , . . . , πK , and Σ, we
get the estimated decision surfaces given by the LDA:
#µ − 1µ
x′ Σ# # µ + ln π
# ′ Σ# #µ − 1µ
#k = x′ Σ# # µ + ln π
# ′ Σ# #l k ̸= l.
k
2 k k l
2 l l

Remark
Obiviously, the population decision boundary (Bayes rule) and the sample
decision boundary (LDA) are different:
4
2
X2
0
−2
−4

−4 −2 0 2 4

X1 161 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 20/16

lda.fit=lda(default balance+student) # lda is in the package MASS


lda.fit
Call:
lda(default∼balance + student)

Prior probabilities of groups:


No Yes
0.9667 0.0333
Group means:
balance studentYes
No 803.9438 0.2914037
Yes 1747.8217 0.3813814
Coefficients of linear discriminants:
LD1
balance 0.002244397
studentYes -0.249059498

162 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Confusion matrix 1/2

True default status


No Yes Total
Predicted No 9644 252 9806
default status Yes 23 81 104
Total 9667 333 10000

The analysis of the confusion matrix shows that LDA gives an overall
misclassification error equal to (23 + 252)/10000 · 100 = 2.75% but we need to
pay attention to the different situations:
! a total of 104 people would default and, of these people, 81 actually
defaulted and 23 did not;
! In particular only 23 out of 9667 of the individuals who did not default
were incorrectly labeled;

163 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Confusion matrix 2/2


True default status
No Yes Total
Predicted No 9644 252 9806
default status Yes 23 81 104
Total 9667 333 10000

The analysis of the confusion matrix shows that LDA gives an overall
misclassification error equal to (23 + 252)/10000 · 100 = 2.75% but we need to
pay attention to the different situations:
! a total of 104 people would default and, of these people, 81 actually
defaulted and 23 did not;
! In particular only 23 out of 9667 of the individuals who did not default
were incorrectly labeled;
! of the 333 individuals who defaulted, 252 were missed by LDA.
! From the perspective of a credit card company that is trying to identify
high-risk individuals, an error rate of 252/333 = 75.7% among
individuals who default may well be unacceptable.
164 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 21/23

lda.predict=predict(lda.fit,Default)
pred.class=lda.predict$class
table(pred.class,default)

default
pred.class No Yes
No 9644 252
Yes 23 81

addmargins(table(pred.class,default))

default
pred.class No Yes Sum
No 9644 252 9896
Yes 23 81 104
Sum 9667 333 10000

165 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Misclassifications
There are two kind of misclassifications:
False positive
some misclassifications will result from incorrectly assigning a customer who
does not default to the default class,

False negative
and others will result from incorrectly assigning a customer who defaults to the
non-default class.
The two errors are not equivalent (it depends on the case study). For
example a credit card company
! might particularly wish to avoid incorrectly classifying an individual who
will default,
! whereas incorrectly classifying an individual who will not default, though
still to be avoided, is less problematic.

166 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Threshold and decision boundary 1/3

The Bayes classifier works by assigning an observation to the class for which
the posterior probability is greatest. In the two-class case, this amounts to
assigning an observation to the default class if

P(default = Yes|X = x) > 0.5

Thus, the Bayes classifier, and by extension LDA, uses a threshold of 50% for
the posterior probability of default in order to assign an observation to the
default class.
However, if we are concerned about incorrectly predicting the default status
for individuals who default, then we can consider lowering this threshold.
In other words, perhaps, we shouldn’t use 0.5 as threshold for predicting de-
fault?

167 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Threshold and decision boundary 2/3


For instance, we might label any customer with a posterior probability of
default above 20% to the default class. In other words, we could instead
assign an observation to this class if

P(default = Yes|X = x) > 0.2

True default status


No Yes Total
Predicted No 9432 138 9570
default status Yes 235 235 430
Total 9667 333 10000

! The overall misclassification error is now


(23 + 138)/10000 · 100 = 3.73% larger than 2.75%;
! Of the 333 individuals who default, LDA correctly predicts all but 138
(rather than 252), or 41.4%. This is a vast improvement over the error
rate of 75.7% that resulted from using the threshold of 50%;

168 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Threshold and decision boundary 2/3


For instance, we might label any customer with a posterior probability of
default above 20% to the default class. In other words, we could instead
assign an observation to this class if
P(default = Yes|X = x) > 0.2
True default status
No Yes Total
Predicted No 9432 138 9570
default status Yes 235 235 430
Total 9667 333 10000

! However, this improvement comes at a cost: now 235 individuals who


do not default are incorrectly classified. As a result, the overall error rate
has increased slightly to 3.73%.

Company policy
But a credit card company may consider this slight increase in the total error
rate to be a small price to pay for more accurate identification of individuals
who do indeed default.

169 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Threshold and decision boundary 3/3


Deciding the trade-off
The trade-off that results from modifying the threshold value for the posterior
probability of default. Various error rates are shown as a function of the
threshold value:

0.6
Error Rate

0.4
0.2
0.0

0.0 0.1 0.2 0.3 0.4 0.5

Threshold

! Using a threshold of 0.5 minimizes the overall error rate, shown as a


black solid line. This is to be expected, since the Bayes classifier uses a
threshold of 0.5 and is known to have the lowest overall error rate;
! when a threshold of 0.5 is used, the error rate among the individuals
who default is quite high (fraction of defaulter missed: blue dashed line);
! the Orange dotted denotes non defaulters incorrectly classified.

170 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Threshold and decision boundary 3/3


Deciding the trade-off
The trade-off that results from modifying the threshold value for the posterior
probability of default. Various error rates are shown as a function of the
threshold value:

0.6
Error Rate

0.4
0.2
0.0

0.0 0.1 0.2 0.3 0.4 0.5

Threshold

In summary: as the threshold is reduced, the error rate among individuals


who default decreases steadily, but the error rate among the individuals who
do not default increases.
How can we decide which threshold value is best?
Such a decision must be based on domain knowledge, such as detailed
information about the costs associated with default.
171 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 22/23

lda.predict=predict(lda.fit,Default)
post.class=lda.predict$posterior[,2]
table((post.class>=0.2),default)

default
No Yes
FALSE 9632 138
TRUE 235 195

addmargins(table((post.class>=0.2),default))

default
No Yes Sum
FALSE 9632 252 9570
TRUE 235 195 430
Sum 9667 333 10000

172 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Sensitivity and Specificity 1/3


True default status
No Yes Total
Predicted No 9644 252 9806
default status Yes 23 81 104
Total 9667 333 10000

The terms sensitivity and specificity characterize the performance of a


classifier or screening test.

Sensitivity
Sensitivity is the percentage of true defaulters that are identified.
In this case, the sensitivity is 81/333 · 100 = 24.31%.

Specificity
Specificity is the is the percentage of non-defaulters that are correctly
identified.
In this case, the specificity (1 − 23/9667) · 100 = 99.8%.
173 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Sensitivity and Specificity 2/3


Let us introduce:
! true positive rate (TPR): equal to sensitivity,
! false positive rate (FPR): equal to 1-specificity.

Predicted class
− or Null + or Non-null Total
True −or Null True Neg. (TN) False Pos. (FP) TN+FP
class + or Non-null False Neg. (FN) True Pos. (TP) FN+TP
Total TN+FN FP+TP

Thus:
TP
TPR = ,
TP + FP

TN TN FP
specificity = FPR= 1 − =
FP + TN FP + TN FP + TN

174 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Sensitivity and Specificity 2/3


Let us introduce:
! true positive rate (TPR): equal to sensitivity,
! false positive rate (FPR): equal to 1-specificity.

Predicted class
− or Null + or Non-null Total
True −or Null True Neg. (TN) False Pos. (FP) TN+FP
class + or Non-null False Neg. (FN) True Pos. (TP) FN+TP
Total TN+FN FP+TP

! To make the connection with the epidemiology literature, we think of ”+”


as the ”disease” state hat we are trying to detect, and ”− ” as the
”non-disease” state.

! To make the connection to the classical hypothesis testing literature, we


think of ”− ” as the null hypothesis and ”+” as the alternative (non-null)
hypothesis.

175 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Sensitivity and Specificity 3/3


More in general

Name Definition Synonymous


False Positive Rate FP/(TN+FP)N Type I error. 1− Specificity
True Positive Rate TP/(FP+TP) 1− Type II error, power, sensitivity
Positive Predicted Rate TP/(FP+TP) Precision, 1− false discovery proportion
Negative Predicted Rate TN/(TN+FN) Precision, 1− false discovery proportion

Thus, in a context of hypothesis testing with

H0 :negative (e.g.: non-default)

against
H1 :positive (e.g.: default)
we have that
! false positives are associated with the error of the first kind,
! while the false negatives are associated with the error of the second kind.

176 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Receiving Operating Characteristics (ROC) curve 1/4


The Receiver Operating Characteristic (ROC) curve is used to assess the
accuracy of a continuous measurement for predicting a binary outcome.
In medicine, ROC curves have a long history of use for evaluating diagnostic
tests in radiology and general diagnostics. ROC curves have also been used
for a long time in signal detection theory.
As we stated above, the accuracy of a diagnostic test can be evaluated by
considering the two possible types of errors: false positives, and false
negatives. For a continuous measurement that we denote in general as M.
Convention dictates that a test positive is defined as M exceeding some fixed
threshold c: M > c.

In reference to the binary outcome that we denote as D:


! a good outcome of the test is when the test is positive among an
individual who truly has a disease (D = 1);
! a bad outcome is when the test is positive among an individual who
does not have the disease (D = 0).

177 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Receiving Operating Characteristics (ROC) curve 2/4

Formally, for a fixed cutoff c, the true positive rate depends on c, say TPR(c),
and it is the probability of a test positive among the diseased population:

TPR(c) = P(M > c|D = 1),

and the false positive rate, say FPR(c)m is the probability of a test positive
among the healthy population:

FPF(c) = P(M > c|D = 0).

! Since the cutoff c is not usually fixed in advance, we can plot the
TPR(c) against the FPR(c) for all possible values of c.
! This is exactly what the ROC curve is: FPR(c) on the x and TPR(c)
along the y axis.

178 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Receiving Operating Characteristics (ROC) curve 3/4


The overall performance of a classifier, summarized over all possible
thresholds, is given by the area under the ROC curve (AUC).
An ideal ROC curve will hug the top left corner, so the larger the AUC the
better the classifier.

1.0

2
0.8

1.6
True positive rate

0.6

1.2
0.4

0.8
0.2

0.4
0.0

0
0.0 0.2 0.4 0.6 0.8 1.0

False positive rate

In the ideal case, it results AUC=1.


179 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Receiving Operating Characteristics (ROC) curve 4/4

We expect a classifier that performs no better than chance to have an AUC of


0.5 (when evaluated on an independent test set not used in model training).
A good ROC curve is close to the ideal ROC curve, indicating a high true
positive rate and a low false positive rate.
ROC curves are useful for comparing different classifiers, since they take into
account all possible thresholds.

180 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 23/23

0.94
1.0

0.75
0.8

0.56
True positive rate

0.6
0.2

0.38
0.4

0.5

0.19
0.2

0.8
0.0

0
0.0 0.2 0.4 0.6 0.8 1.0

False positive rate

A ROC curve for the LDA classifier on the Default data.


The dotted line represents the ”no information” classifier; this is what we
would expect if student status and credit card balance are not associated with
probability of default.

181 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Credit Card Default data 23/23

0.94
1.0

0.75
0.8

0.56
True positive rate

0.6
0.2

0.38
0.4

0.5

0.19
0.2

0.8
0.0

0
0.0 0.2 0.4 0.6 0.8 1.0

False positive rate

We consider the ROC curve on the test set and compare the AUC for
different models:
! Logistic Regression → AUC=0.9530,
! LDA → AUC=0.9538.
The two models are in practice equivalent, here.
182 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Quadratic Discriminant Analysis 1/2

LDA assumes that the observations within each class are drawn from a
multivariate Gaussian distribution with a class-specific mean vector µk
(k = 1, . . . , K) and a covariance matrix Σ that is common to all K classes.
Quadratic discriminant analysis (QDA) provides an alternative approach. Like
LDA, the QDA classifier results from assuming that the observations from
each class are drawn from a Gaussian distribution, and plugging estimates
for the parameters into Bayes’ theorem in order to perform prediction.
However, unlike LDA, QDA assumes that each class has its own covariance
matrix. That is, it assumes that an observation from the kth class is of the
form X ∼ Np(µk , Σk ), where Σk is a covariance matrix for the kth class.

183 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Quadratic Discriminant Analysis 2/2

Under this assumption, the Bayes classifier assigns an observation X = x to


the class for which
1 1
δk (x) = − (x − µk )′ Σ−1
k (x − µ) − ln |Σk | + ln πk
2 2
1 ′ −1 1 ′ −1 1
=− x Σk x + x′ Σ−1k µk − µΣ µ − ln |Σk | + ln πk
2 2 k k k 2
is largest.
Unlike LDA, the quantity x appears as a quadratic function in δk (x). This is
where QDA gets its name.
In practice, QDA works identically as LDA except that it estimates separate
variances/ covariance for each class.

184 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Comparing LDA to QDA 1/5

Why does it matter whether or not we assume that the K classes share a
common covariance matrix? In other words, why would one prefer LDA to
QDA, or vice-versa? in other words, which approach is better: LDA or QDA?
Since QDA allows for different variances among classes, the resulting
boundaries become quadratic, thus:
! QDA will work best when the variances are very different between
classes and we have enough observations to accurately estimate the
variances,
! LDA will work best when the variances are similar among classes or we
don’t have enough data to accurately estimate the variances

In other words, the answer lies in the bias-variance trade-off.

185 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Comparing LDA to QDA 2/5

When there are ppredictors, then estimating a covariance matrix requires


estimating p(p+ 1)/2 parameters.
QDA estimates a separate covariance matrix for each class, for a total of
K · p(p+ 1)/2 parameters.With 50 predictors this is some multiple of 1,275,
which is a lot of parameters.
By instead assuming that the K classes share a common covariance matrix,
the LDA model becomes linear in x, which means there are Kp linear
coefficients to estimate.

Consequently, LDA is a much less flexible classifier than QDA, and so has
substantially lower variance.

This can potentially lead to improved prediction performance.

186 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Comparing LDA to QDA 3/5

But there is a trade-off: if LDA’s assumption that the K classes share a


common covariance matrix is badly off, then LDA can suffer from high bias.

Roughly speaking, LDA tends to be a better bet than QDA if there are relatively
few training observations and so reducing variance is crucial.

In contrast, QDA is recommended if the training set is very large, so that the
variance of the classifier is not a major concern, or if the assumption of a
common covariance matrix for the K classes is clearly untenable.

187 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Comparing LDA to QDA 4/5 - Σ1 = Σ2

2
1
0
X2
−1
−2
−3
✛ LDA decision boundary

✛ QDA decision boundary


−4

−4 −2 0 2 4

X1
Two Gaussian classes with common correlation of 0.7 between X1 and X2 .
As a result, the Bayes decision boundary is linear and is accurately
approximated by the LDA decision boundary.
The QDA decision boundary is inferior, because it suffers from higher
variance without a corresponding decrease in bias.
188 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Comparing LDA to QDA 5/5 - Σ1 ̸


= Σ2

2
1
0
X2
−1
−2
−3
✛ QDA decision boundary

✛ LDA decision boundary


−4

−4 −2 0 2 4

X1
A situation in which the orange class has a correlation of 0.7 between the
variables and the blue class has a correlation of − 0.7.
Now the Bayes decision boundary is quadratic, and so QDA more accurately
approximates this boundary than does LDA.

189 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Comparison between Logistic Regression and LDA 1/2


Though their motivations differ, the logistic regression and LDA methods are
closely connected.
Consider the two-class setting with p= 1 predictor, and let p1 (x) and
p2 (x) = 1 − p1 (x) be the probabilities that the observation X = x belongs to
class 1 and class 2, respectively.
In the LDA framework, we can see that the log odds is given by
$ % $ %
p1 (x) p1 (x)
ln = ln = c0 + c1 x,
1 − p1 (x) p2 (x)

where c0 and c1 are functions of µ1 , µ2 and σ 2 .


For logistic regression, it results
$ %
p1
ln = β0 + β1 x.
1 − p1

Both expressions are linear functions of x. Hence, both logistic regression and
LDA produce linear decision boundaries.

190 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Comparison between Logistic Regression and LDA 2/2


The only difference between the two approaches lies in the fact that β0 and β1
are estimated using maximum likelihood, whereas c0 and c1 are computed
using the estimated mean and variance from a normal distribution.
This same connection between LDA and logistic regression also holds for
multidimensional data with p> 1.
Since logistic regression and LDA differ only in their fitting procedures, one
might expect the two approaches to give similar results.
This is often, but not always, the case:
! LDA assumes that the observations are drawn from a Gaussian
distribution with a common covariance matrix in each class, and so can
provide some improvements over logistic regression when this
assumption approximately holds.
! Conversely, logistic regression can outperform LDA if these Gaussian
assumptions are not met.

191 / 192
Data Analysis and Statistical Learning:03 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Labo activity with R

Labo activity 3.R

192 / 192

You might also like