0% found this document useful (0 votes)
16 views51 pages

Ch2_Statistical_Learning

The document introduces statistical learning, focusing on estimating the function f that relates predictors to a response variable, Y. It discusses the importance of prediction and inference, the methods for estimating f (parametric and non-parametric), and the trade-offs between model complexity and interpretability. Additionally, it covers supervised vs. unsupervised learning, regression vs. classification, and the bias-variance trade-off in model selection.

Uploaded by

Sohaib Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views51 pages

Ch2_Statistical_Learning

The document introduces statistical learning, focusing on estimating the function f that relates predictors to a response variable, Y. It discusses the importance of prediction and inference, the methods for estimating f (parametric and non-parametric), and the trade-offs between model complexity and interpretability. Additionally, it covers supervised vs. unsupervised learning, regression vs. classification, and the bias-variance trade-off in model selection.

Uploaded by

Sohaib Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

1

STATISTICAL LEARNING?
Chapter 02

Some slides/pictures taken from:


https://www.linkedin.com/pulse/accuracy-bias-variance-tradeoff-yair-rajwan-ms-dsc
2

Outline
➢What Is Statistical Learning?
➢ Why estimate f?
➢ How do we estimate f?
➢ The trade-off between prediction accuracy and model
interpretability
➢ Supervised vs. unsupervised learning
➢ Regression vs. classification problems
➢ Assessing quality of fit
➢ Bias variance trade off
➢ Nearest neighbor for classification
What is Statistical Learning?

25

25

25
20

20

20
Sales

Sales

Sales
15

15

15
10

10

10
5

5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper

Suppose given the budgets on three media, our aim is to build a model
to predict sales accurately.
Shown are scatter plots of Sales vs TV, Radio and Newspaper, with a
blue linear-regression line fit separately to each.
Can we predict Sales using these three variables?
Perhaps we can do better using a model:
Sales≈ f (TV, Radio, Newspaper)
3/30
4

What is Statistical Learning?


➢Suppose we observe Yi and Xi = (Xi1,..., Xip ) for i =1,..., n
➢We believe that there is a relationship between Y and at
least one of the X’s.
• Here Y = Sales is a response or target (output) that we wish to
predict..
• TV is a feature, or input, or predictor; we name it X 1 .
Likewise, name Radio as X 2 , and so on
➢We can generally model the relationship as:

Yi = f (Xi ) +  i
➢Where f is an unknown function and ε is a random error
with mean zero.
5

A Simple Example
0.10
0.05
0.00
y

-0.05
-0.10

0.0 0.2 0.4 0.6 0.8 1.0

x
6

A Simple Example
0.10
0.05

εi
0.00
y

-0.05

f
-0.10

0.0 0.2 0.4 0.6 0.8 1.0

x
7

Simulated Data: Income vs. Education and


Seniority
Red points are simulated values for income from the model
income = f (education, seniority) + ε
f is the blue surface.
8

Why Do We Estimate f?
➢Statistical Learning, and this course, are all about how to
estimate f.
➢The term statistical learning refers to using the data to
“learn” f.
➢Why do we care about estimating f?
➢There are 2 reasons for estimating f:
➢ Prediction
➢ Inference
Inference may involve testing relevant subject mater theory
(e.g., economic theory) and suggesting policy based on this
inference.
9

1. Prediction
➢If we can produce a good estimate for f (and the variance
of ε is not too large) we can make accurate predictions for
the response, Y, based on a new (unseen) value of X.
10

Example: Direct Mailing Prediction


➢Interested in predicting how much money an individual
will donate based on observations from 90,000 people on
which we have recorded over 400 different characteristics
(x variables).
➢Don’t care too much about each individual characteristic.
➢Just want to know: For a given individual should I send
out a mailing?
11

2. Inference
➢Alternatively, we may also be interested in the type of
relationship between Y and the X's.
➢For example,
➢ Which particular predictor actually affects the response?
➢ Is the relationship positive or negative?
➢ Is the relationship a simple linear one or is it more complicated
etc.?
12

Example: Housing Price Inference


➢Wish to predict median house price based on 14
variables.
➢Probably want to understand which factors have the
biggest effect on the response and how big the effect is.
➢For example, how much impact does a river view have on
the house value etc.
13

How Do We Estimate f ?
➢We will assume we have observed a set of training data

{( X1, Y1 ), (X2 , Y2 ),, (Xn , Yn )}


➢We must then use the training data and a statistical
method to estimate f.
➢Statistical Learning Methods:
➢ Parametric Methods
➢ Non-parametric Methods
14

Parametric Methods
➢It reduces the problem of estimating f down to one of
estimating a set of parameters.

➢They involve a two-step model-based approach

STEP 1:
Make some assumption about the functional form of f, i.e. come up
with a model. The most common example is a linear model i.e.

f (Xi ) = 0 + 1 X i1 +  2 X i 2 +  +  p X ip
However, in this course we will examine far more complicated,
and flexible, models for f. In a sense the more flexible the model
the more realistic it is.
15

Parametric Methods (cont.)


STEP 2:

Use the training data to fit the model i.e. estimate f


or equivalently the unknown parameters such as
β0, β1, β2,…, βp.

➢The most common approach for estimating


the parameters in a linear model is ordinary
least squares (OLS).
➢However, this is only one way.
➢We will see in this course that there are often
alternative and superior approaches.
Example of Parametric Method:

A linear model fˆL(X) = βˆ0 + βˆ1 X gives a reasonable fithere

3
⚫ ⚫

2


⚫ ⚫

⚫ ⚫ ⚫

1
⚫ ⚫⚫⚫

⚫ ⚫

y
⚫ ⚫ ⚫ ⚫⚫
⚫ ⚫ ⚫ ⚫ ⚫ ⚫⚫ ⚫⚫ ⚫

0
⚫ ⚫ ⚫ ⚫ ⚫
⚫ ⚫ ⚫ ⚫

⚫ ⚫ ⚫⚫ ⚫ ⚫ ⚫ ⚫

⚫ ⚫ ⚫

−1
⚫ ⚫⚫ ⚫ ⚫

−2

1 2 3 4 5 6

A quadratic model fˆQ(X) = βˆ0 + βˆ1X + βˆ2X2 fits slightly better.


3

⚫ ⚫


2



⚫ ⚫ ⚫
⚫ ⚫ ⚫
⚫ ⚫⚫⚫
1

⚫ ⚫ ⚫

y

⚫ ⚫ ⚫ ⚫⚫
⚫ ⚫
⚫ ⚫ ⚫ ⚫ ⚫
⚫ ⚫ ⚫
0

⚫ ⚫ ⚫ ⚫ ⚫
⚫ ⚫ ⚫
⚫ ⚫
⚫ ⚫ ⚫⚫ ⚫ ⚫ ⚫ ⚫

⚫ ⚫ ⚫
−1

⚫ ⚫⚫ ⚫ ⚫

−2

1 2 3 4 5 6

x
17

Example: A Linear Regression Estimate


• Even if the
error standard
deviation is
low, we will still
get a bad
estimate of f if
we use wrong
model e.g., fit
of a linear
model (a
plane) to the
income data.
f = b0 + b1 ´ Education + b2 ´ Seniority
Parametric models assume a global structure on (x,y) relationship
18

Non-parametric Methods
➢They do not make explicit assumptions about the
functional form of f.
➢Advantages: They accurately fit a wider range of possible
shapes of f.
➢Disadvantages: A very large number of observations is
required to obtain an accurate estimate of f
Example of Non-parametric Method: K Nearest Neighbor Method

• Let we want to predict Y at X = 4 .


• We can use average of points which are near to x
fˆ(x) = Average (Y |X ∈ N (x))
where N (x) is some neighborhood of x.
• The resulting method is called nearest neighbor method.

6/30
20

Another example of non-parametric


method: a thin-plate spline estimate
• Splines are
special type of
piece-wise
polynomials.
Non-linear
regression
methods are
more flexible
and can
potentially
provide more
accurate
predictions.

Non-parametric models assume a glocal structure on (x,y) relationship


21

Tradeoff Between Prediction Accuracy


and Model Interpretability
➢Why not just use a more flexible method if it is more realistic?
➢There are two reasons:
Reason 1:
A simple method such as linear regression produces a model
which is much easier to interpret (the Inference part is better).
For example, in a linear model, βj is the average increase in Y
for a one unit increase in Xj holding all other variables constant.
22

Reason 2:
Even if you are only interested in prediction, so the first
reason is not relevant, it is often possible to get more
accurate predictions with a simple, instead of a
complicated, model. This seems counter intuitive but
has to do with the fact that it is harder to fit a more
flexible model.
23

Good training fit but poor test prediction


• Non-linear
regression
methods can
also be too
flexible and
produce poor
estimates for f
for test data.
Here the fitted
model makes no
errors on the
training data!
Also known as
overfitting.
24

Supervised vs. Unsupervised Learning


➢We can divide all learning problems into Supervised and
Unsupervised situations
➢Supervised Learning:
➢ Supervised Learning is where both the predictors, Xi, and the
response, Yi, are observed.
➢ Regression and classification are two main class of approaches
➢ Most of this course will also deal with supervised learning.
25

➢Unsupervised Learning:
➢ In this situation only the Xi’s are observed.
➢ A common example is market segmentation where we try to divide
potential customers into groups based on their characteristics.
➢ A common approach is clustering.
26

A Simple Clustering Example

A clustering data set involving three groups. Each group is shown using a different colored symbol.
Left: The three groups are well-separated. In this setting, a clustering approach should successfully
identify the three groups. Right: There is some overlap among the groups. Now the clustering task is
more challenging.
27

Regression vs. Classification


➢Supervised learning problems can be further divided into
regression and classification problems.
➢Regression covers situations where Y is
quantitative/continuous/numerical. e.g.,
➢ Predicting the value of the KSE100 Index in 6 months.
➢ Predicting the value of a given house based on various features.
➢Classification covers situations where Y is categorical e.g.,
➢ Will the KSE 100 Index be up (U) or down (D) in 6 months?
➢ Is this email a SPAM or not?
28

Different Approaches
➢We will deal with both types of problems in this course.
➢Some methods work well on both types of problem e.g.,
Neural Networks
➢Other methods work best on Regression, e.g., Linear
Regression, or on Classification, e.g., k-Nearest
Neighbors.
• Note: There are other learning methods including online
learning, reinforcement learning etc.
IOM 530: Intro. to Statistical Learning 29

Flexibility
Measuring Quality of Fit (in Regression
context)
➢Suppose we have a regression problem.
➢One common measure of accuracy is the mean squared
error (MSE) for regression i.e.
1 n ˆ
MSETR =  [ yi − f ( xi )]2

n i =1
➢Where fˆ ( xi ) is the prediction our method gives for the
observation in our training data.
➢MSE for test data ( MSE ) is similarly defined.
TE
A Problem
➢In either case our method has generally been designed to
make MSE small on the training data we are looking at
e.g., with linear regression we choose the line such that
MSE is minimized.

➢What we really care about is how well the method works


on new data. We call this new data “Test Data”.

➢There is no guarantee that the method with the smallest


training MSE will have the smallest test (i.e. new data)
MSE.
Training vs. Test MSE’s
➢In general, the more flexible a method is the lower its
training MSE will be i.e. it will “fit” or explain the training
data very well.

➢However, the test MSE may in fact be higher for a more


flexible method than for a simple approach
Examples with Different Levels of Flexibility: Example 1

LEFT RIGHT
Black: Truth RED: Test MES
Orange: Linear Estimate Grey: Training MSE
Blue: smoothing spline Dashed: Minimum possible test
Green: smoothing spline (more MSE (irreducible error)
flexible)
Examples with Different Levels of Flexibility: Example 2

LEFT RIGHT
Black: Truth RED: Test MES
Orange: Linear Estimate Grey: Training MSE
Blue: smoothing spline Dashed: Minimum possible test
Green: smoothing spline (more MSE (irreducible error)
flexible)
Examples with Different Levels of Flexibility: Example 3

LEFT RIGHT
Black: Truth RED: Test MES
Orange: Linear Estimate Grey: Training MSE
Blue: smoothing spline Dashed: Minimum possible test
Green: smoothing spline (more MSE (irreducible error)
flexible)
Bias Variance Tradeoff
➢The previous graphs of test versus training MSE’s
illustrates a very important tradeoff that governs the
choice of statistical learning methods.

➢There are always two competing forces that govern the


choice of learning method i.e., bias and variance.
Bias of Learning Methods
➢Bias refers to the error that is introduced by modeling a
real life problem (that may be quite complicated) by a
much simpler problem.
➢Bias = E[fˆ(x)] − f (x)

➢For example, linear regression assumes that there is a


linear relationship between Y and X. It is unlikely that, in
real life, the relationship is exactly linear so some bias will
be present. e.g., in the previous plots 1 and 3.

➢The more flexible/complex a method is the less bias it will


generally have.
Variance of Learning Methods
➢Variance refers to how much your estimate for f would
change by if you had a different training data set.

Var[ fˆ ( x)] = E[ fˆ ( x) − E[ fˆ ( x)]]2


➢Generally, the more flexible a method is the more
variance it has.
IOM 530: Intro. to Statistical Learning 39
The Trade-off
➢It can be shown that for any given, X=x0, the expected
test Mean Square Error (MSE) for a new 𝑦0 at x0 will be
equal to

➢Typically, as the flexibility of fˆ increases, its variance


increases, and its bias decreases. So, choosing the
flexibility based on average test error amounts to a bias-
variance trade-of. The expected test MSE may go up or
down!
Test MSE, Bias and Variance
The Classification Setting
➢For a regression problem, we used the MSE to assess
the accuracy of the statistical learning method
➢For a classification problem we can use the error rate i.e.
n
Error Rate =  I ( yi  yˆ i ) / n
i =1

➢ I ( yi  yˆ i ) is an indicator function, which will give 1 if the


condition ( yi  yˆ i ) is correct, otherwise it gives a 0.

➢Thus, the error rate represents the fraction of incorrect


classifications, or misclassifications
Bayes Error Rate
➢The Bayes error rate refers to the lowest possible error rate
that could be achieved if somehow, we knew exactly what the
“true” probability distribution of the data.
➢Bayes classifier assigns each observation to the most likely
class, given its predictor values i.e. we simply assign a test
observation with predictor vector x0 to the class j for which
Pr(Y = j|X = x0) is largest.
➢On test data, no classifier (or learning method) can get lower
error rates than the Bayes error rate.

➢For two-class problem Bayes classifier corresponds to


predicting class one if Pr(Y = 1|X = x0) > 0.5, and class two
otherwise. Of course, in real life problems the Bayes error rate
can’t be calculated exactly.
Bayes Optimal Classifier (predict orange
or blue class at each x1, x2 pair)
K-Nearest Neighbors (KNN)
➢K-Nearest Neighbors is a flexible approach to estimate
the Bayes Classifier.
➢For any given X we find the k closest neighbors (e.g.,
using the Euclidian distance) to X in the training data, and
examine their corresponding Y.
➢If the majority of the Y’s are orange, we predict orange
otherwise guess blue.
➢The smaller the k is the more flexible the method will be
(the idea is that average of a small number of values may
be quite erratic i.e. follows the error but average of large
number of values gets smoother as error gets averaged
out).
KNN classifies the test observation x0 to the class with the
largest probability i.e.
KNN Example with k = 3

For observation X, predicted class is blue, why?


Simulated Data: K = 10
K = 1 and K = 100
Training vs. Test Error Rates on the Simulated
Data
➢Notice that training
error rates keep
going down as k
decreases or
equivalently as the
flexibility (1/K)
increases.

➢However, the test


error rate at first
decreases but then
starts to increase
again.
IOM 530: Intro. to Statistical Learning 50
A Fundamental Picture
➢In general training
errors will always
decline.
➢However, test errors
will decline at first
(as reductions in
bias dominate) but
will then start to
increase again (as
increases in
We must always keep this picture in mind when
variance dominate). choosing a learning method. More flexible/complicated
is not always better!
Ex Ch. 2

You might also like