Ch2_Statistical_Learning
Ch2_Statistical_Learning
STATISTICAL LEARNING?
Chapter 02
Outline
➢What Is Statistical Learning?
➢ Why estimate f?
➢ How do we estimate f?
➢ The trade-off between prediction accuracy and model
interpretability
➢ Supervised vs. unsupervised learning
➢ Regression vs. classification problems
➢ Assessing quality of fit
➢ Bias variance trade off
➢ Nearest neighbor for classification
What is Statistical Learning?
25
25
25
20
20
20
Sales
Sales
Sales
15
15
15
10
10
10
5
5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
Suppose given the budgets on three media, our aim is to build a model
to predict sales accurately.
Shown are scatter plots of Sales vs TV, Radio and Newspaper, with a
blue linear-regression line fit separately to each.
Can we predict Sales using these three variables?
Perhaps we can do better using a model:
Sales≈ f (TV, Radio, Newspaper)
3/30
4
Yi = f (Xi ) + i
➢Where f is an unknown function and ε is a random error
with mean zero.
5
A Simple Example
0.10
0.05
0.00
y
-0.05
-0.10
x
6
A Simple Example
0.10
0.05
εi
0.00
y
-0.05
f
-0.10
x
7
Why Do We Estimate f?
➢Statistical Learning, and this course, are all about how to
estimate f.
➢The term statistical learning refers to using the data to
“learn” f.
➢Why do we care about estimating f?
➢There are 2 reasons for estimating f:
➢ Prediction
➢ Inference
Inference may involve testing relevant subject mater theory
(e.g., economic theory) and suggesting policy based on this
inference.
9
1. Prediction
➢If we can produce a good estimate for f (and the variance
of ε is not too large) we can make accurate predictions for
the response, Y, based on a new (unseen) value of X.
10
2. Inference
➢Alternatively, we may also be interested in the type of
relationship between Y and the X's.
➢For example,
➢ Which particular predictor actually affects the response?
➢ Is the relationship positive or negative?
➢ Is the relationship a simple linear one or is it more complicated
etc.?
12
How Do We Estimate f ?
➢We will assume we have observed a set of training data
Parametric Methods
➢It reduces the problem of estimating f down to one of
estimating a set of parameters.
STEP 1:
Make some assumption about the functional form of f, i.e. come up
with a model. The most common example is a linear model i.e.
f (Xi ) = 0 + 1 X i1 + 2 X i 2 + + p X ip
However, in this course we will examine far more complicated,
and flexible, models for f. In a sense the more flexible the model
the more realistic it is.
15
3
⚫ ⚫
2
⚫
⚫
⚫ ⚫
⚫
⚫ ⚫ ⚫
1
⚫ ⚫⚫⚫
⚫
⚫ ⚫
⚫
y
⚫ ⚫ ⚫ ⚫⚫
⚫ ⚫ ⚫ ⚫ ⚫ ⚫⚫ ⚫⚫ ⚫
0
⚫ ⚫ ⚫ ⚫ ⚫
⚫ ⚫ ⚫ ⚫
⚫
⚫ ⚫ ⚫⚫ ⚫ ⚫ ⚫ ⚫
⚫
⚫ ⚫ ⚫
−1
⚫ ⚫⚫ ⚫ ⚫
⚫
−2
1 2 3 4 5 6
⚫ ⚫
⚫
2
⚫
⚫
⚫ ⚫ ⚫
⚫ ⚫ ⚫
⚫ ⚫⚫⚫
1
⚫ ⚫ ⚫
⚫
y
⚫ ⚫ ⚫ ⚫⚫
⚫ ⚫
⚫ ⚫ ⚫ ⚫ ⚫
⚫ ⚫ ⚫
0
⚫ ⚫ ⚫ ⚫ ⚫
⚫ ⚫ ⚫
⚫ ⚫
⚫ ⚫ ⚫⚫ ⚫ ⚫ ⚫ ⚫
⚫
⚫ ⚫ ⚫
−1
⚫ ⚫⚫ ⚫ ⚫
⚫
−2
1 2 3 4 5 6
x
17
Non-parametric Methods
➢They do not make explicit assumptions about the
functional form of f.
➢Advantages: They accurately fit a wider range of possible
shapes of f.
➢Disadvantages: A very large number of observations is
required to obtain an accurate estimate of f
Example of Non-parametric Method: K Nearest Neighbor Method
6/30
20
Reason 2:
Even if you are only interested in prediction, so the first
reason is not relevant, it is often possible to get more
accurate predictions with a simple, instead of a
complicated, model. This seems counter intuitive but
has to do with the fact that it is harder to fit a more
flexible model.
23
➢Unsupervised Learning:
➢ In this situation only the Xi’s are observed.
➢ A common example is market segmentation where we try to divide
potential customers into groups based on their characteristics.
➢ A common approach is clustering.
26
A clustering data set involving three groups. Each group is shown using a different colored symbol.
Left: The three groups are well-separated. In this setting, a clustering approach should successfully
identify the three groups. Right: There is some overlap among the groups. Now the clustering task is
more challenging.
27
Different Approaches
➢We will deal with both types of problems in this course.
➢Some methods work well on both types of problem e.g.,
Neural Networks
➢Other methods work best on Regression, e.g., Linear
Regression, or on Classification, e.g., k-Nearest
Neighbors.
• Note: There are other learning methods including online
learning, reinforcement learning etc.
IOM 530: Intro. to Statistical Learning 29
Flexibility
Measuring Quality of Fit (in Regression
context)
➢Suppose we have a regression problem.
➢One common measure of accuracy is the mean squared
error (MSE) for regression i.e.
1 n ˆ
MSETR = [ yi − f ( xi )]2
n i =1
➢Where fˆ ( xi ) is the prediction our method gives for the
observation in our training data.
➢MSE for test data ( MSE ) is similarly defined.
TE
A Problem
➢In either case our method has generally been designed to
make MSE small on the training data we are looking at
e.g., with linear regression we choose the line such that
MSE is minimized.
LEFT RIGHT
Black: Truth RED: Test MES
Orange: Linear Estimate Grey: Training MSE
Blue: smoothing spline Dashed: Minimum possible test
Green: smoothing spline (more MSE (irreducible error)
flexible)
Examples with Different Levels of Flexibility: Example 2
LEFT RIGHT
Black: Truth RED: Test MES
Orange: Linear Estimate Grey: Training MSE
Blue: smoothing spline Dashed: Minimum possible test
Green: smoothing spline (more MSE (irreducible error)
flexible)
Examples with Different Levels of Flexibility: Example 3
LEFT RIGHT
Black: Truth RED: Test MES
Orange: Linear Estimate Grey: Training MSE
Blue: smoothing spline Dashed: Minimum possible test
Green: smoothing spline (more MSE (irreducible error)
flexible)
Bias Variance Tradeoff
➢The previous graphs of test versus training MSE’s
illustrates a very important tradeoff that governs the
choice of statistical learning methods.