Capitulo 2 big data
Capitulo 2 big data
Model accuracy
Department of Statistics
Universidad Carlos III de Madrid
1 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy
The advertising budgets are input variables while sales is an output variable.
25
25
25
20
20
20
Sales
Sales
Sales
15
15
15
10
10
10
5
5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
2 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy
80
70
70
60
60
Income
Income
50
50
40
40
30
30
20
20
10 12 14 16 18 20 22 10 12 14 16 18 20 22
Incom
e
y
rit
Ye
o
ni
ar
so
Se
fE
du
ca
tio
n
Why estimate f ?
There are two main reasons: prediction and inference.
5 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy
Why estimate f ?
Inference We are often interested in understanding the way that Y is
affected as X1 , X2 , . . . , Xp change. Our goal is not necessarily to make
predictions for Y, but we instead want to understand the relationship
between X and Y , or more specifically, to understand how Y changes as
a function of X1 , X2 , . . . , Xp .
For example, linear models allow for relatively simple and interpretable
model, but may not yield as accurate predictions as some other
approaches.
6 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy
How do we estimate f ?
We will always assume that we have observed a set of n different data
points, these are called training data. There are two main types of
statistical learning methods:
f (X ) = β0 + β1 X1 + . . . + βp Xp
7 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy
Incom
How do we estimate f ?
Incom
e
e
ity
ity
or
or
Ye Ye
ni
ni
ars ar
so
Se
Se
ofE fE
du du
ca catio
tio
n n
8 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy
Subset Selection
Lasso
Least Squares
Interpretability
Bagging, Boosting
Low High
Flexibility 9 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy
Regression vs Classification
Random variables can be classified as:
The MSE is computed using the training data, and so should more
accurately be referred to as the training MSE. But in general, we do not
really care how well the method works on the training data. Rather, we
are interested in the accuracy of the predictions that we obtain for
previously unseen test data:
the average squared prediction error for these test observations (x0 , y0 ).
11 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy
2.5
12
2.0
10
1.5
8
Y
1.0
6
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
As model flexibility increases, training MSE will decrease, but the test
MSE may not. When a given method yields a small training MSE but a
large test MSE, we are said to be overfitting the data.
When we overfit the training data, the test MSE will be very large
because the supposed patterns that the method found in the training
data simply don’t exist in the test data.
13 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy
2.5
12
2.0
10
1.5
8
Y
1.0
6
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
Here the truth is smoother, so the smoother fit and linear model do really
well.
14 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy
20
20
15
Mean Squared Error
10
10
Y
5
−10
0
0 20 40 60 80 100 2 5 10 20
X Flexibility
Here the truth is wiggly and the noise is low, so the more flexible fits do
the best.
15 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy
2.5
20
MSE
Bias
Var
2.0
2.0
15
1.5
1.5
10
1.0
1.0
5
0.5
0.5
0.0
0.0
0
2 5 10 20 2 5 10 20 2 5 10 20
As in the regression setting, we are most interested in the error rates that
result from test observations that were not used in training.
The test error rate is minimized, on average, by the Bayes Classifier that
assigns each observation to the most likely class, given its predictor
values. That is, we should simply assign a test observation with predictor
vector x0 to the class j with largest Pr(Y = j|X = x0 ).
18 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy
o o o oo o o o ooooo o o o
oo o oo o o
o o o o oo o
o o o o
o o o oo o ooo o o
o oo o
oo o ooooo oooo
o o oo oo o o
o o oo oo o
o o o oo oo
oo
o o o o
o oo o
o o o
X1
The purple dushed line is called the Bayes decision boundary where
Pr(Y = orange|X ) = Pr(Y = blue|X ) = 0.5.
19 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy
The Bayes classifier produces the lowest possible test error rate, called
the Bayes error rate.
The Bayes classifier will choose the class for which Pr(Y = j|X = x0 ) is
largest, then the error rate at X = x0 will be 1 − maxj Pr(Y = j|X = x0 ).
where the expectation averages the probability over all possible values of
X.
20 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy
Finally, KNN applies Bayes rule and classifies the test observation x0 to
the class with the largest probability.
o o
o o o o
o o o o
o o
o o
o o o o
o o
o o
o o
21 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy
oo o
oo o
o
o
o oo oo o
o oo o
o o o
o oo o oo oo
oo o o o o o oo o oo
o o oo oo o oo
o o o o oo o
oo oo o o o o
o o o oo o o o
o oo o o o o o
o o
o o oooo o ooo o o o o ooo
o
oo
X2
o o o o
oo o ooo o
o o
oo o oo o oo o
o o o o oo o
o o o o
o
o o o oo o oo o o
o o o o
oo o ooooo oo
o oo
o oo o o
o o oo oo o o o
o o o oo ooo
ooo
o o
o oo o
o o o
X1
o o o o
oo o o oo o o
o o o o
o oo o o oo o
oo oo
o o o o
o oo oo o o oo o oo oo o o oo
o o o
o o o o o
o o
oo o o o o o oo oo oo o o o o o oo oo
o o oo oo o oo o o oo oo o oo
o o o o o oo o o o o o o o oo o o
oo o o o o oo o o o o
o o o oo o o o o o o oo o o o
o oo o o o o o oo o o o o
o o ooo o oo o oo o o o o o o o ooo o oo o oo o o o o o
oo o o oo o o
o o o o oo o o oo o o o o o oo o o oo o
o o o o ooo o o o o ooo
oo o o o o oo o o o o
oo o o o oo o oo o o o oo o
o oo o o o oo o o
o o
o o o o o oo o o o o o o oo o
o ooo o o ooo o
oo o oooo oo oo o oooo oo
o o oo o oo
o oo o o o o oo o o
o o oo oo o o o o o oo oo o o o
oo o oo o
o
o
o oooo oo o
o
o
o oooo oo o
oo o oo o
o o
o o o o o o
23 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy
0.20
0.15
Error Rate
0.10
0.05
Training Errors
0.00
Test Errors
1/K
Final comments
The bias-variance tradeoff, and the resulting U-shape in the test error,
can make this a difficult task.
25 / 25