0% found this document useful (0 votes)
25 views

Capitulo 2 big data

Chapter 2 of 'Big Data for Business' discusses statistical learning, focusing on the relationship between input variables (predictors) and output variables (responses) for improving predictions and understanding relationships. It covers the importance of model accuracy, the distinction between regression and classification, and the trade-offs between accuracy and interpretability in statistical methods. Additionally, it highlights the bias-variance trade-off and the challenges of selecting the best statistical learning method for different data sets.

Uploaded by

100473538
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Capitulo 2 big data

Chapter 2 of 'Big Data for Business' discusses statistical learning, focusing on the relationship between input variables (predictors) and output variables (responses) for improving predictions and understanding relationships. It covers the importance of model accuracy, the distinction between regression and classification, and the trade-offs between accuracy and interpretability in statistical methods. Additionally, it highlights the bias-variance trade-off and the challenges of selecting the best statistical learning method for different data sets.

Uploaded by

100473538
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6.

Model accuracy

Big data for Business

CHAPTER 2: STATISTICAL LEARNING

Department of Statistics
Universidad Carlos III de Madrid

Bachelor in Business Administration


Bachelor in Finance and Accounting

1 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

What Is Statistical Learning?


Suppose that we are statistical consultants hired by a client to provide
advice on how to improve sales of a particular product. The Advertising
data set consists of the Sales of that product in 200 different markets,
along with advertising budgets in: TV, Radio, and Newspaper.

The advertising budgets are input variables while sales is an output variable.
25

25

25
20

20

20
Sales

Sales

Sales
15

15

15
10

10

10
5

5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100

TV Radio Newspaper
2 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

What Is Statistical Learning?


In general, we suppose that we observe a quantitative response Y and p
different predictors, X1 , X2 , . . . , Xp and there is some relationship
between Y and X = (X1 , X2 , . . . , Xp ), which can be written as:
Y = f (X ) +
Here f is some fixed but unknown function of X = (X1 , X2 , . . . , Xp ), and
is a random error term, which is independent of X and has mean zero.

As another example, we plot of income versus years of education for 30


individuals in the Income1 data set.
80

80
70

70
60

60
Income

Income
50

50
40

40
30

30
20

20

10 12 14 16 18 20 22 10 12 14 16 18 20 22

Years of Education Years of Education


3 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

What Is Statistical Learning?


Now, we plot of income versus years of education and seniority for 30
individuals in the Income2 data set.

Incom
e

y
rit
Ye

o
ni
ar
so

Se
fE
du
ca
tio
n

Here f is a two-dimensional surface that must be estimated based on the


observed data. 4 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

Why estimate f ?
There are two main reasons: prediction and inference.

Prediction We can predict Y using:


Ŷ = fˆ(X )
such that one is not particularly interested in the form of fˆ if it provides
accurate predictions. The accuracy of Ŷ depends on two errors:

Reduccible error We can potentially improve the accuracy of fˆ by using


the most appropriate statistical learning technique to
estimate f .
Irreducible error No matter no matter how well we estimate f , we cannot
reduce the error introduced by .

Assuming fixed X and fˆ:


E [(Y − Ŷ )2 ] = (f (X ) − fˆ(X ))2 + Var ()
| {z } | {z }
Reducible Irreducible

5 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

Why estimate f ?
Inference We are often interested in understanding the way that Y is
affected as X1 , X2 , . . . , Xp change. Our goal is not necessarily to make
predictions for Y, but we instead want to understand the relationship
between X and Y , or more specifically, to understand how Y changes as
a function of X1 , X2 , . . . , Xp .

Depending on whether our ultimate goal is prediction, inference, or a


combination of the two, different methods for estimating f may be
appropriate.

For example, linear models allow for relatively simple and interpretable
model, but may not yield as accurate predictions as some other
approaches.

In contrast, highly non-linear approaches can potentially provide quite


accurate predictions for Y , but this comes at the expense of a less
interpretable model for which inference is more challenging.

6 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

How do we estimate f ?
We will always assume that we have observed a set of n different data
points, these are called training data. There are two main types of
statistical learning methods:

Parametric methods It reduces the problem of estimating f down to one


of estimating a set of parameters. For example, we may
assume that f is linear in X :

f (X ) = β0 + β1 X1 + . . . + βp Xp

such that we only need to estimate (β0 , β1 , . . . , βp ).


Non-parametric methods do not make explicit assumptions about the
functional form of f . However, these can lead to data
overfitting the data which will not yield accurate estimates
of the response on new observations that were not part of
the original training data set.

7 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

Incom
How do we estimate f ?

Incom
e

e
ity

ity
or

or
Ye Ye
ni

ni
ars ar
so
Se

Se
ofE fE
du du
ca catio
tio
n n

Left: A linear model fit by least squares to the Income2 data.

Right: A rough thin-plate spline fit to the Income2 data.

8 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

Trade-Off between accuracy and interpretability


• There are advantages and disadvantages to parametric and
non-parametric methods for statistical learning.
1. If we are mainly interested in inference, then restrictive models are
much more interpretable.
2. Sometimes, however, we are only interested in prediction and the
interpretability of the predictive model is simply not of interest.
• In the second case, we might expect that it will be better to use the
most flexible model available. However, this is not in general correct
because highly flexible methods lead to overfitting.
High

Subset Selection
Lasso

Least Squares
Interpretability

Generalized Additive Models


Trees

Bagging, Boosting

Support Vector Machines


Low

Low High

Flexibility 9 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

Regression vs Classification
Random variables can be classified as:

Quantitative take on numerical values


Qualitative take on values in one of K different classes or categories.

Supervised learning problems are divided in:

Regression problems have a quantitative response


Classification problems have a qualitative response

However, there are situations like logistic regression (Chapter 4) with a


qualitative (two-class, or binary) response, and therefore it is a
classification method. But since it estimates class probabilities, it can be
thought of as a regression method as well.

Some statistical methods, such as K-nearest neighbors (Chapters 2 and


4) and boosting (Chapter 8), can be used in the case of either
quantitative or qualitative responses.
10 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

Assessing Model Accuracy


No one method dominates all others over all possible data sets. Selecting
the best approach can be one of the most challenging parts of performing
statistical learning in practice.

In the regression setting, the most commonly-used measure is the mean


squared error (MSE):
Pn
(yi − fˆ(xi ))2
MSE = i=1
n

The MSE is computed using the training data, and so should more
accurately be referred to as the training MSE. But in general, we do not
really care how well the method works on the training data. Rather, we
are interested in the accuracy of the predictions that we obtain for
previously unseen test data:

Ave(y0 − fˆ(x0 ))2 ,

the average squared prediction error for these test observations (x0 , y0 ).
11 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

Measuring the Quality of Fit

2.5
12

2.0
10

Mean Squared Error

1.5
8
Y

1.0
6

0.5
4
2

0.0
0 20 40 60 80 100 2 5 10 20

X Flexibility

Left: Black curve is truth. Three estimates of f : the linear regression


(orange) and two smoothing spline fits (blue and green). Right: Training
MSE (grey), test MSE (red) and minimum possible test MSE (dashed).
Squares represent the training and test MSE for the three fits.
12 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

Measuring the Quality of Fit

Many statistical methods specifically estimate coefficients so as to


minimize the training set MSE. For these methods, the training set MSE
can be quite small, but the test MSE is often much larger.

As model flexibility increases, training MSE will decrease, but the test
MSE may not. When a given method yields a small training MSE but a
large test MSE, we are said to be overfitting the data.

When we overfit the training data, the test MSE will be very large
because the supposed patterns that the method found in the training
data simply don’t exist in the test data.

13 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

Measuring the Quality of Fit

2.5
12

2.0
10

Mean Squared Error

1.5
8
Y

1.0
6

0.5
4
2

0.0
0 20 40 60 80 100 2 5 10 20

X Flexibility

Here the truth is smoother, so the smoother fit and linear model do really
well.
14 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

Measuring the Quality of Fit

20
20

15
Mean Squared Error
10

10
Y

5
−10

0
0 20 40 60 80 100 2 5 10 20

X Flexibility

Here the truth is wiggly and the noise is low, so the more flexible fits do
the best.
15 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

The Bias-Variance Trade-Off


The expected test MSE, for a given value x0 , can always be decomposed
into the sum of three fundamental quantities:
2
E y0 − fˆ(x0 ) = Var (fˆ(x0 )) + [Bias(fˆ(x0 ))]2 + Var ()

Variance refers to the amount by which fˆ would change if we estimated


it using a different training data set.

Bias refers to the error that is introduced by approximating a real-life


problem, which may be extremely complicated, by a much simpler model.

As a general rule, as we use more flexible methods, the variance will


increase and the bias will decrease.

As we increase the flexibility of a class of methods, the bias tends to


initially decrease faster than the variance increases. However, at some
point increasing flexibility starts to significantly increase the variance.
16 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

The Bias-Variance Trade-Off for the three examples


2.5

2.5

20
MSE
Bias
Var
2.0

2.0

15
1.5

1.5

10
1.0

1.0

5
0.5

0.5
0.0

0.0

0
2 5 10 20 2 5 10 20 2 5 10 20

Flexibility Flexibility Flexibility

In a real-life situation in which f is unobserved, it is not possible to


explicitly compute the test MSE, bias, or variance for a statistical
learning method. Nevertheless, one should always keep the bias-variance
trade-off in mind.
17 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

The Classification Setting

Suppose that y1 , . . . , yn are qualitative. Then, the training MSE is


referred as the training error rate which provides the proportion of
mistakes in the classification:
1X
I (yi 6= yˆi )
n

As in the regression setting, we are most interested in the error rates that
result from test observations that were not used in training.

The test error rate is minimized, on average, by the Bayes Classifier that
assigns each observation to the most likely class, given its predictor
values. That is, we should simply assign a test observation with predictor
vector x0 to the class j with largest Pr(Y = j|X = x0 ).

18 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

The Bayes classifier


oo o
oo o
o
o
o oo oo o
o
o oo oo ooo
o o oo ooo o
oo o o o o ooo oo oo
o o oo oo
o o o o oo o
oo oo o o o o
o o o oo o o o
o oo o o o o o
o o
o o oooo o ooo o o o o ooo
o
o
X2

o o o oo o o o ooooo o o o
oo o oo o o
o o o o oo o
o o o o
o o o oo o ooo o o
o oo o
oo o ooooo oooo
o o oo oo o o
o o oo oo o
o o o oo oo
oo
o o o o
o oo o
o o o

X1

The purple dushed line is called the Bayes decision boundary where
Pr(Y = orange|X ) = Pr(Y = blue|X ) = 0.5.
19 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

The Bayes classifier

The Bayes classifier produces the lowest possible test error rate, called
the Bayes error rate.

The Bayes classifier will choose the class for which Pr(Y = j|X = x0 ) is
largest, then the error rate at X = x0 will be 1 − maxj Pr(Y = j|X = x0 ).

In general, the overall Bayes error rate is given by:

1 − E [max Pr(Y = j|X )]


j

where the expectation averages the probability over all possible values of
X.

The Bayes error rate corresponds to the irreducible error in the


classification setting.

20 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

K-Nearest Neighbors (KNN)


Given a positive integer K and a test observation x0 , the KNN classifier
first identifies the neighbors K points in the training data that are closest
to x0 , represented by N0 and estimate:
1 X
P̂r(Y = j|X = x0 ) = I (yi = j)
K
i∈N0

Finally, KNN applies Bayes rule and classifies the test observation x0 to
the class with the largest probability.

o o
o o o o
o o o o
o o

o o

o o o o
o o
o o
o o

21 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

K-Nearest Neighbors (KNN)


KNN: K=10

oo o
oo o
o
o
o oo oo o
o oo o
o o o
o oo o oo oo
oo o o o o o oo o oo
o o oo oo o oo
o o o o oo o
oo oo o o o o
o o o oo o o o
o oo o o o o o
o o
o o oooo o ooo o o o o ooo
o
oo
X2

o o o o
oo o ooo o
o o
oo o oo o oo o
o o o o oo o
o o o o
o
o o o oo o oo o o
o o o o
oo o ooooo oo
o oo
o oo o o
o o oo oo o o o
o o o oo ooo
ooo
o o
o oo o
o o o

X1

KNN decision boundary using K = 10. The Bayes decision boundary is


shown as a purple dashed line. 22 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

K-Nearest Neighbors (KNN)


The choice of K has a drastic effect on the KNN classifier obtained.

KNN: K=1 KNN: K=100

o o o o
oo o o oo o o
o o o o
o oo o o oo o
oo oo
o o o o
o oo oo o o oo o oo oo o o oo
o o o
o o o o o
o o
oo o o o o o oo oo oo o o o o o oo oo
o o oo oo o oo o o oo oo o oo
o o o o o oo o o o o o o o oo o o
oo o o o o oo o o o o
o o o oo o o o o o o oo o o o
o oo o o o o o oo o o o o
o o ooo o oo o oo o o o o o o o ooo o oo o oo o o o o o
oo o o oo o o
o o o o oo o o oo o o o o o oo o o oo o
o o o o ooo o o o o ooo
oo o o o o oo o o o o
oo o o o oo o oo o o o oo o
o oo o o o oo o o
o o
o o o o o oo o o o o o o oo o
o ooo o o ooo o
oo o oooo oo oo o oooo oo
o o oo o oo
o oo o o o o oo o o
o o oo oo o o o o o oo oo o o o
oo o oo o
o
o
o oooo oo o
o
o
o oooo oo o
oo o oo o
o o
o o o o o o

23 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

K-Nearest Neighbors (KNN)

0.20
0.15
Error Rate

0.10
0.05

Training Errors
0.00

Test Errors

0.01 0.02 0.05 0.10 0.20 0.50 1.00

1/K

The training error rate consistently declines as the flexibility increases.


However, the test error exhibits a characteristic U-shape, declining at first
(with a minimum at approximately K = 10) before increasing again when
the method becomes excessively flexible and overfits.
24 / 25
1. What 2. Why 3. How 4. Accuracy and interpretability 5. Regression vs Classification 6. Model accuracy

Final comments

In both the regression and classification settings, choosing the correct


level of flexibility is critical to the success of any statistical learning
method.

The bias-variance tradeoff, and the resulting U-shape in the test error,
can make this a difficult task.

In Chapter 5, we return to this topic and discuss various methods for


estimating test error rates and thereby choosing the optimal level of
flexibility for a given statistical learning method.

25 / 25

You might also like