Multiple+Regression+A
Multiple+Regression+A
1
Outline
Formulation of Multiple Regression Model
Measures of Variation
𝑅 2 and Adjusted 𝑅 2
Inferences about the Parameters
Overall Model Significance
A Particular 𝑋-Variable Significance
Assumptions and Residual Analysis
Prediction of New Observations
2
Multiple Regression Model - Formulation
A multiple regression model is to relate one response
variable with two or more predictor variables in a linear
function
Population Intercept Population Slope Coefficients
4
Multiple Regression Model - Formulation
Coefficients in a multiple regression net out the impact of
each predictor in the regression equation
The estimated slope coefficient, 𝑏𝑖 , measures the change
in the average value of 𝑌 as a result of a one-unit increase
in 𝑋𝑖 , holding all other covariates constant
5
Multiple Regression Model - Formulation
Suppose the true relationship is 𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝜀
and we fit a simple linear regression 𝐸(𝑌|𝑋1 ) = 𝛼0 + 𝛼1 𝑋1
ETY/X1 Bo +B X BrEEX ]
+ IX
Then 𝛼1 = 𝛽1 provided that either = , , , 1
X 215]Ed Ph , X
6
Multiple Regression Model - Estimation
𝑌
eso
𝒀𝒊
𝒃𝟎 𝒆𝒊
𝒊
𝒀 𝒊 = 𝒃𝟎 + 𝒃𝟏 𝑿𝟏𝒊 + 𝒃𝟐 𝑿𝟐𝒊
𝒀 ICESSE
C , Lo
*
𝑋1
(𝑋1𝑖 , 𝑋2𝑖 )
𝑋2
7
Multiple Regression Model - Estimation
No simple formula for both LSE/MLE without using matrix
algebra -> we just resort to program to help us
FYI: we can write the regression equation in matrix form
𝒀 = 𝑿𝜷 + 𝜺 where 𝒀, 𝜺 are 𝑛 by 1 vectors, 𝑿 is an 𝑛 by (𝐾
+ 1) matrix, and 𝜷 is an 𝑛 by 1 vector
so that each row represents the regression equation for
each observation
SSE can be expressed as 𝜺⊤ 𝜺 = 𝒀 − 𝑿𝜷 ⊤ 𝒀 − 𝑿𝜷
Minimizing yields the LSE as 𝒃 = 𝑿⊤ 𝑿 −𝟏 𝑿⊤ 𝒀
8
Multiple Regression Model - Estimation
The LSE/MLE exhibits good statistical properties
=
Zh
Unbiasedness: E 𝒃 SEEDE
= 𝜷 (mean equals true value) I
𝟐 ⊤ −𝟏 i
Efficiency: Var 𝒃 = 𝝈𝝐 𝑿 𝑿 is the minimum among all
# unbiased estimators . EIGELH
Consistency: 𝒃 converges to (getting closer and closer to) 𝜷 as
EHEthe sample size increases to infinity
FM ith'
-
tit it ,
lib
it's Easy 9
Multiple Regression Model - Estimation
Estimate for the error variance 𝜎𝜀2 is the same as before
2
𝑆𝑆𝐸
𝑆𝑒 = 𝑀𝑆𝐸 =
𝑛−𝐾−1
Residual degree of freedom = 𝑛 − 𝐾 − 1
𝐾 is the number of 𝑋 variables in the model
𝐾 + 1 is the number of estimates 𝑏𝑖 in the model,
including the slope 𝑏0
Roughly speaking, residual d.f. is the effective sample size
left after accounting for the plug in estimates
10
Example
Recall the example in the last topic, the real estate agent
wishes to examine the effects of the gross floor area and
age of the building on the apartment price
where
Price = Price in million HK$
GrossFA = Gross floor area in ft2
Age = Age of building in years
Source: HKEA Transaction Records, http://www.hkea.com.hk/private/TransServ
11
Data: Transactions of residential apartments in Tseung Kwan O during 1 – 8 April 2014
Example Cont’d
12
Example – The Model &
Interpretation of Coefficients Cont’d
The estimated multiple regression equation
𝑌 = 2.3562 + 0.0042𝑋1 − 0.0404𝑋2
where 𝑌 = Price in million HK$
𝑋1 = Gross floor area in ft2
𝑋2 = Age of building in years
The estimated slope coefficient
𝑏1 = 0.0042 says that the estimated average price of an
apartment increases by 𝐻𝐾$4,200 for each square footage
increase in gross floor area, for building at the same age
𝑏2 = −0.0404 says that the estimated average price of an
apartment decreases by 𝐻𝐾$40,400 for building getting 1 year
older, given the apartments are at the same size
13
Confidence Interval (CI)
Confidence interval estimate for a particular slope
coefficient
𝑏𝑖 ± 𝑡𝛼Τ2,𝑛−𝐾−1 𝑆𝑏𝑖
In R, use the “confint” function
Table d f 77
99% CI for 𝛽𝐺𝑟𝑜𝑠𝑠𝐹𝐴
.
>
.
-
= 0.0042 ± 2.6412 × 0.0005 = [0.0028, 0.0055]
99% CI for 𝛽𝐴𝑔𝑒
= −0.0404 ± 2.6412 × 0.0157 = [−0.0819, 0.0012]
14
Model Evaluation
Evaluate the model with
𝑅 2 and adjusted 𝑅 2
t-test for a particular 𝑋-variable significance
F-test for overall model significance
Residual analysis
15
Analysis of Variance
Again we want to compare the fitted model against the
null model
Total variation of the 𝑌-variable is made up of two parts
𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸
where
𝑆𝑆𝑇 = σ𝑛𝑖=1(𝑌𝑖 − 𝑌)
ത 2
𝑆𝑆𝑅 = σ𝑛𝑖=1(𝑌𝑖 − 𝑌)
ത 2
𝑆𝑆𝐸 = σ𝑛𝑖=1(𝑌𝑖 − 𝑌𝑖 )2
16
2
Measures of Variation – 𝑅 Cont’d
The blue part: SSE, the variation
attributable to factors other than
Price
GrossFA and Age
2
𝑆𝑆𝑅
𝑅 =
𝑆𝑆𝑇
GrossFA 𝑆𝑆𝑅
=
𝑆𝑆𝑅 + 𝑆𝑆𝐸
Age
The grey, orange and purple parts: SSR, the total variation of 𝑌-variable that being
explained by the regression equation with all 𝑋-variables GrossFA and Age 17
2
Measures of Variation – 𝑅 Cont’d
What is the net effect of adding a new 𝑋-variable?
𝑅 2 increases , even if the new 𝑋-variable is explaining an
insignificant proportion of the variation of the 𝑌-variable
Is it fair to use 𝑅 2 for comparing models with different number of
𝑋-variables?
A degree of freedom will be lost, as a slope coefficient has
to be estimated for that new 𝑋-variable
Did the new 𝑋-variable add enough explanatory power to offset
the lose of one degree of freedom?
18
1 - 2
Measures of Variation – Adjusted 𝑅
2 𝑆𝑆𝐸 Τ 𝑛−𝐾−1 𝑛−1
Adjusted 𝑅 = 1 − =1− (1 − 𝑅2 )
𝑆𝑆𝑇Τ 𝑛−1 𝑛−𝐾−1
Measures the proportion of variation of the 𝑌𝑖 values
that is explained by the regression equation with the
independent variable 𝑋1 , 𝑋2 , … , 𝑋𝐾 , after the adjustment
of sample size (𝑛) and the number of 𝑋-variables used (𝐾)
Smaller than 𝑅 2 , and can be negative R 10 Ap=
f
=
19
Example Cont’d
Compare the model using GrossFA only to the model
using both GrossFA and Age, which one fits better?
20
Inferences about the Parameters –
A Particular 𝑋-Variable Significance
t-test for a particular 𝑋-variable significance
𝐻0 : 𝛽𝑖 = 0 (𝑋𝑖 has no linear relationship with 𝑌)
𝐻1 : 𝛽𝑖 ≠ 0 (𝑋𝑖 is linearly related to 𝑌)
𝑏𝑖 −𝛽𝑖
t-statistic t = where 𝑆𝑏𝑖 = standard error of 𝑏𝑖
𝑆 𝑏𝑖
21
Example Cont’d
At 5% level of significance, which 𝑋-variable(s), GrossFA or
Age is significantly affecting the apartment price?
According to the t-test results,
the p-value for GrossFA and Age
are 4.61 × 10−12 and
0.0122 respectively, both are
smaller than 5%, indicating both
of them are significantly related
to the apartment price.
22
Inferences about the Parameters –
Overall Model Significance
F-test for the overall model significance
𝐻0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝐾 = 0
(none of the 𝑋-variables affects 𝑌)
𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑖 ≠ 0
(at least one 𝑋-variables affects 𝑌)
𝑀𝑆𝑅 𝑆𝑆𝑅/𝐾
F-statistic F = =
𝑀𝑆𝐸 𝑆𝑆𝐸/(𝑛−𝐾−1)
p-value = 𝑃(𝐹𝐾,(𝑛−𝐾−1) ≥ F)
𝐹 distribution with 𝐾, (𝑛 − 𝐾 − 1) d.f.
Reject 𝐻0 if F > C. V. = 𝐹𝛼,𝐾,(𝑛−𝐾−1) or p-value < 𝛼
23
Example Cont’d
Is the overall model significant?
𝐻0 : 𝛽𝐺𝑟𝑜𝑠𝑠𝐹𝐴 = 𝛽𝐴𝑔𝑒 = 0
𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑖 ≠ 0
F = 65.94
p-value < 2.2 × 10−16
At 𝛼 = 5% d.f. = 2, 77
Year g
C.V. = 3.115
25
Example Cont’d
What is the expected price for an apartment of with gross floor
area 950 ft2 and 2 years old?
The point prediction given by the estimated multiple regression
equation is
𝑌 = 2.3562 + 0.0042𝑋1 − 0.0404𝑋2
= 2.3562 + 0.0042 × 950 − 0.0404 × 2 = 6.2654
26
Prediction of New Observations
Given the same covariate values
the point prediction for a single response and the mean
response are the same
but the confidence intervals are different!
Again, the standard errors are related by 𝑆𝑃2 = 𝑆𝑒2 + 𝑆𝑚
2
27
Example Cont’d
For an apartment of gross floor area 950 ft2 and 2 years
old
Construct a 95% confidence interval for the mean
apartment price , and
Construct a 95% prediction interval for the apartment price
28
Regression Assumptions
Linearity of regression equation
𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + ⋯ + 𝛽𝐾 𝑋𝐾𝑖 is a linear function
Error normality
𝜀𝑖 has a normal distribution for all 𝑖
Constant variances of errors
Var 𝜀𝑖 |𝑋1𝑖 , 𝑋2𝑖 , ⋯ , 𝑋𝐾𝑖 = 𝜎𝜀2
Error independence
𝜀𝑖 are independent for all 𝑖
29
Residual Analysis
Check the regression assumptions by examining the
residuals, 𝑒𝑖 = 𝑌𝑖 − 𝑌𝑖
Plot
Residuals against 𝑌
As 𝑌 shows the combined effects of all 𝑋-variables
Residuals against each 𝑋-variable so as to identify any
violations of assumptions specifically related to a particular
𝑋-variable
Residuals against index for checking error independence
Histogram of the residuals for examining error normality
30