0% found this document useful (0 votes)
4 views

Multiple+Regression+A

The document outlines the formulation and estimation of multiple regression models, including the significance of predictor variables and the evaluation of model performance through measures like R² and adjusted R². It discusses the assumptions underlying the model, residual analysis, and the interpretation of coefficients using an example related to real estate prices. Additionally, it covers hypothesis testing for individual predictors and overall model significance using t-tests and F-tests.

Uploaded by

lzyztt720
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Multiple+Regression+A

The document outlines the formulation and estimation of multiple regression models, including the significance of predictor variables and the evaluation of model performance through measures like R² and adjusted R². It discusses the assumptions underlying the model, residual analysis, and the interpretation of coefficients using an example related to real estate prices. Additionally, it covers hypothesis testing for individual predictors and overall model significance using t-tests and F-tests.

Uploaded by

lzyztt720
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Part A: General Concepts

1
Outline
 Formulation of Multiple Regression Model
 Measures of Variation
 𝑅 2 and Adjusted 𝑅 2
 Inferences about the Parameters
 Overall Model Significance
 A Particular 𝑋-Variable Significance
 Assumptions and Residual Analysis
 Prediction of New Observations

2
Multiple Regression Model - Formulation
 A multiple regression model is to relate one response
variable with two or more predictor variables in a linear
function
Population Intercept Population Slope Coefficients

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + ⋯ + 𝛽𝐾 𝑋𝐾𝑖 + 𝜀𝑖

Response Predictors Random Error

 𝑏0 , 𝑏1 , 𝑏2 … , 𝑏𝐾 are used to represent sample intercept


and sample slope coefficients 3
Multiple Regression Model - Formulation
 Equivalently, the linear regression model can be written as
 𝐸 𝑌𝑖 𝑋1𝑖 , 𝑋2𝑖 , ⋯ , 𝑋𝐾𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + ⋯ + 𝛽𝐾 𝑋𝐾𝑖
 𝑉𝑎𝑟 𝑌𝑖 𝑋1𝑖 , 𝑋2𝑖 , ⋯ , 𝑋𝐾𝑖 = 𝜎𝜀2
 𝑌𝑖 |𝑋1𝑖 , 𝑋2𝑖 , ⋯ , 𝑋𝐾𝑖 are independent and normally distributed
 In other words, 𝑌𝑖 |𝑋1𝑖 , 𝑋2𝑖 , ⋯ , 𝑋𝐾𝑖 are independent
𝑁 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + ⋯ + 𝛽𝐾 𝑋𝐾𝑖 , 𝜎𝜀2

4
Multiple Regression Model - Formulation
 Coefficients in a multiple regression net out the impact of
each predictor in the regression equation
 The estimated slope coefficient, 𝑏𝑖 , measures the change
in the average value of 𝑌 as a result of a one-unit increase
in 𝑋𝑖 , holding all other covariates constant

5
Multiple Regression Model - Formulation
 Suppose the true relationship is 𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝜀
 and we fit a simple linear regression 𝐸(𝑌|𝑋1 ) = 𝛼0 + 𝛼1 𝑋1
ETY/X1 Bo +B X BrEEX ]
+ IX
 Then 𝛼1 = 𝛽1 provided that either = , , , 1

 𝑋2 is not related to 𝑌, i.e. 𝛽2 = 0 ↑ IX)


 𝑋2 is not related to 𝑋1 , and so 𝐸 𝑋2 𝑋1 ) = 0 p

 Otherwise, in general, 𝛼1 ≠ 𝛽1 ‼ tal B , + B


: ,
 Known as the confounder effect
 A faulty conclusion on the relationship between 𝑌 and 𝑋1 if we
ignore the confounding variable 𝑋2 (Simpson's paradox)

X 215]Ed Ph , X
6
Multiple Regression Model - Estimation
𝑌

eso
𝒀𝒊
𝒃𝟎 𝒆𝒊
෡𝒊
𝒀 ෡ 𝒊 = 𝒃𝟎 + 𝒃𝟏 𝑿𝟏𝒊 + 𝒃𝟐 𝑿𝟐𝒊
𝒀 ICESSE

C , Lo
*
𝑋1
(𝑋1𝑖 , 𝑋2𝑖 )
𝑋2

7
Multiple Regression Model - Estimation
 No simple formula for both LSE/MLE without using matrix
algebra -> we just resort to program to help us
 FYI: we can write the regression equation in matrix form
 𝒀 = 𝑿𝜷 + 𝜺 where 𝒀, 𝜺 are 𝑛 by 1 vectors, 𝑿 is an 𝑛 by (𝐾
+ 1) matrix, and 𝜷 is an 𝑛 by 1 vector
 so that each row represents the regression equation for
each observation
 SSE can be expressed as 𝜺⊤ 𝜺 = 𝒀 − 𝑿𝜷 ⊤ 𝒀 − 𝑿𝜷
 Minimizing yields the LSE as 𝒃 = 𝑿⊤ 𝑿 −𝟏 𝑿⊤ 𝒀

8
Multiple Regression Model - Estimation
 The LSE/MLE exhibits good statistical properties
=
Zh
 Unbiasedness: E 𝒃 SEEDE
= 𝜷 (mean equals true value) I
𝟐 ⊤ −𝟏 i
 Efficiency: Var 𝒃 = 𝝈𝝐 𝑿 𝑿 is the minimum among all
# unbiased estimators . EIGELH
 Consistency: 𝒃 converges to (getting closer and closer to) 𝜷 as
EHEthe sample size increases to infinity
FM ith'
-

 (Asymptotic) normality: 𝐛 − 𝜷 ∼ 𝑵(𝟎, 𝝈𝟐𝝐 𝑿⊤ 𝑿 −𝟏 ) has a l


+normal distribution (this conclusion is exact if we have the
normal assumption, otherwise it is true for large sample)

tit it ,
lib
it's Easy 9
Multiple Regression Model - Estimation
 Estimate for the error variance 𝜎𝜀2 is the same as before
2
𝑆𝑆𝐸
𝑆𝑒 = 𝑀𝑆𝐸 =
𝑛−𝐾−1
 Residual degree of freedom = 𝑛 − 𝐾 − 1
 𝐾 is the number of 𝑋 variables in the model
 𝐾 + 1 is the number of estimates 𝑏𝑖 in the model,
including the slope 𝑏0
 Roughly speaking, residual d.f. is the effective sample size
left after accounting for the plug in estimates

10
Example
 Recall the example in the last topic, the real estate agent
wishes to examine the effects of the gross floor area and
age of the building on the apartment price

where
Price = Price in million HK$
GrossFA = Gross floor area in ft2
Age = Age of building in years
 Source: HKEA Transaction Records, http://www.hkea.com.hk/private/TransServ
11
 Data: Transactions of residential apartments in Tseung Kwan O during 1 – 8 April 2014
Example Cont’d

12
Example – The Model &
Interpretation of Coefficients Cont’d
 The estimated multiple regression equation
𝑌෠ = 2.3562 + 0.0042𝑋1 − 0.0404𝑋2
where 𝑌 = Price in million HK$
𝑋1 = Gross floor area in ft2
𝑋2 = Age of building in years
 The estimated slope coefficient
 𝑏1 = 0.0042 says that the estimated average price of an
apartment increases by 𝐻𝐾$4,200 for each square footage
increase in gross floor area, for building at the same age
 𝑏2 = −0.0404 says that the estimated average price of an
apartment decreases by 𝐻𝐾$40,400 for building getting 1 year
older, given the apartments are at the same size

13
Confidence Interval (CI)
 Confidence interval estimate for a particular slope
coefficient
𝑏𝑖 ± 𝑡𝛼Τ2,𝑛−𝐾−1 𝑆𝑏𝑖
 In R, use the “confint” function
Table d f 77
 99% CI for 𝛽𝐺𝑟𝑜𝑠𝑠𝐹𝐴
.

>
.

-
= 0.0042 ± 2.6412 × 0.0005 = [0.0028, 0.0055]
 99% CI for 𝛽𝐴𝑔𝑒
= −0.0404 ± 2.6412 × 0.0157 = [−0.0819, 0.0012]

14
Model Evaluation
Evaluate the model with
 𝑅 2 and adjusted 𝑅 2
 t-test for a particular 𝑋-variable significance
 F-test for overall model significance
 Residual analysis

15
Analysis of Variance
 Again we want to compare the fitted model against the
null model
 Total variation of the 𝑌-variable is made up of two parts
𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸
where
𝑆𝑆𝑇 = σ𝑛𝑖=1(𝑌𝑖 − 𝑌)
ത 2
𝑆𝑆𝑅 = σ𝑛𝑖=1(𝑌෠𝑖 − 𝑌)
ത 2
𝑆𝑆𝐸 = σ𝑛𝑖=1(𝑌𝑖 − 𝑌෠𝑖 )2

16
2
Measures of Variation – 𝑅 Cont’d
The blue part: SSE, the variation
attributable to factors other than
Price
GrossFA and Age

2
𝑆𝑆𝑅
𝑅 =
𝑆𝑆𝑇
GrossFA 𝑆𝑆𝑅
=
𝑆𝑆𝑅 + 𝑆𝑆𝐸
Age

The grey, orange and purple parts: SSR, the total variation of 𝑌-variable that being
explained by the regression equation with all 𝑋-variables GrossFA and Age 17
2
Measures of Variation – 𝑅 Cont’d
 What is the net effect of adding a new 𝑋-variable?
 𝑅 2 increases , even if the new 𝑋-variable is explaining an
insignificant proportion of the variation of the 𝑌-variable
 Is it fair to use 𝑅 2 for comparing models with different number of
𝑋-variables?
 A degree of freedom will be lost, as a slope coefficient has
to be estimated for that new 𝑋-variable
 Did the new 𝑋-variable add enough explanatory power to offset
the lose of one degree of freedom?

18
1 - 2
Measures of Variation – Adjusted 𝑅
2 𝑆𝑆𝐸 Τ 𝑛−𝐾−1 𝑛−1
 Adjusted 𝑅 = 1 − =1− (1 − 𝑅2 )
𝑆𝑆𝑇Τ 𝑛−1 𝑛−𝐾−1
 Measures the proportion of variation of the 𝑌𝑖 values
that is explained by the regression equation with the
independent variable 𝑋1 , 𝑋2 , … , 𝑋𝐾 , after the adjustment
of sample size (𝑛) and the number of 𝑋-variables used (𝐾)
 Smaller than 𝑅 2 , and can be negative R 10 Ap=
f
=

 Penalize the excessive use of 𝑋-variables h =0 e0


 Useful in comparing among models with different number
of 𝑋-variables

19
Example Cont’d
 Compare the model using GrossFA only to the model
using both GrossFA and Age, which one fits better?

20
Inferences about the Parameters –
A Particular 𝑋-Variable Significance
 t-test for a particular 𝑋-variable significance
𝐻0 : 𝛽𝑖 = 0 (𝑋𝑖 has no linear relationship with 𝑌)
𝐻1 : 𝛽𝑖 ≠ 0 (𝑋𝑖 is linearly related to 𝑌)
𝑏𝑖 −𝛽𝑖
t-statistic t = where 𝑆𝑏𝑖 = standard error of 𝑏𝑖
𝑆 𝑏𝑖

p-value = 2𝑃(𝑡𝑛−𝐾−1 ≥ |t|)


𝑡 distribution with (𝑛 − 𝐾 − 1) d.f.
Reject 𝐻0 if |t| > C. 𝑉. = 𝑡𝛼Τ2,(𝑛−𝐾−1) or p-value < 𝛼

21
Example Cont’d
 At 5% level of significance, which 𝑋-variable(s), GrossFA or
Age is significantly affecting the apartment price?
According to the t-test results,
the p-value for GrossFA and Age
are 4.61 × 10−12 and
0.0122 respectively, both are
smaller than 5%, indicating both
of them are significantly related
to the apartment price.

22
Inferences about the Parameters –
Overall Model Significance
 F-test for the overall model significance
𝐻0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝐾 = 0
(none of the 𝑋-variables affects 𝑌)
𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑖 ≠ 0
(at least one 𝑋-variables affects 𝑌)
𝑀𝑆𝑅 𝑆𝑆𝑅/𝐾
F-statistic F = =
𝑀𝑆𝐸 𝑆𝑆𝐸/(𝑛−𝐾−1)
p-value = 𝑃(𝐹𝐾,(𝑛−𝐾−1) ≥ F)
𝐹 distribution with 𝐾, (𝑛 − 𝐾 − 1) d.f.
Reject 𝐻0 if F > C. V. = 𝐹𝛼,𝐾,(𝑛−𝐾−1) or p-value < 𝛼
23
Example Cont’d
 Is the overall model significant?
𝐻0 : 𝛽𝐺𝑟𝑜𝑠𝑠𝐹𝐴 = 𝛽𝐴𝑔𝑒 = 0
𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑖 ≠ 0
F = 65.94
p-value < 2.2 × 10−16

At 𝛼 = 5% d.f. = 2, 77
Year g
C.V. = 3.115

Reject 𝐻0 , the model is


significant.
24
Prediction of New Observations
 Check
 Are the given 𝑋-values have the same measurement scales
as the corresponding observed 𝑋?
 How does the model perform on the validation set?
 The model can provide point prediction as well as interval
prediction
 Due to the complexity of calculating interval predictions
manually, use the “predict” function in R with “interval”
option set to “confidence” or “prediction”

25
Example Cont’d
 What is the expected price for an apartment of with gross floor
area 950 ft2 and 2 years old?
 The point prediction given by the estimated multiple regression
equation is
𝑌෠ = 2.3562 + 0.0042𝑋1 − 0.0404𝑋2
= 2.3562 + 0.0042 × 950 − 0.0404 × 2 = 6.2654

where 𝑌 = Price in million HK$


𝑋1 = Gross floor area in ft2
𝑋2 = Age of building in years

 The expected price for an apartment with gross floor area


950 ft2 and 2 years old is 𝐻𝐾$6,265,400

26
Prediction of New Observations
 Given the same covariate values
 the point prediction for a single response and the mean
response are the same
 but the confidence intervals are different!
 Again, the standard errors are related by 𝑆𝑃2 = 𝑆𝑒2 + 𝑆𝑚
2

27
Example Cont’d
 For an apartment of gross floor area 950 ft2 and 2 years
old
 Construct a 95% confidence interval for the mean
apartment price , and
 Construct a 95% prediction interval for the apartment price

28
Regression Assumptions
 Linearity of regression equation
 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + ⋯ + 𝛽𝐾 𝑋𝐾𝑖 is a linear function
 Error normality
 𝜀𝑖 has a normal distribution for all 𝑖
 Constant variances of errors
 Var 𝜀𝑖 |𝑋1𝑖 , 𝑋2𝑖 , ⋯ , 𝑋𝐾𝑖 = 𝜎𝜀2
 Error independence
 𝜀𝑖 are independent for all 𝑖

29
Residual Analysis
 Check the regression assumptions by examining the
residuals, 𝑒𝑖 = 𝑌𝑖 − 𝑌෠𝑖
 Plot

 Residuals against 𝑌
 As 𝑌෠ shows the combined effects of all 𝑋-variables
 Residuals against each 𝑋-variable so as to identify any
violations of assumptions specifically related to a particular
𝑋-variable
 Residuals against index for checking error independence
 Histogram of the residuals for examining error normality

30

You might also like