IST2024 Lecture02
IST2024 Lecture02
Arising question:
https://www.kaggle.com/hjmjerry/gender-discrimination/version/1
SIMPLE LINEAR REGRESSION ANALYSIS
The sample regression equation for these data is 𝑦ො = 329630 − 32587𝑥, where y represents salary (Sal95) and x is
the publication rate (Prate).
MULTIPLE LINEAR REGRESSION MODEL
Often we can make better predictions by using more than one independent (explanatory)
variable.
Multiple linear regression models use two or more independent variables to predict the
value of a dependent variable.
The relationship between a response variable Y and several predictor variables 𝑋1 , 𝑋2 , …, 𝑋𝑘
is postulated as a linear model
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 +. . . +𝛽𝑘 𝑋𝑘 + 𝜖
Y = Sal95
X1 = Publication rate
X2 = Experience
X3 = Gender
MULTIPLE REGRESSION:
THE LEAST SQUARES METHOD
Based on the available data, we wish to estimate the parameters 𝛽0 , 𝛽1 , 𝛽2 , …, 𝛽𝑘
and write the fitted least squares regression equation
𝑦ො = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ + 𝑏𝑘 𝑥𝑘 .
As in the case of simple linear regression, we use the least squares method (the
2
method of minimizing the sum of squares of errors ∑ 𝑌𝑖 − 𝑌𝑖 ).
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 +. . . +𝛽𝑘 𝑋𝑘 + 𝜖
Compare this R-square value to the R-square value when only publication rate is taken as the predictor variable. What do you notice?
ADJUSTED R-SQUARED
The adjusted R-squared is a modified version of R-squared that has been adjusted for
the number of predictors in the model and the sample size.
2 𝑛−1
𝑅𝑎𝑑𝑗 =1− (1 − 𝑅2 )
𝑛−𝑘−1
where n is the sample size and k is the number of independent (predictor) variables in
the regression equation.
The adjusted R-squared:
▪ Compares the explanatory power of regression models that contain different numbers
of predictors.
▪ Increases only if the new term improves the model more than would be expected by
chance.
▪ Decreases when a predictor improves the model by less than expected by chance.
ADJUSTED R-SQUARED:
ILLUSTRATION
Based on the table below, which are the variables to be included in the
regression model?
The R2 always increases or stays the same as you include more terms in the
model. The adjusted-R2 can be thought of as a penalized version of R2 with the
penalty increasing with each parameter added to the model.
ADJUSTED R-SQUARED:
EXAMPLE
From the SAS output, the adjusted R-Square value for the
multiple regression model is 0.6679.
Therefore, 66.79% of the variation in selling price is explained
by the multiple regression model – adjusted for the number of
independent variables and sample size.
MULTIPLE CORRELATION
COEFFICIENT
The strength of the linear relationship between Y and the set of explanatory
variables 𝑋1 , 𝑋2 , …, 𝑋𝑘 acting together as predictors in the model can be
assessed through:
The examination of the scatter plot of the observed Y versus predicted 𝑌
which is given by
The multiple correlation coefficient between Y and 𝑌,
∑(𝑦𝑖 − 𝑦)(
ത 𝑦ො𝑖 − 𝑦)
ത
𝑅= 𝑅2 = 𝐶𝑜𝑟 𝑌, 𝑌 =
𝑦𝑖 − 𝑦ത 2 ∑ 𝑦ො𝑖 − 𝑦ത 2
The value of R always falls between 0 and 1. The larger the multiple correlation
R, the better the predictions of y by the set of explanatory variables.
MULTIPLE CORRELATION
COEFFICIENT: EXAMPLE
The multiple correlation coefficient R
= 0.6705 = 0.8188.
Observed (y)
Predicted (𝑦)
ො
COEFFICIENT OF VARIATION
The coefficient of variation is the size of the standard deviation relative to the mean. It is a unitless measure, so
it can be used to compare data that have different units of measurement or different magnitudes of
measurement.
𝑅𝑜𝑜𝑡 𝑀𝑆𝐸
In the output example above, Coeff Var = × 100 = 30.2886
𝑌ത
DICHOTOMOUS CATEGORICAL
PREDICTOR VARIABLES
When categorical variables with two levels (dichotomous) are entered as
predictor variables, interpretation of regression weights depends upon how the
variable is coded.
If the dichotomous variable is coded as 0 and 1, the regression weight is added
or subtracted to the predicted value of Y depending upon whether it is positive
or negative.
If the dichotomous variable is coded as -1 and 1, then if the regression weight is
positive, it is subtracted from the group coded as -1 and added to the group
coded as 1. If the regression weight is negative, then addition and subtraction is
reversed.
Dichotomous variable can be included in hypothesis tests for 𝑅2 change like any
other variable.
DICHOTOMOUS CATEGORICAL
PREDICTOR VARIABLES: EXAMPLE
𝑦ො = −272014 − 33754𝑥1 + 5266.19𝑥2 + 15998𝑥3
The value for the dichotomous predictor variable Gender (𝑋3 ) is “1” if male, “0” if female.
INFERENCES
TESTING THE COLLECTIVE INFLUENCE
OF THE EXPLANATORY VARIABLES:
OVERALL F-TEST
Do the explanatory (independent) variables collectively have a statistically significant effect on the response (dependent)
variable?
𝐻0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑘 = 0
[There is no linear relationship between the dependent variable and the independent variables.]
𝐻1 : At least one 𝛽𝑗 ≠ 0, j = 1, 2, 3, …, k
[There is a linear relationship between the dependent variable and at least one of the independent variables.]
From the SAS output, 𝐹 = 179.49 with corresponding p-value < 0.0001 .
The p-value is < 0.0001, thus 𝐻0 is rejected at the 0.05 level of significance (𝛼 = 0.05).
At least one of the explanatory variables have a significant effect on the response variable.
INFERENCE FOR INDIVIDUAL
REGRESSION COEFFICIENTS
Suppose the p-value is small for the F-test (test for collective influence).
We REJECT the null hypothesis that the explanatory variables collectively do not have an influence on
the response variable.
This does not imply that every explanatory variable has an effect on y (controlling for the other
explanatory variables in the model).
It implies at least one of them has an effect.
Which partial effects are nonzero? What is the size of those effects?
We can test for the significance of the individual regression coefficients 𝛽𝑗 (j = 1, 2, …, k)
𝐻0 : 𝛽𝑗 = 0
𝐻1 : 𝛽𝑗 ≠ 0
𝑏 −𝛽
𝑗 𝑗
The test statistic is 𝑡0 = 𝑠.𝑒.(𝑏 )
. This test statistic has a t-distribution with df = n – k – 1.
𝑗
INFERENCE FOR INDIVIDUAL
REGRESSION COEFFICIENTS: EXAMPLE 1
𝐻0 : 𝛽1 = 0
𝐻1 : 𝛽1 ≠ 0
The point estimate of 𝛽1 is 𝑏1 = -33754 and has standard error s.e. = 1676.65491. The test statistic is 𝑡0 =
𝑏1 −33754
= = −20.13.
𝑠.𝑒.(𝑏 )
1 1676.65491
Since the p-value is <0.0001, 𝐻0 is rejected at significance level 𝛼 = 0.05. There is strong evidence that
publication rate is related to the salary, controlling for experience and gender.
Controlling for experience and gender, we are 95% confident that the change in mean salary per one-unit increase
in publication rate falls between -$37056 and -$30452.
INFERENCE FOR INDIVIDUAL
REGRESSION COEFFICIENTS: EXAMPLE 2
𝐻0 : 𝛽2 = 0
𝐻1 : 𝛽2 ≠ 0
The point estimate of 𝛽2 is 𝑏2 = 5266.1921 and has standard error s.e. = 553.22504. The test statistic is 𝑡0 =
9.52.
Since the p-value is <0.0001, 𝐻0 is rejected. There is strong evidence that experience is related to the salary of
the homes, controlling for publication rate and gender.
Controlling for _____________, we are 95% confident that the change in mean __________ per one-unit increase in
______________ falls between $4176.76 and $6355.62.
PARTIAL CORRELATION
COEFFICIENTS
Partial correlation coefficients describe the partial association between Y and a
particular predictor, while controlling for other variables of interest.
For example, the partial correlation between Y and X1, controlling for X2, is
𝑟𝑦𝑥1 − 𝑟𝑦𝑥2 𝑟𝑥1𝑥2
𝑟𝑦𝑥1∙𝑥2 =
2 2
(1 − 𝑟𝑦𝑥2
)(1 − 𝑟𝑥 1 𝑥2
)
REGRESSION
DIAGNOSTICS