0% found this document useful (0 votes)
91 views

IST2024 Lecture02

The document discusses multiple linear regression models which use several explanatory variables to predict the value of a dependent variable, and provides an example of using publication rate, years of experience, and gender to predict doctor salaries. It explains how to interpret the regression coefficients in multiple linear regression and evaluate the model using measures like the coefficient of determination (R2), adjusted R2, and multiple correlation coefficient.

Uploaded by

Tan Vk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views

IST2024 Lecture02

The document discusses multiple linear regression models which use several explanatory variables to predict the value of a dependent variable, and provides an example of using publication rate, years of experience, and gender to predict doctor salaries. It explains how to interpret the regression coefficients in multiple linear regression and evaluate the model using measures like the coefficient of determination (R2), adjusted R2, and multiple correlation coefficient.

Uploaded by

Tan Vk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Department of Computing and Information Systems

School of Engineering and Technology


Sunway University

LECTURE 2: MULTIPLE LINEAR IST 2024 Applied Statistics

REGRESSION MODEL March 2021 Semester


LINEAR REGRESSION: AN
OVERVIEW

SIMPLE LINEAR MULTIPLE LINEAR


REGRESSION REGRESSION

One explanatory variable Several explanatory variables


CASE STUDY DATA SET (RECAP)
The figure on the left shows part of a dataset on
department, gender, clinical/research emphasis, board
certification status, publication rate, years of experience,
rank and salary-related variables on doctors attached to a
medical college in U.S.

More details on the dataset are available from the data


source.

Arising question:

How does publication rate (Prate) affect the salary (Sal95)


of the doctors?

https://www.kaggle.com/hjmjerry/gender-discrimination/version/1
SIMPLE LINEAR REGRESSION ANALYSIS

Are there any other explanatory variables


for the response variable salary (Sal95)?

The sample regression equation for these data is 𝑦ො = 329630 − 32587𝑥, where y represents salary (Sal95) and x is
the publication rate (Prate).
MULTIPLE LINEAR REGRESSION MODEL

Often we can make better predictions by using more than one independent (explanatory)
variable.
Multiple linear regression models use two or more independent variables to predict the
value of a dependent variable.
The relationship between a response variable Y and several predictor variables 𝑋1 , 𝑋2 , …, 𝑋𝑘
is postulated as a linear model
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 +. . . +𝛽𝑘 𝑋𝑘 + 𝜖

The constants 𝛽0 , 𝛽1 , 𝛽2 , …, 𝛽𝑘 are called the model regression coefficients or parameters.


The term 𝜖 is a random disturbance or error term.
SCATTER PLOT
MATRIX

Y = Sal95
X1 = Publication rate
X2 = Experience
X3 = Gender
MULTIPLE REGRESSION:
THE LEAST SQUARES METHOD
Based on the available data, we wish to estimate the parameters 𝛽0 , 𝛽1 , 𝛽2 , …, 𝛽𝑘
and write the fitted least squares regression equation
𝑦ො = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ + 𝑏𝑘 𝑥𝑘 .
As in the case of simple linear regression, we use the least squares method (the
2
method of minimizing the sum of squares of errors ∑ 𝑌𝑖 − 𝑌෠𝑖 ).

𝑏0 , 𝑏1 , …, 𝑏𝑘 are sample estimates of the model parameters 𝛽0 , 𝛽1 , 𝛽2 , …, 𝛽𝑘 .


MULTIPLE REGRESSION:
THE LEAST SQUARES METHOD
The least squares estimates 𝑏0 , 𝑏1 , …, 𝑏𝑘 are given by the solution of a system of
linear equations known as the normal equations.
The multiple regression equation represents a plane (two predictors) or a
hyperplane (more than two predictors).
INTERPRETATIONS OF (PARTIAL)
REGRESSION COEFFICIENTS

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 +. . . +𝛽𝑘 𝑋𝑘 + 𝜖

The regression coefficient 𝛽𝑗 (j = 1, 2, …, k):


▪ Can be interpreted as the change in Y corresponding to a unit change in 𝑋𝑗
when all other predictor variables are held constant.
▪ Describes the effect of the explanatory variable 𝑋𝑗 while controlling effects of
the other explanatory variables in the model.
▪ Is also called the partial regression coefficient.
THE MULTIPLE LINEAR REGRESSION
EQUATION: EXAMPLE Y = Sal95 (in dollars)
X1 = Publication rate
X2 = Experience (in years)

From the SAS output, 𝑏0 = 280,906, 𝑏1 = -34,651 𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝜖


and 𝑏2 = 5,758.32.
The sample regression equation for these data is
𝑦ො = 280906 − 34651𝑥1 + 5758.32𝑥2 .
-$34651 is interpreted as the change in salary (Y)
corresponding to a unit change in publication rate
(X1) when experience (X2) is held constant.
When the experience (X2) increase by 1-year, the
salary (Y) increase by $5758.32, when
publication rate (X1) is held constant.
MULTIPLE REGRESSION: ERRORS
For a case (or subject) with predicted response 𝑦ො𝑖 and observed response 𝑦𝑖 , the
residual (or error) is (𝑦𝑖 − 𝑦ො𝑖 ).
The sum of squared errors (SSE) or residual sum of squares is defined as
∑ 𝑦𝑖 − 𝑦ෝ𝑖 2 .
Original dataset
𝑦ො = 280906 − 34651𝑥1 + 5758.32𝑥2
Observed value 𝑦𝑖
Predicted value 𝑦ො𝑖 Residual = 𝑦𝑖 − 𝑦ෝ𝑖
COEFFICIENT OF MULTIPLE
DETERMINATION, R 2

The coefficient of multiple determination is given by


2
𝑆𝑆𝑅
𝑅 =
𝑆𝑆𝑇

𝑅2 may be interpreted as the proportion of the total variability in the response


variable Y that can be accounted for by the set of predictor variables 𝑋1 , 𝑋2 , …,
𝑋𝑘 .
COEFFICIENT OF MULTIPLE
DETERMINATION: EXAMPLE

From the SAS output, The R-Square value is 0.6705.


Therefore, 67.05% of the variation in salary is explained by the variation in publication rate and experience.

Compare this R-square value to the R-square value when only publication rate is taken as the predictor variable. What do you notice?
ADJUSTED R-SQUARED
The adjusted R-squared is a modified version of R-squared that has been adjusted for
the number of predictors in the model and the sample size.
2 𝑛−1
𝑅𝑎𝑑𝑗 =1− (1 − 𝑅2 )
𝑛−𝑘−1
where n is the sample size and k is the number of independent (predictor) variables in
the regression equation.
The adjusted R-squared:
▪ Compares the explanatory power of regression models that contain different numbers
of predictors.
▪ Increases only if the new term improves the model more than would be expected by
chance.
▪ Decreases when a predictor improves the model by less than expected by chance.
ADJUSTED R-SQUARED:
ILLUSTRATION
Based on the table below, which are the variables to be included in the
regression model?

The R2 always increases or stays the same as you include more terms in the
model. The adjusted-R2 can be thought of as a penalized version of R2 with the
penalty increasing with each parameter added to the model.
ADJUSTED R-SQUARED:
EXAMPLE

From the SAS output, the adjusted R-Square value for the
multiple regression model is 0.6679.
Therefore, 66.79% of the variation in selling price is explained
by the multiple regression model – adjusted for the number of
independent variables and sample size.
MULTIPLE CORRELATION
COEFFICIENT
The strength of the linear relationship between Y and the set of explanatory
variables 𝑋1 , 𝑋2 , …, 𝑋𝑘 acting together as predictors in the model can be
assessed through:
 The examination of the scatter plot of the observed Y versus predicted 𝑌෠
෠ which is given by
 The multiple correlation coefficient between Y and 𝑌,

∑(𝑦𝑖 − 𝑦)(
ത 𝑦ො𝑖 − 𝑦)

𝑅= 𝑅2 = 𝐶𝑜𝑟 𝑌, 𝑌෠ =
𝑦𝑖 − 𝑦ത 2 ∑ 𝑦ො𝑖 − 𝑦ത 2

The value of R always falls between 0 and 1. The larger the multiple correlation
R, the better the predictions of y by the set of explanatory variables.
MULTIPLE CORRELATION
COEFFICIENT: EXAMPLE
The multiple correlation coefficient R
= 0.6705 = 0.8188.

Observed (y)

Predicted (𝑦)

COEFFICIENT OF VARIATION

The coefficient of variation is the size of the standard deviation relative to the mean. It is a unitless measure, so
it can be used to compare data that have different units of measurement or different magnitudes of
measurement.

𝑅𝑜𝑜𝑡 𝑀𝑆𝐸
In the output example above, Coeff Var = × 100 = 30.2886
𝑌ത
DICHOTOMOUS CATEGORICAL
PREDICTOR VARIABLES
When categorical variables with two levels (dichotomous) are entered as
predictor variables, interpretation of regression weights depends upon how the
variable is coded.
If the dichotomous variable is coded as 0 and 1, the regression weight is added
or subtracted to the predicted value of Y depending upon whether it is positive
or negative.
If the dichotomous variable is coded as -1 and 1, then if the regression weight is
positive, it is subtracted from the group coded as -1 and added to the group
coded as 1. If the regression weight is negative, then addition and subtraction is
reversed.
Dichotomous variable can be included in hypothesis tests for 𝑅2 change like any
other variable.
DICHOTOMOUS CATEGORICAL
PREDICTOR VARIABLES: EXAMPLE
𝑦ො = −272014 − 33754𝑥1 + 5266.19𝑥2 + 15998𝑥3

$15,998 is interpreted as the change (increase) in Sal95 (Y)


when the doctor is a male (X3) when publication rate and
number of years of experience (X2) is held constant.

The value for the dichotomous predictor variable Gender (𝑋3 ) is “1” if male, “0” if female.
INFERENCES
TESTING THE COLLECTIVE INFLUENCE
OF THE EXPLANATORY VARIABLES:
OVERALL F-TEST
Do the explanatory (independent) variables collectively have a statistically significant effect on the response (dependent)
variable?
𝐻0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑘 = 0
[There is no linear relationship between the dependent variable and the independent variables.]
𝐻1 : At least one 𝛽𝑗 ≠ 0, j = 1, 2, 3, …, k
[There is a linear relationship between the dependent variable and at least one of the independent variables.]

H0: Population multiple correlation = 0


Other equivalent tests: H1: Population multiple correlation > 0

H0: Population R-squared= 0


H1: Population R-squared > 0
𝑀𝑆𝑅
The test statistic is 𝐹 = . This test statistic has an F-distribution, with df1 = k and df2 = n – (k + 1) where k is the number
𝑀𝑆𝐸
of explanatory variables and n is the sample size.
TESTING THE COLLECTIVE INFLUENCE OF THE
EXPLANATORY VARIABLES: EXAMPLE
𝐻0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑘 = 0
𝐻1 : At least one 𝛽𝑗 ≠ 0

From the SAS output, 𝐹 = 179.49 with corresponding p-value < 0.0001 .
The p-value is < 0.0001, thus 𝐻0 is rejected at the 0.05 level of significance (𝛼 = 0.05).
At least one of the explanatory variables have a significant effect on the response variable.
INFERENCE FOR INDIVIDUAL
REGRESSION COEFFICIENTS
Suppose the p-value is small for the F-test (test for collective influence).
 We REJECT the null hypothesis that the explanatory variables collectively do not have an influence on
the response variable.
 This does not imply that every explanatory variable has an effect on y (controlling for the other
explanatory variables in the model).
 It implies at least one of them has an effect.

Which partial effects are nonzero? What is the size of those effects?
We can test for the significance of the individual regression coefficients 𝛽𝑗 (j = 1, 2, …, k)
𝐻0 : 𝛽𝑗 = 0
𝐻1 : 𝛽𝑗 ≠ 0
𝑏 −𝛽
𝑗 𝑗
 The test statistic is 𝑡0 = 𝑠.𝑒.(𝑏 )
. This test statistic has a t-distribution with df = n – k – 1.
𝑗
INFERENCE FOR INDIVIDUAL
REGRESSION COEFFICIENTS: EXAMPLE 1
𝐻0 : 𝛽1 = 0
𝐻1 : 𝛽1 ≠ 0

Note that 𝛽1 is the partial regression coefficient for


𝑋1 (Prate).

The point estimate of 𝛽1 is 𝑏1 = -33754 and has standard error s.e. = 1676.65491. The test statistic is 𝑡0 =
𝑏1 −33754
= = −20.13.
𝑠.𝑒.(𝑏 )
1 1676.65491

Since the p-value is <0.0001, 𝐻0 is rejected at significance level 𝛼 = 0.05. There is strong evidence that
publication rate is related to the salary, controlling for experience and gender.
Controlling for experience and gender, we are 95% confident that the change in mean salary per one-unit increase
in publication rate falls between -$37056 and -$30452.
INFERENCE FOR INDIVIDUAL
REGRESSION COEFFICIENTS: EXAMPLE 2

𝐻0 : 𝛽2 = 0
𝐻1 : 𝛽2 ≠ 0

Note that 𝛽2 is the partial regression coefficient for 𝑋2


(Experience).

The point estimate of 𝛽2 is 𝑏2 = 5266.1921 and has standard error s.e. = 553.22504. The test statistic is 𝑡0 =
9.52.
Since the p-value is <0.0001, 𝐻0 is rejected. There is strong evidence that experience is related to the salary of
the homes, controlling for publication rate and gender.
Controlling for _____________, we are 95% confident that the change in mean __________ per one-unit increase in
______________ falls between $4176.76 and $6355.62.
PARTIAL CORRELATION
COEFFICIENTS
Partial correlation coefficients describe the partial association between Y and a
particular predictor, while controlling for other variables of interest.
For example, the partial correlation between Y and X1, controlling for X2, is
𝑟𝑦𝑥1 − 𝑟𝑦𝑥2 𝑟𝑥1𝑥2
𝑟𝑦𝑥1∙𝑥2 =
2 2
(1 − 𝑟𝑦𝑥2
)(1 − 𝑟𝑥 1 𝑥2
)

When one variable is controlled, the partial correlation is known as first-order


partial correlations.
The square of the partial correlation 𝑟𝑦𝑥1 ∙𝑥2 represents the proportion of the
variation in Y that is explained by 𝑋1 , out of that left unexplained by 𝑋2 .
2
𝑅2 − 𝑟𝑦𝑥
2
1 The proportion of the variation in y that is explained by
𝑟𝑦𝑥1 ∙𝑥2
= 2
1− 𝑟𝑦𝑥1
𝑥1 , out of that left unexplained by 𝑥2 .
PARTIAL CORRELATION
COEFFICIENTS: EXAMPLE
Controlling for publication rate
and gender, 26.07% of the
variation in salary is explained
by experience.

Can you interpret the rest of


the partial correlation
coefficients?
LINEAR REGRESSION: AN
OVERVIEW

SIMPLE LINEAR MULTIPLE LINEAR


REGRESSION REGRESSION

REGRESSION
DIAGNOSTICS

You might also like