0% found this document useful (0 votes)
21 views

Week 11-2 Lecture 15 Student

This document provides an overview of multiple regression analysis. It discusses the multiple regression model and how it allows consideration of more than one independent variable. It covers estimating the regression equation coefficients using the least squares method. The document provides an example using programmer salary data and examines interpreting regression coefficients, the multiple coefficient of determination, assumptions of the error term, and testing for significance using F and t tests.

Uploaded by

Yihao Quan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Week 11-2 Lecture 15 Student

This document provides an overview of multiple regression analysis. It discusses the multiple regression model and how it allows consideration of more than one independent variable. It covers estimating the regression equation coefficients using the least squares method. The document provides an example using programmer salary data and examines interpreting regression coefficients, the multiple coefficient of determination, assumptions of the error term, and testing for significance using F and t tests.

Uploaded by

Yihao Quan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Applied Statistics

Lecture 15 Multiple Regression

Instructor: Dr. Shuli Liu


Department of Economics, BJTU
Email: [email protected]
Office: SD306
Office Hours: Thu 2:00 pm – 5:00 pm
week 11

RIT 2023 Spring 1


Outline

• Multiple Regression Model


• Least Squares Method
• Multiple Coefficient of Determination
• Model Assumptions
• Testing for Significance
• Using the Estimated Regression Equation for Estimation and Prediction
• Categorical Independent Variables
• Residual Analysis
• Logistic Regression

RIT 2023 Spring 2


Multiple Regression

• In this part we continue our study of regression analysis by


considering situations involving two or more independent
variables.
• This subject area, called multiple regression analysis, enables us to
consider more factors and thus obtain better estimates than are
possible with simple linear regression.

RIT 2023 Spring 3


Multiple Regression Model

Multiple Regression Model


• The equation that describes how the dependent variable y is related to
the independent variables x1, x2, . . . xp and an error term is:
y = b0 + b1x1 + b2x2 + . . . + bpxp + e
where:
b0, b1, b2, . . . , bp are the parameters, and e is a random variable called
the error term

Multiple Regression Equation


• The equation that describes how the mean value of y is related to x1,
x2, . . . xp is:
E(y) = 0 + 1x1 + 2x2 + . . . + pxp

RIT 2023 Spring 4


Estimated Multiple Regression Equation

Estimated Multiple Regression Equation

= b0 + b1x1 + b2x2 + . . . + bpxp

A simple random sample is used to compute sample statistics b0, b1,


b2, . . . , bp that are used as the point estimators of the parameters
b0 , b 1 , b2 , . . . , bp .

RIT 2023 Spring 5


Estimation Process

RIT 2023 Spring 6


Least Squares Method

• Least Squares Criterion

𝑚𝑖𝑛∑ ( 𝑦 𝑖 − ^𝑦 𝑖 )
2

• Computation of Coefficient Values


The formulas for the regression coefficients b0, b1, b2, . . . bp involve the
use of matrix algebra. We will rely on computer software packages to
perform the calculations.
The emphasis will be on how to interpret the computer output rather
than on how to make the multiple regression computations.

RIT 2023 Spring 7


Multiple Regression: Example

• Example: Programmer Salary Survey


A software firm collected data for a sample of 20 computer
programmers. A suggestion was made that regression analysis could be
used to determine if salary was related to the years of experience and
the score on the firm’s programmer aptitude test.
The years of experience, score on the aptitude test, and corresponding
annual salary ($1000s) for a sample of 20 programmers is shown on the
next slide.

RIT 2023 Spring 8


Multiple Regression: Example (con’t)

• Example: Programmer Salary Survey

RIT 2023 Spring 9


Multiple Regression: Example (con’t)

• Example: Programmer Salary Survey


Multiple Regression Model
Suppose we believe that salary (y) is related to the years of experience
(x1) and the score on the programmer aptitude test (x2) by the following
regression model:
y =  0 + 1 x 1 + 2 x 2 + 

where
y = annual salary ($1000s)
x1 = years of experience
x2 = score on programmer aptitude test

RIT 2023 Spring 10


Multiple Regression: Example (con’t)

• Example: Programmer Salary Survey


Solving for the Estimates of 0, 1, 2

RIT 2023 Spring 11


Multiple Regression: Example (con’t)

• Example: Programmer Salary Survey


Regression Equation Output

Estimated Regression Equation


SALARY = 3.174 + 1.404(EXPER) + 0.251(SCORE)

(Note: Predicted salary will be in thousands of dollars.)

RIT 2023 Spring 12


Multiple Regression: Example (con’t)

• Example: Programmer Salary Survey


Interpreting the Coefficients
In multiple regression analysis, we interpret each regression coefficient
as follows:

bi represents an estimate of the change in y corresponding to


one unit increase in xi when all other independent variables
are held constant.
b1 = 1.404 b2 = 0.251
Salary is expected to increase by Salary is expected to increase by $251
$1,404 for each additional year of for each additional point scored on the
experience (when the variable score on programmer aptitude test (when the
programmer attitude test is held variable years of experience is held
constant). constant).

RIT 2023 Spring 13


Multiple Coefficient of Determination

• Relationship Among SST, SSR, SSE

SST = SSR + SSE

where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error

RIT 2023 Spring 14


Multiple Coefficient of Determination (con’t)

• Example: Programmer Salary Survey


Multiple Coefficient of Determination

R2 = SSR/SST

R2 = 500.3285/599.7855 = 0.83418

RIT 2023 Spring 15


Adjusted Multiple Coefficient of Determination

• Adding independent variables, even ones that are not statistically


significant, causes the prediction errors to become smaller, thus
reducing the sum of squares due to error, SSE.
• Because SSR = SST – SSE, when SSE becomes smaller, SSR becomes
larger, causing R2 = SSR/SST to increase.
• The adjusted multiple coefficient of determination compensates for the
number of independent variables in the model.

2 𝑛− 1 2
𝑅𝑎 =1−(1 − 𝑅 )
𝑛 −𝑝 −1

2 20 − 1
𝑅𝑎 =1− ( 1− .834179 ) = .814671
20 − 2 − 1
RIT 2023 Spring 16
Assumptions About the Error Term 

• The error  is a random variable with mean of zero.


• The variance of  , denoted by  2, is the same for all values of the
independent variables.
• The values of  are independent.
• The error  is a normally distributed random variable reflecting the
deviation between the y value and the expected value of y given by
𝛽 0 + 𝛽 1 𝑥1 + 𝛽 2 𝑥 2 +…+ 𝛽 𝑝 𝑥 𝑝

RIT 2023 Spring 17


Testing for Significance

• In simple linear regression, the F and t tests provide the same


conclusion.
• In multiple regression, the F and t tests have different purposes.
F Test
• The F test is used to determine whether a significant relationship exists
between the dependent variable and the set of all the independent
variables.
• The F test is referred to as the test for overall significance.
t Test
• If the F test shows an overall significance, the t test is used to determine
whether each of the individual independent variables is significant.
• A separate t test is conducted for each of the independent variables in the
model. We refer to each of these t tests as a test for individual significance.

RIT 2023 Spring 18


Testing for Significance (con’t)

• F Test

Hypotheses H0: 1 = 2 = . . . = p = 0
Ha: One or more of the parameters is not equal to zero

Test Statistics F = MSR/MSE

Reject H0 if p-value < a or if F F , where F is


Rejection Rule based on an F distribution with p d.f. in the
numerator and n - p - 1 d.f. in the denominator.

RIT 2023 Spring 19


Testing for Significance (con’t)

• F Test: example

Hypotheses H0: 1 = 2 = 0
Ha: One or both of the parameters is not equal to zero.

For  = .05 and d.f. = 2, 17; F.05 = 3.59


Rejection Rule
Reject H0 if p-value < .05 or F > 3.59
ANOVA Output

RIT 2023 Spring 20


Testing for Significance (con’t)

• t Test

Hypotheses H0: i = 0 Ha: i ≠ 0

𝑏𝑖
Test Statistics 𝑡=
𝑠𝑏 𝑖

Rejection Rule Reject H0 if p-value < a or if t < -tor t > t


where t is based on a t distribution with n - p – 1
degrees of freedom.

RIT 2023 Spring 21


Testing for Significance (con’t)

• t Test: example

Hypotheses H0 : = 0 Ha: ≠ 0

Rejection Rule For  = .05 and d.f. = 17, t.025 = 2.11


Reject H0 if p-value < .05, or if t < -2.11 or t > 2.11

Regression Equation Output

RIT 2023 Spring 22


Testing for Significance: Multicolinearity

• The term multicollinearity refers to the correlation among the


independent variables.
• When the independent variables are highly correlated (say, |r |> .7), it is
not possible to determine the separate effect of any particular
independent variable on the dependent variable.
• If the estimated regression equation is to be used only for predictive
purposes, multicollinearity is usually not a serious problem.
• Every attempt should be made to avoid including independent variables
that are highly correlated.

RIT 2023 Spring 23


Using the Estimated Regression Equation
for Estimation and Prediction

• The procedures for estimating the mean value of y and predicting an


individual value of y in multiple regression are similar to those in simple
regression.
• We substitute the given values of x1, x2, . . . , xp into the estimated
regression equation and use the corresponding value of as the point
estimate.
• The formulas required to develop interval estimates for the mean value
of and for an individual value of y are beyond the scope of the
textbook.
• Software packages for multiple regression will often provide these
interval estimates.

RIT 2023 Spring 24


Residual Analysis

• For simple linear regression the residual plot against and the residual
plot against x provide the same information.
• In multiple regression analysis it is preferable to use the residual plot
against to determine if the model assumptions are satisfied.

RIT 2023 Spring 25


Residual Analysis (con’t)

• Standardized Residual Plot Against


• Standardized residuals are frequently used in residual plots for purposes of:
• Identifying outliers (typically, standardized residuals < -2 or > +2).
• Providing insight about the assumption that the error term has a normal
distribution.
• The computation of the standardized residuals in multiple regression
analysis is too complex to be done by hand.
• Excel’s Regression tool can be used.

RIT 2023 Spring 26


Residual Analysis (con’t)

• Standardized Residual Plot Against


• Residual Output

RIT 2023 Spring 27


Categorical Independent Variables

• In many situations we must work with categorical independent variables


such as gender (male, female), method of payment (cash, check, credit
card), etc.
• For example, x2 might represent gender where x2 = 0 indicates male and
x2 = 1 indicates female.
• In this case, x2 is called a dummy or indicator variable.

RIT 2023 Spring 28


Categorical Independent Variables (con’t)

• Example: Programmer Salary Survey


As an extension of the problem involving the computer programmer salary
survey, suppose that management also believes that the annual salary is related
to whether the individual has a graduate degree in computer science or
information systems.
The years of experience, the score on the programmer aptitude test, whether
the individual has a relevant graduate degree, and the annual salary ($1000) for
each of the sampled 20 programmers are shown on the next slide.

RIT 2023 Spring 29


Categorical Independent Variables (con’t)

• Example: Programmer Salary Survey: data

RIT 2023 Spring 30


Categorical Independent Variables (con’t)

• Example: Programmer Salary Survey: data


Regression Equation

= b0 + b1x1 + b2x2 + b3x3

where:

= annual salary ($1000)


x1 = years of experience
x2 = score on programmer aptitude test
x3 = 0 if individual does not have a graduate degree
1 if individual does have a graduate degree
(x3 is a dummy variable)

RIT 2023 Spring 31


Categorical Independent Variables (con’t)

• Example: Programmer Salary Survey: data


ANOVA Output

R2 = 507.896/599.7855 = 0.8468 Previously, R2 = 0.8342


2 20 − 1
𝑅𝑎 =1− ( 1−.8468 ) =0.8181 Previously, Adjusted R2 = 0.815
20 − 3 −1

RIT 2023 Spring 32


Categorical Independent Variables (con’t)

• Example: Programmer Salary Survey: data


Regression Equation Output

RIT 2023 Spring 33


More Complex Categorical Variables

• If a categorical variable has k levels, k - 1 dummy variables are required,


with each dummy variable being coded as 0 or 1.
• For example, a variable with levels A, B, and C could be represented by
x1 and x2 values of (0, 0) for A, (1, 0) for B, and (0, 1) for C.
• Care must be taken in defining and interpreting the dummy variables.
• For example, a variable indicating level of education could be
represented by x1 and x2 values as follows:

RIT 2023 Spring 34


Modeling Curvilinear Relationship

• Example: Sales of Laboratory Sales


A manufacturer of laboratory scales wants to investigate the relationship
between the length of employment of their salespeople and the number of
scales sold.
The table on the next slide gives the number of months each salesperson has
been employed by the firm (x) and the number of scales sold (y) by 15
randomly selected salespersons.

RIT 2023 Spring 35


Modeling Curvilinear Relationship (con’t)

• Excel’s Chart tools can be used to develop a scatter diagram and fit a
straight line to bivariate data.
• The estimated regression equation and the coefficient of determination
for simple linear regression can also be developed.
• The results of using Excel’s Chart tools to fit a line to the data are shown
below.

RIT 2023 Spring 36


Modeling Curvilinear Relationship (con’t)

• The scatter diagram indicates a possible curvilinear relationship


between the length of time employed and the number of scales sold.
• So, we develop a multiple regression model with two independent
variables: x and x2.

y = b0 + b1x + b2x2 + e

• This model is often referred to as a second-order polynomial or a


quadratic model.

RIT 2023 Spring 37


Modeling Curvilinear Relationship (con’t)

• Excel’s Chart tools can be used to fit a polynomial curve to the data.
(Dialog box is below.)
• To get the dialog box, position the mouse pointer over any data point in
the scatter diagram and right-click.
• The estimated multiple regression equation and multiple coefficient of
determination for this second-order model are also obtained.

RIT 2023 Spring 38


Modeling Curvilinear Relationship (con’t)

• Excel’s Chart tools output does not provide any means for testing the
significance of the results, so we need to use Excel’s Regression tool.
• We will treat the values of x2 as a second independent variable (called
MonthSq below).
• Second Independent Variable (MonthSq) Added

RIT 2023 Spring 39


Modeling Curvilinear Relationship (con’t)

• Excel’s Regression Tool Output

We should be pleased with the fit provided by the estimated multiple regression equation.

The overall model is significant (p-value for the F test is 8.75E-07)


RIT 2023 Spring 40
Modeling Curvilinear Relationship (con’t)

• Excel’s Regression Tool Output

We can conclude that adding MonthsSq to the model is significant.

RIT 2023 Spring 41


Exercise

• Problem One
Regression analysis was applied between sales data (y in $1000s) and
advertising data (x in $100s) and the following information was obtained.

A. 3 B. 45
The F statistic computed from the above data is:
C. 48 D. 50

RIT 2023 Spring 42


Exercise (con’t)

• Problem Two
Regression analysis was applied between sales data (y in $1000s) and
advertising data (x in $100s) and the following information was obtained.

A. 1.80 B. 1.96
The t statistic for testing the significance of the slope is:
C. 6.71 D. 0.56

RIT 2023 Spring 43


Part 3
Regression Analysis: Model Building

RIT 2023 Spring 44


Outline: Part 3

• General Linear Model


• Determining When to Add or Delete Variables
• Variable Selection Procedures
• Multiple Regression Approach to Experimental Design
• Autocorrelation and the Durbin-Watson Test

RIT 2023 Spring 45


General Linear Model

• Models in which the parameters (0, 1, . . . , p ) all have exponents of


one are called linear models.
• A general linear model involving p independent variables is

y = b0 + b1z1 + b2z2 + … bpzp + e

• Each of the independent variables z is a function of x1, x2,..., xk (the


variables for which data have been collected).

RIT 2023 Spring 46


General Linear Model (con’t)

• The simplest case is when we have collected data for just one variable x1
and want to estimate y by using a straight-line relationship. In this case
z1 = x 1 .
• This model is called a simple first-order model with one predictor
variable.

y = b 0 + b 1 x1 + e

RIT 2023 Spring 47


Modeling Curvilinear Relationships

• To account for a curvilinear relationship, we might set z1 = x1 and z2 =.


• This model is called a second-order model with one predictor variable.

𝑦 =𝛽 0 + 𝛽 1 𝑥 1+ 𝛽2 𝑥 12 +𝜀

RIT 2023 Spring 48


Interaction

• If the original data set consists of observations for y and two


independent variables x1 and x2 we might develop a second-order model
with two predictor variables.

𝑦 =𝛽 0 + 𝛽 1 𝑥 1+ 𝛽2 𝑥 2 + 𝛽 3 𝑥 12+ 𝛽 4 𝑥 22 + 𝛽 5 𝑥 1 𝑥 2+ 𝜀

• In this model, the variable z5 = x1x2 is added to account for the potential
effects of the two variables acting together.
• This type of effect is called interaction.

RIT 2023 Spring 49


Transformation Involving the Dependent Variable

• Often the problem of non-constant variance can be corrected by


transforming the dependent variable to a different scale.
• Most statistical packages provide the ability to apply logarithmic
transformations using either the base-10 (common log) or the base e =
2.71828... (natural log).
• Another approach, called a reciprocal transformation, is to use 1/y as
the dependent variable instead of y.

RIT 2023 Spring 50


Nonlinear Models That Are Intrinsically Linear

• Models in which the parameters (0, 1, . . . , p ) have exponents other


than one are called nonlinear models.
• In some cases we can perform a transformation of variables that will
enable us to use regression analysis with the general linear model.
• The exponential model involves the regression equation:

𝐸 ( 𝑦 ) =𝛽 0 𝛽 1𝑥

• We can transform this nonlinear model to a linear model by taking the


logarithm of both sides.

RIT 2023 Spring 51


Determining When to Add or Delete Variables

• To test whether the addition of x2 to a model involving x1 (or the


deletion of x2 from a model involving x1 and x2) is statistically significant
we can perform an F Test.
• The F Test is based on a determination of the amount of reduction in
the error sum of squares resulting from adding one or more
independent variables to the model.
(SSE ( reduced ) − SSE ( full ) )
number of extra terms
𝐹=
MSE ( full )

( SSE ( 𝑥 1 ) − SSE ( 𝑥 1 , 𝑥 2 ) )/1


𝐹=
( SSE ( 𝑥1 , 𝑥 2 ) ) / (𝑛− 𝑝 − 1)

RIT 2023 Spring 52


Determining When to Add or Delete Variables

• The p–value criterion can also be used to determine whether it is


advantageous to add one or more dependent variables to a multiple
regression model.
• The p–value associated with the computed F statistic can be compared
to the level of significance a.
• It is difficult to determine the p–value directly from the tables of the F
distribution, but computer software packages, such as Minitab or Excel,
provide the p-value.

RIT 2023 Spring 53


End

RIT 2023 Spring 54

You might also like