0% found this document useful (0 votes)

21 views

4.1 Multiple Regression Models

This document discusses multiple regression models, focusing on estimating the relationship between a dependent variable Y and independent variables Xs using linear and polynomial forms. It explains key concepts such as the least squares method, regression analysis of variance, and the significance of R2 and adjusted R2 in assessing model fit. Additionally, it provides examples of regression analysis applied to real data, illustrating how to interpret results and make predictions.

Uploaded by

Angus Chung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

4.1 Multiple Regression Models

Uploaded by

Angus Chung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 6

Multiple Regression - 1

CHAPTER 4 MULTIPLE REGRESSION MODELS

Basic Concept
1. Regression Analysis is a process of estimating the functional relationship, or of using statistical
methods to obtain an equation, between a random variable Y (called dependent variable), and one
or more other non-random variables Xs (called independent variables).
2. In multiple regression analysis, we assume the relationship between the dependent variable Y and
the independent variables Xs is linear, as in the following form:

where ’s (random errors) are assumed to be independently and normally distributed with mean
equal to zero and a constant (but unknown) variance 2.
With these assumptions, the expected value of Y is therefore:
E(Y) = E(β0 + β1X1 + β2X2 + … + βkXk + ε) = β0 + β1X1 + β2X2 + … + βkXk
Note that the regression model may be seemingly non-linear as in the following forms:
 Y = 0 + 1X1 + 2X12 +  (polynomial model)
 Y = 0 + 1X1 +2X2 + 3X +4X + 5X1 X2 +
1
2
2
2
(polynomial model)
 (multiplicative model)

 (reciprocal model)

A non-linear function that can be transformed into linear form (see the third and fourth examples
above) is called virtually linear (or intrinsically linear).
3. Since the coefficients (’s) of the above equations are unknown, we need to make estimations
based on the data collected for X and Y. In Regression Analysis, the method used to estimate
these coefficients is called “the least squares method” – so called because the resulting
mathematical function (called the least squares equation, least squares line, or simply regression
line) has the smallest sum of squared estimation errors (minimum SSE = ).

4. The least squares equation can be written as:

where (the fitted value of y) is used to estimate E(y), and

(the fitted values of i) are used to estimate i.

Note that ’s are obtained by solving the following k + 1 equations (called normal equations)
simultaneously:

1
Multiple Regression - 2

5. Following is the typical information provided by any regression software.

 The Regression Analysis of Variance Table

SOURCE DF SS MS F P
Regression k SSR MSR = SSR/k MSR/MSE p-value
Error n-k-1 SSE MSE = SSE/(n-k-1)
Total n-1 SST

 n = number of observations and k = number of independent variables (predictors)

 SS stands for “Sum of Squares”. Hence,
SST = , stands for “Sum of Squares Total”
SSE = stands for “Sum of Squares Error”, and
SSR = SST – SSE stands for “Sum of Squares Regression”.

So, the idea here is that total squared variations around the mean of Y (SST) can be
partitioned into two parts: those that are due to random errors (SSE) and those that can
be accounted for by the regression equation (SSR). If a large portion of SST can be
explained by the regression equation (namely, SSR is large relative to SST), then it is an
indication that there may exist a strong linear relationship between Y and Xs (see also the
discussion of R2 below).
 MS stands for “Mean Square”. Hence, MSR is “Mean Square Regression” and MSE is
“Mean Square Error”.
 MSE, or called s2, can be used to approximate 2. Consequently, it can be used to
calculate the confidence interval for E(y) or prediction interval for y.
 The F-ratio, or the p-value for F-ratio, can be used to test the overall usefulness of the
model.
Theoretically, we know that (1) if the assumptions about ’s hold and (2) if there exists no
linear relationship between Y and any of the independent variables Xi (i.e., 1 = 2 = … k
= 0), then the sample ratio MSR/MSE will follow an F distribution with k degrees of
freedom in the numerator and n – k – 1 degrees of freedom in the denominator.
Hence to determine whether there exists any linear relationship between Y and any of the
independent variables Xi, we test the following hypothesis:
H0: There exists no linear relationship between Y and any of the Xis
H1: There exists a linear relationship between Y and at least one Xi
Or, equivalently,
H0: 1 = 2 = … k = 0
H1: At least one of the i’s  0
Test Statistic: F = MSR/MSE. Decision Rule: Reject H0 if F > Fk, n-k-1, .
If the null hypothesis is true, we will not expect value of the sample statistic F to be too
extreme. The p-value of the test statistic printed in the regression output tells us how
unusual the value of F is if the null hypothesis is true. If the p-value is very small, it is an
indication that the null hypothesis may be false. We will therefore reject H0 and accept
H1. By accepting H1, we conclude that Y is at least linearly related to one Xi.

2
Multiple Regression - 3

 The estimated values of ’s and other related statistics

This information allows us to obtain the regression equation. Also, it allows us to test
whether there exists any linear relationship between the dependent variable Y and each of the
independent variable Xi. In addition, it also allows us to construct confidence intervals for i.
 Theoretically, we know that (1) if the assumptions about ’s hold and (2) if there is no
linear relationship between Xi and Y (i.e., i = 0), then the sample statistic t = will
follow a t distribution with n – k – 1 degrees of freedom. Here, is the estimated
standard error of .
 Hence, to determine whether there exists any linear relationship between Y and X, we test
the following hypothesis:
H0: i = 0 (There is no linear relationship between Y and Xi.)
H1: i  0 (There is linear relationship between Y and Xi.)

Test Statistic: t = . Decision Rule: Reject H0 if |t| > t n-k-1, /2.

If the null hypothesis is true, we will not expect the sample statistic t to be either too small
or too large (say, for example, more than two standard errors above or below the mean 0).
The p-value of the test statistic printed in the regression output tells us how unusual the
value of t is if the null hypothesis is true. If the p-value is very small, it is an indication
that the null hypothesis is false. We will therefore reject H0 and accept H1. By accepting
H1, we conclude that there exists linear relationship between Y and X.

 To construct a confidence interval for i, we use the formula:

 R2 (Coefficient of Multiple Determination) and adjusted R2 both measure the total

percentage of variations of Y around the mean of Y that can be explained by the regression
equation. Alternatively, one can interpret R2 as “the percentage prediction errors reduced by
using the regression line, instead of the mean of Y, to predict Y”. Hence, the higher the value
of R2, the better the model fits the data. However, since R2 always increases as the set of
independent variables expands, we sometimes use “adjusted R2” to measure the fitness of the
model when the sample size relative to the number of independent variables is small.

 The relationship between R2, adjusted R2, SSE, and F

 As more independent variables are added to the model (keeping the original independent
variables in the model, R2 always increases and SSE always decreases. This fact that R2 =
1 – SSE/SST implies that R2 and SSE always go in the opposite direction. One can think
of this result intuitively in the following way: the prediction tends to get better when
more information becomes available.

 . Thus, as n  ∞ (keeping k constant), .

 . Thus, F and R2 are positively related (i.e., a large (small) value

of F implies a large (small) value for R2, and vice versa).

 SST remains constant regardless of the number of independent variables in the model.

3
Multiple Regression - 4

A Simple Example
In the following example, we will use regression analysis to find the relationship between the size of
house and the electricity usage. Data collected from 10 houses are given below (File: Electricity-
HomeSize.MTW), where X = size of the house (in square ft) and Y = electricity usage (in kW/hr).

Y X
1182 1290
1172 1350
1264 1470
1493 1600
1571 1710
1711 1840
1804 1980
1840 2230
1956 2400
1954 2930

The following plot shows that Y and X seem to have quite a strong positive linear relationship.

The regression output is given below.

Regression Analysis: Y versus X

The regression equation is

Y = 579 + 0.540 X

Predictor Coef SE Coef T P

Constant 578.9 167.0 3.47 0.008
X 0.54030 0.08593 6.29 0.000

S = 133.4 R-Sq = 83.2% R-Sq(adj) = 81.1%

Analysis of Variance

Source DF SS MS F P
Regression 1 703957 703957 39.54 0.000
Residual Error 8 142445 17806
Total 9 846402

From the output, we can see that R2 of the model is fairly high (83.2%). In addition, the very small p-
value (0.000) for t (or F) suggests that we can reject H 0: 1 = 0 (and conclude that Y and X are
linearly related). Overall, everything seems to indicate that this is a fairly satisfactory model.

4
Multiple Regression - 5

We’ll try to see if we can do better. This time, let’s fit a model with the squared term of ‘sq-ft’ (a
multiple regression model with two variables). In other words, we will fit the following polynomial
model:

Y = 0 + 1X1 + 2X12 + 

To do so, we need to create another column and the values in this column will be equal to x2. The
data and the regression output are shown below.
Y X X2
1182 1290 1664100
1172 1350 1822500
1264 1470 2160900
1493 1600 2560000
1571 1710 2924100
1711 1840 3385600
1804 1980 3920400
1840 2230 4972900
1956 2400 5760000
1954 2930 8584900

Regression Analysis: Y versus X, X-sq

The regression equation is

Y = - 1216 + 2.40 X -0.000450 X-sq

Predictor Coef SE Coef T P

Constant -1216.1 242.8 -5.01 0.002
X 2.3989 0.2458 9.76 0.000
X-sq -0.00045004 0.00005908 -7.62 0.000

S = 46.80 R-Sq = 98.2% R-Sq(adj) = 97.7%

Analysis of Variance

Source DF SS MS F P
Regression 2 831070 415535 189.71 0.000
Residual Error 7 15333 2190
Total 9 846402

Source DF Seq SS
X 1 703957
X-sq 1 127112

Unusual Observations
Obs X Y Fit SE Fit Residual St Resid
10 2930 1954.0 1949.2 44.7 4.8 0.35 X

X denotes an observation whose X value gives it large influence.

Predicted Values for New Observations

New Obs Fit SE Fit 95.0% CI 95.0% PI

1 1141.9 29.3 ( 1072.5, 1211.3) ( 1011.2, 1272.6)

Values of Predictors for New Observations

New Obs X X-sq

1 1300 1690000

5
Multiple Regression - 6

Apparently, judging from R2 and p-values for F and t, this second model is obviously a much better
model than the first one and we will therefore accept this model.
Since we have decided to accept the second model, we conclude that the relationship between the
electricity usage and the size of the house is the following:
y = – 1216.1 + 2.3989 x – 0.00045 x2.
We can then use it to make prediction. For example, the point prediction for the electricity usage of a
1300 square feet house would be:
y = – 1216.1 + 2.39893 (1300) – 0.00045 (1300)2 = 1141.9

Interval Estimation for Y

Sometimes, in addition to point estimation, we may want to do interval estimation for Y. To do this,
recall that, in Regression Analysis, we assume the relationship between Y and Xs is linear, as in the
following form:

where ’s (epsilons) are random errors. We assume ’s are independently and normally distributed
with mean equal to zero and a constant (but unknown) variance 2.

Since the point estimation for y is and the standard error of  can be estimated by
(usually denoted as s, called Standard Error of Estimate), the 95% prediction
interval for the value of y will then be approximately equal to
.
So, going back to the previous example in which the point estimation of y is 1141.9 when x is 1300.
The 95% prediction interval for y when x is 1300 will be approximately:
= 1141.9  2  46.8 = [1048.3; 1235.5]
You should understand, however, this is only an “approximation”. This approximation should work
quite well if the sample size is sufficiently large.
Specialized statistical software such as MINITAB, SPSS, and SAS are all able to calculate the
interval precisely. For example, in the regression analysis printout above, we can find the following:
1. The 95% confidence interval for E(y), the average electricity usage of a 1300 sq ft house, is
(1072.5, 1211.3).
2. The 95% prediction interval for y, the electricity usage of 1300 sq ft house, is (1011.2, 1272.6).
Note that the approximation we made above is not too far from this interval, even though the
sample size is only 10.

Stat... ICS Part-II 2022
100% (1)
Stat... ICS Part-II 2022
52 pages
BioStats and Epidemiology BNB Notes
No ratings yet
BioStats and Epidemiology BNB Notes
39 pages
Final Project Using Eviews For Capital Asset Pricing Model (Capm) Testing
No ratings yet
Final Project Using Eviews For Capital Asset Pricing Model (Capm) Testing
13 pages
125.785 Module 2.2
No ratings yet
125.785 Module 2.2
95 pages
Passing Reference Multiple Regression
No ratings yet
Passing Reference Multiple Regression
10 pages
REGRESSION
No ratings yet
REGRESSION
8 pages
Chapter 09 W12 L1 Multiple Regression Analysis 2015 UTP C10 PDF
No ratings yet
Chapter 09 W12 L1 Multiple Regression Analysis 2015 UTP C10 PDF
17 pages
Unit 5
No ratings yet
Unit 5
10 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
17 pages
3.1 Multiple Linear Regression Model
No ratings yet
3.1 Multiple Linear Regression Model
11 pages
Multivariate Regression
No ratings yet
Multivariate Regression
20 pages
Multiple Regression
No ratings yet
Multiple Regression
100 pages
Bivariate
No ratings yet
Bivariate
28 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
73 pages
chapter 3
No ratings yet
chapter 3
31 pages
Complete Business Statistics: Multiple Regression
No ratings yet
Complete Business Statistics: Multiple Regression
64 pages
简单线性回归分析Simple Linear Regression PDF
No ratings yet
简单线性回归分析Simple Linear Regression PDF
8 pages
Section 2
No ratings yet
Section 2
22 pages
Simple Regression 1
No ratings yet
Simple Regression 1
18 pages
Regression
No ratings yet
Regression
24 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Regression and Correlation
No ratings yet
Regression and Correlation
17 pages
FinQuiz - Curriculum Note, @InsightSquad Study Session 2, Reading 5
No ratings yet
FinQuiz - Curriculum Note, @InsightSquad Study Session 2, Reading 5
11 pages
What Is Multiple Linear Regression
No ratings yet
What Is Multiple Linear Regression
23 pages
Applied Business Forecasting and Planning: Multiple Regression Analysis
No ratings yet
Applied Business Forecasting and Planning: Multiple Regression Analysis
100 pages
Multiple Regression
100% (1)
Multiple Regression
100 pages
ADM2304 Multiple Regression Dr. Suren Phansalker
No ratings yet
ADM2304 Multiple Regression Dr. Suren Phansalker
12 pages
Theme 3 Multivariante Regression Model
No ratings yet
Theme 3 Multivariante Regression Model
8 pages
01 SLR Final
No ratings yet
01 SLR Final
37 pages
Lecture 12
No ratings yet
Lecture 12
47 pages
1725857551_SMA32
No ratings yet
1725857551_SMA32
30 pages
Multiple Regression
No ratings yet
Multiple Regression
60 pages
Multiple Linear Regression
100% (3)
Multiple Linear Regression
26 pages
MAS316/Math352 Regression Analysis: 1 Multiple Linear Regression Models
No ratings yet
MAS316/Math352 Regression Analysis: 1 Multiple Linear Regression Models
12 pages
Multivariate Regression
No ratings yet
Multivariate Regression
20 pages
10 - Regression Analysis
No ratings yet
10 - Regression Analysis
6 pages
01 - Quantitative Methods
No ratings yet
01 - Quantitative Methods
28 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
Multiple Linear Regression-I
No ratings yet
Multiple Linear Regression-I
6 pages
Multiple Regression (Compatibility Mode)
No ratings yet
Multiple Regression (Compatibility Mode)
24 pages
High Yield Notes
No ratings yet
High Yield Notes
251 pages
EDA Template
No ratings yet
EDA Template
18 pages
BN2102 7-10
No ratings yet
BN2102 7-10
24 pages
Regression
No ratings yet
Regression
15 pages
Chapter 3 MLR
No ratings yet
Chapter 3 MLR
40 pages
Multiple Linear Regression 2
No ratings yet
Multiple Linear Regression 2
30 pages
Module01 LinearRegression
No ratings yet
Module01 LinearRegression
41 pages
Multiple Linear Regression & Nonlinear Regression Models
No ratings yet
Multiple Linear Regression & Nonlinear Regression Models
51 pages
Review of Multiple Regression: Assumptions About Prior Knowledge. This Handout Attempts To Summarize and Synthesize
No ratings yet
Review of Multiple Regression: Assumptions About Prior Knowledge. This Handout Attempts To Summarize and Synthesize
12 pages
STAT 252-Notes-Topic 5-Multiple Linear Regression
No ratings yet
STAT 252-Notes-Topic 5-Multiple Linear Regression
33 pages
SRM Notes
No ratings yet
SRM Notes
38 pages
Econometric lec3
No ratings yet
Econometric lec3
76 pages
ArunRangrej
No ratings yet
ArunRangrej
5 pages
C2 English
No ratings yet
C2 English
34 pages
Chapter 11
No ratings yet
Chapter 11
18 pages
Business Analytics: Advance: Simple & Multiple Linear Regression
No ratings yet
Business Analytics: Advance: Simple & Multiple Linear Regression
38 pages
Regression Models - Follow
No ratings yet
Regression Models - Follow
7 pages
Part 8 Linear Regression
No ratings yet
Part 8 Linear Regression
6 pages
Exercises of Advanced Statistics
From Everand
Exercises of Advanced Statistics
Simone Malacrida
No ratings yet
Lectures on the Coupling Method
From Everand
Lectures on the Coupling Method
Torgny Lindvall
No ratings yet
Integration, Measure and Probability
From Everand
Integration, Measure and Probability
H. R. Pitt
No ratings yet
Mathematical Foundations of Information Theory
From Everand
Mathematical Foundations of Information Theory
A. Ya. Khinchin
3.5/5 (9)
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
Finacial Analysis - PDF Best One
No ratings yet
Finacial Analysis - PDF Best One
11 pages
Eba3e PPT ch06
No ratings yet
Eba3e PPT ch06
41 pages
Problem Set 1
No ratings yet
Problem Set 1
4 pages
All All: % (A) Construct Side-By-Side Stem-And-Leaf Plots
No ratings yet
All All: % (A) Construct Side-By-Side Stem-And-Leaf Plots
34 pages
Free Access to Introduction to Probability and Statistics 13th Edition Mendenhall Solutions Manual Chapter Answers
100% (15)
Free Access to Introduction to Probability and Statistics 13th Edition Mendenhall Solutions Manual Chapter Answers
55 pages
Bland Altman
No ratings yet
Bland Altman
27 pages
Chapter - 5 Introduction To Estimation
No ratings yet
Chapter - 5 Introduction To Estimation
14 pages
DASS CAT2
No ratings yet
DASS CAT2
2 pages
MCE321 Lecture Note 4
No ratings yet
MCE321 Lecture Note 4
43 pages
3rd Quarter Exam - Statistics and Probability
100% (3)
3rd Quarter Exam - Statistics and Probability
11 pages
Chapter 3 Notes
No ratings yet
Chapter 3 Notes
5 pages
Production Planning and Control
No ratings yet
Production Planning and Control
44 pages
T Test F Test Table
No ratings yet
T Test F Test Table
337 pages
R04 Introduction To Linear Regression
No ratings yet
R04 Introduction To Linear Regression
12 pages
The Central Limit Theorem
No ratings yet
The Central Limit Theorem
8 pages
All chapter download Managing for Quality and Performance Excellence 10th Edition Evans Solutions Manual
100% (17)
All chapter download Managing for Quality and Performance Excellence 10th Edition Evans Solutions Manual
66 pages
Bio IA2
No ratings yet
Bio IA2
12 pages
Varían Clinac
No ratings yet
Varían Clinac
27 pages
Tutorials in Statistics - Chapter 4 New
No ratings yet
Tutorials in Statistics - Chapter 4 New
11 pages
Choosing and Using Statistics A Biologist s Guide 3rd Edition Calvin Dythamdownload
100% (2)
Choosing and Using Statistics A Biologist s Guide 3rd Edition Calvin Dythamdownload
44 pages
Lab 04
No ratings yet
Lab 04
12 pages
The Analysis of Covariance and Alternatives Statistical Methods for Experiments Quasi Experiments and Single Case Studies Second Edition Bradley E. Huitema(Auth.) 2024 Scribd Download
100% (19)
The Analysis of Covariance and Alternatives Statistical Methods for Experiments Quasi Experiments and Single Case Studies Second Edition Bradley E. Huitema(Auth.) 2024 Scribd Download
60 pages
Unit+16 T Test
No ratings yet
Unit+16 T Test
35 pages
Straightforward Statistics with Excel 2nd Edition C. Bowen pdf download
100% (4)
Straightforward Statistics with Excel 2nd Edition C. Bowen pdf download
71 pages
GeNorm Manual
No ratings yet
GeNorm Manual
16 pages
Tabout Tutorial PDF
No ratings yet
Tabout Tutorial PDF
46 pages
STATPROB Module 7
No ratings yet
STATPROB Module 7
16 pages

Uploaded by

Uploaded by

Multiple Regression - 1

CHAPTER 4 MULTIPLE REGRESSION MODELS

4. The least squares equation can be written as:

where (the fitted value of y) is used to estimate E(y), and

5. Following is the typical information provided by any regression software.

 n = number of observations and k = number of independent variables (predictors)

 The estimated values of ’s and other related statistics

Test Statistic: t = . Decision Rule: Reject H0 if |t| > t n-k-1, /2.

 To construct a confidence interval for i, we use the formula:

 R2 (Coefficient of Multiple Determination) and adjusted R2 both measure the total

 The relationship between R2, adjusted R2, SSE, and F

 . Thus, as n  ∞ (keeping k constant), .

 . Thus, F and R2 are positively related (i.e., a large (small) value

of F implies a large (small) value for R2, and vice versa).

The regression output is given below.

Regression Analysis: Y versus X

The regression equation is

Predictor Coef SE Coef T P

S = 133.4 R-Sq = 83.2% R-Sq(adj) = 81.1%

Regression Analysis: Y versus X, X-sq

The regression equation is

Predictor Coef SE Coef T P

S = 46.80 R-Sq = 98.2% R-Sq(adj) = 97.7%

X denotes an observation whose X value gives it large influence.

Predicted Values for New Observations

New Obs Fit SE Fit 95.0% CI 95.0% PI

Values of Predictors for New Observations

New Obs X X-sq

Interval Estimation for Y

You might also like