0% found this document useful (0 votes)
21 views

4.1 Multiple Regression Models

This document discusses multiple regression models, focusing on estimating the relationship between a dependent variable Y and independent variables Xs using linear and polynomial forms. It explains key concepts such as the least squares method, regression analysis of variance, and the significance of R2 and adjusted R2 in assessing model fit. Additionally, it provides examples of regression analysis applied to real data, illustrating how to interpret results and make predictions.

Uploaded by

Angus Chung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

4.1 Multiple Regression Models

This document discusses multiple regression models, focusing on estimating the relationship between a dependent variable Y and independent variables Xs using linear and polynomial forms. It explains key concepts such as the least squares method, regression analysis of variance, and the significance of R2 and adjusted R2 in assessing model fit. Additionally, it provides examples of regression analysis applied to real data, illustrating how to interpret results and make predictions.

Uploaded by

Angus Chung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 6

Multiple Regression - 1

CHAPTER 4 MULTIPLE REGRESSION MODELS

Basic Concept
1. Regression Analysis is a process of estimating the functional relationship, or of using statistical
methods to obtain an equation, between a random variable Y (called dependent variable), and one
or more other non-random variables Xs (called independent variables).
2. In multiple regression analysis, we assume the relationship between the dependent variable Y and
the independent variables Xs is linear, as in the following form:

where ’s (random errors) are assumed to be independently and normally distributed with mean
equal to zero and a constant (but unknown) variance 2.
With these assumptions, the expected value of Y is therefore:
E(Y) = E(β0 + β1X1 + β2X2 + … + βkXk + ε) = β0 + β1X1 + β2X2 + … + βkXk
Note that the regression model may be seemingly non-linear as in the following forms:
 Y = 0 + 1X1 + 2X12 +  (polynomial model)
 Y = 0 + 1X1 +2X2 + 3X +4X + 5X1 X2 +
1
2
2
2
(polynomial model)
 (multiplicative model)

 (reciprocal model)

A non-linear function that can be transformed into linear form (see the third and fourth examples
above) is called virtually linear (or intrinsically linear).
3. Since the coefficients (’s) of the above equations are unknown, we need to make estimations
based on the data collected for X and Y. In Regression Analysis, the method used to estimate
these coefficients is called “the least squares method” – so called because the resulting
mathematical function (called the least squares equation, least squares line, or simply regression
line) has the smallest sum of squared estimation errors (minimum SSE = ).

4. The least squares equation can be written as:

where (the fitted value of y) is used to estimate E(y), and


(the fitted values of i) are used to estimate i.

Note that ’s are obtained by solving the following k + 1 equations (called normal equations)
simultaneously:

1
Multiple Regression - 2

5. Following is the typical information provided by any regression software.


 The Regression Analysis of Variance Table

SOURCE DF SS MS F P
Regression k SSR MSR = SSR/k MSR/MSE p-value
Error n-k-1 SSE MSE = SSE/(n-k-1)
Total n-1 SST

 n = number of observations and k = number of independent variables (predictors)


 SS stands for “Sum of Squares”. Hence,
SST = , stands for “Sum of Squares Total”
SSE = stands for “Sum of Squares Error”, and
SSR = SST – SSE stands for “Sum of Squares Regression”.

So, the idea here is that total squared variations around the mean of Y (SST) can be
partitioned into two parts: those that are due to random errors (SSE) and those that can
be accounted for by the regression equation (SSR). If a large portion of SST can be
explained by the regression equation (namely, SSR is large relative to SST), then it is an
indication that there may exist a strong linear relationship between Y and Xs (see also the
discussion of R2 below).
 MS stands for “Mean Square”. Hence, MSR is “Mean Square Regression” and MSE is
“Mean Square Error”.
 MSE, or called s2, can be used to approximate 2. Consequently, it can be used to
calculate the confidence interval for E(y) or prediction interval for y.
 The F-ratio, or the p-value for F-ratio, can be used to test the overall usefulness of the
model.
Theoretically, we know that (1) if the assumptions about ’s hold and (2) if there exists no
linear relationship between Y and any of the independent variables Xi (i.e., 1 = 2 = … k
= 0), then the sample ratio MSR/MSE will follow an F distribution with k degrees of
freedom in the numerator and n – k – 1 degrees of freedom in the denominator.
Hence to determine whether there exists any linear relationship between Y and any of the
independent variables Xi, we test the following hypothesis:
H0: There exists no linear relationship between Y and any of the Xis
H1: There exists a linear relationship between Y and at least one Xi
Or, equivalently,
H0: 1 = 2 = … k = 0
H1: At least one of the i’s  0
Test Statistic: F = MSR/MSE. Decision Rule: Reject H0 if F > Fk, n-k-1, .
If the null hypothesis is true, we will not expect value of the sample statistic F to be too
extreme. The p-value of the test statistic printed in the regression output tells us how
unusual the value of F is if the null hypothesis is true. If the p-value is very small, it is an
indication that the null hypothesis may be false. We will therefore reject H0 and accept
H1. By accepting H1, we conclude that Y is at least linearly related to one Xi.

2
Multiple Regression - 3

 The estimated values of ’s and other related statistics


This information allows us to obtain the regression equation. Also, it allows us to test
whether there exists any linear relationship between the dependent variable Y and each of the
independent variable Xi. In addition, it also allows us to construct confidence intervals for i.
 Theoretically, we know that (1) if the assumptions about ’s hold and (2) if there is no
linear relationship between Xi and Y (i.e., i = 0), then the sample statistic t = will
follow a t distribution with n – k – 1 degrees of freedom. Here, is the estimated
standard error of .
 Hence, to determine whether there exists any linear relationship between Y and X, we test
the following hypothesis:
H0: i = 0 (There is no linear relationship between Y and Xi.)
H1: i  0 (There is linear relationship between Y and Xi.)

Test Statistic: t = . Decision Rule: Reject H0 if |t| > t n-k-1, /2.

If the null hypothesis is true, we will not expect the sample statistic t to be either too small
or too large (say, for example, more than two standard errors above or below the mean 0).
The p-value of the test statistic printed in the regression output tells us how unusual the
value of t is if the null hypothesis is true. If the p-value is very small, it is an indication
that the null hypothesis is false. We will therefore reject H0 and accept H1. By accepting
H1, we conclude that there exists linear relationship between Y and X.

 To construct a confidence interval for i, we use the formula:

 R2 (Coefficient of Multiple Determination) and adjusted R2 both measure the total


percentage of variations of Y around the mean of Y that can be explained by the regression
equation. Alternatively, one can interpret R2 as “the percentage prediction errors reduced by
using the regression line, instead of the mean of Y, to predict Y”. Hence, the higher the value
of R2, the better the model fits the data. However, since R2 always increases as the set of
independent variables expands, we sometimes use “adjusted R2” to measure the fitness of the
model when the sample size relative to the number of independent variables is small.

 The relationship between R2, adjusted R2, SSE, and F


 As more independent variables are added to the model (keeping the original independent
variables in the model, R2 always increases and SSE always decreases. This fact that R2 =
1 – SSE/SST implies that R2 and SSE always go in the opposite direction. One can think
of this result intuitively in the following way: the prediction tends to get better when
more information becomes available.

 . Thus, as n  ∞ (keeping k constant), .

 . Thus, F and R2 are positively related (i.e., a large (small) value

of F implies a large (small) value for R2, and vice versa).


 SST remains constant regardless of the number of independent variables in the model.

3
Multiple Regression - 4

A Simple Example
In the following example, we will use regression analysis to find the relationship between the size of
house and the electricity usage. Data collected from 10 houses are given below (File: Electricity-
HomeSize.MTW), where X = size of the house (in square ft) and Y = electricity usage (in kW/hr).

Y X
1182 1290
1172 1350
1264 1470
1493 1600
1571 1710
1711 1840
1804 1980
1840 2230
1956 2400
1954 2930

The following plot shows that Y and X seem to have quite a strong positive linear relationship.

The regression output is given below.

Regression Analysis: Y versus X

The regression equation is


Y = 579 + 0.540 X

Predictor Coef SE Coef T P


Constant 578.9 167.0 3.47 0.008
X 0.54030 0.08593 6.29 0.000

S = 133.4 R-Sq = 83.2% R-Sq(adj) = 81.1%

Analysis of Variance

Source DF SS MS F P
Regression 1 703957 703957 39.54 0.000
Residual Error 8 142445 17806
Total 9 846402

From the output, we can see that R2 of the model is fairly high (83.2%). In addition, the very small p-
value (0.000) for t (or F) suggests that we can reject H 0: 1 = 0 (and conclude that Y and X are
linearly related). Overall, everything seems to indicate that this is a fairly satisfactory model.

4
Multiple Regression - 5

We’ll try to see if we can do better. This time, let’s fit a model with the squared term of ‘sq-ft’ (a
multiple regression model with two variables). In other words, we will fit the following polynomial
model:

Y = 0 + 1X1 + 2X12 + 

To do so, we need to create another column and the values in this column will be equal to x2. The
data and the regression output are shown below.
Y X X2
1182 1290 1664100
1172 1350 1822500
1264 1470 2160900
1493 1600 2560000
1571 1710 2924100
1711 1840 3385600
1804 1980 3920400
1840 2230 4972900
1956 2400 5760000
1954 2930 8584900

Regression Analysis: Y versus X, X-sq

The regression equation is


Y = - 1216 + 2.40 X -0.000450 X-sq

Predictor Coef SE Coef T P


Constant -1216.1 242.8 -5.01 0.002
X 2.3989 0.2458 9.76 0.000
X-sq -0.00045004 0.00005908 -7.62 0.000

S = 46.80 R-Sq = 98.2% R-Sq(adj) = 97.7%

Analysis of Variance

Source DF SS MS F P
Regression 2 831070 415535 189.71 0.000
Residual Error 7 15333 2190
Total 9 846402

Source DF Seq SS
X 1 703957
X-sq 1 127112

Unusual Observations
Obs X Y Fit SE Fit Residual St Resid
10 2930 1954.0 1949.2 44.7 4.8 0.35 X

X denotes an observation whose X value gives it large influence.

Predicted Values for New Observations

New Obs Fit SE Fit 95.0% CI 95.0% PI


1 1141.9 29.3 ( 1072.5, 1211.3) ( 1011.2, 1272.6)

Values of Predictors for New Observations

New Obs X X-sq


1 1300 1690000

5
Multiple Regression - 6

Apparently, judging from R2 and p-values for F and t, this second model is obviously a much better
model than the first one and we will therefore accept this model.
Since we have decided to accept the second model, we conclude that the relationship between the
electricity usage and the size of the house is the following:
y = – 1216.1 + 2.3989 x – 0.00045 x2.
We can then use it to make prediction. For example, the point prediction for the electricity usage of a
1300 square feet house would be:
y = – 1216.1 + 2.39893 (1300) – 0.00045 (1300)2 = 1141.9

Interval Estimation for Y


Sometimes, in addition to point estimation, we may want to do interval estimation for Y. To do this,
recall that, in Regression Analysis, we assume the relationship between Y and Xs is linear, as in the
following form:

where ’s (epsilons) are random errors. We assume ’s are independently and normally distributed
with mean equal to zero and a constant (but unknown) variance 2.

Since the point estimation for y is and the standard error of  can be estimated by
(usually denoted as s, called Standard Error of Estimate), the 95% prediction
interval for the value of y will then be approximately equal to
.
So, going back to the previous example in which the point estimation of y is 1141.9 when x is 1300.
The 95% prediction interval for y when x is 1300 will be approximately:
= 1141.9  2  46.8 = [1048.3; 1235.5]
You should understand, however, this is only an “approximation”. This approximation should work
quite well if the sample size is sufficiently large.
Specialized statistical software such as MINITAB, SPSS, and SAS are all able to calculate the
interval precisely. For example, in the regression analysis printout above, we can find the following:
1. The 95% confidence interval for E(y), the average electricity usage of a 1300 sq ft house, is
(1072.5, 1211.3).
2. The 95% prediction interval for y, the electricity usage of 1300 sq ft house, is (1011.2, 1272.6).
Note that the approximation we made above is not too far from this interval, even though the
sample size is only 10.

You might also like