Chapter 3
Chapter 3
Multiple Regression
[ Cross-Sectional Data ]
Simple Regression
A statistical model that utilizes one quantitative
independent variable “X” to predict the quantitative
dependent variable “Y.”
i.e., considers the relation between a single explanatory
variable and response variable:
Multiple Regression
A statistical model that utilizes two or more quantitative and
qualitative explanatory variables (x1,..., xp) to predict a
quantitative dependent variable Y.
Caution: have at least two or more quantitative
explanatory variables.
Multiple regression simultaneously considers the influence of
multiple explanatory variables on a response variable Y:
Simple vs. Multiple
• represents the unit change • i represents the unit change in
in Y per unit change in X . Y per unit change in Xi.
• Does not take into account
any other variable besides • Takes into account the effect of
single independent variable. other independent variable s.
• r2: proportion of variation in • R2: proportion of variation in Y
Y predictable from X. predictable by set of X’s
Goal
Develop a statistical model that can
predict the values of a dependent
(response)
response variable based upon the values
of the independent (explanatory)
explanatory variables.
Multiple Regression Models
M ultiple
Regression
M odels
Non-
Linear
Linear
Dummy Inter-
Linear action
Variable
Poly- Square
Log Reciprocal Exponential
Nomial Root
The Multiple Linear Regression Model building
Yi β0 β1 X 1i β2 X 2i βk X ki ε
• The coefficients of the multiple regression model
are estimated using sample data with k
independent variables
Estimated Estimated
(or predicted) Estimated slope coefficients
intercept
value of Y
Ŷi b0 b1 X 1i b2 X 2i bk X ki
• Interpretation of the Slopes: (referred to as a Net
Regression Coefficient)
– b1=The change in the mean of Y per unit change in X1,
taking into account the effect of X2 (or net of X2)
– b0 Y intercept. It is the same as simple regression.
ASSUMPTIONS
• Linear regression model: The regression model is linear in the
parameters, though it may or may not be linear in variables.
• The X variable is independent of the error term. This means
that we require zero covariance between ui and each X variables.
cov.(ui , X1i) = cov(ui, X2i)=------- = cov(ui, Xki) = 0
• Zero mean value of disturbance ui. Given the value of Xi, the
mean, or the expected value of the random disturbance term ui is
zero.
E(ui)= 0 for each i
• Homoscedasticity or constant variance of ui . This implies that
the variance of the error term is the same, regardless of the value
of X.
var (ui) = σ2
• No auto-correlation between the disturbance terms.
bi tnk 1Sbi
Dummy independent Variables
Describing Qualitative Information
• In regression analysis the dependent variable can be
influenced by variables that are essentially qualitative in
nature,
such as sex, race, color, religion, nationality, geographical
region, political upheavals, and party affiliation.
• One way we could “quantify” such attributes is by
constructing artificial variables that take on values of 1 or 0,
1 indicating the presence (or possession) of that attribute and 0
indicating the absence of that attribute.
• Variables that assume such 0 and 1 values are called dummy/
indicator/ binary/ categorical/ dichotomous variables.
Example 1:
where Y=annual salary of a college professor
Di 1 if male college professor
= 0 otherwise
“less than high school education” category as the base category.
Therefore, the intercept will reflect the intercept for this
category.
• the mean health care expenditure functions for the three
levels of education, namely, less than high school, high
school, and college:
E (Yi | D2 0, D3 0, X i ) 1 X i
E (Yi | D2 1, D3 0, X i ) ( 1 2 ) X i
E (Yi | D2 0, D3 1, X i ) ( 1 3 ) X i
Assumptions and Procedures to Conduct Multiple
Linear Regression
When you choose to analyse your data using multiple
regression, part of the process involves checking to make sure
that the data you want to analyse can actually be analysed using
multiple regression.
You need to do this because it is only appropriate to use
multiple regression if your data "passes" eight assumptions that
are required for multiple regression to give you a valid result.
In practice, checking for these eight assumptions just adds a
little bit more time to your analysis but it is not a difficult task.
let's take a look at these eight assumptions:
Assumption #1:
Your dependent variable should be measured on a continuous
scale .
Assumption #2:
You should have two or more independent variables, which can
be either continuous or categorical or dummy.
Assumption #3:
You should have independence of observations (i.e.,
independence of residuals), which you can easily check using the
Durbin-Watson statistic.
Assumption #4:
There needs to be a linear relationship between
(a) the DV and each of your independent variables, and
(b) the DV and the independent variables collectively.
Assumption #5:
Your data needs to show homoscedasticity, which is where the
variances along the line of best fit remain similar as you move
along the line.
Assumption #6:
Your data must not show multicollinearity, which occurs when
you have two or more independent variables that are highly
correlated with each other.
This leads to problems with understanding which IV contributes
to the variance explained in the dependent variable
Assumption #7:
There should be no significant outliers, high leverage points or
highly influential points.
This can change the output that any Statistics produces and
reduce the predictive accuracy of your results as well as the
statistical significance.
Assumption #8:
Finally, you need to check that the residuals (errors) are
approximately normally distributed.
Two common methods to check this assumption include using:
(a) a histogram and a Normal P-P Plot; or (b) a Normal Q-Q Plot of
the studentized residuals.
You can check assumptions #3, #4, #5, #6, #7 and #8 using
SPSS Statistics.
Assumptions #1 and #2 should be checked first, before
moving onto assumptions #3, #4, #5, #6, #7 and #8.
Just remember that if you do not run the statistical tests on
these assumptions correctly, the results you get when running
multiple regression might not be valid.
This is why we are concerned on the assumptions and
procedures of multiple regressions to help you get this right.
Given the assumptions and data on a dependent variable Y
and set of potential explanatory variables (X1,.., XK ) , the
following are a suggested procedures/steps to conduct multiple
linear regression:
1.Select variables that you believe are linearly related to the
dependent variable.
2.Use a computer and software to generate the coefficients and
the statistics used to assess the model.
3.Diagnose violations of required conditions/ assumptions.
If there are problems, attempt to remedy them.
4. Assess the model’s fit.
Three statistics that perform this function are
the standard error of estimate,
the coefficient of determination, and
the F-test of the analysis of variance.
5. If we are satisfied with the model’s fit and that the required
conditions are met, we can interpret the coefficients and test
them.
6. We use the model to predict a value of the dependent variable
or estimate the expected value of the dependent variable.
Regression Output Interpretation
Example
In a study of consumer demand (Qd), multiple regression analysis
is done to examine the relationship between quantity demanded and
four potential predictors.
The four independent variables are: price, income, tax and Price of
related goods.
The output for this example is interpreted as follows:
Model Fit
Model Summary
a. Predictors: (Constant), price of related goods, income of the consumer, commodity tax, price
of the product