0% found this document useful (0 votes)
76 views

Chapter 3

This document discusses regression analysis techniques, including simple and multiple linear regression. It defines simple regression as using one independent variable to predict a dependent variable, while multiple regression uses two or more independent variables. The key assumptions of multiple regression are outlined, such as having a continuous dependent variable, independence of observations, linear relationships between variables, homoscedasticity, no multicollinearity, no outliers, and normally distributed residuals. Steps for conducting multiple regression are also summarized, including checking that the data meets these assumptions.

Uploaded by

feyisaabera19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

Chapter 3

This document discusses regression analysis techniques, including simple and multiple linear regression. It defines simple regression as using one independent variable to predict a dependent variable, while multiple regression uses two or more independent variables. The key assumptions of multiple regression are outlined, such as having a continuous dependent variable, independence of observations, linear relationships between variables, homoscedasticity, no multicollinearity, no outliers, and normally distributed residuals. Steps for conducting multiple regression are also summarized, including checking that the data meets these assumptions.

Uploaded by

feyisaabera19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 36

Regression Analysis

Multiple Regression
[ Cross-Sectional Data ]
Simple Regression
 A statistical model that utilizes one quantitative
independent variable “X” to predict the quantitative
dependent variable “Y.”
 i.e., considers the relation between a single explanatory
variable and response variable:
Multiple Regression
 A statistical model that utilizes two or more quantitative and
qualitative explanatory variables (x1,..., xp) to predict a
quantitative dependent variable Y.
Caution: have at least two or more quantitative
explanatory variables.
 Multiple regression simultaneously considers the influence of
multiple explanatory variables on a response variable Y:
Simple vs. Multiple
•  represents the unit change • i represents the unit change in
in Y per unit change in X . Y per unit change in Xi.
• Does not take into account
any other variable besides • Takes into account the effect of
single independent variable. other independent variable s.
• r2: proportion of variation in • R2: proportion of variation in Y
Y predictable from X. predictable by set of X’s
Goal
Develop a statistical model that can
predict the values of a dependent
(response)
response variable based upon the values
of the independent (explanatory)
explanatory variables.
Multiple Regression Models
M ultiple
Regression
M odels
Non-
Linear
Linear

Dummy Inter-
Linear action
Variable

Poly- Square
Log Reciprocal Exponential
Nomial Root
The Multiple Linear Regression Model building

Idea: Examine the linear relationship between 1 dependent (Y)


& 2 or more independent variables (Xi)

Multiple Regression Model with k Independent Variables:

Y-intercept Population slopes Random Error

Yi  β0  β1 X 1i  β2 X 2i    βk X ki  ε
• The coefficients of the multiple regression model
are estimated using sample data with k
independent variables
Estimated Estimated
(or predicted) Estimated slope coefficients
intercept
value of Y

Ŷi  b0  b1 X 1i  b2 X 2i    bk X ki
• Interpretation of the Slopes: (referred to as a Net
Regression Coefficient)
– b1=The change in the mean of Y per unit change in X1,
taking into account the effect of X2 (or net of X2)
– b0 Y intercept. It is the same as simple regression.
ASSUMPTIONS
• Linear regression model: The regression model is linear in the
parameters, though it may or may not be linear in variables.
• The X variable is independent of the error term. This means
that we require zero covariance between ui and each X variables.
cov.(ui , X1i) = cov(ui, X2i)=------- = cov(ui, Xki) = 0
• Zero mean value of disturbance ui. Given the value of Xi, the
mean, or the expected value of the random disturbance term ui is
zero.
E(ui)= 0 for each i
• Homoscedasticity or constant variance of ui . This implies that
the variance of the error term is the same, regardless of the value
of X.
var (ui) = σ2
• No auto-correlation between the disturbance terms.

cov ( ui, uj) = 0 i≠j

 This implies that the observations are sampled independently.


• the number of observations n must be greater than the number
of parameters to be estimated.
• There must be variation in the values of the X variables.
Technically, var(X) must be a positive number.
• No exact collinearity between the X variables. i.e.
 No multicollinearity: No exact linear relationship exists between
any of the explanatory variables.
Estimation of parameters and standard errors
The coefficient of determination and test of model adequacy
Are Individual Variables Significant?
• Use t-tests of individual variable slopes
• Shows if there is a linear relationship between the variable Xi
and Y; Hypotheses:
• H0: βi = 0 (no linear relationship)

• H1: βi ≠ 0 (linear relationship does exist between Xi and Y)


bi  0
• Test Statistic: tn  k  1 
Sb
i

• Confidence interval for the population slope βi

bi  tnk 1Sbi
Dummy independent Variables
Describing Qualitative Information
• In regression analysis the dependent variable can be
influenced by variables that are essentially qualitative in
nature,
such as sex, race, color, religion, nationality, geographical
region, political upheavals, and party affiliation.
• One way we could “quantify” such attributes is by
constructing artificial variables that take on values of 1 or 0,
 1 indicating the presence (or possession) of that attribute and 0
indicating the absence of that attribute.
• Variables that assume such 0 and 1 values are called dummy/
indicator/ binary/ categorical/ dichotomous variables.
Example 1:
where Y=annual salary of a college professor
Di  1 if male college professor

= 0 otherwise (i.e., female professor)


The Model may enable us to find out whether sex makes any
difference in a college professor’s salary, assuming, of course, that
all other variables such as age, degree attained, and years of
experience are held constant.
Mean salary of female college professor:
Mean salary of male college professor:
 tells by how much the mean salary of a male college professor
differs from the mean salary of his female counterpart.
A test of the null hypothesis that there is no sex discrimination
( Ho:  = 0) can be easily made and finding out whether on the basis
of the t test the estimated  is statistically significant.
Example 2:
Where: Xi = years of teaching experience
Mean salary of female college professor: E (Yi / X i , Di  0)   1  X i
Mean salary of male college professor: E (Yi / X i , Di  1)  (   2 )  X i

the male and female college professors’ salary functions in relation to


the years of teaching experience have the same slope () but different
intercepts.
 Male intercept = a1 +a2
 Female intercept = a1
 Difference = a2
Note: If a qualitative variable has ‘m’ categories, introduce only ‘m-1’
dummy variables.
The group, category, or classification that is assigned the value of 0 is
often referred to as the base, benchmark, control, comparison, reference,
or omitted category.
Example 3: qualitative variable with more than two classes
we want to regress the annual expenditure on health care by an
individual on the income and education of the individual.
Yi   1   2 D2i   3 D3i  X i  u i

Where Yi  annual expenditure on health care


X i  annual income

D2  1 if high school education


= 0 otherwise
D3  1 if college education

= 0 otherwise
“less than high school education” category as the base category.
Therefore, the intercept  will reflect the intercept for this
category.
• the mean health care expenditure functions for the three
levels of education, namely, less than high school, high
school, and college:
E (Yi | D2  0, D3  0, X i )   1  X i

E (Yi | D2  1, D3  0, X i )  ( 1   2 )  X i

E (Yi | D2  0, D3  1, X i )  ( 1   3 )  X i
Assumptions and Procedures to Conduct Multiple
Linear Regression
When you choose to analyse your data using multiple
regression, part of the process involves checking to make sure
that the data you want to analyse can actually be analysed using
multiple regression.
 You need to do this because it is only appropriate to use
multiple regression if your data "passes" eight assumptions that
are required for multiple regression to give you a valid result.
 In practice, checking for these eight assumptions just adds a
little bit more time to your analysis but it is not a difficult task.
 let's take a look at these eight assumptions:
Assumption #1:
 Your dependent variable should be measured on a continuous
scale .
Assumption #2:
 You should have two or more independent variables, which can
be either continuous or categorical or dummy.
Assumption #3:
 You should have independence of observations (i.e.,
independence of residuals), which you can easily check using the
Durbin-Watson statistic.
Assumption #4:
There needs to be a linear relationship between
 (a) the DV and each of your independent variables, and
 (b) the DV and the independent variables collectively.
Assumption #5:
 Your data needs to show homoscedasticity, which is where the
variances along the line of best fit remain similar as you move
along the line.
Assumption #6:
 Your data must not show multicollinearity, which occurs when
you have two or more independent variables that are highly
correlated with each other.
 This leads to problems with understanding which IV contributes
to the variance explained in the dependent variable
Assumption #7:
There should be no significant outliers, high leverage points or
highly influential points.
This can change the output that any Statistics produces and
reduce the predictive accuracy of your results as well as the
statistical significance.
Assumption #8:
Finally, you need to check that the residuals (errors) are
approximately normally distributed.
Two common methods to check this assumption include using:
(a) a histogram and a Normal P-P Plot; or (b) a Normal Q-Q Plot of
the studentized residuals.
You can check assumptions #3, #4, #5, #6, #7 and #8 using
SPSS Statistics.
Assumptions #1 and #2 should be checked first, before
moving onto assumptions #3, #4, #5, #6, #7 and #8.
 Just remember that if you do not run the statistical tests on
these assumptions correctly, the results you get when running
multiple regression might not be valid.
This is why we are concerned on the assumptions and
procedures of multiple regressions to help you get this right.
Given the assumptions and data on a dependent variable Y
and set of potential explanatory variables (X1,.., XK ) , the
following are a suggested procedures/steps to conduct multiple
linear regression:
1.Select variables that you believe are linearly related to the
dependent variable.
2.Use a computer and software to generate the coefficients and
the statistics used to assess the model.
3.Diagnose violations of required conditions/ assumptions.
 If there are problems, attempt to remedy them.
4. Assess the model’s fit.
 Three statistics that perform this function are
 the standard error of estimate,
 the coefficient of determination, and
 the F-test of the analysis of variance.
5. If we are satisfied with the model’s fit and that the required
conditions are met, we can interpret the coefficients and test
them.
6. We use the model to predict a value of the dependent variable
or estimate the expected value of the dependent variable.
Regression Output Interpretation
Example
 In a study of consumer demand (Qd), multiple regression analysis
is done to examine the relationship between quantity demanded and
four potential predictors.
The four independent variables are: price, income, tax and Price of
related goods.
The output for this example is interpreted as follows:
 Model Fit
Model Summary

Model R R Square Adjusted R Square Std. Error of the Estimate


1 .991a .982 .978 .334
a. Predictors: (Constant), price of related goods, income of the consumer, commodity tax, price of the
product
The multiple correlation coefficient is 0.991.
indicates that the correlation among the ID and DV is positive.
This statistic, which ranges from -1 to +1, does not indicate
statistical significance of this correlation.
The R2 is 0.982.
This means that the independent variables explain 98.2% of the
variation in the dependent variable.
The adjusted R-square, a measure of explanatory power, is 0.978.
 This statistic is not generally interpreted because it is neither a
percentage (like the R2), nor a test of significance (the F-statistic).
The standard error of the regression is 0.334, which is an estimate
of the variation of the observed quantity supplied, in quantity terms,
about the regression line.
ANOVAb
Model Sum of Squares(SS) df Mean Square F Sig.
1 Regression 93.124 4 23.281 208.359 .000a
Residual 1.676 15 .112
Total 94.800 19

a. Predictors: (Constant), price of related goods, income of the consumer, commodity tax, price
of the product

b. Dependent Variable: quantity demanded

The p value for the F statistic is < .01.


 This means that at least one of the independent variables is a
significant predictor of the dependent variable (quantity demanded).
This indicates rejection of the null hypothesis.
Interpreting Parameter Values (Model Coefficients)
The results of the estimated regression line include the estimated
coefficients, the standard error of the coefficients, the calculated t-
statistic, the corresponding p-value, and the bounds of both the 95%
and the 90% confidence intervals.
Coefficientsa
Model Unstandardized Standardized t Sig. 95% Confidence
Coefficients Coefficients Interval for B
B Std. Beta Lower Upper
Error Bound Bound
Constant 7.309 1.290 5.665 .000 4.559 10.060
price -.471** .167 -.361 -2.825 .013 -.827 -.116
income .027** .009 .275 2.844 .012 .007 .046
tax -.615* .159 -.375 -3.856 .002 -.954 -.275
price of related .013 .095 .010 .138 .892 -.189 .215
Finally, the above table will help us determine whether quantity
demanded and explanatory variables are significantly related, and
the direction and strength of their relationship.
 The prediction equation is based on the unstandardized coefficients
and written as:
Qd = 7.3 – 0. 47 price + 0.03 income – 0.62 tax
Results of the multiple linear regression model showed that out of
the 4 explanatory variables that were entered to the model, 3 of
them, namely price of the product, income of the consumer and
commodity tax were found to be statistically significant.
 Results of the statistically significant variables are discussed as
follows:
The Constant is the predicted value of quantity demanded when all
of the independent variables have a value of zero.
 The b coefficient associated with price (-0. 47) is negative,
indicating an inverse relationship in which higher price of the
product is associated with lower quantity demanded.
 For the independent variable price, the probability of the t
statistic (-2.825) for the b coefficient is less than the level of
significance of 0.05.
 We reject the null hypothesis that the slope associated with
price is equal to zero and conclude that there is a statistically
significant relationship between price and quantity demanded.
 A unit increase/decrease in the price of the product leads to a
0. 47 decrease/increase in quantity demanded, ceteris paribus.
 The income variable is found to be positively and
significantly (at 5% probability level) related to quantity
demanded.
 The result confirmed that holding other independent variables
constant, for every one unit increase/decrease in income,
quantity demanded will increase/decrease by 0.03 units.
 Tax coefficient is highly significant (at 1% probability level)
and carries negative sign.
 The slope of tax is -0.62. This means that for every one unit
increase/decrease in tax on a commodity, quantity demanded
will decrease/increase by -0.62 units, ceteris paribus.

You might also like