0% found this document useful (0 votes)
31 views

Lecture5 Mar22 2024

Uploaded by

sd.ashif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Lecture5 Mar22 2024

Uploaded by

sd.ashif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Data Management & Analysis

(PBH 711)

Lecture 5: Linear Regression Models


March 22, 2024
REVIEW

• Requirements for simple linear regression

• Assumptions of linear regression models


Relationship Between Two Variables
(Bivariate Relationship)

• To understand a statistical relationship


between two variables, we measure both
variables on the same individuals.

• Often relationship between two variables can


be strongly influenced by other variables.
Simple Linear Regression Model

Y =  0 +  1X + 

Dependent Error term


variable

Constant or Slope
intercept parameter Independent or
parameter “explanatory”
variable
Simple Linear Regression Model

The figure above borrowed from the web (Stephen Baraka)


Linear regression graph
YAge at marriage = β0 + β1Xeducation

Regression of years of education on age at marriage

β0 = Intercept
20

β1 = Slope
19
age at marriage
1817
16

0 5 10 15
education in single years
Example: Simple Linear Regression Model
in STATA
• Bring the dataset provided in the last class (BDHS
IR file 2017-18)

• Our objective is to explore the association


between education and age at marriage

• What is the outcome variable here and how may


we find it in the dataset?

•What is the independent variable here and how


may we obtain the variable in the dataset?
Example: Simple Linear Regression Model
in STATA
• Once we identified the variables, what would be our
first step in identifying bivariate association?

• What would we do to find the strength of the linear


relationship between two variables?

• How may we identify the changes in the average age at


marriage for a year increase in education?

• How and for what values may we predict the average


age at marriage for an individual who completed 12
years of education?
Example: Simple Linear Regression Model
in STATA
• use ZZIR62FL.dta, clear

• summarize v133 v511

• scatter v511 v133

• pwcorr v511 v133, star(0.05)

• regress v511 v133


Example: Simple Linear Regression Model
diagnostics in STATA (Linearity)

predict res_std, rstandard

hist res_std, kdensity saving(rno.gph, replace)

qnorm res_std, saving(qno.gph, replace)


Example: Simple Linear Regression Model
diagnostics in STATA (Normality)
predict resid, residual

label var resid "Residuals of educ and age at M"

histogram resid, normal /*graphical check*/

swilk resid /* Statistical test*/


Example: Simple Linear Regression Model
diagnostics in STATA (Const. Variance)

rvfplot /*graphical check*/

/* “rvfplot” command needs to be entered after


regression equation is run – Stata uses
estimates from the regression to create this
plot*/

estat hettest /* Statistical test*/


Simple Versus Multiple Regression
• Simple regression
is bivariate linear regression in
which one dependent variable, y, is predicted by one
independent variable, x.

• There are situations in which a dependent variable


may be predicted by using more than one independent
variable: BMI may be predicted by using variables such
as height, age, eating habit, and genetic factors.
Simple Versus Multiple Regression

•Regression analysis with two or more independent


variables is called multiple regression analysis.

•Multiple regression
is similar in principle to simple
regression. However, it is more complex conceptually
and computationally.
Multiple Regression Models
A Description of the Model and its Coefficients
• In multiple regressionanalysis, the dependent variable,
y, is sometimes referred to as the response variable.

• Unlike simple regression,


the slope coefficients, βi,
associated with the independent variables are called
partial coefficients.

• βirepresents the increase/decrease that will occur in


the (average) value of y from a one-unit increase in that
independent variable if all other variables are held
constant.
Multiple Regression Model with Two Independent
Variables
Determining the Multiple Regression Equation
• The procedure for determining the formulas for
multiple b regression coefficients are similar to simple
regression

• They are derived so as to minimize the sum of squares


of error for the model

• The analysis is referred to as, least squares analysis

• Calculus minimization and maximization techniques


treating the coefficients as the unknowns are applied
leading to k+1 equations in k+1 unknowns
Determining the Multiple Regression Equation

• Thus a regression with two unknowns will generate


three (3) simultaneous equations with three
unknowns (b0 , b1 , b2)

• The process of solving these equations by hand is


tedious and time consuming. Many software
programs exist that do this more efficiently.
Determining the Multiple Regression Equation
Example: Hand/spreadsheet Calculation
Example: Hand/spreadsheet Calculation
Example: Hand/spreadsheet Calculation
Real Estate Data
A Scatter Plot of Real Estate Data
Response Surface or Plane for First-Order Two-
Predictor Multiple Regression Model
• Note that most of the points are not on the surface. As was
the case with simple regression, an error in the fit of the
model is usually present.
Minitab Output for the Real Estate Example
Predicting the Price of Home
Evaluating the Multiple Regression Model
Testing the Overall Model
for the Real Estate Example
H 0 :  = 2 = 0 F .01,2 ,20
= 585
.
1

Ha : At least one of the regression coefficients is  0 F Cal


= 28.63  585
. , reject H0.

SSR SSE MSR


MSR = MSE = F=
k n − k −1 MSE

ANOVA
df SS MS F p
Regression 2 8189.723 4094.86 28.63 .000
Residual (Error) 20 2861.017 143.1
Total 22 11050.74
Significance Test of the Regression
Coefficients for the Real Estate Example

H 0: 1
=0

H :
a
1
0 t.025,20 = 2.086


H 0: 2
=0
tCal = 5.63 > 2.086, reject H0.
H :
a
2
0

Coefficients Std Dev t Stat p

x1 (Sq.Feet) 0.0177 0.003146 5.63 .000


x2 (Age) -0.666 0.2280 -2.92 .008
Residuals and Sum of Squares Error
for the Real Estate Example
Minitab Residual Diagnostics
for the Real Estate Problem
Residual Plots for Price
Normal Probability Plot of the Residuals Residuals Versus the Fitted Values
99 30

90 15

Residual
Percent

50 0

10 -15

1 -30
-30 -15 0 15 30 60 80 100 120 140
Residual Fitted Value

Histogram of the Residuals Residuals Versus the Order of the Data


6.0 30

4.5 15
Frequency

Residual
3.0 0

1.5 -15

0.0 -30
-24 -12 0 12 24 2 4 6 8 10 12 14 16 18 20 22
Residual Observation Order
Definition and Interpretation of Adjusted R2

• Sometimes additional independent variables add no


significant information to the regression model
• Adjusted R2 take into consideration both the additional
information each new variable brings to the regression model
and the changed degrees of freedom of regression.
• The gap between adjusted R2 and R2 increases as
nonsignificant independent variables are added to the
regression model
• The difference also increases as n increases
Calculating Adjusted R2 Real Estate Example
SSE and Standard Error of the Estimate
ANOVA
df SS MS F P
Regression 2 8189.7 4094.9 28.63 .000
Residual (Error) 20 2861.0 143.1
Total 22 11050.7

SSE
SSE
Se = n − k −1
2861
=
23 − 2 − 1
= 1196
.
where: n = number of observations
k = number of independent variables
SSE and Standard Error of the Estimate

SSYY
SSE SSR
ANOVA
df SS MS F p
Regression 2 8189.7 4094.89 28.63 .000
Residual (Error) 20 2861.0 143.1
Total 22 11050.7

SSR 8189.723
= = = .741
2
R SS yy 11050.74
SSE 2861.017
= 1− = 1− = .741
2
R SS yy 11050.74
Interpretation of
Coefficient of Multiple Determination
• The coefficientof multiple determination (R 2) is analogous to r2
in simple regression, but it requires a very different interpretation
and computation.

• R2 represents the proportion of total variation in the dependent


variable (around its mean as a predictor) collectively accounted for
by the independent variables in the regression collectively.

• R2 does not tell you which and how a given independent variable
is contributing to explaining the total variation in Y. It considers
them altogether.
Interpretation of
Coefficient of Multiple Determination

• As is the case with simple regression r2,


0 ≤ R2 ≤1, and it has the same meaning:
‒That is, 0 means that the independent variables
collectively do not explain any of the variation in Y, so
that the mean of Y would be as good or better
estimator. 1 means that all the variation in Y is
collectively explained by the independent variables,
and thus all the points are on the regression surface
or plane.
Interpreting Multiple Regression
Example Computer Output
Using Regression Analysis: Some Caveats

• Influential Observations – outliers are atypical data points that


“influence” (for worse) the nature of the regression equation.

• Extrapolation – for predictive purposes such as forecasting over a


period of time, the extrapolation of the equation outside the
range of independent variable(s) is risky.

• Nonlinearity – regression analysis assumes that the relationship


between the dependent and the independent variables is a
straight line, this may not always be the case. The sample
relationship may indicate linearity but not the overall
relationship.
Using Regression Analysis: Some Caveats

• Collinearity – occurs when two or more independent


variables are highly correlated among themselves.
Multicollinearity or collinearity increases the variance
of the regression coefficient making it unstable.

• Small Samples in Relation to the Number of Variables –


if the sample size is not substantially larger than the
number of variables, then it is possible to get a large R2
value even when there is no relationship.
Exercise: Multiple Linear Regression Model
• use auto dataset and consider predicting price
based on mpg, weight, and length

• explore linearity and do correlation analysis

• run regressionanalysis and identify significant


variables that predict price

• how may you interpret adjusted R-square, model


coefficients?

You might also like