Simple and Multiple Linear Regression
Simple and Multiple Linear Regression
Chapter - 1
Business Analytics
1. Covariance
2. Correlation Coefficient
3. Regression
2
What is Covariance?
• Covariance is a statistical measure of the degree to which the two
variables move together.
n 1
3
What is Correlation Coefficient?
• Correlation is a measure of the strength of linear relationship
between two variables
• The correlation coefficient is given by:
cov xy
xy
x y
• Population correlation is denoted by ρ (rho)
• Sample correlation is denoted by r. It is an estimate of ρ same way as
X
• S (sample variance) an estimate of σ (population variance) and
2 2
x x x
r = -1 r = -0.6 r=0
y y
x x
r = +.3 r = +1 6
Introduction to Regression Analysis
Regression analysis is used to:
• Predict the value of a dependent variable based on the value of at
least one independent variable
• Explain the impact of changes in an independent variable on the
dependent variable
• Dependent variable:
The variable we wish to predict or explain
• Independent variable:
The variable used to predict or explain the dependent variable
Regression
Dependent variable
Independent variable
X X
In Positive Linear Relationship, the values of Y In Negative Linear Relationship as X increases,
are generally increasing linearly as X increases the values of Y are generally decreasing
Y Y
X
X In Negative Curvilinear Relationship it illustrate an
In positive curvilinear relationship between X and exponential relationship between X and Y. In this case,
Y the values of Y increases as X increases, but this Y decreases very rapidly as X first increases, but then
increase tapers off beyond certain values of X it decreases much less rapidly as X increases further.
13
Types of Relationships… Continued
Y Y Y
X X X
Y Y Y
X X X
14
Linear Regression Model
Linear Regression Equation:
Population Slope Independent
Population Variable
Y intercept Coefficient
Random
Dependent Error term
Variable
Yi a b i X i ei
where:
Y: the variable that we are trying to predict
Xi: the variables that we are using to predict Y
a: the intercept (the estimated average value of Y when the value
of X is zero)
bi: the slope(s) (the estimated change in the average value of Y as
a one-unit increase in X)
ei= the regression residual or the error term
Simple Linear Regression Model (continued)
Y Yi a bXi ei
Observed Value of
Y for Xi
ei Slope = b
Predicted Value Random Error for
of Y for Xi this Xi value
Intercept = a
Xi
X
16
Simple Linear Regression Equation (Prediction Line)
Estimated (or
predicted) Y value Estimate of Estimate of the
for observation i the regression regression slope
intercept
Value of X for
3
Example: Using Excel Data Analysis Function
2. Choose Data Analysis
1. Choose Data 3. Choose Regression
Example: Using Excel Data Analysis Function
Enter Y range and X range and desired options
Example: Excel Output
Regression Statistics
Multiple R 0.76211 The regression equation is:
R Square 0.58082
Adjusted R Square 0.52842 houseprice 98.24833 0.10977 (square feet)
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
450
400
House Price ($1000s)
350
300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
450
400
House Price ($1000s)
350 Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet
0 R2 1
Coefficient of determination – R Square
• The higher the value, the better the fit. Coefficient of determination is
symbolized by r2 because it is square of the coefficient of
correlation symbolized by r.
R2 r 2
where:
R2 = Coefficient of determination
r = Simple correlation coefficient
R Square
R2 = 1
X
Perfect linear relationship
R2 = 1 between x and y:
Y
100% of the variation in y is
explained by variation in x
X
R2 = +1
R Square
0 < R2 < 1
X
R Square
R2 = 0
Y
s ε 41.33032
R Square 0.58082
Observations 10
ANOVA df SS MS F Significance F
• Residuals :-
The residuals are the difference between the actual values of the
variable you're predicting and predicted values from your regression.
• Significance Stars :-
The stars are shorthand for significance levels, with the number of
asterisks displayed according to the p-value computed. The more
punctuation there is next to your variables, the better.
Blank = bad, Dots = pretty good, Stars = good,
More Stars = very good
Explanation for Model Summary Statistics
Standard Error:
Measure of the variability in the estimate for the coefficient.
Lower means better but this number is relative to the value of the
coefficient.
Residual Std. Error / Degrees of Freedom :
The Residual Std. Error is just the standard deviation of your
residuals. The Degrees of Freedom is the difference between the number
of observations included in your training sample and the number of
variables used in your model (intercept counts as a variable).
R-squared:
Metric for evaluating the goodness of fit of your model. Higher is
better with 1 being the best. Corresponds with the amount of variability
in what you're predicting that is explained by the model.
Explanation for Model Summary Statistics
• Adjusted R-squared :
The adjusted R-squared is a modified version of R-squared that
has been adjusted for the number of predictors in the model. The
adjusted R-squared increases only if the new term improves the model
more than would be expected by chance.
• p-value :
In other words, a predictor that has a low p-value is likely to be a
meaningful addition to your model because changes in the predictor's
value are related to changes in the response variable.
Assumptions of Regression
1. Linearity 2. Autocorrelation
Yi b0 b1 X1i b2 X 2i ......... bk X ki i
where:
Yi = ith observation of dependent variable Y
Xki = ith observation of kth independent variable X
b0 = intercept term
bk = slope coefficient of kth independent variable
εi = error term of ith observation
n = number of observations
k = total number of independent variables
Multiple Regression
• Now, taking it one step further and
adding amount of radio play to the
equation
• This turns into multiple regression
analysis with more predictors
creating a regression plane (or a 3d
model) with the line turning into a
plane
• It looks more complicated, but the
principles remain the same as linear
regression
Multiple Regression
Northwest 290 21 4 2 2
South 95 11 2 1 0
Northeast 170 19 3 2 2
Northwest 375 38 5 4 3
West 350 24 4 3 2
South 125 10 2 2 0
West 310 31 4 4 2
West 275 25 3 2 2
Northwest 340 27 5 3 3
Northeast 215 22 4 3 2
Northwest 295 20 4 3 2
South 190 24 4 3 2
Northwest 385 36 5 4 3
West 430 32 5 4 2
South 185 14 3 2 1
South 175 18 4 2 2
Northeast 190 19 4 2 2
Northwest 330 29 4 4 3
West 405 33 5 4 3
Northeast 170 23 4 2 2
West 365 34 5 4 3
Northwest 280 25 4 2 2
South 135 17 3 1 1
Northeast 205 21 4 3 2
West 260 26 4 3 2
Business Analytics
400
300
SellingPrice
200
100
10 15 20 25 30 35
HouseSize
5.0
4.0
Bedrooms
3.0
2.0
100 200 300 400 2.0 3.0 4.0 5.0
• We begin by testing whether the explanatory variables collectively
have an effect on the response variable.
• We can do this using a partial F-test. This test involves comparing the
SSE from a reduced model (excluding the parameters we hypothesis
are equal to zero) with the SSE from the full model (including all of the
parameters).
• In R we can perform partial F-tests by fitting both the reduced and full
models separately and thereafter comparing them using the anova
function.
• Ex. Suppose we include the variables bathrooms and GarageSize in our
model and are interested in testing whether the number of bathrooms
and GarageSize are significant after taking HouseSize into
consideration.
• Now when comparing the models, our hypothesis becomes,
• Ex. Obtain a 95% confidence interval for the mean sales price of
houses whose HouseSize is 2,150 square feet.
Y
Y b0 b1 X
Low Variance of
Residual Terms
High Variance of
Residual Terms
X
Detecting Heteroskedasticity
• Heteroskedasticity can be detected either by viewing the scatter
plots as discussed in the previous case or by Breusch-Pagan chi-
square test.
• In Breusch-Pagan chi-square test, the residuals are regressed with
the independent variables to check whether the independent
variable explains a significant proportion of the squared residual or
not.
• If the independent variables explain a significant proportion of the
squared residuals then we conclude that the conditional
heteroskedasticity is present otherwise not.
Detecting Heteroskedasticity
• Breusch-Pagan test statistic follows a chi-square distribution with k
degrees of freedom, where k is the number of independent variables.
where:
n : number of observations
2
Rresid : Coefficient of determination when residuals are regressed
with independent variables
• Imagine you are watching a rocket take off nearby and measuring the
distance it has traveled once each second. In the first couple of
seconds your measurements may be accurate to the nearest
centimeter, say. However, 5 minutes later as the rocket recedes into
space, the accuracy of your measurements may only be good to 100 m,
because of the increased distance, atmospheric distortion and a
variety of other factors. The data you collect would exhibit
heteroscedasticity.
Serial Correlation
• Serial correlations are often found in repeating patterns when the
level of a variable effects its future level.
• In finance, serial correlation is used by technical analysts to determine
how well the past price of a security predicts the future price.
• For example, if we are predicting the growth of stock dividends, an
overestimate in one year is likely to lead to overestimates in
succeeding years.
Business Analytics
Appendix
Measures of Variation
• Total variation is made up of two parts:
where:
Y = Mean value of the dependent variable
Yi = Observed value of the dependent variable
Yˆi = Predicted value of Y for the given Xi value
Coefficient of determination (R2) & Adjusted R2
• Coefficient of determination(R2) can also be used to test the
significance of the coefficients collectively apart from using F-test.
n 1 2
Ra2 1
1 R
n k 1
where
n = Number of Observations
k = Number of Independent Variables
Ra2 = Adjusted R2
Analysis of Variance (ANOVA)
– Analysis of variance is a statistical method for analyzing the
variability of the data by breaking the variability into its
constituents.
– A typical ANOVA table looks like:
Source of Variability DoF Sum of Squares Mean Sum of Squares
SSE
Standard Error of Estimate(SEE) = MSE
n2
Analysis of Variance (ANOVA)