0% found this document useful (0 votes)
90 views

Simple and Multiple Linear Regression

The document discusses key concepts in simple linear regression including: 1) Covariance and correlation which measure the relationship between two variables. 2) Linear regression predicts the value of a dependent variable (y) based on the value of an independent variable (x). 3) An example demonstrates using Excel to calculate the linear regression equation to predict house prices based on home size using a sample of housing data.

Uploaded by

komal kashyap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views

Simple and Multiple Linear Regression

The document discusses key concepts in simple linear regression including: 1) Covariance and correlation which measure the relationship between two variables. 2) Linear regression predicts the value of a dependent variable (y) based on the value of an independent variable (x). 3) An example demonstrates using Excel to calculate the linear regression equation to predict house prices based on home size using a sample of housing data.

Uploaded by

komal kashyap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Business Analytics

Chapter - 1

Business Analytics

Simple and Multiple Linear Regression


Agenda

1. Covariance

2. Correlation Coefficient

3. Regression

2
What is Covariance?
• Covariance is a statistical measure of the degree to which the two
variables move together.

• The sample covariance is calculated as


n

(X i  X )(Y iY )


cov xy  i 1

n 1

3
What is Correlation Coefficient?
• Correlation is a measure of the strength of linear relationship
between two variables
• The correlation coefficient is given by:
cov xy
 xy 
 x y
• Population correlation is denoted by ρ (rho)
• Sample correlation is denoted by r. It is an estimate of ρ same way as
X
• S (sample variance) an estimate of σ (population variance) and
2 2

(sample mean) an estimate of μ (population mean)


What is Correlation Coefficient?
Features of ρ and r :
• Unit free and ranges between -1 and 1
• The closer to -1, the stronger the negative linear relationship
• The closer to 1, the stronger the positive linear relationship
• The closer to 0, the weaker the linear relationship
Trends of Approximate r Values
• Can we fit a linear equation between X and Y to predict the values of
Y based on X?
y y y

x x x
r = -1 r = -0.6 r=0
y y

x x
r = +.3 r = +1 6
Introduction to Regression Analysis
Regression analysis is used to:
• Predict the value of a dependent variable based on the value of at
least one independent variable
• Explain the impact of changes in an independent variable on the
dependent variable
• Dependent variable:
The variable we wish to predict or explain
• Independent variable:
The variable used to predict or explain the dependent variable
Regression

• A statistical measure that attempts to determine the strength of the


relationship between one dependent variable (usually denoted by Y)
and a series of other changing variables (known as independent
variables)

• It is used to understand a phenomena, make predictions, and/or test


hypotheses

• It is one of the most commonly used tools for business analysis

• It is easy to use and applies to many situations


Regression
• It can also be used to predict the values of dependent variable
given values of independent variable(s)

• It also explains the impact of changes in an independent


variable on the dependent variable
Dependent & Independent Variable

Dependent variable

• The variable we wish to predict or explain

• It can also be called the easured varia le , e plai ed

varia le , respo se varia le

Independent variable

• The variable(s) used to predict or explain the dependent variable

• Are also called as e pla ator varia le s , a ipulated

varia le s , o trolled varia le s , predi tor varia le s


Regression Analysis Form and Types. . .
• Linear Regression: straight-line relationship
– Form: y=mx+b
• Non-linear: implies curved relationships
– logarithmic relationships

• Simple Linear Regression: Single explanatory variable


– Only one independent variable, X
– Relationship between X and Y is described by a linear function
– Changes in Y are assumed to be related to changes in X
Regression Analysis Form and Types. . .
• Multiple Linear Regression: Includes any number of
explanatory variables
- Two or more independent variables, Xs
- Relationship between Xs and Y is described by a
linear function
- Changes in Y are assumed to be related to changes
in X
Types of Relationships…
Positive Linear Negative Linear
Y Y

X X
In Positive Linear Relationship, the values of Y In Negative Linear Relationship as X increases,
are generally increasing linearly as X increases the values of Y are generally decreasing

Positive Curvilinear Negative Curvilinear

Y Y

X
X In Negative Curvilinear Relationship it illustrate an
In positive curvilinear relationship between X and exponential relationship between X and Y. In this case,
Y the values of Y increases as X increases, but this Y decreases very rapidly as X first increases, but then
increase tapers off beyond certain values of X it decreases much less rapidly as X increases further.
13
Types of Relationships… Continued

Strong relationships Weak relationships No relationship

Y Y Y

X X X

Y Y Y

X X X
14
Linear Regression Model
Linear Regression Equation:
Population Slope Independent
Population Variable
Y intercept Coefficient
Random
Dependent Error term
Variable
Yi  a  b i X i  ei
where:
 Y: the variable that we are trying to predict
 Xi: the variables that we are using to predict Y
 a: the intercept (the estimated average value of Y when the value
of X is zero)
 bi: the slope(s) (the estimated change in the average value of Y as
a one-unit increase in X)
 ei= the regression residual or the error term
Simple Linear Regression Model (continued)

Y Yi  a  bXi  ei
Observed Value of
Y for Xi

ei Slope = b
Predicted Value Random Error for
of Y for Xi this Xi value

Intercept = a

Xi
X
16
Simple Linear Regression Equation (Prediction Line)

The simple linear regression equation provides an estimate of the


population regression line.

Estimated (or
predicted) Y value Estimate of Estimate of the
for observation i the regression regression slope
intercept
Value of X for

Ŷi  a  bXi observation i


The Least Squares Method

The Value of b0 and b1 are obtained by finding the values of that


minimize the sum of the squared differences between Y and Ŷ:

min  (Yi Ŷi ) 2  min  (Yi  (a  bXi ))2


Simple Linear Regression Example. .
• A real estate agent wishes to examine the relationship between the
selling price of a home and its size (measured in square feet)
• A random sample of 10 houses is selected
– Dependent variable (y): House price in $1000s
– Independent variable (x): Area in Square feet
House Price in $1000s Area in Square Feet
(y) (x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Linear Regression- Preparing MS Excel
1 2

3
Example: Using Excel Data Analysis Function
2. Choose Data Analysis
1. Choose Data 3. Choose Regression
Example: Using Excel Data Analysis Function
Enter Y range and X range and desired options
Example: Excel Output

Regression Statistics
Multiple R 0.76211 The regression equation is:
R Square 0.58082
Adjusted R Square 0.52842 houseprice  98.24833  0.10977 (square feet)
Standard Error 41.33032
Observations 10

ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Example: Scatter Plot
House price model: Scatter Plot

450
400
House Price ($1000s)

350
300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000

Area in Square Feet


Graphical Representation of Output

House price model: scatter plot and regression line

450
400
House Price ($1000s)

350 Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet

houseprice  98.24833  0.10977 (square feet)


Interpretation of the Intercept, a and b
a is the estimated average value of Y when the value of X is zero (if x = 0
is in the range of observed x values)
houseprice  98.24833  0.10977 (square feet)

Here, no houses had 0 square feet, so a = 98.24833 just indicates that,


for houses within the range of sizes observed, $98,248.33 is the portion
of the house price not explained by square feet
b measures the estimated change in the average value of Y as a result of
a one-unit change in X
Here, b = .10977 tells us that the average value of a house increases by
.10977($1000) = $109.77, on average, for each additional one square
foot of size
Coefficient of determination – R Square
• The coefficient of determination (denoted by R2) is a key output of
regression analysis. It is interpreted as the proportion of the variance
in the dependent variable that is predictable from the independent
variable. (Goodness of fit)

• Coefficient of determination is used in trend analysis.

• It is computed as a value between 0 and 1

0  R2  1
Coefficient of determination – R Square

• The higher the value, the better the fit. Coefficient of determination is
symbolized by r2 because it is square of the coefficient of
correlation symbolized by r.

SSR regression sum of squares


R2  
SST total sum of squares

Note: In the single independent variable case, the coefficient of determination is

R2  r 2
where:
R2 = Coefficient of determination
r = Simple correlation coefficient
R Square

R2 = 1
X
Perfect linear relationship
R2 = 1 between x and y:
Y
100% of the variation in y is
explained by variation in x

X
R2 = +1
R Square

0 < R2 < 1

Weaker linear relationship


X
between x and y:
Y
Some but not all of the variation
in y is explained by variation in x

X
R Square

R2 = 0
Y

No linear relationship between x


and y:

X The value of Y does not depend


R2 = 0 on x. (None of the variation in y
is explained by variation in x)
R Square – House Price example
Regression Statistics 58.08% of the variation in house prices is
Multiple R 0.76211 explained by variation in square feet
R Square 0.58082

Adjusted R Square 0.52842


SSR 18934.9348
Standard Error 41.33032 R2    0.58082
SST 32600.5000
Observations 10
SSR - Sum of Squares due to regression
ANOVA df SS MS F Significance F

Regression 1 18934.9348 18934.9348 11.0848 0.01039


Residual 8 13665.5652 1708.1957
Total 9 32600.5

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 98.24833 58.03348 1.69296 0.12892 -35.5772 232.07386

Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.1858


Standard Error
• Standard error is a statistical term that measures the accuracy with
which a sample represents a population.
• In statistics, a sample mean deviates from the actual mean of a
population; this deviation is the standard error.
Regression Statistics
Multiple R 0.76211

s ε  41.33032
R Square 0.58082

Adjusted R Square 0.52842

Standard Error 41.33032

Observations 10

ANOVA df SS MS F Significance F

Regression 1 18934.9348 18934.9348 11.0848 0.01039


Residual 8 13665.5652 1708.1957
Total 9 32600.5

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 98.24833 58.03348 1.69296 0.12892 -35.5772 232.07386

Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.1858


Explanation for Model Summary Statistics:
• Estimated Coefficient :-
The estimated coefficient is the value of slope calculated by the
regression. It might seem a little confusing that the Intercept also has
has a value, but just think of it as a slope that is always multiplied by 1.

• Residuals :-
The residuals are the difference between the actual values of the
variable you're predicting and predicted values from your regression.

• Significance Stars :-
The stars are shorthand for significance levels, with the number of
asterisks displayed according to the p-value computed. The more
punctuation there is next to your variables, the better.
Blank = bad, Dots = pretty good, Stars = good,
More Stars = very good
Explanation for Model Summary Statistics
 Standard Error:
Measure of the variability in the estimate for the coefficient.
Lower means better but this number is relative to the value of the
coefficient.
 Residual Std. Error / Degrees of Freedom :
The Residual Std. Error is just the standard deviation of your
residuals. The Degrees of Freedom is the difference between the number
of observations included in your training sample and the number of
variables used in your model (intercept counts as a variable).
 R-squared:
Metric for evaluating the goodness of fit of your model. Higher is
better with 1 being the best. Corresponds with the amount of variability
in what you're predicting that is explained by the model.
Explanation for Model Summary Statistics
• Adjusted R-squared :
The adjusted R-squared is a modified version of R-squared that
has been adjusted for the number of predictors in the model. The
adjusted R-squared increases only if the new term improves the model
more than would be expected by chance.
• p-value :
In other words, a predictor that has a low p-value is likely to be a
meaningful addition to your model because changes in the predictor's
value are related to changes in the response variable.
Assumptions of Regression
1. Linearity 2. Autocorrelation

Durbin Watson statistic is used to test


autocorrelation.

H0: Residuals are not correlated.


H1: Autocorrelation is present.

• The possible range is 0 <= D <= 4


• D should be close to 2 if H0 is true.
• D less than 2 may signal positive
autocorrelation, D greater than 2 may
signal negative autocorrelation.
Assumptions
Assumptions of Regressionof Regression
3. Normality of Errors 4. Homoscedasticity
Multiple Linear Regression Model
• In simple linear regression, the dependent variable was assumed to
be dependent on only one variable (independent variable)
• In General Multiple Linear Regression model, the dependent
variable derive sits value from two or more than two variable.
• Multiple Linear Regression model take the following form:

Yi  b0  b1 X1i  b2 X 2i  .........  bk X ki   i
where:
Yi = ith observation of dependent variable Y
Xki = ith observation of kth independent variable X
b0 = intercept term
bk = slope coefficient of kth independent variable
εi = error term of ith observation
n = number of observations
k = total number of independent variables
Multiple Regression
• Now, taking it one step further and
adding amount of radio play to the
equation
• This turns into multiple regression
analysis with more predictors
creating a regression plane (or a 3d
model) with the line turning into a
plane
• It looks more complicated, but the
principles remain the same as linear
regression
Multiple Regression

• Dependant variable: the single variable being explained/ predicted by


the regression model
• Independent variable: The explanatory variable(s) used to predict the
dependant variable.
• Coeffi ie ts β : alues, o puted y the regressio tool, refle ti g
explanatory to dependent variable relationships.
• Residuals ε : the portio of the depe de t aria le that is ’t
explained by the model; the model under and over predictions.
Multiple Regression
• The betas tell the relationship between a
particular predictor and the outcome
• The betas also define the shape of the
plane
In this instance:
• the beta 0 is represent where the plane
hits the y-axis (value of the outcome
when both predictors are zero)
• b1 represents the slope of the side associated with radio play
• b2 represents the slope of the side associated with advertising
budget
• This can go on for multiple dimensions with each of the predictors
defining the shape
Assumptions of Multiple Regression Model
– There exists a linear relationship between the dependent and
independent variables.
– The expected value of the error term, conditional on the
independent variables is zero.
– The error terms are homoskedastic, i.e. the variance of the error
terms is constant for all the observations.
– The expected value of the product of error terms is always zero,
which implies that the error terms are uncorrelated with each
other.
– The error term is normally distributed.
– The independent variables doesn't have any linear relationships
between each other.
Multicollinearity
• A significant problem faced in the Regression Analysis is when the
independent variables or the linear combinations of the
independent variables are correlated with each other.

• This correlation among the independent variables is called


Multicollinearity which creates problems in conducting t-statistic for
statistical significance.

• Multicollinearity is evident when the t-test concludes that the


coefficients are not statistically different from zero but the F-test is
significant and the coefficient of determination (R2) is high.
Multicollinearity
• High correlation among the independent variables suggests the
presence of multicollinearity but lower values of correlations
doesn't omit the chances of presence of multicollinearity.

• The most common method of correcting multicollinearity is by


systematically removing the independent variable until
multicollinearity is minimized.
MR - House Price Example
• Data was collected on 100 houses recently sold in a city. It consisted
of the selling price (in $000), house size (in 00 square feet), the number
of bedrooms, the number of bathrooms, and GarageSize (cars)
• Use Excel and R to perform Multiple Linear Regression.
Data
Segment of City Selling Price ($000) House Size (00 sq. ft.) Number of Bedrms. Number of Bathrms. Garage Size (cars)

Northwest 290 21 4 2 2
South 95 11 2 1 0
Northeast 170 19 3 2 2
Northwest 375 38 5 4 3
West 350 24 4 3 2
South 125 10 2 2 0
West 310 31 4 4 2
West 275 25 3 2 2
Northwest 340 27 5 3 3
Northeast 215 22 4 3 2
Northwest 295 20 4 3 2
South 190 24 4 3 2
Northwest 385 36 5 4 3
West 430 32 5 4 2
South 185 14 3 2 1
South 175 18 4 2 2
Northeast 190 19 4 2 2
Northwest 330 29 4 4 3
West 405 33 5 4 3
Northeast 170 23 4 2 2
West 365 34 5 4 3
Northwest 280 25 4 2 2
South 135 17 3 1 1
Northeast 205 21 4 3 2
West 260 26 4 3 2
Business Analytics

Multiple Linear Regression


Using Excel
Attempt 1
SUMMARY OUTPUT
• Garage Size
Regression Statistics
Multiple R 0.898964
(Cars) is the
R Square 0.808137 independent
Adjusted R Square 0.769764
Standard Error 45.87155 variable with the
Observations 25 highest p-value
ANOVA (.697) > .05
Regression
df
4
SS
177260
MS F
44315 21.06027
Significance F
6.14E-07
• Cars is removed
Residual 20 42083.98 2104.199 from the model
Total 24 219344
• Multiple
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% regression is
Intercept -59.4161 54.6072 -1.0881 0.28951 -173.325 54.4925 performed again
House Size (00 sq. ft.)
Number of Bedrms.
6.50587
29.1013
3.24687 2.00374 0.05883
26.2148 1.11011 0.28012
-0.26697
-25.5819
13.2787
83.7844
on the remaining
Number of Bathrms. 26.4004 18.8077 1.40370 0.17574 -12.8317 65.6325 independent
Garage Size (cars) -10.8032 27.329 -0.395 0.69680 -67.8105 46.2040
variables
Attempt 2
• Bedrooms is the
SUMMARY OUTPUT

Regression Statistics independent


Multiple R 0.89813 variable with the
R Square 0.806638
Adjusted R Square 0.779015 highest p-value
Standard Error
Observations
44.94059
25
(.281) > .05
• Bedrooms is
ANOVA
removed from
df SS MS F Significance F
the model
• Multiple
Regression 3 176931.2 58977.07 29.20153 1.1E-07
Residual 21 42412.79 2019.657
Total 24 219344 regression is
performed again
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept -47.3416 44.3467 -1.0675 0.2979 -139.57 44.8824


on the
House Size (00 sq.
6.02021 2.9445 2.04459 0.05363 -0.1031 12.1436
remaining
ft.)
Number of Bedrms. 23.0353 20.823 1.10625 0.28113 -20.268 66.3389
independent
Number of Bathrms. 27.0286 18.3601 1.47214 0.15581 -11.153 65.2105
variables
Attempt 3
SUMMARY OUTPUT
• Bathrooms is the
Regression Statistics
Multiple R 0.891835
independent
R Square 0.79537 variable with the
Adjusted R Square 0.776767
Standard Error 45.16858
highest p-value
Observations 25 (.110) > .05
ANOVA
• Bathrooms is
df SS MS F
Significance removed from
F
Regression 2 174459.6 87229.79 42.7555 2.63E-08 the model
Residual
Total
22
24
44884.42
219344
2040.201
• Regression is
performed again
Coefficients
Standard
Error
t Stat P-value Lower 95% Upper 95% on the
Intercept -12.349 31.239 -0.3953 0.69642 -77.1352 52.4371 remaining
House Size (00 sq. ft.) 7.94652 2.38644 3.32987 0.00304 2.99735 12.8957
Number of Bathrms. 30.3444 18.2056 1.66676 0.10974 -7.4118 68.1006 independent
variable
Attempt 4
SUMMARY OUTPUT
• House size is
Regression Statistics the only
Multiple R 0.877228 independent
R Square 0.76953
Adjusted R Square 0.759509 variable
Standard Error 46.88202 remaining in
Observations 25
the model
ANOVA • The estimated
df SS MS F Significance F
regression
Regression 1 168791.7 168791.7 76.79599 8.67E-09 equation is:
Residual 23 50552.25 2197.924
Total 24 219344
y = -9.8669 +
11.3383
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
(House Size)
Intercept -9.8669 32.3874 -0.3047 0.76337 -76.865 57.1315 • The Adjusted
House Size (00 sq. ft.) 11.3384 1.2938 8.76333 8.67E-09 8.66184 14.0149
R Square
value is .760
ANOVA - Analysis of Variance
• The reason for doing an ANOVA is to see if there is any difference
between groups on some variable.
• ANOVA is available for both parametric (score data) and non-
parametric (ranking/ordering) data.
Business Analytics

Multiple Linear Regression


Using R
• Before fitting our regression model we want to investigate how the
variables are related to one another. We can do this graphically by
constructing scatter plots of all pair-wise combinations of variables in
the data frame.
This can be done by using R
• Initially consider only two explanatory variables – HouseSize and
Bedrooms.
10 15 20 25 30 35

400
300
SellingPrice

200
100
10 15 20 25 30 35

HouseSize

5.0
4.0
Bedrooms

3.0
2.0
100 200 300 400 2.0 3.0 4.0 5.0
• We begin by testing whether the explanatory variables collectively
have an effect on the response variable.

• If we can reject this hypothesis, we continue by testing whether the


individual regression coefficients are significant while controlling for
the other variables in the model.

Our hypothesis is,

H0: The variables HouseSize and Bedrooms collectively have no


effect on Selling Price.
v/s
H1: At least one variable has significant effect on the Selling Price.
Model Building in R
• Bedrooms is the independent variable with the highest p-value
(.19712) > .05 and hence removed from the model.
• Multiple regression is performed again on the remaining
independent variable HouseSize.
Interpretation

• The output shows that F = 76.8 (p < 8.67e-09), indicating that we


should clearly reject the null hypothesis that the variable HouseSize
has no effect on Selling Price.

• From the output , R2 = 0.7695 and adjusted R2 = 0.7595


ANOVA - Partial F-test

• Sometimes we are interested in simultaneously testing whether a


certain subset of the coefficients are equal to 0
(e.g. Beta 3 = Beta 4 = 0).

• We can do this using a partial F-test. This test involves comparing the
SSE from a reduced model (excluding the parameters we hypothesis
are equal to zero) with the SSE from the full model (including all of the
parameters).

• In R we can perform partial F-tests by fitting both the reduced and full
models separately and thereafter comparing them using the anova
function.
• Ex. Suppose we include the variables bathrooms and GarageSize in our
model and are interested in testing whether the number of bathrooms
and GarageSize are significant after taking HouseSize into
consideration.
• Now when comparing the models, our hypothesis becomes,

H0: The variables bathrooms and GarageSize collectively have no


effect on Selling Price after taking HouseSize into consideration.
v/s
H1: At least one variable has significant effect on the Selling Price
after taking HouseSize into consideration.
Analysis of Variance Table
Interpretation
• The output shows the results of the partial F-test. Since F = 1.3808 (p-
value=0.2733) we cannot reject the null hypothesis (Beta 3 = Beta 4 =
0) at the 5% level of significance.
• It appears that the variables Bathrooms and GarageSize do not
contribute significant information to the Selling Price once the variable
HouseSize has been taken into consideration.
• Hence, the estimated regression equation is

y = -9.867 + 11.338 * HouseSize


Confidence and Prediction Intervals
• We often use our regression models to estimate the mean response or
predict future values of the response variable for certain values of the
response variables.
• The function predict() can be used to make both confidence intervals
for the mean response and prediction intervals.
• To make confidence intervals for the mean response use the option
interval=”confidence”. To make a prediction interval use the option
i ter al= predi tio .
• By default this makes 95% confidence and prediction intervals. If you
instead want to make a 99% confidence or prediction interval use the
option level=0.99.
Confidence Interval

• Ex. Obtain a 95% confidence interval for the mean sales price of
houses whose HouseSize is 2,150 square feet.

• A 95% confidence interval is (18677.18, 30057.99)


Prediction Interval
• Ex. Obtain a 95% prediction interval for the sales price of a particular
house whose HouseSize is 2,150 square feet.

• A 95% prediction interval is given by (18676.35, 30058.82).


• Note that this is almost same as the confidence interval, indicating
that the variation about the mean is very small.
Significance Test
• Getting a correlation coefficient is generally only half the story; you
will want to know if the relationship is significant. The cor() function in
R can be extended to provide the significance testing required. The
function is cor.test()
• As above you need to read your data into R from a .CSV file and attach
the factors so that they are all stored in memory.
• To run a correlation test we type:
> cor.test(var1, var2, method = "method")
• Use it for Housing data.
Significance Test
• The default ethod is Pearso " so you ay o it this if that is hat
you a t. If you type Ke dall" or “pear a " the you ill get the
appropriate significance test.

• As usual with R it is a good idea to assign a variable name to your


result in case you want to perform additional operations.
Model Misspecifications
• Apart from checking the previously discussed problems in the
regression, we should check for the correct form of the regression as
well.
• Following 3 misspecification can be present in the regression model:

 Functional form of regression is misspecified:


- The important variables could have been omitted from the
regression model
- Some regression variables may need the transformation (like
conversion to the logarithmic scale)
- Pooling of data from incorrect pools
Model Misspecifications
 The variables can be correlated with the error term in time-
series models:
- Lagged dependent variables are used as independent
variables with serially correlated errors
- A function of dependent variables is used as an
independent variable because of incorrect dating of the
variables
- Independent variables are sometimes measured with
error
Model Misspecifications
• Other Time-Series Misspecification which leads to the non-stationarity
of the variables:
- Existence of relationships in time-series that results in patterns
- Random walk relationships among the time series

• These misspecifications in the regression model results in the biased


and inconsistent regression coefficients which further leads to
incorrect confidence intervals leading to TYPE-I or TYPE-II errors.
• Non-stationarity means that the properties(like mean, variance) of
the variables is not constant
Heteroscedasticity
• In statistics, a collection of random variables is heteroscedastic.
• Heteroscedasticity often occurs when there is a large difference
among the sizes of the observations.
Heteroscedasticity
Example
Predi ti g hu a eight fro height. It is u likely e’d eed to
make such a prediction, but it serves as a quick and easy illustration.)
The homoscedasticity assumption says that tall people have no more
variation in weight than short people, certainly not true.
Unconditional & Conditional Heteroskedasticity

– Presence of heteroskedasticity in the data is the violation of the


assumption about the constant variance of the residual term.
– Heteroskedasticity takes the following two forms, unconditional and
conditional.
– Unconditional Heteroskedasticity is present when the variance of
the residual terms are not related to the values of the independent
variable.
– Unconditional Heteroskedasticity doesn't pose any problem in the
regression analysis as the variance doesn't change systematically
Unconditional & Conditional Heteroskedasticity
– Conditional Heteroskedasticity pose problems in regression analysis
as the residuals are systematically related to the independent
variables

Y
  
Y  b0  b1 X

Low Variance of
Residual Terms

High Variance of
Residual Terms
X
Detecting Heteroskedasticity
• Heteroskedasticity can be detected either by viewing the scatter
plots as discussed in the previous case or by Breusch-Pagan chi-
square test.
• In Breusch-Pagan chi-square test, the residuals are regressed with
the independent variables to check whether the independent
variable explains a significant proportion of the squared residual or
not.
• If the independent variables explain a significant proportion of the
squared residuals then we conclude that the conditional
heteroskedasticity is present otherwise not.
Detecting Heteroskedasticity
• Breusch-Pagan test statistic follows a chi-square distribution with k
degrees of freedom, where k is the number of independent variables.

BP Chi SquareTest Statistic  n  Rresid


2

where:
n : number of observations
2
Rresid : Coefficient of determination when residuals are regressed
with independent variables

Conditional Heteroskedasticity can be corrected by using White-corrected standard errors


which are also called heteroskedasticity consistent standard errors
Heteroscedasticity
• A classic example of heteroscedasticity is that of income versus
expenditure on meals. As one's income increases, the variability of
food consumption will increase. A poorer person will spend a rather
constant amount by always eating inexpensive food; a wealthier
person may occasionally buy inexpensive food and at other times eat
expensive meals. Those with higher incomes display a greater
variability of food consumption.
Correcting for Heteroskedasticity
– There are two methods for correcting the effects of conditional
heteroskedasticity
• Robust Standard Errors
Correct the standard errors of the linear regression model's
estimated coefficients to account for conditional
heteroskedasticity
• Generalized Least Squares
Modifies the original equation in an attempt to eliminate
heteroskedasticity.
– Statistical packages are available are available for computing robust
standard errors.
Heteroscedasticity

• Imagine you are watching a rocket take off nearby and measuring the
distance it has traveled once each second. In the first couple of
seconds your measurements may be accurate to the nearest
centimeter, say. However, 5 minutes later as the rocket recedes into
space, the accuracy of your measurements may only be good to 100 m,
because of the increased distance, atmospheric distortion and a
variety of other factors. The data you collect would exhibit
heteroscedasticity.
Serial Correlation
• Serial correlations are often found in repeating patterns when the
level of a variable effects its future level.
• In finance, serial correlation is used by technical analysts to determine
how well the past price of a security predicts the future price.
• For example, if we are predicting the growth of stock dividends, an
overestimate in one year is likely to lead to overestimates in
succeeding years.
Business Analytics

Appendix
Measures of Variation
• Total variation is made up of two parts:

SST  SSR  SSE


Total Sum of Regression Sum of Error Sum of
Squares Squares Squares

SST   ( Yi  Y)2 SSR   ( Ŷi  Y)2 SSE   ( Yi  Ŷi )2

where:
Y = Mean value of the dependent variable
Yi = Observed value of the dependent variable
Yˆi = Predicted value of Y for the given Xi value
Coefficient of determination (R2) & Adjusted R2
• Coefficient of determination(R2) can also be used to test the
significance of the coefficients collectively apart from using F-test.

SST - SSE RSS Sum of Squares explained by regression


R2   
SST SST Total Sum of Squares

• The drawback of using Coefficient of determination is that the value of


the coefficient of determination always increases as the number of
independent variables are increased even if the marginal contribution
of the incoming variable is statistically insignificant.
• To take care of the above drawback, coefficient of determination is
adjusted for the number of independent variables taken. This adjusted
measure of coefficient of determination is called adjusted R2
Coefficient of determination (R2) & Adjusted R2

• Adjusted R2 is given by the following formula:

 n  1  2 
Ra2  1    
  1 R 
 n  k  1  

where
n = Number of Observations
k = Number of Independent Variables
Ra2 = Adjusted R2
Analysis of Variance (ANOVA)
– Analysis of variance is a statistical method for analyzing the
variability of the data by breaking the variability into its
constituents.
– A typical ANOVA table looks like:
Source of Variability DoF Sum of Squares Mean Sum of Squares

Regression(Explained) k RSS MSR=RSS/1

Error(Unexplained) n-k-1 SSE MSE=SSE/n-2


Total n-1 SST=RSS+SSE

From the above summary(ANOVA table) we can calculate:

SSE
Standard Error of Estimate(SEE) = MSE 
n2
Analysis of Variance (ANOVA)

Coefficient of determination(R2) = Explained Variation( RSS)


Total Variation( SST)

Total Variation( SST)  Unexplaine d Variation( SSE)


=
Total Variation( SST)
F-Statistic
• An F-test explains how well the dependent variable is explained by
the independent variables collectively.
• In case of multiple independent variable, F-test tells us whether a
single variable explains a significant part of the variation in
dependent variable or all the independent variables explain the
variability collectively.
• F-statistic is given as:
RSS
MSR k
F 
MSE SSE
n  k 1
Where:
MSR: Mean Regression sum of squares
MSE: Mean Squared Error
n: Number of observations
k: Number of independent variables
F-Statistic contd…….

– Decision rule for F-test: Reject H0 if the F-statistic > Fc (Critical


Value)
– The numerator of F-statistic has degrees of freedom of "k" and the
denominator has the degrees of freedom of "n-k-1"
– If H0 is rejected then at least one out of two independent variable is
significantly different that zero.

F-test is always a single tailed test while testing the hypothesis


that the coefficients are simultaneously equal to zero
Thank You

You might also like