0% found this document useful (0 votes)
20 views

Topic 7 Regression (Cont.)

Uploaded by

k60.2111213014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Topic 7 Regression (Cont.)

Uploaded by

k60.2111213014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Topic 7 Regression (cont)

Vincent Hoang (2022), Lecture 11


Camn et al (2016), Chapter 7
Outline
1. Independent categorical variables
2. Non-linear relationship
3. Interaction terms between independent variables
4. A note on multicollinearity
5. Assessment of model assumptions
Multiple linear regression
• A linear regression model with more than one independent variable
is called a multiple linear regression model.
Y   0  1 X 1   2 X 2     k X k   8.10 
◦ Y is the dependent variable, Xi are the independent (explanatory) variables;
◦ βi are the regression coefficients for the independent variables, ε is the error
term.

• Using sample data, we estimate the regression coefficients bi


Yˆ  b  b X  b X    b X
0 1 1 2 2  8.11
k k
Regression with Independent Categorical Variables
• Few examples:
◦ Employee income depends on “managerial / supervisorial” positions
◦ Employee income depends on “having or not having” an MBA degree
◦ The decision to approve a loan depends on whether applicant(s) has a house
or not..
◦ The price (value) of properties depends on “types of the property”

• These independent variables are categorical.


• In regression analysis, we include them using dummy variables.
Case study
• Consider Salary Dataset
• Preliminary data understanding (reflecting on Data Understanding and
Modelling sections your group assessments as well):
◦ Types of data / variables: do you understand them?
◦ Descriptive statistics
◦ Detection of outliers
◦ Correlation matrix among variables (using Data Analysis Toolpak, built-in Excel add-in)
◦ Plot and diagram analysis
◦ Hypothesis testings
◦ Confidence interval estimate
Dummy variable for two categories
• A dummy variable indicates whether an observation belongs to a particular
category in the data.
◦ Gender: The categorical variable could be one if female and 0 if not female.
◦ Run analysis: current salary is a function of (beginning salary, …., gender)

Y   0  1 X 1   2 X 2     k X k   8.10 
◦ Consider X1 is dummy variable for gender, X1 either takes a value of zero (not female) or one
(female)
For Female employees:

For Male employees:


Current
salary

Experience
Four dummy variables for five categories

• Consider “Departments” variable


◦ How many levels / categories?
◦ Administration – Production - Sales – Research & Development - Warehouse
◦ How many dummy variables should we use?
◦ Determine the “Reference” or “Base” category
◦ Remaining fours are benchmarked against this reference category
◦ How to interpret each of these dummy variables?
If Warehouse is the base category, then Production is compared with Warehouse.
Similarly, Administration and Support Service is compared with Warehouse
Indirectly, we can compare Production and Administration and Support Service.

Current
salary

Experience
General rules for dummy variable number
• When a categorical variable has k (> 2) levels/categories, we need
to add (k−1) additional dummy variables to the model.
◦ The variable “Department” has five levels
◦ Choose one level, such as “Research and Development” as the reference level
◦ Four dummy variables are needed
◦ Each dummy compares to the reference level.
How to prepare data in RealStatistics
• In Excel RealStatistics:
◦ Categorical Coding Regression | Real Statistics Using Excel (real-statistics.com)

Excel Function: =CHOOSECOLS()


How to transform variables into dummy
variables
In Excel RealStatistics:

Excel Function: =IF ()


Scales Sold
500

450
Case: Months
400
employed and
350

300
Sales
Data:
Scales sold

250

200

150
Reynolds
100

50

0
0 20 40 60 80 100 120

Months Employed
Case: Months employed and Sales
Data: Reynolds

Residuals
100
80
60
40
20
0
0 50 100 150 200 250 300 350 400
-20
-40
-60
-80
Predicted value

 Curvilinear relationship may provide a better fit to the data


Regression Models with Nonlinear Terms
• Curvilinear models may be appropriate when scatter charts or residual plots
show nonlinear relationships.
• A second order polynomial might be used

Y   0  1 X   2 X 2  

• Here ß1 represents the linear effect of X on Y and ß2 represents the curvilinear


effect.
Case: Months employed and Sales
Data: Reynolds Quadratic regression model
Case: Months employed and Sales
Data: Reynolds Quadratic regression model
Residuals
60

40

20

0
Axis Title

0 50 100 150 200 250 300 350

-20

-40

-60

-80
Predicted value
Case: Months employed and Sales
Data: Reynolds Quadratic regression model

Let’s estimate the maximum sales!!

The knot/ the breakpoint: the value of the independent variable at


which the relationship between the independent variable and the
dependent variable changes
Interaction terms
• An interaction occurs when the effect of one variable is dependent on
another variable or the relationship between the dependent variable and one
independent variable is different at various values of a second independent
variable
• We can test for interactions by defining a new variable as the product of the
two variables, X 3  X 1  X 2 , testing whether this variable is significant,
leading to an alternative model.
Case: Advertising expenditure – Price and Sales
Data: Tyler
Case: Advertising expenditure – Price and Sales
Data: Tyler
• Difference in mean sales between Mean Sales
advertising expenditures of $50,000
900
808
800

and $100,000 depends on the price 700 646

600

of the product. 500 375


400 461 364

• At higher selling prices, the effect of 300


332

increased advertising expenditure 200

100

diminishes. 0
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50

Price
Case: Advertising expenditure – Price and Sales
Data: Tyler
Case: Advertising expenditure – Price and Sales
Data: Tyler

• Let’s consider the effect when price increases by $1

• Let’s consider the effect when Advertising expenditure increases by $1,000


Categorical by Continuous
For example: income depends on MBA and Age but you believe that the effect of Age on income differs
between two groups: with and without MBA degree

Salary = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + (𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒) × Age + (𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒) × MBA + (𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒) × MBA × Age +ε

• If MBA = 1, then we have

• If MBA = 0, then
Salary = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + (𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒) × Age + 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒 × 0 × Age +ε

Salary = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + (𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒) × Age +ε


Categorical by Categorical
• For examples:
◦ MBA with Gender
◦ Gender with Managerial / Supervisory Position

• The coefficient of the interaction term of two categorical variables controls for
the differences in the union of the two groups. This coefficient acts as a constant
that shifts the model if the observation is in the union of the two categorical
variables.
Continuous by Continuous
For example
◦ Experience and Years of Education
Take the MLR model:
𝑌 =𝛽 +𝛽 𝑋 +𝛽 𝑋 +𝛽 𝑋 𝑋 +𝜀
The partial effect of 𝑋 is given by:
𝛽 +𝛽 𝑋
And the partial effect of 𝑋 is given by:
𝛽 +𝛽 𝑋
Clearly, the partial effect of one of the continuous variables in the interaction term depends on the
size of the term it is interacted with.
Co- and multi-collinearity problems
• Multicollinearity occurs when we have two or more independent variables that are highly correlated
with one another.
• If those variables capture similar things, for example income and disposable income or body size and
weight, then we should consider to use only one (better) variable.
• Multicollinearity affects statistical significance of t-tests but does not have much impacts on the
predictive power.
• If you have two independent variables that are highly correlated with one another, think about:
◦ whether they measure the same thing
◦ They could be results of a chain or a fork

• Avoiding using multiple variables that capture the same thing help us avoid (1) multicollinearity
problems (ie. For the use of situational assessment) and (2) avoid “over-fitting the model” in predictive
analytics.
From causal diagrams to mathematical
equations
• A simplest form of empirical model would be
using regression model as below

• This equation fails to capture the actual


relationship among independent variables (x1
 x6)
Shortcomings
• Consider X1, X2 and X4:
associations among these
variables are clear, hence we
call that this model suffers from
multicollinearity problem.
• Also, we cannot use standard
significance tests to reliably
determine which independent
variables exert the most
influence.
A solution (not discussed further in this unit)
• Possible to use structural equation model (SEM)
(stepwise regression) via a two-stage regression
(or more advanced models).

• Stage 1

• Stage 2: using estimated value of the independent


variable obtained from stage 1 regression.
Check outliers and assumptions for linear
regression using least square method using
residual plots
• Linearity
◦ examine scatter diagram (should appear linear)
◦ examine residual plot (should appear random)

• Normality of Errors (residuals)


◦ view a histogram of standard residuals
◦ regression is robust to departures from normality

• Homoscedasticity: variation about the regression line is constant.


◦ examine the residual plot

• Independence of Errors: successive observations should not be related.


◦ This is important when the independent variable is time.
Data plot vs R-squared

• Models with higher R-squared & adjusted R-squared often are better.
Predicted vs Actual Plots
• The left-side model
is explains more of
the variation in
revenue than the
right-side model.
Predicted vs Residual Plot
• The right-side chart is
the “residual plot”.
• We can ask Excel
add-in to generate
these plots.
• We can use these
plots to assess model
assumptions.
Ideally, the plot
should look like
these
• The more symmetrically
distributed, tending to cluster
towards the middle of the plot,
the better.

• The more clustered around the


lower single digits of the y-axis
(e.g., 0.5 or 1.5, not 30 or 150),
the better.

• There aren’t any clear patterns.


These do not
look “good”
• They are not evenly distributed
vertically (y-axis).

• They have many outliers.

• They have a clear shape or


pattern.

• These indicate rooms for


improvement  might be there
are better models.
Checking linearity – Residual plots
These two charts shows non-linear relationships
Violating Homoscedasticity

Heteroscedasticity: means different (hetero) variance (skedasticity).


This means the variance of the residuals changes as the predicted
value changes.
Testing Heteroskedasticity
The heteroskedasticity of the residuals can be
tested using the White test (also called the
heteroskedasticity test). The null hypothesis is that
the residuals are homoscedastic (same variance)
and the alternative hypothesis is
heteroskedasticity.
Heteroskedasticity is an issue because it inflates
the standard errors of the coefficient estimates.
Heteroskedasticity can be resolved by including
more independent variables or including robust
standard errors.
Presence of Outliers
Among these four, which is the best?
Checking Normality
Best to check if the distribution of residuals
are normal
This can be accomplished by plotting the
distribution of the standard residuals, or by
performing a normality test (Shapiro-Wilk).
Practicing: Salary data
By applying Systematic Model Building Approach
1. Consider causal graphs
2. Descriptive analysis & checking out for outliers in both Y and X variables
3. Correlation matrix of all available variables
4. Construct a model with all available independent variables and examine the value of
coefficients and p-values for each coefficient.
5. If p-values > 10%, consider to remove and run step 4 again. You should check adjusted R-
square again.
6. Once majority (or all) x variables are statistically significant and the signs of coefficients are
consistent with expectations, then you are closer to a good model.
7. Check all assumptions (next week learning)
Comparing Models 3 & 4
Comparing models 1 and 4
VIF and regression
• The VIF represents how much an independent
variable is contributing to multicollinearity based
on how correlated it is with other independent
variables

• VIF of 1 means the independent variable is


uncorrelated

• VIF of 1-5 means moderate multicollinearity

• VIF of 5+ means high multicollinearity

You might also like