0% found this document useful (0 votes)

20 views

Topic 7 Regression (Cont.)

Uploaded by

k60.2111213014

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

Topic 7 Regression (Cont.)

Uploaded by

k60.2111213014

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Topic 7 Regression (cont)

Vincent Hoang (2022), Lecture 11

Camn et al (2016), Chapter 7
Outline
1. Independent categorical variables
2. Non-linear relationship
3. Interaction terms between independent variables
4. A note on multicollinearity
5. Assessment of model assumptions
Multiple linear regression
• A linear regression model with more than one independent variable
is called a multiple linear regression model.
Y   0  1 X 1   2 X 2     k X k   8.10 
◦ Y is the dependent variable, Xi are the independent (explanatory) variables;
◦ βi are the regression coefficients for the independent variables, ε is the error
term.

• Using sample data, we estimate the regression coefficients bi

Yˆ  b  b X  b X    b X
0 1 1 2 2  8.11
k k
Regression with Independent Categorical Variables
• Few examples:
◦ Employee income depends on “managerial / supervisorial” positions
◦ Employee income depends on “having or not having” an MBA degree
◦ The decision to approve a loan depends on whether applicant(s) has a house
or not..
◦ The price (value) of properties depends on “types of the property”

• These independent variables are categorical.

• In regression analysis, we include them using dummy variables.
Case study
• Consider Salary Dataset
• Preliminary data understanding (reflecting on Data Understanding and
Modelling sections your group assessments as well):
◦ Types of data / variables: do you understand them?
◦ Descriptive statistics
◦ Detection of outliers
◦ Correlation matrix among variables (using Data Analysis Toolpak, built-in Excel add-in)
◦ Plot and diagram analysis
◦ Hypothesis testings
◦ Confidence interval estimate
Dummy variable for two categories
• A dummy variable indicates whether an observation belongs to a particular
category in the data.
◦ Gender: The categorical variable could be one if female and 0 if not female.
◦ Run analysis: current salary is a function of (beginning salary, …., gender)

Y   0  1 X 1   2 X 2     k X k   8.10 
◦ Consider X1 is dummy variable for gender, X1 either takes a value of zero (not female) or one
(female)
For Female employees:

For Male employees:

Current
salary

Experience
Four dummy variables for five categories

• Consider “Departments” variable

◦ How many levels / categories?
◦ Administration – Production - Sales – Research & Development - Warehouse
◦ How many dummy variables should we use?
◦ Determine the “Reference” or “Base” category
◦ Remaining fours are benchmarked against this reference category
◦ How to interpret each of these dummy variables?
If Warehouse is the base category, then Production is compared with Warehouse.
Similarly, Administration and Support Service is compared with Warehouse
Indirectly, we can compare Production and Administration and Support Service.

Current
salary

Experience
General rules for dummy variable number
• When a categorical variable has k (> 2) levels/categories, we need
to add (k−1) additional dummy variables to the model.
◦ The variable “Department” has five levels
◦ Choose one level, such as “Research and Development” as the reference level
◦ Four dummy variables are needed
◦ Each dummy compares to the reference level.
How to prepare data in RealStatistics
• In Excel RealStatistics:
◦ Categorical Coding Regression | Real Statistics Using Excel (real-statistics.com)

Excel Function: =CHOOSECOLS()

How to transform variables into dummy
variables
In Excel RealStatistics:

Excel Function: =IF ()

Scales Sold
500

450
Case: Months
400
employed and
350

300
Sales
Data:
Scales sold

250

200

150
Reynolds
100

0
0 20 40 60 80 100 120

Months Employed
Case: Months employed and Sales
Data: Reynolds

Residuals
100
80
60
40
20
0
0 50 100 150 200 250 300 350 400
-20
-40
-60
-80
Predicted value

 Curvilinear relationship may provide a better fit to the data

Regression Models with Nonlinear Terms
• Curvilinear models may be appropriate when scatter charts or residual plots
show nonlinear relationships.
• A second order polynomial might be used

Y   0  1 X   2 X 2  

• Here ß1 represents the linear effect of X on Y and ß2 represents the curvilinear

effect.
Case: Months employed and Sales
Data: Reynolds Quadratic regression model
Case: Months employed and Sales
Data: Reynolds Quadratic regression model
Residuals
60

0
Axis Title

0 50 100 150 200 250 300 350

-20

-40

-60

-80
Predicted value
Case: Months employed and Sales
Data: Reynolds Quadratic regression model

Let’s estimate the maximum sales!!

The knot/ the breakpoint: the value of the independent variable at

which the relationship between the independent variable and the
dependent variable changes
Interaction terms
• An interaction occurs when the effect of one variable is dependent on
another variable or the relationship between the dependent variable and one
independent variable is different at various values of a second independent
variable
• We can test for interactions by defining a new variable as the product of the
two variables, X 3  X 1  X 2 , testing whether this variable is significant,
leading to an alternative model.
Case: Advertising expenditure – Price and Sales
Data: Tyler
Case: Advertising expenditure – Price and Sales
Data: Tyler
• Difference in mean sales between Mean Sales
advertising expenditures of $50,000
900
808
800

and $100,000 depends on the price 700 646

600

of the product. 500 375

400 461 364

• At higher selling prices, the effect of 300

332

increased advertising expenditure 200

100

diminishes. 0
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50

Price
Case: Advertising expenditure – Price and Sales
Data: Tyler
Case: Advertising expenditure – Price and Sales
Data: Tyler

• Let’s consider the effect when price increases by $1

• Let’s consider the effect when Advertising expenditure increases by $1,000

Categorical by Continuous
For example: income depends on MBA and Age but you believe that the effect of Age on income differs
between two groups: with and without MBA degree

Salary = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + (𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒) × Age + (𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒) × MBA + (𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒) × MBA × Age +ε

• If MBA = 1, then we have

• If MBA = 0, then
Salary = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + (𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒) × Age + 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒 × 0 × Age +ε

Salary = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + (𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒) × Age +ε

Categorical by Categorical
• For examples:
◦ MBA with Gender
◦ Gender with Managerial / Supervisory Position

• The coefficient of the interaction term of two categorical variables controls for
the differences in the union of the two groups. This coefficient acts as a constant
that shifts the model if the observation is in the union of the two categorical
variables.
Continuous by Continuous
For example
◦ Experience and Years of Education
Take the MLR model:
𝑌 =𝛽 +𝛽 𝑋 +𝛽 𝑋 +𝛽 𝑋 𝑋 +𝜀
The partial effect of 𝑋 is given by:
𝛽 +𝛽 𝑋
And the partial effect of 𝑋 is given by:
𝛽 +𝛽 𝑋
Clearly, the partial effect of one of the continuous variables in the interaction term depends on the
size of the term it is interacted with.
Co- and multi-collinearity problems
• Multicollinearity occurs when we have two or more independent variables that are highly correlated
with one another.
• If those variables capture similar things, for example income and disposable income or body size and
weight, then we should consider to use only one (better) variable.
• Multicollinearity affects statistical significance of t-tests but does not have much impacts on the
predictive power.
• If you have two independent variables that are highly correlated with one another, think about:
◦ whether they measure the same thing
◦ They could be results of a chain or a fork

• Avoiding using multiple variables that capture the same thing help us avoid (1) multicollinearity
problems (ie. For the use of situational assessment) and (2) avoid “over-fitting the model” in predictive
analytics.
From causal diagrams to mathematical
equations
• A simplest form of empirical model would be
using regression model as below

• This equation fails to capture the actual

relationship among independent variables (x1
 x6)
Shortcomings
• Consider X1, X2 and X4:
associations among these
variables are clear, hence we
call that this model suffers from
multicollinearity problem.
• Also, we cannot use standard
significance tests to reliably
determine which independent
variables exert the most
influence.
A solution (not discussed further in this unit)
• Possible to use structural equation model (SEM)
(stepwise regression) via a two-stage regression
(or more advanced models).

• Stage 1

• Stage 2: using estimated value of the independent

variable obtained from stage 1 regression.
Check outliers and assumptions for linear
regression using least square method using
residual plots
• Linearity
◦ examine scatter diagram (should appear linear)
◦ examine residual plot (should appear random)

• Normality of Errors (residuals)

◦ view a histogram of standard residuals
◦ regression is robust to departures from normality

• Homoscedasticity: variation about the regression line is constant.

◦ examine the residual plot

• Independence of Errors: successive observations should not be related.

◦ This is important when the independent variable is time.
Data plot vs R-squared

• Models with higher R-squared & adjusted R-squared often are better.
Predicted vs Actual Plots
• The left-side model
is explains more of
the variation in
revenue than the
right-side model.
Predicted vs Residual Plot
• The right-side chart is
the “residual plot”.
• We can ask Excel
add-in to generate
these plots.
• We can use these
plots to assess model
assumptions.
Ideally, the plot
should look like
these
• The more symmetrically
distributed, tending to cluster
towards the middle of the plot,
the better.

• The more clustered around the

lower single digits of the y-axis
(e.g., 0.5 or 1.5, not 30 or 150),
the better.

• There aren’t any clear patterns.

These do not
look “good”
• They are not evenly distributed
vertically (y-axis).

• They have many outliers.

• They have a clear shape or

pattern.

• These indicate rooms for

improvement  might be there
are better models.
Checking linearity – Residual plots
These two charts shows non-linear relationships
Violating Homoscedasticity

Heteroscedasticity: means different (hetero) variance (skedasticity).

This means the variance of the residuals changes as the predicted
value changes.
Testing Heteroskedasticity
The heteroskedasticity of the residuals can be
tested using the White test (also called the
heteroskedasticity test). The null hypothesis is that
the residuals are homoscedastic (same variance)
and the alternative hypothesis is
heteroskedasticity.
Heteroskedasticity is an issue because it inflates
the standard errors of the coefficient estimates.
Heteroskedasticity can be resolved by including
more independent variables or including robust
standard errors.
Presence of Outliers
Among these four, which is the best?
Checking Normality
Best to check if the distribution of residuals
are normal
This can be accomplished by plotting the
distribution of the standard residuals, or by
performing a normality test (Shapiro-Wilk).
Practicing: Salary data
By applying Systematic Model Building Approach
1. Consider causal graphs
2. Descriptive analysis & checking out for outliers in both Y and X variables
3. Correlation matrix of all available variables
4. Construct a model with all available independent variables and examine the value of
coefficients and p-values for each coefficient.
5. If p-values > 10%, consider to remove and run step 4 again. You should check adjusted R-
square again.
6. Once majority (or all) x variables are statistically significant and the signs of coefficients are
consistent with expectations, then you are closer to a good model.
7. Check all assumptions (next week learning)
Comparing Models 3 & 4
Comparing models 1 and 4
VIF and regression
• The VIF represents how much an independent
variable is contributing to multicollinearity based
on how correlated it is with other independent
variables

• VIF of 1 means the independent variable is

uncorrelated

• VIF of 1-5 means moderate multicollinearity

• VIF of 5+ means high multicollinearity

Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Predictive Analytics Using Regression
75% (4)
Predictive Analytics Using Regression
62 pages
VTC - Brochure Digital
No ratings yet
VTC - Brochure Digital
44 pages
Chapter 3
No ratings yet
Chapter 3
36 pages
Econometrics I - Lecture 7 (Wooldridge)
No ratings yet
Econometrics I - Lecture 7 (Wooldridge)
34 pages
C3-English
No ratings yet
C3-English
31 pages
Dummy Variable
No ratings yet
Dummy Variable
10 pages
UNIT II Regression
No ratings yet
UNIT II Regression
59 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
104 pages
Basic Regression Analysis 2
No ratings yet
Basic Regression Analysis 2
6 pages
Unit 5
No ratings yet
Unit 5
104 pages
Running A Proper Regression Analysis: V G R Chandran Govindaraju Uitm Email: Website
No ratings yet
Running A Proper Regression Analysis: V G R Chandran Govindaraju Uitm Email: Website
36 pages
Linear Regression PDF
100% (1)
Linear Regression PDF
32 pages
Estimating Demand: Regression Analysis
No ratings yet
Estimating Demand: Regression Analysis
29 pages
Introduction To Econometrics Ii (Econ-3062) : Mohammed Adem (PHD)
100% (5)
Introduction To Econometrics Ii (Econ-3062) : Mohammed Adem (PHD)
83 pages
Multiple Regression: MR Example With Dummy Variables
No ratings yet
Multiple Regression: MR Example With Dummy Variables
19 pages
Multiple Regression
100% (1)
Multiple Regression
30 pages
econoch7
No ratings yet
econoch7
32 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
Banking Risk Management
No ratings yet
Banking Risk Management
57 pages
Data Science Q&A - Latest Ed (2020) - 3 - 1
No ratings yet
Data Science Q&A - Latest Ed (2020) - 3 - 1
2 pages
Multiple Regression: Curve Estimation
100% (2)
Multiple Regression: Curve Estimation
23 pages
Estimating Demand: Learn How To Interpret The Results of Regression Analysis Based On Demand Data
No ratings yet
Estimating Demand: Learn How To Interpret The Results of Regression Analysis Based On Demand Data
18 pages
Chapter 4: Economic Analysis
No ratings yet
Chapter 4: Economic Analysis
18 pages
ML_Unit2
No ratings yet
ML_Unit2
69 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
43 pages
Marketing Research: Ninth Edition
No ratings yet
Marketing Research: Ninth Edition
44 pages
Management Science Notes
No ratings yet
Management Science Notes
13 pages
11bda
No ratings yet
11bda
25 pages
Unit 1 - Part 1
No ratings yet
Unit 1 - Part 1
105 pages
Dummy Variable Regression
No ratings yet
Dummy Variable Regression
8 pages
Session 11-12-Linear and Multiple Regression
No ratings yet
Session 11-12-Linear and Multiple Regression
15 pages
University of Caloocan City: Managerial Economics Eco 3
No ratings yet
University of Caloocan City: Managerial Economics Eco 3
34 pages
Econometrics 2
No ratings yet
Econometrics 2
84 pages
Econometrics 2
No ratings yet
Econometrics 2
135 pages
Mod 3C
No ratings yet
Mod 3C
36 pages
Lecture7 - Regression Extensions
No ratings yet
Lecture7 - Regression Extensions
58 pages
BRM Multi Var
No ratings yet
BRM Multi Var
38 pages
Econometrics II All Chapters
No ratings yet
Econometrics II All Chapters
240 pages
MKTG 4110 Class 10
No ratings yet
MKTG 4110 Class 10
5 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Cfa Level 2 2023 Summary
No ratings yet
Cfa Level 2 2023 Summary
100 pages
m2 Data analytic and visualization
No ratings yet
m2 Data analytic and visualization
53 pages
Chapter One Review of Linear Regression Models: Definitions and Components of Econometrics
No ratings yet
Chapter One Review of Linear Regression Models: Definitions and Components of Econometrics
65 pages
DECS Cheat Sheet
No ratings yet
DECS Cheat Sheet
8 pages
Indicator Variables: Variable or Dummy Variables
No ratings yet
Indicator Variables: Variable or Dummy Variables
11 pages
Econometric Analysis: DR Alka Chadha IIM Trichy
No ratings yet
Econometric Analysis: DR Alka Chadha IIM Trichy
29 pages
Econ. Assignment
No ratings yet
Econ. Assignment
6 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
Chap 6 MultipleLinearRegression Adjusted
No ratings yet
Chap 6 MultipleLinearRegression Adjusted
30 pages
20230305slides
No ratings yet
20230305slides
39 pages
Regression Analysis For Forecasting: Yosef Daryanto
No ratings yet
Regression Analysis For Forecasting: Yosef Daryanto
36 pages
Lecture 04
No ratings yet
Lecture 04
45 pages
Extending the Multiple Regression
No ratings yet
Extending the Multiple Regression
19 pages
Advanced - Linear Regression
No ratings yet
Advanced - Linear Regression
57 pages
Multiple Regression
No ratings yet
Multiple Regression
49 pages
Chap7 and 6.2_Fall20_1124
No ratings yet
Chap7 and 6.2_Fall20_1124
45 pages
New Section 1
No ratings yet
New Section 1
39 pages
Week 5 Notes
No ratings yet
Week 5 Notes
175 pages
Business Analytics: Step-by-Step Tutorial
From Everand
Business Analytics: Step-by-Step Tutorial
Narcyz Roztocki
No ratings yet
Pre-Calculus: 1,001 Practice Problems For Dummies (+ Free Online Practice)
From Everand
Pre-Calculus: 1,001 Practice Problems For Dummies (+ Free Online Practice)
Mary Jane Sterling
3/5 (16)
Group 3 Tyre Report - Question 4
No ratings yet
Group 3 Tyre Report - Question 4
21 pages
Topic 6 Understanding Causality and Regression (Updated)
No ratings yet
Topic 6 Understanding Causality and Regression (Updated)
66 pages
Lean Manufacturing and Operational Efficiency of Nestle Nigeria Plc. Using Data Envelopment Analysis (DEA)
No ratings yet
Lean Manufacturing and Operational Efficiency of Nestle Nigeria Plc. Using Data Envelopment Analysis (DEA)
29 pages
Case Iima 2020 000018
No ratings yet
Case Iima 2020 000018
26 pages
Hair PPT Ch05
No ratings yet
Hair PPT Ch05
18 pages
01a - Course Outline of BS and MSC Mathematics Only For Fall 2020
No ratings yet
01a - Course Outline of BS and MSC Mathematics Only For Fall 2020
37 pages
Creating A Grade Thickness Long Section in Leapfrog
No ratings yet
Creating A Grade Thickness Long Section in Leapfrog
21 pages
Cise301 Topic5 t212 Sec01
No ratings yet
Cise301 Topic5 t212 Sec01
50 pages
Comparison of Lagrange's and Newton's Interpolating Polynomials
No ratings yet
Comparison of Lagrange's and Newton's Interpolating Polynomials
4 pages
Structured Iir Models For HRTF Interpolation
No ratings yet
Structured Iir Models For HRTF Interpolation
8 pages
Applied Numerical Methods with MATLAB for Engineers and Scientists Fourth Edition Steven C. Chapra Dr. - The ebook with all chapters is available with just one click
100% (2)
Applied Numerical Methods with MATLAB for Engineers and Scientists Fourth Edition Steven C. Chapra Dr. - The ebook with all chapters is available with just one click
68 pages
Curve Fitting
No ratings yet
Curve Fitting
11 pages
Using Stata Chapter 2
No ratings yet
Using Stata Chapter 2
4 pages
Logistic Regression
No ratings yet
Logistic Regression
7 pages
5710 NM Tutorial 2
No ratings yet
5710 NM Tutorial 2
8 pages
Tutorial 2 PSNM (2024-25) Unit-1 Correlation, Regression and Curve Fitting
No ratings yet
Tutorial 2 PSNM (2024-25) Unit-1 Correlation, Regression and Curve Fitting
2 pages
Ha01 - PP Test
No ratings yet
Ha01 - PP Test
17 pages
BA Waypoint Navigation
No ratings yet
BA Waypoint Navigation
131 pages
Introduction to Elementary Computational Modeling Essential Concepts Principles and Problem Solving 1st Edition Jose Garrido (Author) instant download
100% (1)
Introduction to Elementary Computational Modeling Essential Concepts Principles and Problem Solving 1st Edition Jose Garrido (Author) instant download
54 pages
Vargaftik PDF
No ratings yet
Vargaftik PDF
5 pages
Assignment 2 - 1 TR2103
No ratings yet
Assignment 2 - 1 TR2103
6 pages
Relative Value Models (Feb04)
No ratings yet
Relative Value Models (Feb04)
18 pages
Newton's Interpolation Formulae
No ratings yet
Newton's Interpolation Formulae
7 pages
Regression Analysis Presentation
No ratings yet
Regression Analysis Presentation
52 pages
Econometrics - Solution sh.2B 2024
No ratings yet
Econometrics - Solution sh.2B 2024
9 pages
Numerical Methods Lesson Plan
No ratings yet
Numerical Methods Lesson Plan
2 pages
Anomie's S-DSP
No ratings yet
Anomie's S-DSP
18 pages
2016-Scientific Computing With MATLAB-Paul Gribble-Math Eng Chap01
No ratings yet
2016-Scientific Computing With MATLAB-Paul Gribble-Math Eng Chap01
53 pages
The Finite Element Method in Engineering
100% (2)
The Finite Element Method in Engineering
6 pages
4 3 4 4 4 5 Practice Test
No ratings yet
4 3 4 4 4 5 Practice Test
15 pages
Midterm Exam Data Analytics
No ratings yet
Midterm Exam Data Analytics
858 pages
SEM - An Econometrican S Introduction
No ratings yet
SEM - An Econometrican S Introduction
16 pages
USING DUMMY VARIABLES IN THE EVENT METHODOLOGY imre karafiath
No ratings yet
USING DUMMY VARIABLES IN THE EVENT METHODOLOGY imre karafiath
7 pages

Uploaded by

Uploaded by

Topic 7 Regression (cont)

Vincent Hoang (2022), Lecture 11

• Using sample data, we estimate the regression coefficients bi

• These independent variables are categorical.

For Male employees:

• Consider “Departments” variable

Excel Function: =CHOOSECOLS()

Excel Function: =IF ()

 Curvilinear relationship may provide a better fit to the data

• Here ß1 represents the linear effect of X on Y and ß2 represents the curvilinear

0 50 100 150 200 250 300 350

Let’s estimate the maximum sales!!

The knot/ the breakpoint: the value of the independent variable at

and $100,000 depends on the price 700 646

of the product. 500 375

• At higher selling prices, the effect of 300

increased advertising expenditure 200

• Let’s consider the effect when price increases by $1

• Let’s consider the effect when Advertising expenditure increases by $1,000

• If MBA = 1, then we have

Salary = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + (𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒) × Age +ε

• This equation fails to capture the actual

• Stage 2: using estimated value of the independent

• Normality of Errors (residuals)

• Homoscedasticity: variation about the regression line is constant.

• Independence of Errors: successive observations should not be related.

• The more clustered around the

• There aren’t any clear patterns.

• They have many outliers.

• They have a clear shape or

• These indicate rooms for

Heteroscedasticity: means different (hetero) variance (skedasticity).

• VIF of 1 means the independent variable is

• VIF of 1-5 means moderate multicollinearity

• VIF of 5+ means high multicollinearity

You might also like