Topic 7 Regression (Cont.)
Topic 7 Regression (Cont.)
Y 0 1 X 1 2 X 2 k X k 8.10
◦ Consider X1 is dummy variable for gender, X1 either takes a value of zero (not female) or one
(female)
For Female employees:
Experience
Four dummy variables for five categories
Current
salary
Experience
General rules for dummy variable number
• When a categorical variable has k (> 2) levels/categories, we need
to add (k−1) additional dummy variables to the model.
◦ The variable “Department” has five levels
◦ Choose one level, such as “Research and Development” as the reference level
◦ Four dummy variables are needed
◦ Each dummy compares to the reference level.
How to prepare data in RealStatistics
• In Excel RealStatistics:
◦ Categorical Coding Regression | Real Statistics Using Excel (real-statistics.com)
450
Case: Months
400
employed and
350
300
Sales
Data:
Scales sold
250
200
150
Reynolds
100
50
0
0 20 40 60 80 100 120
Months Employed
Case: Months employed and Sales
Data: Reynolds
Residuals
100
80
60
40
20
0
0 50 100 150 200 250 300 350 400
-20
-40
-60
-80
Predicted value
Y 0 1 X 2 X 2
40
20
0
Axis Title
-20
-40
-60
-80
Predicted value
Case: Months employed and Sales
Data: Reynolds Quadratic regression model
600
100
diminishes. 0
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50
Price
Case: Advertising expenditure – Price and Sales
Data: Tyler
Case: Advertising expenditure – Price and Sales
Data: Tyler
Salary = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + (𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒) × Age + (𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒) × MBA + (𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒) × MBA × Age +ε
• If MBA = 0, then
Salary = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + (𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒) × Age + 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑎𝑙𝑢𝑒 × 0 × Age +ε
• The coefficient of the interaction term of two categorical variables controls for
the differences in the union of the two groups. This coefficient acts as a constant
that shifts the model if the observation is in the union of the two categorical
variables.
Continuous by Continuous
For example
◦ Experience and Years of Education
Take the MLR model:
𝑌 =𝛽 +𝛽 𝑋 +𝛽 𝑋 +𝛽 𝑋 𝑋 +𝜀
The partial effect of 𝑋 is given by:
𝛽 +𝛽 𝑋
And the partial effect of 𝑋 is given by:
𝛽 +𝛽 𝑋
Clearly, the partial effect of one of the continuous variables in the interaction term depends on the
size of the term it is interacted with.
Co- and multi-collinearity problems
• Multicollinearity occurs when we have two or more independent variables that are highly correlated
with one another.
• If those variables capture similar things, for example income and disposable income or body size and
weight, then we should consider to use only one (better) variable.
• Multicollinearity affects statistical significance of t-tests but does not have much impacts on the
predictive power.
• If you have two independent variables that are highly correlated with one another, think about:
◦ whether they measure the same thing
◦ They could be results of a chain or a fork
• Avoiding using multiple variables that capture the same thing help us avoid (1) multicollinearity
problems (ie. For the use of situational assessment) and (2) avoid “over-fitting the model” in predictive
analytics.
From causal diagrams to mathematical
equations
• A simplest form of empirical model would be
using regression model as below
• Stage 1
• Models with higher R-squared & adjusted R-squared often are better.
Predicted vs Actual Plots
• The left-side model
is explains more of
the variation in
revenue than the
right-side model.
Predicted vs Residual Plot
• The right-side chart is
the “residual plot”.
• We can ask Excel
add-in to generate
these plots.
• We can use these
plots to assess model
assumptions.
Ideally, the plot
should look like
these
• The more symmetrically
distributed, tending to cluster
towards the middle of the plot,
the better.