DECS Cheat Sheet
DECS Cheat Sheet
Mutually Exclusive /Disjoint Events - if the events have no common elements and cannot happen at the same time. P(A or B) = P(A) +
P(B);
Not mutually disjoint = P(A or B) = P(A) + P(B) - P(A and B)
Conditional Probability - The probability that event A happens, given event B has occurred
P(A|B) = P(A and B)/P(B)
If these are independent events, then P(A|B) = P(A)
Independent Events - when the outcome of one event does not change the likelihood of the other; P(A and B) = P(A)*P(B)
Bayes’ Rule
P(B|A) = P(A|B)*P(B) / P(A|B)*P(B) + P(A|Bc)*P(Bc)
Probability Trees
Random Variables
An uncertain numerical outcome
E(aX) = a*E(X)
E(X+Y) = E(X) + E(Y)
Total Revenue Example: X sells for $500 and Y sells for $1200 so E(X + Y) = 500E(X) + 1200E(Y)
Percentage Returns Example: We invest 60% of portfolio in X and 40% in Y so E(X + Y) = 0.60E(X) + 0.40E(Y)
Correlation
A more reliable and intuitive measure of independence
Corr(X,Y) = Cov(X,Y)/SD(X)∙SD(Y)
Normal Distribution
Normal random variables is completely described by its mean and standard deviation
- Adding or subtracting a constant doesn’t affect, X + b ~ Normal (μ + b,σ)
- Multiplying/dividing by a constant doesn’t affect, aX~ Normal (aμ,aσ)
- Combining - Y = aX+b~ Normal (aμ+b,aσ)
P(X ≤ 2) = NORM.DIST( 2, 5, 3.6, 1) or [x, μ, σ, Cumulative?]
P(X ≥ 20) = 1 –NORM.DIST(20, 10, 3, 1)
How short would a call have to be to be in the quickest 2% of all calls?
=NORM.INV(.02, 10, 3) or (p, μ, σ)
Sampling Distributions
Suppose we know the average of n trials. Then we are 95% confident that
the true population mean, μ, falls within the following range:
You can get the t-value from excel: T.INV.2T(tail probability, n -1)
Other tests
=T.DIST (t,n-1,1) if testing <Ho
=T.DIST.2T(t,n-1,1) if testing not equal to Ho
Y= β0+ β1X1+ 𝜖𝜖
The Regression MODEL –
Β0 - Y-intercept
Β1 - Slope / rate of change between X and Y
R-squared
Used to measure regression performance, but is NOT AS IMPORTANT AS IT SEEMS. It only measures the percentage of the variance of
Y that can be explained by the value of X.
- If high it means that the regression equation does a much better job predicting y as compared to the sample mean y
- It is based on observations included in the regression -- only useful if the model that generates future observations is the same as the
model that generated your data.
Prediction Interval
When we want to be 95% confident about the outcome of one particular instance. KPREDINT: Estimate a single observation – only
LEVEL questions also gives prediction interval Whereas the confidence interval is for the mean. Kpredint - creates a prediction interval
for the sales in any one specific store with X=x. Prediction interval is always bigger than CI because SE(induv pred)2 = SE(mean) 2 +
SE(regression)2. The actual value of Y in a line of data (e.g. one specific store) is equal to the mean of the Y variable for other specific
stores with the same X variable levels plus an error term, which measures the effect of other variables impacting that specific data point.
Linear Combinations
Klincom - Lets you run hypotheses about linear combinations of regression coefficients. The output will be the mean y across the
population of y with that specific x
KLINCOM: Estimate the mean of a variable while controlling for other factors (for LEVEL (include constant) or CHANGE(don’t include
constant) question related to the mean OR for CHANGE question related to SINGLE OBS) e.g. “klincom _b[cons] + _b[VAR1
Name]*(value level)
To determine if a mean Y value is above e.g. 10, then “klincom _b[cons] +_b[VAR 1]*(VALUE) – 10; returns p-value mean > 10
Benchmarking
Using the regression model to benchmark certain situations. Can use residuals as rankings to see how performance pans out given what
was expected.
- Run the regression
- predict residuals, residuals
- sort residuals - smallest to largest
- gsort -residuals - largest to smallest
Multiple Regression
Multiple Regression
To solve for omitted variable bias, and isolate the effect of a specific variable on the dependent variable, we include control variables. We
use this to basically replicate the conditions of a Randomized Control Trial.
Interpreting Coefficients [continuous variables]
Constant = the true mean of the y variable when the x variables are all equal to 0
Beta 1 = how much the true mean of the y variable increases when x1 increases by 1 unit, while all other x variables remain constant
Dummy Variables
Used to test the difference between the 2 categories is statistically significant. Like having different CONSTANTS for 0 and 1 categories,
BUT any other continuous variables have the same IMPACT (slope) across the other variables in the equation without slope dummy
Generate Dummy in Stata: “generate black = 0”, then “replace black=1 if color ==”black” ” If more than one category, create N-1
variables. Interpreting coefficients, when only one dummy variable, B0 = average (mean) of Y variable when dummy category = 0; B0+B1
= average (mean) of Y variable when dummy category = 1; B1 = difference in Y var value between when – 0 and 1. To get CI for omitted
category, KLINCOM _b[cons]; cat. 1 = KLINCOM _b[_cons] + _b[_X1]*1 Coefficient difference between mean of Y for the category of
interest and mean of Y for the excluded categories, holding all other X-variables fixed. Each regression choice enables us to easily TEST
the significance of the difference between the excluded variable category and the ones that we included.
Slope Dummy Variables
Used to measure the differences in slopes between the variables. This is used when we want to combine two linear equations. Enables a
DIFFERENT IMPACT (slope) between the different categories; include when believe that the change in y when the other c-variable
change is diff. for the diff. categories associated with the dummy (if the same impact just use normal dummy). Stata: “generate
SlopeDummyName = (continuous variable name)*(dummy variable name)”
What is the difference between the estimated rooms for tourists and locals when price equals 50? Here we are asking if we think the
impact of being a tourist adds a price differential. Klincom tourist + 50*touristprice
Multicollinearity
The opposite of omitted variable bias, this is caused by the variables that are included in the regression that vary with each other /
strongly related to one another. Occurs when correlation among independent variables is too high. Makes it more difficult to estimate the
impact of each variable individually on the Y variable TF p-values and SE may be high.
- Test using VIF or Variance Inflation Factor to identify excessive multicollinearity. If the VIF > 10 then you have a serious
multicollinearity problem (.vif)
- Use the partial f-test to find out if the two variables are jointly significant . testparm. If p-value is small, then we can say that at least
one of the variables are significantly related to the other. But don’t remove both variables.
- F-TEST: hypothesis tests (returns p-value) that at least one x variable coefficient ≠ 0; like a p-value for the whole
regression; takes into account the # of variables testing + requires stronger evidence for more vars . If high F-test,
but low p-value, likely spurious correlation
- As you add variables to your regression the likelihood of one of the “junk” variables is stat. signif. Increases
- PARTIAL F-TEST: Tests if some of the variables in a regression are jointly statistically significant; e.g. if there is enough evidence
that X2 OR X3 are related to Y
Influential Observations
Outlier (extreme y values) - if the y value of our data point is far away from the line. We can identify outliers by looking at residuals. We
define this as being more than about 1.96 standard deviations away from the line. About 5% of our data should be outliers.
.inflobs - command that helps identify the data points that may be having disproportionate influence on your regression.
Leverage (extreme x values) - has a residual of zero, but is lonely in its x value. They tend to shape the regression line more than other
data points.
Cook’s D - How much a single data point moves the regression line.
Extrapolation
Trying to predict a value outside the range of our model
Log-Log
The log-log model is: logY= β0+ β1logX
To use KLINCOM or KPREDINT need to take the log of the value in the X variable you want to test (e.g. if want to test 2,
then need to multiple the coefficient by ln(2)≈ 0.69)
REMEMBER! Then convert your prediction back to original units AND multiply by a correction factor ((SE of regression)2 /2) (works for
estimations and CI levels) ONLY with estimates around the mean do you need the correction factor NOT when for individual predictions!
Interpreting co-efficient in log-log: each 1% increase in X1 is associated with an approx. B1% change in the Y variable
B1 is also called the elasticity of the Y variable with respect to the X variables
UnitCost = a(Experience)b --> ln(UnitCost) = ln(a) + b ln(Experience)
Correction Factor
Logging and exponentiating creates a bias in the average, so we can correct that bias by multiplying fitted values and CI (not PI) by a
correction factor
. display exp(e(rmse)^2/2)
Independence
Assumption: residuals are independent; violated when data is “groupy” (panel) AND aren’t using a fixed model regression. Each error
must be independent of every other error. Doesn’t happen in the instance of autocorrelation such as a time series plot
- ID via a pattern in residual plot
- Can correct by making the dependent variable the change in y from day to day. In Stata: add “, cluster(GroupVariableName)” at the
end of your regression
We can say that there is a negative relationship (quantified more exactly below) between how long a student studied for the midterm and the
student’s IQ. Explanation: Comparing the two regressions, we see that the OVB on the coefficient of study from omitting IQ is equal to 8 – 12 = -4,
which is negative. This OVB is the product of (1) the relation between study time and IQ and (2) the relation between IQ and scores when study time
is held fixed. The latter is given by the 30 estimated coefficient on IQ in the second regression. Thus -4 = (?)*(30), so that the ? representing the
relation between study time and IQ must equal -4/30 ≈ -0.133, meaning that in this data an additional hour of study time is associated with an IQ
lower by 0.133 points. In other words, in our data, students with lower IQs tended to study more for the midterm.
Solutions to OVB:
- Find data for the omitted variables (however, there may be omitted factors we don’t think of)
- Use Fixed Effects requires Panel Data (see Panel Data & Fixed Effects section below) \
Building Models
Convincing results have five features:
1. The coefficients are unbiased(or known biases do not pose a problem)
2. The model is parsimonious
3. The standard errors (and associated confidence and prediction intervals) are correctly estimated
4. The results are robust
5. When required, the model is well identified, meaning that we can sort out cause and effect
The shop is considering installing new “puncture resistant” tires on their Simple bikes. The new tires will cost an extra $45 per pair but will reduce the average number of visits in the first year by 2.5. Is this upgrade a good
idea for the bike shop? Justify your recommendation.
. klincom -2.5*repairs - 45
( 1) - 2.5*repairs = 45
------------------------------------------------------------------------------
netprofits | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 10.55826 5.146492 2.05 0.043 .3621264 20.75438
------------------------------------------------------------------------------
The non-profit organization is considering an aggressive new marketing campaign to encourage attendance at the farmer’s market. They believe that the campaign will increase attendance by 50 people per week on average,
but they will have to increase rents for the each seller. The Aloha Honey Co would have to pay an extra $150 per week to support the campaign. Should the company support this new campaign? Explain.