0% found this document useful (0 votes)
4 views

DECS Cheat Sheet

The document covers fundamental concepts in probability, random variables, and statistical inference, including mutually exclusive events, conditional probability, and Bayes' Rule. It also discusses regression analysis, including linear regression, multiple regression, and the importance of understanding relationships between variables, as well as techniques for estimating and benchmarking. Additionally, it addresses issues like multicollinearity, influential observations, and the use of log and linear models in data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

DECS Cheat Sheet

The document covers fundamental concepts in probability, random variables, and statistical inference, including mutually exclusive events, conditional probability, and Bayes' Rule. It also discusses regression analysis, including linear regression, multiple regression, and the importance of understanding relationships between variables, as well as techniques for estimating and benchmarking. Additionally, it addresses issues like multicollinearity, influential observations, and the use of log and linear models in data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Probability

Mutually Exclusive /Disjoint Events - if the events have no common elements and cannot happen at the same time. P(A or B) = P(A) +
P(B);
Not mutually disjoint = P(A or B) = P(A) + P(B) - P(A and B)

Conditional Probability - The probability that event A happens, given event B has occurred
P(A|B) = P(A and B)/P(B)
If these are independent events, then P(A|B) = P(A)

Independent Events - when the outcome of one event does not change the likelihood of the other; P(A and B) = P(A)*P(B)

Bayes’ Rule
P(B|A) = P(A|B)*P(B) / P(A|B)*P(B) + P(A|Bc)*P(Bc)

Probability Trees

Random Variables
An uncertain numerical outcome
E(aX) = a*E(X)
E(X+Y) = E(X) + E(Y)

Expected Value in Excel = SUMPRODUCT(A range of x values, B range of probability)

Total Revenue Example: X sells for $500 and Y sells for $1200 so E(X + Y) = 500E(X) + 1200E(Y)
Percentage Returns Example: We invest 60% of portfolio in X and 40% in Y so E(X + Y) = 0.60E(X) + 0.40E(Y)

Variance - how close are we to the expected value


Var(X) = p1[x1 – E(X)]2 + p2[x2 – E(X)]2 + … + pn[xn – E(X)]2
Var (X+Y) = Var(X) + Var(Y)
Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y)
Var(aX+bY) = a2Var(X) + b2Var(Y) + 2abCov(X,Y)
Var(aX) = a^2 Var(X)

Two possible investments with independent returns


Asset X: The rate of return of this asset is given by X E(X) = .12 ;Var(X) = .04
Asset Y: The rate of return of this asset is given by Y E(Y) = .05 ;Var(Y) = .01
Consider a 20-80 mix of X and Y
We define P, our portfolio’s returns, to be: P = .2X + .8Y
E(P) = .2E(X) + .8E(Y) = .2(.12) + .8(.05) = .064
Var(P) = .22Var(X) + .82Var(Y) = .04(.04)+.64(.01) = .008
The combined portfolio has less risk and greater returns. Use this to find standard deviation - the measure of risk.

Correlation, Gauss and CLT


Covariance of Random Variables
When X and Y vary together, and the way they vary is correlated. Covariance is the way we measure relationships between variables
Cov(X,Y) = Σ P(X=x,Y=y) [x-E(X)][y-E(Y)]
- Can be small when the two random variables do not tend to move together very much, or they just don’t vary much at all
- Can be large when the two random variables generally move together quite closely or they vary an awful lot.
=COVARIANCE.P(A2:A6,B2:B6)

Correlation
A more reliable and intuitive measure of independence
Corr(X,Y) = Cov(X,Y)/SD(X)∙SD(Y)

Normal Distribution
Normal random variables is completely described by its mean and standard deviation
- Adding or subtracting a constant doesn’t affect, X + b ~ Normal (μ + b,σ)
- Multiplying/dividing by a constant doesn’t affect, aX~ Normal (aμ,aσ)
- Combining - Y = aX+b~ Normal (aμ+b,aσ)
P(X ≤ 2) = NORM.DIST( 2, 5, 3.6, 1) or [x, μ, σ, Cumulative?]
P(X ≥ 20) = 1 –NORM.DIST(20, 10, 3, 1)
How short would a call have to be to be in the quickest 2% of all calls?
=NORM.INV(.02, 10, 3) or (p, μ, σ)

Z = 𝑋−𝜇/𝜎~ Normal (0,1) 𝑃(−1.96≤𝑍≤1.96)=.9500


Standard Normal

CLT - Central Limit Theorum


For any distribution of X when n is large, the sum of n independent trials of X is approximately normally distributed. Used when we only

Sum ~ Normal (nμ, sqrt(𝑛)𝜎) ; Where E(X) = nμ and Var(X) = nσ2


care about the SUM TOTAL about the total outcome of trials.

Average ~ Normal (μ, 𝜎/sqrt(𝑛)) ; Where E(X) = μ and Var(X) = σ2/n

Sampling Distributions

Xbar~ Normal (μ, 𝜎/sqrt(n))


Confidence Intervals

Suppose we know the average of n trials. Then we are 95% confident that
the true population mean, μ, falls within the following range:

T-distribution -- used when we don’t know the


population standard deviation.
95% confidence interval in an uncertain
population standard deviation case would be

You can get the t-value from excel: T.INV.2T(tail probability, n -1)

Statistical Inference / Hypothesis Testing


If a certain condition were met, how surprising would it be to see data like this? The significant or p-value.
Claim: The population mean exceeds $10,000.
Supporting evidence:
1. X bar= $10,362, s= $2,287, n= 40.
2. “Such a high sample mean would be surprising if the population mean did not exceed $10,000”
The nullhypothesis(H0): the “default” claim.
The alternative hypothesis (Ha): the claim one is trying to establish (or “demonstrate” or “prove”).
For us:
H0: μ≤ 10,000
Ha: μ> 10,000
=T.DIST.RT(t,n-1,1) the probability of getting an average this high (or higher)
Conclude:If the population mean were 10,000, there would only be a 16.1% chance of getting sample mean this large or larger. In our
case a p-value of 16% is not considered “small enough” to reject the null hypothesis. p > αso we do not reject the null hypothesis
The p-value of our test is p= 16.1%.

Other tests
=T.DIST (t,n-1,1) if testing <Ho
=T.DIST.2T(t,n-1,1) if testing not equal to Ho

Linear Regression: Prediction


What single factor is the biggest determinant in the price of a home?

Y= β0+ β1X1+ 𝜖𝜖
The Regression MODEL –

Β0 - Y-intercept
Β1 - Slope / rate of change between X and Y

Best-Fit Line - Least Squares Method


The Regression LINE –
ŷ = b0 + b1x1
Degrees of freedom = n-k-1 where k is the number of x variables
Statistical Significance ≠ Economic Significance

Root MSE - Root Mean Squared Error


Uses our data set to estimate standard deviation from the error term

R-squared
Used to measure regression performance, but is NOT AS IMPORTANT AS IT SEEMS. It only measures the percentage of the variance of
Y that can be explained by the value of X.
- If high it means that the regression equation does a much better job predicting y as compared to the sample mean y
- It is based on observations included in the regression -- only useful if the model that generates future observations is the same as the
model that generated your data.

Prediction Interval
When we want to be 95% confident about the outcome of one particular instance. KPREDINT: Estimate a single observation – only
LEVEL questions also gives prediction interval Whereas the confidence interval is for the mean. Kpredint - creates a prediction interval
for the sales in any one specific store with X=x. Prediction interval is always bigger than CI because SE(induv pred)2 = SE(mean) 2 +
SE(regression)2. The actual value of Y in a line of data (e.g. one specific store) is equal to the mean of the Y variable for other specific
stores with the same X variable levels plus an error term, which measures the effect of other variables impacting that specific data point.

Regression: Estimating and Benchmarking


Rate of Change / Estimating
Instead of using a regression equation to make predictions of Y for a particular value of X, we want to use it to understand how changes
in X relate to changes in Y. The key here is the slope. It answers the question when X1 increases by ONE, by how much does Y
increase? Use Klincom.
- CAPM or firm specific risk, where we compare to a beta of 1
- Economics of Pricing

Linear Combinations
Klincom - Lets you run hypotheses about linear combinations of regression coefficients. The output will be the mean y across the
population of y with that specific x
KLINCOM: Estimate the mean of a variable while controlling for other factors (for LEVEL (include constant) or CHANGE(don’t include
constant) question related to the mean OR for CHANGE question related to SINGLE OBS) e.g. “klincom _b[cons] + _b[VAR1
Name]*(value level)
To determine if a mean Y value is above e.g. 10, then “klincom _b[cons] +_b[VAR 1]*(VALUE) – 10; returns p-value mean > 10

Benchmarking
Using the regression model to benchmark certain situations. Can use residuals as rankings to see how performance pans out given what
was expected.
- Run the regression
- predict residuals, residuals
- sort residuals - smallest to largest
- gsort -residuals - largest to smallest

Multiple Regression
Multiple Regression
To solve for omitted variable bias, and isolate the effect of a specific variable on the dependent variable, we include control variables. We
use this to basically replicate the conditions of a Randomized Control Trial.
Interpreting Coefficients [continuous variables]
Constant = the true mean of the y variable when the x variables are all equal to 0
Beta 1 = how much the true mean of the y variable increases when x1 increases by 1 unit, while all other x variables remain constant

Dummy Variables
Used to test the difference between the 2 categories is statistically significant. Like having different CONSTANTS for 0 and 1 categories,
BUT any other continuous variables have the same IMPACT (slope) across the other variables in the equation without slope dummy
Generate Dummy in Stata: “generate black = 0”, then “replace black=1 if color ==”black” ” If more than one category, create N-1
variables. Interpreting coefficients, when only one dummy variable, B0 = average (mean) of Y variable when dummy category = 0; B0+B1
= average (mean) of Y variable when dummy category = 1; B1 = difference in Y var value between when – 0 and 1. To get CI for omitted
category, KLINCOM _b[cons]; cat. 1 = KLINCOM _b[_cons] + _b[_X1]*1 Coefficient difference between mean of Y for the category of
interest and mean of Y for the excluded categories, holding all other X-variables fixed. Each regression choice enables us to easily TEST
the significance of the difference between the excluded variable category and the ones that we included.
Slope Dummy Variables
Used to measure the differences in slopes between the variables. This is used when we want to combine two linear equations. Enables a
DIFFERENT IMPACT (slope) between the different categories; include when believe that the change in y when the other c-variable
change is diff. for the diff. categories associated with the dummy (if the same impact just use normal dummy). Stata: “generate
SlopeDummyName = (continuous variable name)*(dummy variable name)”
What is the difference between the estimated rooms for tourists and locals when price equals 50? Here we are asking if we think the
impact of being a tourist adds a price differential. Klincom tourist + 50*touristprice

Multicollinearity
The opposite of omitted variable bias, this is caused by the variables that are included in the regression that vary with each other /
strongly related to one another. Occurs when correlation among independent variables is too high. Makes it more difficult to estimate the
impact of each variable individually on the Y variable TF p-values and SE may be high.
- Test using VIF or Variance Inflation Factor to identify excessive multicollinearity. If the VIF > 10 then you have a serious
multicollinearity problem (.vif)
- Use the partial f-test to find out if the two variables are jointly significant . testparm. If p-value is small, then we can say that at least
one of the variables are significantly related to the other. But don’t remove both variables.
- F-TEST: hypothesis tests (returns p-value) that at least one x variable coefficient ≠ 0; like a p-value for the whole
regression; takes into account the # of variables testing + requires stronger evidence for more vars . If high F-test,
but low p-value, likely spurious correlation
- As you add variables to your regression the likelihood of one of the “junk” variables is stat. signif. Increases
- PARTIAL F-TEST: Tests if some of the variables in a regression are jointly statistically significant; e.g. if there is enough evidence
that X2 OR X3 are related to Y

Influential Observations
Outlier (extreme y values) - if the y value of our data point is far away from the line. We can identify outliers by looking at residuals. We
define this as being more than about 1.96 standard deviations away from the line. About 5% of our data should be outliers.

.inflobs - command that helps identify the data points that may be having disproportionate influence on your regression.

Leverage (extreme x values) - has a residual of zero, but is lonely in its x value. They tend to shape the regression line more than other
data points.

Cook’s D - How much a single data point moves the regression line.

Extrapolation
Trying to predict a value outside the range of our model

Log v. Linear Models


If you plot rvfplot, and there is a pattern, then likely not
linear. Two ways to deal with nonlinear functions:
quadratic (applied to one or more x’s) and
logarithmic (applied to y alone or both y and x’s)
Quadratic

Semi-log - Log tends to popup when the impact of X is


multiplicative and not additive. Only log the Y variable.
- Diminishing returns
- Learning curves
- Growth rates
- Price elasticity

The semi-log model is: logY= β0+ β1 X


To use KLINCOM or KPREDINT in a semi-log use the NORMAL UNITS for the X variables in your command
REMEMBER! Then convert your prediction back to original units AND multiply by a correction factor ((SE of regression)2 /2) (works for
estimations and CI levels) ONLY with estimates around the mean do you need the correction factor NOT when for individual predictions!
In Stata: di exp(prediction#)*exp((e(rmse)^2)/2)
Interpreting co-efficient in semi-log: each 1 unit increase in X1 variable is associated with an approximately (100*B1)% change in
the Y variable

Log-Log
The log-log model is: logY= β0+ β1logX
To use KLINCOM or KPREDINT need to take the log of the value in the X variable you want to test (e.g. if want to test 2,
then need to multiple the coefficient by ln(2)≈ 0.69)
REMEMBER! Then convert your prediction back to original units AND multiply by a correction factor ((SE of regression)2 /2) (works for
estimations and CI levels) ONLY with estimates around the mean do you need the correction factor NOT when for individual predictions!
Interpreting co-efficient in log-log: each 1% increase in X1 is associated with an approx. B1% change in the Y variable
B1 is also called the elasticity of the Y variable with respect to the X variables
UnitCost = a(Experience)b --> ln(UnitCost) = ln(a) + b ln(Experience)

Correction Factor
Logging and exponentiating creates a bias in the average, so we can correct that bias by multiplying fitted values and CI (not PI) by a
correction factor
. display exp(e(rmse)^2/2)

Error Assumptions in Regression


In order to compute the standard errors of the estimated coefficients in regressions, we have to make two assumptions about the errors:

Homeskedasticity and Cluster-Robust Standard Errors


The distribution (variation of the individual observations around the regression line) of each observations error is identical. Something
violates this and is heteroskedastic if the magnitudes of the errors are related to any x. Estimates are regressions are fine, but SEs and
CIs/PIs and p-values are wrong.
- .rvfplot, and see if there is a pattern / cone shape
- .hettest is a better way. If p-value is significant, then say that model is heteroskedastic
- Logs can correct for this. If logs still don’t help, can add vce(robust) to the end of the regression equation
- . regress sales shoppers, vce(robust) >> Reduces the effective sample size to the number of groups, so typically need a large
number of groups (>20). Need to ensure that there is linearity, homosked. And independence in these items as well after use cluster-
robust SEs

Independence
Assumption: residuals are independent; violated when data is “groupy” (panel) AND aren’t using a fixed model regression. Each error
must be independent of every other error. Doesn’t happen in the instance of autocorrelation such as a time series plot
- ID via a pattern in residual plot
- Can correct by making the dependent variable the change in y from day to day. In Stata: add “, cluster(GroupVariableName)” at the
end of your regression

Omitted Variable Bias (OVB)


OVB occurs when a coefficient of some x-variables is interpreted incorrectly because it leaves out an important factor which is related to
the X1 variable and the Y-variable (therefore, the coefficient captures the direct impact of X1, but also impact on X1 on Y through X2)
Consider these regressions:
- Y = B0* +B1*(X1 var)
- Y = B0 + B1(X1 var) + B2(X2 var)
- X2 var = A0 + A1*(X1 var)
OVB = B1* - B1 = A1*B2
OVB = (relationship b/w X1 and omitted variable)*(relationship b/w Y and omitted variable)
- If negative OVB, then B1* - B1 < 0 ; B1* < B1
- The coefficient on an X variable measures DIFFERENT things depending on what other variables are in a regression
- Business question may mean only include one variable (i.e. don’t care how a company reduces energy cost, just care that it does, so
would use just energy cost as the X variable)
Example:
scores=30+8* study
scores=-70+12* study+30*IQ
What can you conjecture about the relationship between IQ and the hours devoting to studying among your older peers?

We can say that there is a negative relationship (quantified more exactly below) between how long a student studied for the midterm and the
student’s IQ. Explanation: Comparing the two regressions, we see that the OVB on the coefficient of study from omitting IQ is equal to 8 – 12 = -4,
which is negative. This OVB is the product of (1) the relation between study time and IQ and (2) the relation between IQ and scores when study time
is held fixed. The latter is given by the 30 estimated coefficient on IQ in the second regression. Thus -4 = (?)*(30), so that the ? representing the
relation between study time and IQ must equal -4/30 ≈ -0.133, meaning that in this data an additional hour of study time is associated with an IQ
lower by 0.133 points. In other words, in our data, students with lower IQs tended to study more for the midterm.

Solutions to OVB:
- Find data for the omitted variables (however, there may be omitted factors we don’t think of)
- Use Fixed Effects requires Panel Data (see Panel Data & Fixed Effects section below) \

Panel Data Fixed Effects Models


Cross-sectional data is data collected from one point in time across different subjects; panel data is data with multiple dimensions (often
many subjects) over many periods of time (e.g. multiple stock values over time)
- Cross section studies are particularly vulnerable to OVB because comes from a single point in time
- This can be corrected by using panel data and implementing a fixed effects regression. When have panel data, we can use
dummy variables (Fixed Effect) to eliminate OVB. Fixed effects models focus on this within-group variation
- Need to assume the effect of the X variable on the Y variable is the SAME across the different ‘panels’ (e.g. people, regions, stocks,
etc.), then can create dummy variables for each of these categories/panels/groups if DO NOT want this assumption, need to also use
a slope dummy
- We regress sales on bonus and include in the regression dummy variables for the regions(remembering to omit one region). Look for
groupiness.
- xi: regress sales bonus i.region i.year
- The fixed effects (the region dummies) soak up all between group variation. In doing so, they pick up the combined effects of all
factors that differ across groups but remain constant over time within each group.
- Regressions based on cross-section data are vulnerable to OVB
- Panel data includes both between-group and “within”-group variation Between variation is the difference in the mean values of
variables between each group of data (e..g, each individual, firm, etc.)
- Within variation is the variation around the mean (over time) within each group
- FE regression focuses on the effects of within-group variation. FE regression eliminates OVB due to the effects of unobserved
between-group differences
- Some OVB may remain, but only if unobservables vary over time within groups, and this variation is correlated with the variation over
time of the variables in your regression
- When you create a fixed effect, the interpretation of the coefficients on your variables of interest is now “controlled for within one
group” (e.g. within one store, one stock, one person, etc.)

Building Models
Convincing results have five features:
1. The coefficients are unbiased(or known biases do not pose a problem)
2. The model is parsimonious
3. The standard errors (and associated confidence and prediction intervals) are correctly estimated
4. The results are robust
5. When required, the model is well identified, meaning that we can sort out cause and effect

Problem 3: Simple re-greasin’


Run a regression of netprofits versus repairs and write out the regression equation below.

The shop is considering installing new “puncture resistant” tires on their Simple bikes. The new tires will cost an extra $45 per pair but will reduce the average number of visits in the first year by 2.5. Is this upgrade a good
idea for the bike shop? Justify your recommendation.

. klincom -2.5*repairs - 45

( 1) - 2.5*repairs = 45

------------------------------------------------------------------------------
netprofits | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 10.55826 5.146492 2.05 0.043 .3621264 20.75438
------------------------------------------------------------------------------

If Ha: < then Pr(T < t) = .979

If Ha: not = then Pr(|T| > |t|) = .043

If Ha: > then Pr(T > t) = .021


We should reject the null hypothesis. This upgrade is a good idea. We expect profits to go up by $10.56 per customer.

The non-profit organization is considering an aggressive new marketing campaign to encourage attendance at the farmer’s market. They believe that the campaign will increase attendance by 50 people per week on average,
but they will have to increase rents for the each seller. The Aloha Honey Co would have to pay an extra $150 per week to support the campaign. Should the company support this new campaign? Explain.

. klincom 50*shoppers -150

You might also like