0% found this document useful (0 votes)
95 views

Questions Stats and Trix

Linear regression predicts continuous output values while logistic regression predicts categorical classifications. The key differences are: - Linear regression solves regression problems and logistic regression solves classification problems. - Linear regression finds the best fit line to predict output, while logistic regression finds S-curves to classify binary outcomes. - Accuracy is measured by least squares for linear regression and maximum likelihood for logistic regression. - Linear regression requires continuous output values while logistic regression requires categorical outputs.

Uploaded by

Aakriti Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views

Questions Stats and Trix

Linear regression predicts continuous output values while logistic regression predicts categorical classifications. The key differences are: - Linear regression solves regression problems and logistic regression solves classification problems. - Linear regression finds the best fit line to predict output, while logistic regression finds S-curves to classify binary outcomes. - Accuracy is measured by least squares for linear regression and maximum likelihood for logistic regression. - Linear regression requires continuous output values while logistic regression requires categorical outputs.

Uploaded by

Aakriti Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

j1. What is the difference between logistic and linear regression .

Linear regression is used for regression or to predict continuous values whereas


logistic regression can be used both in classification and regression problems but it
is widely used as classification algorithm. Regression models aim to project value
based on independent features.

The main difference that makes both different from each other is when the
dependent variables are binary logistic regression is considered and when
dependent variables are continuous then linear regression is used.

Differences Between Linear And Logistic Regression


 
 Linear regression is used for predicting the continuous dependent variable
using a given set of independent features whereas Logistic Regression is
used to predict the categorical.
 Linear regression is used to solve regression problems whereas logistic
regression is used to solve classification problems.
 In Linear regression, the approach is to find the best fit line to predict the
output whereas in the Logistic regression approach is to try for S curved
graphs that classify between the two classes that are 0 and 1.
 The method for accuracy in linear regression is the least square estimation
whereas for logistic regression it is maximum likelihood estimation. 
 In Linear regression, the output should be continuous like price & age,
whereas in Logistic regression the output must be categorical like either Yes /
No or 0/1.
 There should be a linear relationship between the dependent and independent
features in the case of Linear regression whereas it is not in the case of
Logistic regression.
 There can be collinearity between independent features in case of linear
regression but it is not in the case of logistic regression.
2. What do you mean by panel, cross section, time series data.

Panel data, also known as longitudinal data or cross-sectional time series data
in some special cases, is data that is derived from a (usually small) number of
observations over time on a (usually large) number of cross-sectional units like
individuals, households, firms, or governments.

In the disciplines of econometrics and statistics, panel data refers to multi-


dimensional data that generally involves measurements over some period of
time.
A panel data set may be one that follows a given sample of individuals over
time and records observations or information on each individual in the
sample.
Panel data, sometimes referred to as longitudinal data, is data that contains
observations about different cross sections across time.
Example - GDP across multiple countries, Unemployment across different
states, Income dynamic studies, international current account balances.

Time series data is a collection of quantities that are assembled over even
intervals in time and ordered chronologically. The time interval at which
data is collection is generally referred to as the time series frequency.

Mean reverting data returns, over time, to a time-invariant mean. It is


important to know whether a model includes a non-zero mean because it is
a prerequisite for determining appropriate testing and modeling methods.
Time series data is data that is collected at different points in time.
Example Gross Domestic Product (GDP), Consumer Price Index (CPI),
S&P 500 Index, and unemployment rates across years

3.What is stationarity.
 Data points are often non-stationary or have means, variances,
and covariances that change over time.

 Non-stationary behaviors can be trends, cycles, random walks, or


combinations of the three.

 Non-stationary data, as a rule, are unpredictable and cannot be


modeled or forecasted. The results obtained by using non-stationary
time series may be spurious in that they may indicate a relationship
between two variables where one does not exist.
 In order to receive consistent, reliable results, the non-stationary data
needs to be transformed into stationary data.

 In contrast to the non-stationary process that has a variable variance and a


mean that does not remain near, or returns to a long-run mean over time,
the stationary process reverts around a constant long-term mean and
has a constant variance independent of time.

 In the most intuitive sense, stationarity means that the statistical


properties of a process generating a time series do not change over
time. It does not mean that the series does not change over time, just
that the way it changes does not itself change over time. The
algebraic equivalent is thus a linear function, perhaps, and not a
constant one; the value of a linear function changes as 𝒙 grows, but
the way it changes remains constant — it has a constant slope; one
value that captures that rate of change.

 Examples of non-stationary processes are random walk with or without a


drift (a slow steady change) and deterministic trends (trends that are
constant, positive, or negative, independent of time for the whole life of the
series).

4.Explain standard deviation in layman terms.


 Standard deviation is a number used to tell how measurements for a group are
spread out from the average (mean or expected value).
 A low standard deviation means that most of the numbers are close to the
average, while a high standard deviation means that the numbers are more
spread out.
 Standard deviation is a descriptive statistic that is used to understand the
distribution of a dataset. It is often reported in combination with the mean (or
average), giving context to that statistic. Specifically, a standard deviation refers
to how much scores in a dataset tend to spread-out from the mean.
 A small standard deviation (relative to the mean score) indicates that the majority
of individuals (or data points) tend to have scores that are very close to the mean
(see figure below). In this case, cases may look clustered around the mean score,
with only a few scores farther away from the mean (probably outliers).

1. How to treat missing values?

 The extent of the missing values is identified after identifying the variables
with missing values. If any patterns are identified the analyst has to
concentrate on them as it could lead to interesting and meaningful business
insights.
 If there are no patterns identified, then the missing values can be substituted
with mean or median values (imputation) or they can simply be ignored.
Assigning a default value which can be mean, minimum or maximum value.
Getting into the data is important.
 If it is a categorical variable, the default value is assigned. The missing value is
assigned a default value. If you have a distribution of data coming, for normal
distribution give the mean value. If 80% of the values for a variable are missing
then you can answer that you would be dropping the variable instead of
treating the missing values.
 For categorical variables we can us the most frequent values i.e the
median or zero or constant values.

2. Correlation matrix
 A correlation matrix is a table showing correlation coefficients
between variables.
 Each cell in the table shows the correlation between two variables.
 A correlation matrix is used to summarize data, as an input into a more
advanced analysis, and as a diagnostic for advanced analyses.
 As a diagnostic when checking other analyses. For example, with linear
regression, a high amount of correlations suggests that the linear
regression estimates will be unreliable.

3. Regression vs classification
 The most significant difference between regression vs classification is that while
regression helps predict a continuous quantity, classification predicts discrete
class labels.
 Let’s consider a dataset that contains student information of a particular
university. A regression algorithm can be used in this case to predict the height of
any student based on their weight, gender, diet, or subject major. We use
regression in this case because height is a continuous quantity. There is an
infinite number of possible values for a person’s height.  
 On the contrary, classification can be used to analyse whether an email is a spam
or not spam. The algorithm checks the keywords in an email and the sender’s
address is to find out the probability of the email being spam. Similarly, while a
regression model can be used to predict temperature for the next day, we can use
a classification algorithm to determine whether it will be cold or hot according to
the given temperature values. 
PARAMENTER CLASSIFICATION REGRESSION

Mapping Function is

used for mapping of Mapping Function is used

values to predefined for mapping of values to

Basic classes. continuous output.

Involves

prediction of Discrete values Continuous values

Nature of the

predicted

data Unordered Ordered

Method of by measurement of root

calculation by measuring accuracy mean square error

Regression tree (Random

Example Decision tree, logistic forest), Linear regression,

Algorithms regression, etc. etc.

4. Drawbacks of linear

Disadvantages of Linear Regression

1. Main limitation of Linear Regression is the assumption of linearity between the


dependent variable and the independent variables. In the real world, the data is rarely
linearly separable. It assumes that there is a straight-line relationship between the
dependent and independent variables which is incorrect many times.
2. Prone to noise and overfitting: If the number of observations are lesser than the
number of features, Linear Regression should not be used, otherwise it may lead to
overfit because is starts considering noise in this scenario while building the model.

3. Prone to outliers: Linear regression is very sensitive to outliers (anomalies). So,


outliers should be analyzed and removed before applying Linear Regression to the
dataset.

4. Prone to multicollinearity: Before applying Linear regression, multicollinearity should be


removed (using dimensionality reduction techniques) because it assumes that there is no
relationship among independent variables. Else it becomes difficult to interpret which
variable is having a significant effect if the variables are perfectly collinear or something.

1.Degree of freedom
 Degrees of Freedom refers to the maximum number of logically
independent values, which are values that have the freedom to
vary, in the data sample.

The easiest way to understand Degrees of Freedom conceptually is through


an example:

 Consider a data sample consisting of, for the sake of simplicity, five
positive integers. The values could be any number with no known
relationship between them. This data sample would, theoretically, have
five degrees of freedom.
 Four of the numbers in the sample are {3, 8, 5, and 4} and the
average of the entire data sample is revealed to be 6.
 This must mean that the fifth number has to be 10. It can be
nothing else. It does not have the freedom to vary.
 So the Degrees of Freedom for this data sample is 4.

The formula for Degrees of Freedom equals the size of the data sample minus
one:

Df=N−1
where:Df=degrees of freedomN=sample size

 Degrees of freedom encompasses the notion that the amount of


independent information you have limits the number of parameters that you
can estimate.
 Typically, the degrees of freedom equal your sample size minus the number
of parameters you need to calculate during an analysis. It is usually a
positive whole number.

Chi-Square Tests
There are two different kinds of Chi-Square tests: the test of independence,
which asks a question of relationship, such as, "Is there a relationship
between gender and SAT scores?"; and the goodness-of-fit test, which
asks something like "If a coin is tossed 100 times, will it come up heads
50 times and tails 50 times?"

For these tests, degrees of freedom are utilized to determine if a certain null


hypothesis can be rejected based on the total number of variables and
samples within the experiment. For example, when considering students
and course choice, a sample size of 30 or 40 students is likely not large
enough to generate significant data. Getting the same or similar results
from a study using a sample size of 400 or 500 students is more valid.

2.Interpretation of regression coefficients for log log, log level etc


WRITTEN IN THE NOTEBOOK
3. F test

F Test to Compare Two Variances


A Statistical F Test uses an F Statistic to compare two variances, s1 and s2,
by dividing them. The result is always a positive number (because variances
are always positive). The equation for comparing two variances with the f-test
is:
F = s21 / s22
If the variances are equal, the ratio of the variances will equal 1. For example,
if you had two data sets with a sample 1 (variance of 10) and a sample 2
(variance of 10), the ratio would be 10/10 = 1.
You always test that the population variances are equal when running an F
Test. In other words, you always assume that the variances are equal to 1.
Therefore, your null hypothesis will always be that the variances are equal. 

4. How to deal with outliers?


WRITTEN IN THE NOTEBOOK
5. What is a decision tree?
 A decision tree is a supervised machine learning algorithm mainly used for
Regression and Classification.
 It breaks down a data set into smaller and smaller subsets while at the
same time an associated decision tree is incrementally developed.
 The final result is a tree with decision nodes and leaf nodes.
 A decision tree can handle both categorical and numerical data.
 Decision Tree is considered to be one of the most useful Machine Learning
algorithms since it can be used to solve a variety of problems.
 Here are a few reasons why you should use Decision Tree:

o It is considered to be the most understandable Machine Learning algorithm


and it can be easily interpreted.
o It can be used for classification and regression problems.
o Unlike most Machine Learning algorithms, it works effectively with non-
linear data.
o Constructing a Decision Tree is a very quick process since it uses only
one feature per node to split the data.

- what is Z test
 A z-test is a statistical test used to determine whether two population
means are different when the variances are known and the sample
size is large.

 The test statistic is assumed to have a normal distribution, and nuisance


parameters such as standard deviation should be known in order for an
accurate z-test to be performed.

 A z-statistic, or z-score, is a number representing how many standard


deviations above or below the mean population a score derived from
a z-test is.

 Z-tests are closely related to t-tests, but t-tests are best performed
when an experiment has a small sample size.
o Also, t-tests assume the standard deviation is unknown, while z-
tests assume it is known. If the standard deviation of the
population is unknown, the assumption of the sample variance
equaling the population variance is made.

One-Sample Z-Test Example


Assume an investor wishes to test whether the average daily return of a stock
is greater than 1%. A simple random sample of 50 returns is calculated and
has an average of 2%. Assume the standard deviation of the returns is 2.5%.
Therefore, the null hypothesis is when the average, or mean, is equal to 3%.

Conversely, the alternative hypothesis is whether the mean return is greater or


less than 3%. Assume an alpha of 0.05% is selected with a two-tailed test.
Consequently, there is 0.025% of the samples in each tail, and the alpha has
a critical value of 1.96 or -1.96. If the value of z is greater than 1.96 or less
than -1.96, the null hypothesis is rejected.

The value for z is calculated by subtracting the value of the average daily
return selected for the test, or 1% in this case, from the observed average of
the samples. Next, divide the resulting value by the standard deviation divided
by the square root of the number of observed values. Therefore, the test
statistic is calculated to be 2.83, or (0.02 - 0.01) / (0.025 / (50)^(1/2)). The
investor rejects the null hypothesis since z is greater than 1.96 and concludes
that the average daily return is greater than 1%.

- What Does Anova Table Tell You

 The degrees of freedom associated with SSR will always be 1 for the simple


linear regression model. The degrees of freedom associated with SSTO is n-1
= 49-1 = 48. The degrees of freedom associated with SSE is n-2 = 49-2 = 47.
And the degrees of freedom add up: 1 + 47 = 48.
 The sums of squares add up: SSTO = SSR + SSE. That is, here: 53637 =
36464 + 17173.

 We already know the "mean square error (MSE)" is defined as:


 MSE=∑(yi−^yi)2n−2=SSEn−2.MSE=∑(yi−y^i)2n−2=SSEn−2.
 That is, we obtain the mean square error by dividing the error sum of squares by its associated
degrees of freedom n-2. Similarly, we obtain the "regression mean square (MSR)" by
dividing the regression sum of squares by its degrees of freedom 1:
 MSR=∑(^yi−¯y)21=SSR1.MSR=∑(y^i−y¯)21=SSR1.
 Of course, that means the regression sum of squares (SSR) and the regression mean square
(MSR) are always identical for the simple linear regression model.

Analysis of variance (ANOVA) is an analysis tool used in statistics that


splits an observed aggregate variability found inside a data set into two
parts: systematic factors and random factors. The systematic factors
have a statistical influence on the given data set, while the random
factors do not.
Analysts use the ANOVA test to determine the influence that independent
variables have on the dependent variable in a regression study.

F=MSE/MSTwhere:F=ANOVA coefficient,
MST=Mean sum of squares due to treatment,
MSE=Mean sum of squares due to error
The ANOVA test allows a comparison of more than two groups at the same
time to determine whether a relationship exists between them. The ANOVA
test is the initial step in analyzing factors that affect a given data set.
There are two types of ANOVA: one-way (or unidirectional) and two-way.
One-way or two-way refers to the number of independent variables in
your analysis of variance test. A one-way ANOVA evaluates the impact
of a sole factor on a sole response variable. It determines whether all the
samples are the same. The one-way ANOVA is used to determine
whether there are any statistically significant differences between the
means of three or more independent (unrelated) groups.
A two-way ANOVA is an extension of the one-way ANOVA. With a one-way,
you have one independent variable affecting a dependent variable. With a
two-way ANOVA, there are two independents. For example, a two-way
ANOVA allows a company to compare worker productivity based on two
independent variables, such as salary and skill set. It is utilized to observe
the interaction between the two factors and tests the effect of two factors at
the same time.

READ THE ECONOMETRICS TERM PAPER FOR ANOVA

https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/anova/

https://statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide.php

https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_HypothesisTesting-ANOVA/
BS704_HypothesisTesting-Anova3.html#:~:text=The%20ANOVA%20table%20breaks%20down,Source%20of
%20Variation

https://online.stat.psu.edu/stat462/node/107/

2. Confidence interval
ANOTHER FILE STATS QUESTIONS

3. Overfitting
ANOTHER FILE STATS QUESTIONS

4. Difference between correlation and covariance


 Extent / how strongly to which they move
 Direction of linear relationship / strength and direction of the linear
relationship
 Vary between minus infinity to positive infinity
 Affected by change of scale

Covariance Correlation
Covariance is a measure to indicate Correlation is a measure used to represent
the extent to which two random how strongly two random variables are related
variables change in tandem. to each other.

Covariance is nothing but a measure Correlation refers to the scaled form of


of correlation. covariance.

Covariance indicates the direction of Correlation on the other hand measures both
the linear relationship between the strength and direction of the linear
variables. relationship between two variables.

Covariance can vary between -∞ and


Correlation ranges between -1 and +1
+∞

Covariance is affected by the change in


scale. If all the values of one variable
are multiplied by a constant and all the
Correlation is not influenced by the change in
values of another variable are
scale.
multiplied, by a similar or different
constant, then the covariance is
changed. 

Covariance assumes the units from the Correlation is dimensionless, i.e. It’s a unit-
product of the units of the two free measure of the relationship between
variables. variables.

Covariance of two dependent Correlation of


variables measures how much in real two dependent variables measures
quantity (i.e. cm, kg, liters) on the proportion of how much on average
average they co-vary. these variables vary w.r.t one another.

Covariance is zero in case of


Independent movements do not contribute
independent variables (if one variable
to the total correlation. Therefore,
moves and the other doesn’t) because
completely independent variables have a zero
then the variables do not necessarily
correlation.
move together.

1. What is p- value
WRITTEN IN THE NOTEBOOK
How to calculate p-value
2. Power of a test
WRITTEN IN THE NOTEBOOK
3. Type 1 & type 2 errors
WRITTEN IN THE NOTEBOOK
4.Central Limit Theorem

 The central limit theorem states that if you have a population with mean μ and
standard deviation σ and take sufficiently large random samples from the
population with replacement , then the distribution of the sample means will be
approximately normally distributed.

 as you take more samples, especially large ones, your graph of the sample
means will look more like a normal distribution.

 This will hold true regardless of whether the source population is normal or
skewed, provided the sample size is sufficiently large (usually n > 30)

 CLT is a statistical theory stating that given a sufficiently large sample


size from a population with a finite level of variance, the mean of all
samples from the same population will be approximately equal to the
mean of the population.

 Furthermore, all the samples will follow an approximate normal


distribution pattern, with all variances being approximately equal to
the variance of the population, divided by each sample's size.

 A key aspect of CLT is that the average of the sample means and standard
deviations will equal the population mean and standard deviation.
Why is central limit theorem important?
 The central limit theorem tells us that no matter what the distribution of the
population is, the shape of the sampling distribution will
approach normality as the sample size (N) increases.
 This is useful, as the research never knows which mean in the sampling
distribution is the same as the population mean, but by selecting many
random samples from a population the sample means will cluster together,
allowing the research to make a very good estimate of the population mean.
 Thus, as the sample size (N) increases the sampling error will decrease.
 An essential component of the Central Limit Theorem is that the average of your
sample means will be the population mean. In other words, add up the means
from all of your samples, find the average and that average will be your actual
population mean. Similarly, if you find the average of all of the standard
deviations in your sample, you’ll find the actual standard deviation for your
population. It’s a pretty useful phenomenon that can help accurately predict
characteristics of a population. 

1. When do we use t- test


 A t-test is a type of inferential statistic used to determine if there is a
significant difference between the means of two groups, which may be
related in certain features.
 It is mostly used when the data sets, like the data set recorded as the
outcome from flipping a coin 100 times, would follow a normal
distribution and may have unknown variances.
 Essentially, a t-test allows us to compare the average values of the
two data sets and determine if they came from the same population.
 **Remember chi square test is also used to test the difference
between the means of two groups but its used when the population
variance is known.
 Here since the population variance is unknown we use the sample
variance in the formula.
1. What do you understand by Gauss
2. What is multicollinearity
3. What is heteroscedasticity
4. What is alpha?
WRITTEN IN THE NOTEBOOK

5. What is hypothesis testing


 Hypothesis is basically an educated guess about something in the world around
us. It should be testable either by an experiment or an observation.
 Hypothesis testing is an act in statistics whereby an analyst tests an
assumption regarding a population parameter.
 Hypothesis testing is used to assess the plausibility of a hypothesis by using
sample data. Such data may come from a larger population, or from a data-
generating process. The word "population" will be used for both of these cases in
the following descriptions.
https://stattrek.com/hypothesis-test/hypothesis-testing.aspx

Difference between standard deviation and standard error


Standard Error

 There will be, of course, different means for different samples(from the
same population), this is called “sampling distribution of the mean”.
 This variance between the means of different samples can be estimated by
the standard deviation of this sampling distribution and it is the standard
error of the estimate of the mean. Now, this is where everybody gets
confused, the standard error is a type of standard deviation for the
distribution of the means.
 Standard error measures the precision of the estimate of the sample
mean, sigma — standard deviation; n — sample size
 The standard error is strictly dependent on the sample size and thus
the standard error falls as the sample size increases.
 Standard error = sigma/ root(n)
 It makes total sense if you think about it, the bigger the sample, the
closer the sample mean is to the population mean and thus the
estimate of it is closer to the actual value.

BASIS FOR
STANDARD DEVIATION STANDARD ERROR
COMPARISON

Meaning Standard Deviation implies a Standard Error connotes


measure of dispersion of the measure of
the set of values from statistical exactness of
their mean. an estimate.

Statistic Descriptive Inferential

Measures How much observations How precise the


vary from each other. sample mean to the
true population mean.

Distribution Distribution of observation Distribution of an


concerning normal estimate concerning
curve. normal curve.

Formula Square root of variance Standard deviation


BASIS FOR
STANDARD DEVIATION STANDARD ERROR
COMPARISON

divided by square root


of sample size.

Increase in Gives a more specific Decreases standard


sample size measure of standard error.
deviation.

5. Sampling bias

What is linear regression? How is regression different from correlation

Correlation vs causation vs regression


Correlation does not imply causation.

 Correlation measures the degree of relationship between two


variables. Regression analysis is about how one variable affects
another or what changes it triggers in the other.
 Correlation doesn’t capture  causality  but the degree of interrelation
between the two variables. Regression is based on  causality. It shows
no degree of connection, but cause and effect.
 A property of correlation is that the correlation  between  x  and  y  is the
same as between  y  and  x.
 Regressions  of  y  on  x  and  x  on  y  yield different results. Think about
income and education. Predicting income, based on education makes
sense, but the opposite does not.

Spurious correlation
A spurious correlation occurs when two variables are statistically
related but not directly causally related. These two variables falsely
appear to be related to each other, normally due to an unseen, third
factor.
Each dot on the chart below shows the number of driver deaths in railway
collisions by year (the horizontal position), and the annual imports of
Norwegian crude oil by the US. There is a strong correlation evident in the
data with a correlation statistic of 0.95. Yet this is a spurious correlation
because there's no reason to believe that railway deaths cause oil imports,
or vice versa.

Spurious regression
spurious regression is a regression that provides misleading statistical evidence of a
linear relationship between independent non-stationary variables.
Example rain and prices of stocks

When do you use Chebychez inequality

Chebyshev's inequality is a probabilistic inequality. It provides an upper


bound to the probability that the absolute deviation of a random variable
from its mean will exceed a given threshold.
Proposition Let  be a random variable having finite mean  and finite
variance . Let  (i.e.,  is a strictly positive real number). Then, the following
inequality, called Chebyshev's inequality, holds

What are different types of sampling

1. Simple random sampling


 In a simple random sample, every member of the population has an equal
chance of being selected. Your sampling frame should include the whole
population.

 To conduct this type of sampling, you can use tools like random number
generators or other techniques that are based entirely on chance.
 Example
You want to select a simple random sample of 100 employees of Company X.
You assign a number to every employee in the company database from 1 to
1000, and use a random number generator to select 100 numbers.

2. Systematic sampling
Systematic sampling is similar to simple random sampling, but it is usually slightly
easier to conduct. Every member of the population is listed with a number, but
instead of randomly generating numbers, individuals are chosen at regular
intervals.

Example
All employees of the company are listed in alphabetical order. From the first 10
numbers, you randomly select a starting point: number 6. From number 6
onwards, every 10th person on the list is selected (6, 16, 26, 36, and so on),
and you end up with a sample of 100 people.
If you use this technique, it is important to make sure that there is no hidden pattern
in the list that might skew the sample. For example, if the HR database groups
employees by team, and team members are listed in order of seniority, there is a risk
that your interval might skip over people in junior roles, resulting in a sample that is
skewed towards senior employees.

3. Stratified sampling
Stratified sampling involves dividing the population into subpopulations that
may differ in important ways. It allows you draw more precise conclusions by
ensuring that every subgroup is properly represented in the sample.

To use this sampling method, you divide the population into subgroups (called strata)
based on the relevant characteristic (e.g. gender, age range, income bracket, job
role).

Based on the overall proportions of the population, you calculate how many people
should be sampled from each subgroup. Then you use random
or systematic sampling to select a sample from each subgroup.

Example
The company has 800 female employees and 200 male employees. You want to
ensure that the sample reflects the gender balance of the company, so you sort the
population into two strata based on gender. Then you use random sampling on each
group, selecting 80 women and 20 men, which gives you a representative sample of
100 people.

4. Cluster sampling
Cluster sampling also involves dividing the population into subgroups, but
each subgroup should have similar characteristics to the whole sample. Instead of
sampling individuals from each subgroup, you randomly select entire subgroups.
Example
The company has offices in 10 cities across the country (all with roughly the same
number of employees in similar roles). You don’t have the capacity to travel to every
office to collect your data, so you use random sampling to select 3 offices – these
are your clusters

What is sampling bais


Sampling bias occurs when some members of a population are systematically more
likely to be selected in a sample than others.  sampling bias is a bias in which a sample
is collected in such a way that some members of the intended population have a lower or
higher sampling probability than others. It results in a biased sample, a non-random
sample[1] of a population (or non-human factors) in which all individuals, or instances, were not
equally likely to have been selected

Let us consider a specific example: we might want to predict the outcome of a presidential
election by means of an opinion poll. Asking 1000 voters about their voting intentions can
give a pretty accurate prediction of the likely winner, but only if our sample of 1000 voters is
'representative' of the electorate as a whole (i.e. unbiased). If we only poll the opinion of,
1000 white middle class college students, then the views of many important parts of the
electorate as a whole (ethnic minorities, elderly people, blue-collar workers) are likely to be
underrepresented in the sample, and our ability to predict the outcome of the election from
that sample is reduced.

Difference between R^2 and adjusted R^2


 Every time you add a independent variable to a model, the R-
squared increases, even if the independent variable is insignificant. It never
declines. Whereas Adjusted R-squared increases only when independent
variable is significant and affects dependent variable. 
 In the table below, adjusted R-squared is maximum when we included two
variables. It declines when third variable is added. Whereas r-squared increases
when we included third variable. It means third variable is insignificant to the
model.

R-Squared vs. Adjusted R-Squared


 Adjusted r-squared can be negative when r-squared is close to zero.
 Adjusted r-squared value always be less than or equal to r-squared value.
Adjusted R squared
 Adding more independent variables or predictors to a regression model
tends to increase the R-squared value, which tempts makers of the
model to add even more. This is called overfitting and can return an
unwarranted high R-squared value. Adjusted R-squared is used to
determine how reliable the correlation is and how much is determined
by the addition of independent variables.
 In a portfolio model that has more independent variables, adjusted R-
squared will help determine how much of the correlation with the index
is due to the addition of those variables. The adjusted R-squared
compensates for the addition of variables and only increases if the new
predictor enhances the model above what would be obtained by
probability. Conversely, it will decrease when a predictor improves the
model less than what is predicted by chance.

As you can see, adding a random independent variable did not help in explaining the
variation in the target variable. Our R-squared value remains the same. Thus, giving us a
false indication that this variable might be helpful in predicting the output. However, the
Adjusted R-squared value decreased which indicated that this new variable is actually
not capturing the trend in the target variable.

Lognormal
A lognormal distribution is a continuous probability distribution of a
random variable in which logarithm is normally distributed. Thus, if
the random variable X has a lognormal distribution, then Y=ln(X) has a
normal distribution.
Normal distributions may present a few problems that log-normal distributions
can solve. Mainly, normal distributions can allow for negative random
variables while log-normal distributions include all positive variables.

One of the most common applications where log-normal distributions are used
in finance is in the analysis of stock prices. The potential returns of a stock
can be graphed in a normal distribution. The prices of the stock however can
be graphed in a log-normal distribution. The log-normal distribution curve can
therefore be used to help better identify the compound return that the stock
can expect to achieve over a period of time.

Note that log-normal distributions are positively skewed with long right tails


due to low mean values and high variances in the random variables.
Selection bias

Bayes Theorem
Bayes' theorem is a mathematical formula for determining conditional
probability. Conditional probability is the likelihood of an outcome occurring,
based on a previous outcome occurring. Bayes' theorem provides a way to
revise existing predictions or theories (update probabilities) given new or
additional evidence.
Bayes' theorem relies on incorporating prior probability distributions in order to
generate posterior probabilities. Prior probability, in Bayesian statistical
inference, is the probability of an event before new data is collected. Posterior
probability is the revised probability of an event occurring after taking into
consideration new information. Posterior probability is calculated by updating
the prior probability by using Bayes' theorem. In statistical terms, the posterior
probability is the probability of event A occurring given that event B has
occurred.

Mean, Median, Mode (which will be a better measure type)


However, in a skewed distribution, the mean can miss the mark. In the histogram
above, it is starting to fall outside the central area. This problem occurs
because outliers have a substantial impact on the mean. Extreme values in an
extended tail pull the mean away from the center. As the distribution becomes more
skewed, the mean is drawn further away from the center. Consequently, it’s best to
use the mean as a measure of the central tendency when you have a symmetric
distribution.

When to use the mean: Symmetric distribution, Continuous data

Outliers and skewed data have a smaller effect on the median. To understand why, imagine


we have the Median dataset below and find that the median is 46. However, we discover
data entry errors and need to change four values, which are shaded in the Median Fixed
dataset. We’ll make them all significantly higher so that we now have a skewed distribution
with large outliers.
As you can see, the median doesn’t change at all. It is still 46. Unlike the mean, the
median value doesn’t depend on all the values in the dataset. Consequently, when
some of the values are more extreme, the effect on the median is smaller. Of course,
with other types of changes, the median can change. When you have a skewed
distribution, the median is a better measure of central tendency than the mean.
In a skewed distribution, the outliers in the tail pull the mean away from the center
towards the longer tail. 
When you have a symmetrical distribution for continuous data, the mean, median,
and mode are equal. In this case, analysts tend to use the mean because it includes
all of the data in the calculations. However, if you have a skewed distribution, the
median is often the best measure of central tendency.

When you have ordinal data, the median or mode is usually the best choice. For
categorical data, you have to use the mode.

Omitted Variable Bias in Regression with a Single


Regressor
Omitted variable bias is the bias in the OLS estimator that arises when the regressor, X,
is correlated with an omitted variable. For omitted variable bias to occur, two
conditions must be fulfilled:

1. X is correlated with the omitted variable.


2. The omitted variable is a determinant of the dependent variable YY.

Together, 1. and 2. result in a violation of the first OLS assumption E(ui|Xi)=0E(ui|Xi)=0.


Formally, the resulting bias can be expressed as

^β1p→β1+ρXuσuσX.(6.1)(6.1)β^1→pβ1+ρXuσuσX.See Appendix 6.1 of the book for a


detailed derivation. (6.1) states that OVB is a problem that cannot be solved by
increasing the number of observations used to estimate β1β1, as ^β1β^1 is
inconsistent: OVB prevents the estimator from converging in probability to the true
parameter value. Strength and direction of the bias are determined by ρXuρXu, the
correlation between the error term and the regressor.

What Are Confounding Variables?


 In statistics, a confounder is a variable that influences both the dependent
variable and independent variable.
 For example, if you are researching whether a lack of exercise leads to weight
gain, lack of exercise = independent variable weight gain = dependent variable. A
confounding variable here would be any other variable that affects both of these
variables, such as the age of the subject.

Autocovariance
Linear dependence between two points on the same series observed at different times.

Condition for weakly stationary time series

A) Mean is constant and does not depend on time


B) Autocovariance function depends on s and t only through their difference |s-t| (where
t and s are moments in time)
C) The time series under considerations is a finite variance process

Bias and Variance


We have been given the data – divide into train and test. LR fits a straight to the train
set. Sometimes it cannot really capture the true relationship. The inability for a
machine learning method (like linear regression) to capture the true relationship is
called Bias. Another ML method might fit a squiggly line to our train data. The
squiggly line is super flexible and hugs the arc of the true relationship. Since it can
handle the arc in the true relationship very well, it has very little bias.
Now for test data, we suppose that the straight-line fits very well whereas the
squiggly line doesn’t do a great job. The difference in fits between the datasets is
called as variance.
Bias is the difference between the average prediction of our model and the correct
value which we are trying to predict. Model with high bias pays very little attention
to the training data and oversimplifies the model. It always leads to high error on
training and test data.
Variance is the variability of model prediction for a given data point or a value
which tells us spread of our data. Model with high variance pays a lot of attention
to training data and does not generalize on the data which it hasn’t seen before. As
a result, such models perform very well on training data but has high error rates on
test data.
Its hard to predict how well the squiggly line will perform in the future datasets.
Straight line has high bias. But has low variance since sum of sqaures are very similar
for different datasets.

Ideal ML algorithm should have low bias and low variance by producing consistent
predictions across different datasets.

Y=f(X) + e
We will make a model f^(X) of f(X) using linear regression or any other modeling
technique.
So the expected squared error at a point x is
Expected error square = Bias^2 + Variance + Irreducible error
Irreducible error is the error that can’t be reduced by creating good models. It is a
measure of the amount of noise in our data. Here it is important to understand that
no matter how good we make our model, our data will have certain amount of noise
or irreducible error that can not be removed.

Underfitting happens when a model unable to capture the underlying pattern of the
data. Such models generally have high bias and low variance. It generally happens
when there is very less data to build an accurate model or when we try to build a
linear model with non-linear data.
Overfitting happens when our model captures the noise along with the underlying
pattern in the data. It happens when we train our model a lot over noisy dataset.
Such models have low bias and high variance. These models are very complex like
decision trees and prone to overfitting.

 If our model is too simple and has very few parameters then it may have high
bias and low variance.
 On the other hand if our model has large number of parameters then it’s going
to have high variance and low bias.
 So we need to find the right/good balance without overfitting and underfitting
the data.
 This tradeoff in complexity is why there is a tradeoff between bias and variance.
An algorithm can’t be more complex and less complex at the same time.
 To build a good model, we need to find a good balance between bias and
variance such that it minimizes the total error.
 An optimal balance of bias and variance would never overfit or underfit the
model.

https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229

Normal Distribution
Normal distribution can be checked for by plotting a histogram. If the distribution is
not normal then we transform the data to normal using the following techniques
 Z score
 Box cox transformation
o A Box Cox transformation is a transformation of a non-normal dependent
variables into a normal shape. Normality is an important assumption for
many statistical techniques; if your data isn’t normal, applying a Box-Cox
means that you are able to run a broader number of tests.
o At the core of the Box Cox transformation is an exponent, lambda (λ),
which varies from -5 to 5. All values of λ are considered and the optimal
value for your data is selected; The “optimal value” is the one which
results in the best approximation of a normal distribution curve. The
transformation of Y has the form:

This test only works for positive data. However, Box and Cox did propose a
second formula that can be used for negative y-values:
o The formulae are deceptively simple. Testing all possible values by hand is
unnecessarily labor intensive; most software packages will include an
option for a Box Cox transformation, including:
 Checking for normality
o There are various methods available to test the normality of the
continuous data, out of them, most popular methods are Shapiro–Wilk
test, Kolmogorov–Smirnov test, skewness, kurtosis, histogram, box plot, P–
P Plot, Q–Q Plot, and mean with SD.
o The Shapiro–Wilk test is more appropriate method for small sample sizes
(<50 samples) although it can also be handling on larger sample size while
Kolmogorov–Smirnov test is used for n ≥50. For both of the above tests,
null hypothesis states that data are taken from normal distributed
population. When P > 0.05, null hypothesis accepted and data are called
as normally distributed.
o A histogram is an estimate of the probability distribution of a continuous
variable. If the graph is approximately bell-shaped and symmetric about
the mean, we can assume normally distributed data.
o In statistics, a Q–Q plot is a scatterplot created by plotting two sets of
quantiles (observed and expected) against one another. For normally
distributed data, observed data are approximate to the expected data,
that is, they are statistically equal [Figure 2].
o A P–P plot (probability–probability plot or percent–percent plot) is a
graphical technique for assessing how closely two data sets (observed
and expected) agree. It forms an approximate straight line when data are
normally distributed. Departures from this straight line indicate departures
from normality [Figure 3].
o Box plot is another way to assess the normality of the data. It shows the
median as a horizontal line inside the box and the IQR (range between the
first and third quartile) as the length of the box. The whiskers (line
extending from the top and bottom of the box) represent the minimum
and maximum values when they are within 1.5 times the IQR from either
end of the box (i.e., Q1 − 1.5* IQR and Q3 + 1.5* IQR). Scores >1.5 times
and 3 times the IQR are out of the box plot and are considered as outliers
and extreme outliers, respectively. A box plot that is symmetric with the
median line at approximately the center of the box and with symmetric
whiskers indicate that the data may have come from a normal
distribution. In case many outliers are present in our data set, either
outliers are need to remove or data should treat as nonnormally
distributed
Dummy Variables
 Numeric variable that represents categorical data, such as gender, race,
political affiliation, etc.
 The number of dummy variables required to represent a particular
categorical variable depends on the number of values that the categorical
variable can assume. To represent a categorical variable that can
assume k different values, a researcher would need to define k - 1 dummy
variables.
 For example, suppose we are interested in political affiliation, a categorical
variable that might assume three values - Republican, Democrat, or
Independent. We could represent political affiliation with two dummy
variables:

X1 = 1, if Republican; X1 = 0, otherwise.

X2 = 1, if Democrat; X2 = 0, otherwise.

 In this example, notice that we don't have to create a dummy variable to


represent the "Independent" category of political affiliation. If X1 equals zero
and X2 equals zero, we know the voter is neither Republican nor Democrat.
Therefore, voter must be Independent.
 Dummy variables are dichotomous, quantitative variables. Their range of
values is small; they can take on only two quantitative values. As a practical
matter, regression results are easiest to interpret when dummy variables are
limited to two specific values, 1 or 0. Typically, 1 represents the presence of a
qualitative attribute, and 0 represents the absence.

Dummy Variable Trap

 When defining dummy variables, a common mistake is to define too many


variables. If a categorical variable can take on k values, it is tempting to
define k dummy variables. Resist this urge. Remember, you only need k -
1 dummy variables.

 A kth dummy variable is redundant; it carries no new information. And it


creates a severe multicollinearity problem for the analysis. Using k dummy
variables when only k - 1 dummy variables are required is known as the
dummy variable trap. Avoid this trap!
Once a categorical variable has been recoded as a dummy variable, the dummy variable can be used
in regression analysis just like any other quantitative variable.

For example, suppose we wanted to assess the relationship between household income and political
affiliation (i.e., Republican, Democrat, or Independent). The regression equation might be:

Income = b0 + b1X1+ b2X2

where b0, b1, and b2 are regression coefficients. X1 and X2 are regression coefficients defined as:

 X1 = 1, if Republican; X1 = 0, otherwise.


 X2 = 1, if Democrat; X2 = 0, otherwise.

The value of the categorical variable that is not represented explicitly by a dummy variable is called
the reference group. In this example, the reference group consists of Independent voters.

In analysis, each dummy variable is compared with the reference group. In this example, a positive
regression coefficient means that income is higher for the dummy variable political affiliation than for
the reference group; a negative regression coefficient means that income is lower. If the regression
coefficient is statistically significant, the income discrepancy with the reference group is also
statistically significant.

Single Independent Dummy variable interpretation

The intercept for males is b0, and the intercept for females is b0 + d0. Because there are just two
groups, we only need two different intercepts.

Using two dummy variables would introduce perfect collinearity because female + male = 1, which
means that male is a perfect linear function of female. Including dummy variables for both
genders is the simplest example of the so-called dummy variable trap, which arises when too
many dummy variables describe a given number of groups.

In the above example the base group is males or benchmark group, that is, the group against which
comparisons are made. This is why b0 is the intercept for males, and d0 is the difference in
intercepts between females and males.
If we take a woman and a man with the same levels of education, experience, and tenure, the
woman earns, on average, $1.81 less per hour than the man.

The intercept is the average wage for men in the sample (let female = 0), so men earn $7.10 per
hour on average. The coefficient on female is the difference in the average wage between women
and men. Thus, the average wage for women in the sample is 7.10 - 2.51 = 4.59, or $4.59 per hour

Base group is single men.


we must remember that the base group is single males. Thus, the estimates on the three dummy
variables measure the proportionate difference in wage relative to single males. For example,
married men are estimated to earn about 21.3% more than single men, holding levels of education,
experience, and tenure fixed.

Because the overall intercept is common to all groups, we can ignore that in finding differences.
Thus, the estimated proportionate difference between single and married women is

-0.110 - (- 0.198) = .088 which means that single women earn about 8.8% more than married
women.

Interaction dummy variables


we can recast that model by adding an interaction term between female and married to the model
where female and married appear separately. This allows the marriage premium to depend on
gender, just as it did in equation (7.11). For purposes of comparison, the estimated model with the
female-married interaction term is
Setting female = 0 and married = 0 corresponds to the group single men, which is the base group,
since this eliminates female, married, and female.married. We can find the intercept for married
men by setting female = 0 and married = 1 in (7.14); this gives an intercept of .321 + .213 = .534, and
so on.

It is worth noticing that the estimated return to using a computer at work (but not at home) is about
17.7%. (The more precise estimate is 19.4%.) Similarly, people who use computers at home but not
at work have about a 7% wage premium over those who do not use a computer at all. The
differential between those who use a computer at both places, relative to those who use a computer
in neither place, is about 26.4% (obtained by adding all three coefficients and multiplying by 100), or
the more precise estimate 30.2% obtained from equation (7.10).

All this taken from Jeffery Wooldridge

Principal Component Analysis


 dimensionality-reduction method that is often used to reduce the dimensionality of
large data sets, by transforming a large set of variables into a smaller one that still
contains most of the information in the large set.
 The idea of PCA is simple reduce the number of varibales of a datsetwhile preserving
as much information as possible
 Step by step explanation of PCA:
 Standardization - standardize the range of the continuous initial variables so that
each one of them contributes equally to the analysis.
 if there are large differences between the ranges of initial variables, those variables
with larger ranges will dominate over those with small ranges (For example, a
variable that ranges between 0 and 100 will dominate over a variable that ranges
between 0 and 1), which will lead to biased results.
 this can be done by subtracting the mean and dividing by the standard deviation for
each value of each variable.
 Variance Covariance matrix - to understand how the variables of the input
data set are varying from the mean with respect to each other, or in other
words, to see if there is any relationship between them. Because
sometimes, variables are highly correlated in such a way that they contain
redundant information. So, in order to identify these correlations, we
compute the covariance matrix.
 What do the covariances that we have as entries of the matrix tell us about
the correlations between the variables?
 It’s actually the sign of the covariance that matters :

 if positive then : the two variables increase or decrease together (correlated)


 if negative then : One increases when the other decreases (Inversely correlated)
 Compute the eigenvectors and eigenvalues
o need to compute from the covariance matrix in order to determine
the principal components of the data.
o Principal components are new variables that are constructed as
linear combinations or mixtures of the initial variables. These
combinations are done in such a way that the new variables (i.e.,
principal components) are uncorrelated and most of the information
within the initial variables is squeezed or compressed into the first
components. So, the idea is 10-dimensional data gives you 10
principal components, but PCA tries to put maximum possible
information in the first component, then maximum remaining
information in the second and so on
o Organizing information in principal components this way, will allow
you to reduce dimensionality without losing much information, and
this by discarding the components with low information and
considering the remaining components as your new variables.
 The relationship between variance and information here, is that, the larger
the variance carried by a line, the larger the dispersion of the data points
along it, and the larger the dispersion along a line, the more the
information it has. 
 principal components are constructed in such a manner that the first
principal component accounts for the largest possible variance in the data
set.
 The second principal component is calculated in the same way, with the
condition that it is uncorrelated with (i.e., perpendicular to) the first
principal component and that it accounts for the next highest variance.
 This continues until a total of p principal components have been calculated,
equal to the original number of variables.
 for a 3-dimensional data set, there are 3 variables, therefore there are 3
eigenvectors with 3 corresponding eigenvalues.
 Without further ado, it is eigenvectors and eigenvalues who are behind all
the magic explained above, because the eigenvectors of the Covariance
matrix are actually the directions of the axes where there is the most
variance(most information) and that we call Principal Components. And
eigenvalues are simply the coefficients attached to eigenvectors, which give
the amount of variance carried in each Principal Component.
 By ranking your eigenvectors in order of their eigenvalues, highest to
lowest, you get the principal components in order of significance.
 https://builtin.com/data-science/step-step-explanation-principal-component-
analysis

What are the types of data? Explain how all these data types are
different from each other? What are the problems you face in these
data types?

 crucial prerequisite for doing Exploratory Data Analysis (EDA), since you
can use certain statistical measurements only for specific data types.
 to choose the right visualization method
 Categorical data represents characteristics. Therefore, it can represent
things like a person’s gender, language etc. Categorical data can also take on
numerical values (Example: 1 for female and 0 for male).
 Nominal values represent discrete units and are used to label variables, that
have no quantitative value. Just think of them as „labels“. Note that nominal
data that has no order.
 Ordinal values represent discrete and ordered units. It is therefore nearly
the same as nominal data, except that it’s ordering matters.
 Numerical data
o Discrete data - if its values are distinct and separate. In other words:
We speak of discrete data if the data can only take on certain values.
o Continuous data - represents measurements and therefore their
values can’t be counted but they can be measured. An example would
be the height of a person, which you can describe by using intervals
on the real number line.
 Datatypes are an important concept because statistical methods can only be
used with certain data types. You have to analyze continuous data
differently than categorical data otherwise it would result in a wrong
analysis.
 You can use one hot encoding, to transform nominal data into a numeric
feature.
 You can use one label encoding, to transform ordinal data into a
numeric feature.

Endogeneity
In general, we say that a variable X is endogenous if it is correlated with the model error
term. Endogeneity always induces bias.

 Endogeneity in a scatter plot


 One variable with intercept model
 Draw the true regression line and the data
 The OLS regression line will pick up both the slope of Y in X and the slope of
the conditional mean of the error with respect to X.
 Correlated missing regressor – X2 is invisible
 More generally, some of what looks like (x1,y) is really the model error term,
and not (x1,y).
 Causes of endogeneity
o Correlated missing regressors
oSample selection
oReverse causality – If y also causes x
oEndogeneity is like pollution in X
oIncluding missing regressors is like identifying the pollution exactly, so
that you can just use the X that is uncorrelated with that pollution.
o Alternatively, you could find a part of the variation in X that is
unpolluted by construction
 Inconsistent (As n tends to infinity, beta hat ols does not tend to beta) and
biased beta estimates


 Single variable endo – t test
 Two variable endo – F test
 https://www.stata.com/meeting/spain16/slides/pinzon-spain16.pdf
 http://www.sfu.ca/~pendakur/teaching/buec333/Multicollinearity%20and
%20Endogeneity.pdf

Instrumental Variables
 Instruments are variables, denoted Z, that are correlated with X, but
uncorrelated with the model error term by assumption or by construction.
 Cov(Z,e)=0, so in the Ballentine, Z and the error term have no overlap.
 But, (Z,X) do overlap

When to use a log-log model?

Using natural logs for variables on both sides of your econometric


specification is called a log-log model. This model is handy when the
relationship is nonlinear in parameters, because the log transformation
generates the desired linearity in parameters (you may recall that linearity in
parameters is one of the OLS assumptions).
In principle, any log transformation (natural or not) can be used to transform a
model that’s nonlinear in parameters into a linear one

A generic form of a constant elasticity model can be represented by

If you take the natural log of both sides, you end up with

You treat

as the intercept. You end up with the following model:

You can estimate this model with OLS by simply using natural log values for
the variables instead of their original scale.

Point Biserial

A point-biserial correlation is used to measure the strength and direction of the association
that exists between one continuous variable and one dichotomous variable. It is a special
case of the Pearson’s product-moment correlation, which is applied when you have two
continuous variables, whereas in this case one of the variables is measured on a
dichotomous scale.

For example, you could use a point-biserial correlation to determine whether there is an
association between salaries, measured in US dollars, and gender (i.e., your continuous
variable would be "salary" and your dichotomous variable would be "gender", which has two
categories: "males" and "females"). Alternately, you could use a point-biserial correlation to
determine whether there is an association between cholesterol concentration, measured in
mmol/L, and smoking status (i.e., your continuous variable would be "cholesterol
concentration", a marker of heart disease, and your dichotomous variable would be
"smoking status", which has two categories: "smoker" and "non-smoker").

o Assumption #1: One of your two variables should be measured on


a continuous scale. Examples of continuous variables include revision time
(measured in hours), intelligence (measured using IQ score), exam performance
(measured from 0 to 100), weight (measured in kg), and so forth. You can learn more
about continuous variables in our article: Types of Variable.
o Assumption #2: Your other variable should be dichotomous. Examples
of dichotomous variables include gender (two groups: male or female),
employment status (two groups: employed or unemployed), smoker (two groups: yes
or no), and so forth.
o Assumption #3: There should be no outliers for the continuous variable for each
category of the dichotomous variable. You can test for outliers using boxplots.
o Assumption #4: Your continuous variable should be approximately normally
distributed for each category of the dichotomous variable. You can test this using
the Shapiro-Wilk test of normality.
o Assumption #5: Your continuous variable should have equal variances for each
category of the dichotomous variable. You can test this using Levene's test of
equality of variances.

Association between categorical variables


Null hypothesis: Assumes that there is no association between the two variables.

Alternative hypothesis: Assumes that there is an association between the two variables

Pearson chi-sqaure test for categorical variables


The estimated value for each cell is the total for its row multiplied by the total for
its column, then divided by the total for the table: that
is, (RowTotal*ColTotal)/GridTotal. Thus, in our table above, the expected count
in cell (1,1) is (33*31)/54, or 18.94. Don't be afraid of decimals for your expected
counts; they're meant to be estimates!

Pass Fail Total


25 6
Attended 31
(18.94) (12.05)
8 15
Skipped 23
(14.05) (8.94)
Total 33 21 54

Chi-square formula
Actually, it's a fairly simple relationship. The variables in this formula are not
simply symbols, but actual concepts that we've been discussing all
along. O stands for the Observed frequency. E stands for the Expected frequency.
You subtract the expected count from the observed count to find the difference
between the two (also called the "residual"). You calculate the square of that
number to get rid of positive and negative values (because the squares of 5 and -5
are, of course, both 25). Then, you divide the result by the expected frequency to
normalize bigger and smaller counts (because we don't want a formula that will
give us a bigger Chi-square value just because you're working with a bigger set of
data). The huge sigma sitting in front of all that is asking for the sum of every i for
which you calculate this relationship - in other words, you calculate this for each
cell in the table, then add it all together. And that's it!

Using this formula, we find that the Chi-square value for our gender/party
example is ((20-25)^2/25) + ((30-25)^2/25) + ((30-25)^2/25) + ((20-25)^2/25), or
(25/25) + (25/25) + (25/25) + (25/25), or 1 + 1 + 1 + 1, which comes out to 4.
And then use p value
http://www.ce.memphis.edu/7012/L17_CategoricalVariableAssociation.pdf
https://www.ling.upenn.edu/~clight/chisquared.htm

Shapiro Wilk test


 This test for normality has been found to be the most powerful test in most situations.
 It is the ratio of two estimates of the variance of a normal distribution based on a
random sample of n observations.
 The numerator is proportional to the square of the best linear estimator of the standard
deviation. The denominator is the sum of squares of the observations about the sample
mean.
 The test statistic W may be written as the square of the Pearson correlation coefficient
between the ordered observations and a set of weights which are used to calculate the
numerator. Since these weights are asymptotically proportional to the corresponding
expected normal order statistics, W is roughly a measure of the straightness of the
normal quantile-quantile plot. Hence, the closer W is to one, the more normal the
sample is.
 P value greater than 0.05 which indicates normal distribution.

You might also like