Questions Stats and Trix
Questions Stats and Trix
The main difference that makes both different from each other is when the
dependent variables are binary logistic regression is considered and when
dependent variables are continuous then linear regression is used.
Panel data, also known as longitudinal data or cross-sectional time series data
in some special cases, is data that is derived from a (usually small) number of
observations over time on a (usually large) number of cross-sectional units like
individuals, households, firms, or governments.
Time series data is a collection of quantities that are assembled over even
intervals in time and ordered chronologically. The time interval at which
data is collection is generally referred to as the time series frequency.
3.What is stationarity.
Data points are often non-stationary or have means, variances,
and covariances that change over time.
The extent of the missing values is identified after identifying the variables
with missing values. If any patterns are identified the analyst has to
concentrate on them as it could lead to interesting and meaningful business
insights.
If there are no patterns identified, then the missing values can be substituted
with mean or median values (imputation) or they can simply be ignored.
Assigning a default value which can be mean, minimum or maximum value.
Getting into the data is important.
If it is a categorical variable, the default value is assigned. The missing value is
assigned a default value. If you have a distribution of data coming, for normal
distribution give the mean value. If 80% of the values for a variable are missing
then you can answer that you would be dropping the variable instead of
treating the missing values.
For categorical variables we can us the most frequent values i.e the
median or zero or constant values.
2. Correlation matrix
A correlation matrix is a table showing correlation coefficients
between variables.
Each cell in the table shows the correlation between two variables.
A correlation matrix is used to summarize data, as an input into a more
advanced analysis, and as a diagnostic for advanced analyses.
As a diagnostic when checking other analyses. For example, with linear
regression, a high amount of correlations suggests that the linear
regression estimates will be unreliable.
3. Regression vs classification
The most significant difference between regression vs classification is that while
regression helps predict a continuous quantity, classification predicts discrete
class labels.
Let’s consider a dataset that contains student information of a particular
university. A regression algorithm can be used in this case to predict the height of
any student based on their weight, gender, diet, or subject major. We use
regression in this case because height is a continuous quantity. There is an
infinite number of possible values for a person’s height.
On the contrary, classification can be used to analyse whether an email is a spam
or not spam. The algorithm checks the keywords in an email and the sender’s
address is to find out the probability of the email being spam. Similarly, while a
regression model can be used to predict temperature for the next day, we can use
a classification algorithm to determine whether it will be cold or hot according to
the given temperature values.
PARAMENTER CLASSIFICATION REGRESSION
Mapping Function is
Involves
Nature of the
predicted
4. Drawbacks of linear
1.Degree of freedom
Degrees of Freedom refers to the maximum number of logically
independent values, which are values that have the freedom to
vary, in the data sample.
Consider a data sample consisting of, for the sake of simplicity, five
positive integers. The values could be any number with no known
relationship between them. This data sample would, theoretically, have
five degrees of freedom.
Four of the numbers in the sample are {3, 8, 5, and 4} and the
average of the entire data sample is revealed to be 6.
This must mean that the fifth number has to be 10. It can be
nothing else. It does not have the freedom to vary.
So the Degrees of Freedom for this data sample is 4.
The formula for Degrees of Freedom equals the size of the data sample minus
one:
Df=N−1
where:Df=degrees of freedomN=sample size
Chi-Square Tests
There are two different kinds of Chi-Square tests: the test of independence,
which asks a question of relationship, such as, "Is there a relationship
between gender and SAT scores?"; and the goodness-of-fit test, which
asks something like "If a coin is tossed 100 times, will it come up heads
50 times and tails 50 times?"
- what is Z test
A z-test is a statistical test used to determine whether two population
means are different when the variances are known and the sample
size is large.
Z-tests are closely related to t-tests, but t-tests are best performed
when an experiment has a small sample size.
o Also, t-tests assume the standard deviation is unknown, while z-
tests assume it is known. If the standard deviation of the
population is unknown, the assumption of the sample variance
equaling the population variance is made.
The value for z is calculated by subtracting the value of the average daily
return selected for the test, or 1% in this case, from the observed average of
the samples. Next, divide the resulting value by the standard deviation divided
by the square root of the number of observed values. Therefore, the test
statistic is calculated to be 2.83, or (0.02 - 0.01) / (0.025 / (50)^(1/2)). The
investor rejects the null hypothesis since z is greater than 1.96 and concludes
that the average daily return is greater than 1%.
F=MSE/MSTwhere:F=ANOVA coefficient,
MST=Mean sum of squares due to treatment,
MSE=Mean sum of squares due to error
The ANOVA test allows a comparison of more than two groups at the same
time to determine whether a relationship exists between them. The ANOVA
test is the initial step in analyzing factors that affect a given data set.
There are two types of ANOVA: one-way (or unidirectional) and two-way.
One-way or two-way refers to the number of independent variables in
your analysis of variance test. A one-way ANOVA evaluates the impact
of a sole factor on a sole response variable. It determines whether all the
samples are the same. The one-way ANOVA is used to determine
whether there are any statistically significant differences between the
means of three or more independent (unrelated) groups.
A two-way ANOVA is an extension of the one-way ANOVA. With a one-way,
you have one independent variable affecting a dependent variable. With a
two-way ANOVA, there are two independents. For example, a two-way
ANOVA allows a company to compare worker productivity based on two
independent variables, such as salary and skill set. It is utilized to observe
the interaction between the two factors and tests the effect of two factors at
the same time.
https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/anova/
https://statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide.php
https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_HypothesisTesting-ANOVA/
BS704_HypothesisTesting-Anova3.html#:~:text=The%20ANOVA%20table%20breaks%20down,Source%20of
%20Variation
https://online.stat.psu.edu/stat462/node/107/
2. Confidence interval
ANOTHER FILE STATS QUESTIONS
3. Overfitting
ANOTHER FILE STATS QUESTIONS
Covariance Correlation
Covariance is a measure to indicate Correlation is a measure used to represent
the extent to which two random how strongly two random variables are related
variables change in tandem. to each other.
Covariance indicates the direction of Correlation on the other hand measures both
the linear relationship between the strength and direction of the linear
variables. relationship between two variables.
Covariance assumes the units from the Correlation is dimensionless, i.e. It’s a unit-
product of the units of the two free measure of the relationship between
variables. variables.
1. What is p- value
WRITTEN IN THE NOTEBOOK
How to calculate p-value
2. Power of a test
WRITTEN IN THE NOTEBOOK
3. Type 1 & type 2 errors
WRITTEN IN THE NOTEBOOK
4.Central Limit Theorem
The central limit theorem states that if you have a population with mean μ and
standard deviation σ and take sufficiently large random samples from the
population with replacement , then the distribution of the sample means will be
approximately normally distributed.
as you take more samples, especially large ones, your graph of the sample
means will look more like a normal distribution.
This will hold true regardless of whether the source population is normal or
skewed, provided the sample size is sufficiently large (usually n > 30)
A key aspect of CLT is that the average of the sample means and standard
deviations will equal the population mean and standard deviation.
Why is central limit theorem important?
The central limit theorem tells us that no matter what the distribution of the
population is, the shape of the sampling distribution will
approach normality as the sample size (N) increases.
This is useful, as the research never knows which mean in the sampling
distribution is the same as the population mean, but by selecting many
random samples from a population the sample means will cluster together,
allowing the research to make a very good estimate of the population mean.
Thus, as the sample size (N) increases the sampling error will decrease.
An essential component of the Central Limit Theorem is that the average of your
sample means will be the population mean. In other words, add up the means
from all of your samples, find the average and that average will be your actual
population mean. Similarly, if you find the average of all of the standard
deviations in your sample, you’ll find the actual standard deviation for your
population. It’s a pretty useful phenomenon that can help accurately predict
characteristics of a population.
There will be, of course, different means for different samples(from the
same population), this is called “sampling distribution of the mean”.
This variance between the means of different samples can be estimated by
the standard deviation of this sampling distribution and it is the standard
error of the estimate of the mean. Now, this is where everybody gets
confused, the standard error is a type of standard deviation for the
distribution of the means.
Standard error measures the precision of the estimate of the sample
mean, sigma — standard deviation; n — sample size
The standard error is strictly dependent on the sample size and thus
the standard error falls as the sample size increases.
Standard error = sigma/ root(n)
It makes total sense if you think about it, the bigger the sample, the
closer the sample mean is to the population mean and thus the
estimate of it is closer to the actual value.
BASIS FOR
STANDARD DEVIATION STANDARD ERROR
COMPARISON
5. Sampling bias
Spurious correlation
A spurious correlation occurs when two variables are statistically
related but not directly causally related. These two variables falsely
appear to be related to each other, normally due to an unseen, third
factor.
Each dot on the chart below shows the number of driver deaths in railway
collisions by year (the horizontal position), and the annual imports of
Norwegian crude oil by the US. There is a strong correlation evident in the
data with a correlation statistic of 0.95. Yet this is a spurious correlation
because there's no reason to believe that railway deaths cause oil imports,
or vice versa.
Spurious regression
spurious regression is a regression that provides misleading statistical evidence of a
linear relationship between independent non-stationary variables.
Example rain and prices of stocks
To conduct this type of sampling, you can use tools like random number
generators or other techniques that are based entirely on chance.
Example
You want to select a simple random sample of 100 employees of Company X.
You assign a number to every employee in the company database from 1 to
1000, and use a random number generator to select 100 numbers.
2. Systematic sampling
Systematic sampling is similar to simple random sampling, but it is usually slightly
easier to conduct. Every member of the population is listed with a number, but
instead of randomly generating numbers, individuals are chosen at regular
intervals.
Example
All employees of the company are listed in alphabetical order. From the first 10
numbers, you randomly select a starting point: number 6. From number 6
onwards, every 10th person on the list is selected (6, 16, 26, 36, and so on),
and you end up with a sample of 100 people.
If you use this technique, it is important to make sure that there is no hidden pattern
in the list that might skew the sample. For example, if the HR database groups
employees by team, and team members are listed in order of seniority, there is a risk
that your interval might skip over people in junior roles, resulting in a sample that is
skewed towards senior employees.
3. Stratified sampling
Stratified sampling involves dividing the population into subpopulations that
may differ in important ways. It allows you draw more precise conclusions by
ensuring that every subgroup is properly represented in the sample.
To use this sampling method, you divide the population into subgroups (called strata)
based on the relevant characteristic (e.g. gender, age range, income bracket, job
role).
Based on the overall proportions of the population, you calculate how many people
should be sampled from each subgroup. Then you use random
or systematic sampling to select a sample from each subgroup.
Example
The company has 800 female employees and 200 male employees. You want to
ensure that the sample reflects the gender balance of the company, so you sort the
population into two strata based on gender. Then you use random sampling on each
group, selecting 80 women and 20 men, which gives you a representative sample of
100 people.
4. Cluster sampling
Cluster sampling also involves dividing the population into subgroups, but
each subgroup should have similar characteristics to the whole sample. Instead of
sampling individuals from each subgroup, you randomly select entire subgroups.
Example
The company has offices in 10 cities across the country (all with roughly the same
number of employees in similar roles). You don’t have the capacity to travel to every
office to collect your data, so you use random sampling to select 3 offices – these
are your clusters
Let us consider a specific example: we might want to predict the outcome of a presidential
election by means of an opinion poll. Asking 1000 voters about their voting intentions can
give a pretty accurate prediction of the likely winner, but only if our sample of 1000 voters is
'representative' of the electorate as a whole (i.e. unbiased). If we only poll the opinion of,
1000 white middle class college students, then the views of many important parts of the
electorate as a whole (ethnic minorities, elderly people, blue-collar workers) are likely to be
underrepresented in the sample, and our ability to predict the outcome of the election from
that sample is reduced.
As you can see, adding a random independent variable did not help in explaining the
variation in the target variable. Our R-squared value remains the same. Thus, giving us a
false indication that this variable might be helpful in predicting the output. However, the
Adjusted R-squared value decreased which indicated that this new variable is actually
not capturing the trend in the target variable.
Lognormal
A lognormal distribution is a continuous probability distribution of a
random variable in which logarithm is normally distributed. Thus, if
the random variable X has a lognormal distribution, then Y=ln(X) has a
normal distribution.
Normal distributions may present a few problems that log-normal distributions
can solve. Mainly, normal distributions can allow for negative random
variables while log-normal distributions include all positive variables.
One of the most common applications where log-normal distributions are used
in finance is in the analysis of stock prices. The potential returns of a stock
can be graphed in a normal distribution. The prices of the stock however can
be graphed in a log-normal distribution. The log-normal distribution curve can
therefore be used to help better identify the compound return that the stock
can expect to achieve over a period of time.
Bayes Theorem
Bayes' theorem is a mathematical formula for determining conditional
probability. Conditional probability is the likelihood of an outcome occurring,
based on a previous outcome occurring. Bayes' theorem provides a way to
revise existing predictions or theories (update probabilities) given new or
additional evidence.
Bayes' theorem relies on incorporating prior probability distributions in order to
generate posterior probabilities. Prior probability, in Bayesian statistical
inference, is the probability of an event before new data is collected. Posterior
probability is the revised probability of an event occurring after taking into
consideration new information. Posterior probability is calculated by updating
the prior probability by using Bayes' theorem. In statistical terms, the posterior
probability is the probability of event A occurring given that event B has
occurred.
When you have ordinal data, the median or mode is usually the best choice. For
categorical data, you have to use the mode.
Autocovariance
Linear dependence between two points on the same series observed at different times.
Ideal ML algorithm should have low bias and low variance by producing consistent
predictions across different datasets.
Y=f(X) + e
We will make a model f^(X) of f(X) using linear regression or any other modeling
technique.
So the expected squared error at a point x is
Expected error square = Bias^2 + Variance + Irreducible error
Irreducible error is the error that can’t be reduced by creating good models. It is a
measure of the amount of noise in our data. Here it is important to understand that
no matter how good we make our model, our data will have certain amount of noise
or irreducible error that can not be removed.
Underfitting happens when a model unable to capture the underlying pattern of the
data. Such models generally have high bias and low variance. It generally happens
when there is very less data to build an accurate model or when we try to build a
linear model with non-linear data.
Overfitting happens when our model captures the noise along with the underlying
pattern in the data. It happens when we train our model a lot over noisy dataset.
Such models have low bias and high variance. These models are very complex like
decision trees and prone to overfitting.
If our model is too simple and has very few parameters then it may have high
bias and low variance.
On the other hand if our model has large number of parameters then it’s going
to have high variance and low bias.
So we need to find the right/good balance without overfitting and underfitting
the data.
This tradeoff in complexity is why there is a tradeoff between bias and variance.
An algorithm can’t be more complex and less complex at the same time.
To build a good model, we need to find a good balance between bias and
variance such that it minimizes the total error.
An optimal balance of bias and variance would never overfit or underfit the
model.
https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229
Normal Distribution
Normal distribution can be checked for by plotting a histogram. If the distribution is
not normal then we transform the data to normal using the following techniques
Z score
Box cox transformation
o A Box Cox transformation is a transformation of a non-normal dependent
variables into a normal shape. Normality is an important assumption for
many statistical techniques; if your data isn’t normal, applying a Box-Cox
means that you are able to run a broader number of tests.
o At the core of the Box Cox transformation is an exponent, lambda (λ),
which varies from -5 to 5. All values of λ are considered and the optimal
value for your data is selected; The “optimal value” is the one which
results in the best approximation of a normal distribution curve. The
transformation of Y has the form:
This test only works for positive data. However, Box and Cox did propose a
second formula that can be used for negative y-values:
o The formulae are deceptively simple. Testing all possible values by hand is
unnecessarily labor intensive; most software packages will include an
option for a Box Cox transformation, including:
Checking for normality
o There are various methods available to test the normality of the
continuous data, out of them, most popular methods are Shapiro–Wilk
test, Kolmogorov–Smirnov test, skewness, kurtosis, histogram, box plot, P–
P Plot, Q–Q Plot, and mean with SD.
o The Shapiro–Wilk test is more appropriate method for small sample sizes
(<50 samples) although it can also be handling on larger sample size while
Kolmogorov–Smirnov test is used for n ≥50. For both of the above tests,
null hypothesis states that data are taken from normal distributed
population. When P > 0.05, null hypothesis accepted and data are called
as normally distributed.
o A histogram is an estimate of the probability distribution of a continuous
variable. If the graph is approximately bell-shaped and symmetric about
the mean, we can assume normally distributed data.
o In statistics, a Q–Q plot is a scatterplot created by plotting two sets of
quantiles (observed and expected) against one another. For normally
distributed data, observed data are approximate to the expected data,
that is, they are statistically equal [Figure 2].
o A P–P plot (probability–probability plot or percent–percent plot) is a
graphical technique for assessing how closely two data sets (observed
and expected) agree. It forms an approximate straight line when data are
normally distributed. Departures from this straight line indicate departures
from normality [Figure 3].
o Box plot is another way to assess the normality of the data. It shows the
median as a horizontal line inside the box and the IQR (range between the
first and third quartile) as the length of the box. The whiskers (line
extending from the top and bottom of the box) represent the minimum
and maximum values when they are within 1.5 times the IQR from either
end of the box (i.e., Q1 − 1.5* IQR and Q3 + 1.5* IQR). Scores >1.5 times
and 3 times the IQR are out of the box plot and are considered as outliers
and extreme outliers, respectively. A box plot that is symmetric with the
median line at approximately the center of the box and with symmetric
whiskers indicate that the data may have come from a normal
distribution. In case many outliers are present in our data set, either
outliers are need to remove or data should treat as nonnormally
distributed
Dummy Variables
Numeric variable that represents categorical data, such as gender, race,
political affiliation, etc.
The number of dummy variables required to represent a particular
categorical variable depends on the number of values that the categorical
variable can assume. To represent a categorical variable that can
assume k different values, a researcher would need to define k - 1 dummy
variables.
For example, suppose we are interested in political affiliation, a categorical
variable that might assume three values - Republican, Democrat, or
Independent. We could represent political affiliation with two dummy
variables:
For example, suppose we wanted to assess the relationship between household income and political
affiliation (i.e., Republican, Democrat, or Independent). The regression equation might be:
where b0, b1, and b2 are regression coefficients. X1 and X2 are regression coefficients defined as:
The value of the categorical variable that is not represented explicitly by a dummy variable is called
the reference group. In this example, the reference group consists of Independent voters.
In analysis, each dummy variable is compared with the reference group. In this example, a positive
regression coefficient means that income is higher for the dummy variable political affiliation than for
the reference group; a negative regression coefficient means that income is lower. If the regression
coefficient is statistically significant, the income discrepancy with the reference group is also
statistically significant.
The intercept for males is b0, and the intercept for females is b0 + d0. Because there are just two
groups, we only need two different intercepts.
Using two dummy variables would introduce perfect collinearity because female + male = 1, which
means that male is a perfect linear function of female. Including dummy variables for both
genders is the simplest example of the so-called dummy variable trap, which arises when too
many dummy variables describe a given number of groups.
In the above example the base group is males or benchmark group, that is, the group against which
comparisons are made. This is why b0 is the intercept for males, and d0 is the difference in
intercepts between females and males.
If we take a woman and a man with the same levels of education, experience, and tenure, the
woman earns, on average, $1.81 less per hour than the man.
The intercept is the average wage for men in the sample (let female = 0), so men earn $7.10 per
hour on average. The coefficient on female is the difference in the average wage between women
and men. Thus, the average wage for women in the sample is 7.10 - 2.51 = 4.59, or $4.59 per hour
Because the overall intercept is common to all groups, we can ignore that in finding differences.
Thus, the estimated proportionate difference between single and married women is
-0.110 - (- 0.198) = .088 which means that single women earn about 8.8% more than married
women.
It is worth noticing that the estimated return to using a computer at work (but not at home) is about
17.7%. (The more precise estimate is 19.4%.) Similarly, people who use computers at home but not
at work have about a 7% wage premium over those who do not use a computer at all. The
differential between those who use a computer at both places, relative to those who use a computer
in neither place, is about 26.4% (obtained by adding all three coefficients and multiplying by 100), or
the more precise estimate 30.2% obtained from equation (7.10).
What are the types of data? Explain how all these data types are
different from each other? What are the problems you face in these
data types?
crucial prerequisite for doing Exploratory Data Analysis (EDA), since you
can use certain statistical measurements only for specific data types.
to choose the right visualization method
Categorical data represents characteristics. Therefore, it can represent
things like a person’s gender, language etc. Categorical data can also take on
numerical values (Example: 1 for female and 0 for male).
Nominal values represent discrete units and are used to label variables, that
have no quantitative value. Just think of them as „labels“. Note that nominal
data that has no order.
Ordinal values represent discrete and ordered units. It is therefore nearly
the same as nominal data, except that it’s ordering matters.
Numerical data
o Discrete data - if its values are distinct and separate. In other words:
We speak of discrete data if the data can only take on certain values.
o Continuous data - represents measurements and therefore their
values can’t be counted but they can be measured. An example would
be the height of a person, which you can describe by using intervals
on the real number line.
Datatypes are an important concept because statistical methods can only be
used with certain data types. You have to analyze continuous data
differently than categorical data otherwise it would result in a wrong
analysis.
You can use one hot encoding, to transform nominal data into a numeric
feature.
You can use one label encoding, to transform ordinal data into a
numeric feature.
Endogeneity
In general, we say that a variable X is endogenous if it is correlated with the model error
term. Endogeneity always induces bias.
Single variable endo – t test
Two variable endo – F test
https://www.stata.com/meeting/spain16/slides/pinzon-spain16.pdf
http://www.sfu.ca/~pendakur/teaching/buec333/Multicollinearity%20and
%20Endogeneity.pdf
Instrumental Variables
Instruments are variables, denoted Z, that are correlated with X, but
uncorrelated with the model error term by assumption or by construction.
Cov(Z,e)=0, so in the Ballentine, Z and the error term have no overlap.
But, (Z,X) do overlap
If you take the natural log of both sides, you end up with
You treat
You can estimate this model with OLS by simply using natural log values for
the variables instead of their original scale.
Point Biserial
A point-biserial correlation is used to measure the strength and direction of the association
that exists between one continuous variable and one dichotomous variable. It is a special
case of the Pearson’s product-moment correlation, which is applied when you have two
continuous variables, whereas in this case one of the variables is measured on a
dichotomous scale.
For example, you could use a point-biserial correlation to determine whether there is an
association between salaries, measured in US dollars, and gender (i.e., your continuous
variable would be "salary" and your dichotomous variable would be "gender", which has two
categories: "males" and "females"). Alternately, you could use a point-biserial correlation to
determine whether there is an association between cholesterol concentration, measured in
mmol/L, and smoking status (i.e., your continuous variable would be "cholesterol
concentration", a marker of heart disease, and your dichotomous variable would be
"smoking status", which has two categories: "smoker" and "non-smoker").
Alternative hypothesis: Assumes that there is an association between the two variables
Chi-square formula
Actually, it's a fairly simple relationship. The variables in this formula are not
simply symbols, but actual concepts that we've been discussing all
along. O stands for the Observed frequency. E stands for the Expected frequency.
You subtract the expected count from the observed count to find the difference
between the two (also called the "residual"). You calculate the square of that
number to get rid of positive and negative values (because the squares of 5 and -5
are, of course, both 25). Then, you divide the result by the expected frequency to
normalize bigger and smaller counts (because we don't want a formula that will
give us a bigger Chi-square value just because you're working with a bigger set of
data). The huge sigma sitting in front of all that is asking for the sum of every i for
which you calculate this relationship - in other words, you calculate this for each
cell in the table, then add it all together. And that's it!
Using this formula, we find that the Chi-square value for our gender/party
example is ((20-25)^2/25) + ((30-25)^2/25) + ((30-25)^2/25) + ((20-25)^2/25), or
(25/25) + (25/25) + (25/25) + (25/25), or 1 + 1 + 1 + 1, which comes out to 4.
And then use p value
http://www.ce.memphis.edu/7012/L17_CategoricalVariableAssociation.pdf
https://www.ling.upenn.edu/~clight/chisquared.htm