0% found this document useful (0 votes)
13 views67 pages

Data Analysis Notes

This document provides a comprehensive overview of quantitative data analysis, focusing on statistical significance, correlation, and regression. It outlines the stages of statistical analysis, methods for data preparation, and the importance of defining research problems and variables. Additionally, it discusses various statistical tests and coefficients, including correlation methods and their applications in research.

Uploaded by

davidteaco5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views67 pages

Data Analysis Notes

This document provides a comprehensive overview of quantitative data analysis, focusing on statistical significance, correlation, and regression. It outlines the stages of statistical analysis, methods for data preparation, and the importance of defining research problems and variables. Additionally, it discusses various statistical tests and coefficients, including correlation methods and their applications in research.

Uploaded by

davidteaco5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 67

QUANTITATIVE DATA ANALYSIS

Fundamental of Data Analysis


Dr. Olunga Okomo, PhD
Broad Objective
• By the end of the session learners shall have
comprehended the concept of statistical
significance, correlation; regression and be
able to skilfully apply them in their research
work
Specific Objectives

• To Define Basic terms in statistics


• To discuss correlation and regression analysis
• To describe , linear correlation coefficient
• To describe confidence intervals
• To discuss binomial distribution
• To discuss normal distribution
Definition of terms
• Statistical Data Analysis: This is the process of
collecting and analyzing large volumes of data in
order to identify trends and develop valuable
insights.
• In the professional world, statistical analysts take
raw data and find correlations between variables
to reveal patterns & trends to stakeholders
• Inferential Statistics are methods used to make
inferences about the relationship between the
dependent and independent variables in a
population, based on a sample of observations
Variables and Observation units
• Observation unit: The variable in research to be
measured.
Example:
– Individual (students, health workers, farmers ….)
– Groups (e.g. family, household, couples)
– Institution, organization or community (e.g. school,
enterprise, municipality)
– Time, age, sex, scores in an exam etc

Note: a mean of a sample is statistic, while a mean of


the whole population is a parameter
Definition cont…
• Correlation Analysis : A process used to establish
relationship - patterns within datasets of variables. A
positive correlation result means that both variables
increase in relation to each other, while a negative
correlation means that as one variable decreases, the
other increases.
• Linear correlation is defined when the ratio of
proportion of two given variables are same/constant,
for example- every time when the income increases by
20% there is a rise in expenditure of 5%.
• Non-linear correlation: A situation when the ratio of
variations between two given variables changes (not
constant).
Stages of statistical analysis

Note: With statistical data analysis programs you


can easily do several steps in one operation.

1. Clean your data


• Make very sure that your data are correct
(e.g. check data transcription)
• Make sure that missing values (e.g. not
answered questions in a survey) are clearly
identified as missing data
Stages of statistical analysis
2. Gain knowledge about your data
• Make lists of data (for small data sets only!)
• Produce descriptive statistics, e.g. means,
standard-deviations, min, max for each variable
• Produce graphics, e.g. histograms or box plot
that show the distribution

3. Produce composite scales


• E.g. create a single variable from a set of
questions
Stages of statistical analysis
4. Make graphics or tables that show
relationships
• E.g. Scatter plots for interval data (as in our
previous examples) or cross-tabulations
5. Calculate coefficients that measure the
strength and the structure of a relation
• Strength examples: Cramer’s V for cross-
tabulations, or Pearson’s R for interval data
• Structure examples: regression coefficient,
tables of means in analysis of variance
Stages of statistical analysis
6. Calculate coefficients that describe the
percentage of variance explained
• E.g. R2 in a regression analysis

7. Compute significance level, i.e. find out


if you can interpret the relation
• E.g. Chi2 for cross-tabs, Fisher’s F in
regression analysis
Steps in conducting Chi square in SPSS
• Go to data entry
• Click analyze Menu then select:
 Descriptive statistics then
 Cross-tabulation
• Click Rows provision (insert dependent variables)
• Click Columns provision (insert indep. variables of your
choice)
• Click on Exact: - highlight Monte- configure Confidence
interval (C.I) of your choice e.g. 95%
• Indicate total number of samples in the study
• Click on statistics:- highlight Chi-square)
• Click on cells: - highlight column and row then Click ok.
Difference Chi square and Fisher’s exact
• The chi-squared test applies an approximation
assuming the sample is large, while the
Fisher's exact test runs an exact procedure
especially for small-sized samples (i.e. < 5)
Significance Level

For a study finding to be significant:


1. P. values < 0.05 (at 95% C.I)
2. Chi square Value > 3
3. Odd ratios > 2
4. Relative risk > 2
5. C.I : (Both lower and upper levels must be
either positive whole figures; negatives whole
figures; or decimals, of which the gap between
the lower and the upper levels is quite narrow
based on subject under study)
Measures of Association
(Relations)
Data preparation and composite scale
making
Data preparation
• Enter the data Assign a number to each
response item (planned when you design
the questionnaire)
• Enter a clear code for missing values (no
response), e.g. -1
• Make sure that your data set is complete
and free of errors
Data preparation and composite scale
making
Data preparation
• Some simple descriptive statistics (minimum,
maximum, missing values, etc.) can help
• Learn how to document the data in your statistics
program
• Enter labels for variables, labels for responses
items, display instructions (e.g. decimal points to
show)
• Define data-types (interval, ordinal or nominal)
Composite scales (indicators)

Basics:
• Most scales are made by simply adding
the values from different items
(sometimes called "Lickert scales")
• Eliminate items that have a high
number of non responses
Composite scales (indicators)

Basics cont.:
• Make sure to take into account missing values
(non responses) when you add up the
responses from the different items. A real
statistics program (SPSS) does that for you

• Make sure when you create your


questionnaire that all items use the same
range of response item, else you will need to
standardize !!
Composite scales (indicators)
Quality of a scale:
Again: use a published set of items to measure a
variable (if available) if you do, you can avoid
making long justifications !

Sensitivity: questionnaire scores discriminate


e.g. if exploratory research has shown higher degree
of presence in one kind of learning environment
than in an other one, results of presence
questionnaire should demonstrate this.
. Composite scales (indicators)
Quality of a scale:
• Reliability: internal consistency is high
Inter-correlation between items (alpha) is
high
• Validity: results obtained with the
questionnaire can be tied to other measures
• e.g. were similar to results obtained by
other tools (e.g. in depth interviews),
• e.g. results are correlated with similar
variables.
Stage One: Define the Research Problem

In this stage, the following issues are


addressed:
Relationship to be analyzed
Specifying the dependent and
independent variables
Method for including independent
variables
Define Relationship to be analyzed
The goal of this analysis is to examine the
relationship between …………….
Specifying the dependent and independent variables

The dependent variables are/is …………………...

The independent variables:

• AGE 'Age of respondent’


• EDUC ‘Highest year of school completed’
• DEGREE ‘Respondent’s Highest Degree’
• SEX ‘Respondent’s Sex’

Note: Identify the method for including variables


Stage 2: Develop the Analysis Plan: Sample Size Issues
In this stage, the following issues are addressed:

• Missing data analysis


• Minimum sample size requirement: cases per
independent variable

Missing data analysis

Check the magnitude of missing cases. If the number of cases


with missing data is so small, it cannot produce a missing data
process that is disruptive to the analysis.
Stage 2: Develop the Analysis Plan: Sample Size
Issues

Minimum sample size requirement: cases per


independent variable

Check the sample size and verify that you have


enough cases to give a stable result
Depends on the statistical test, but usually 15-20
cases per independent variable is sufficient.
Stage 2: Develop the Analysis Plan:
Measurement Issues:
• Examine the data structure:
• How to incorporate non-metric data with
dummy variables
• How to deal with Curvi-linear Effects with
Polynomials
• How to identify and describe Interaction or
Moderator Effects
Stage 3: Evaluate Underlying Assumptions

Evaluate the assumptions for the intended


statistics in terms of:
• Non-metric dependent variable with two or
more groups
• Metric or non-metric independent
variables
Stage 4: Ran your statistical estimates and
Assess Overall Fit: Model Estimation

• Compute all your statistical estimates and


assess the model fit.
• Interpret the results
• Report your findings.
Overview of statistical methods

Descriptive statistics
• Descriptive statistics are not very interesting
in most cases (unless they are used to
compare different cases in comparative
systems designs)
• Therefore, do not fill up pages of your report
with tons of Excel diagrams !!
. Which data analysis for which data types?

Popular multi-variate analysis

Independent Dependant variable Y


(explaining)
Quantitative Qualitative
variable X (interval) (nominal or ordinal)

Quantitative Factor Analysis, Transform Y into a


multiple regression, quantitative
SEM, variable
Cluster Analysis,
Qualitative ANOVA Multidimensional
scaling etc.
5. Which data analysis for which data
types?
Popular bi-variate analysis
Dependant variable Y
Independent Quantitative Qualitative
(explaining) (interval) (nominal or
variable X ordinal)
Quantitative Correlation and Transform Y into a
Regression quantitative
variable
Qualitative Analysis of Cross-tabulations
variance
Types of statistical coefficients

• First of all the coefficient must be more or less


appropriate for the data
The big four:
1. Strength of a relation
• Coefficients usually range from -1 (total
negative relationship) to +1 (total positive
relationship). 0 means no relationship.
2. Structure (tendency) of a relation
3. Percentage of variance explained
Types of statistical coefficients:

 . Signification level of a model


• The probability of observing a result as
extreme or more extreme than the one
actually observed from chance alone
• Typically in the social sciences a sig. level
of 5% (0.05) is acceptable
• These four are mathematically connected:
E.g. Significant level is not just dependent on
the size of the sample, but also on the
strength of a relation.
Cross-tabulation
• Cross-tabulation is a popular technique to study
relationships between normal (categorical) or
ordinal variables.
Computing the percentages (probabilities)
1. For each value of the independent variable
compute the percentages
• Usually the X variable is put on top (i.e. its values show
in columns). If you don’t have to compute percentages
across lines, remember this:
• You want to know the probability (percentage) that a
value of X leads to a value of Y
• Compare (interpret) percentages across the dependant
variables .
Crosstabulation

Ever been Social Economic Status as risk


emotionally or factor
physically abused? has health has no health
insurance insurance

No 1,011 954

Yes 104 294


Steps in Statistical Testing

• Null hypothesis
Ho: there is no difference between the groups
• Alternative hypothesis
H1: there is a difference between the groups
• Collect data
• Perform test statistic e.g T test, Chi square or ANOVA
• Interpret P value and confidence intervals
P value  0.05 Reject Ho
P value > 0.05 Accept Ho

• Draw conclusions
CORRELATION
• The correlation is a powerful statistical tool
that is used in examining relationship
between two or more variable.
• We can use correlation coefficient to
determine the presence, direction and
magnitude of the relation between variables.
• We can determine if the difference among
subjects for one variable can be accounted
for, or explained by another variables.
Methods of studying Correlation

• Methods include:
– Scatter diagram
– Pearson’s correlation coefficient
– Spearman’s Rank correlation coefficient
– Method of least squares
• The first method is based on the knowledge
of graphs whereas the others are
mathematical methods
Types of Correlation
• Correlation may be classified as:
1. Positive and negative
2. Linear and non-linear
3. Simple and multiple
• Give examples of correlation
How To Calculate correlation
• Step 1: Find the mean of x, and the mean of y.
• Step 2: Subtract the mean of x from every x
value (call them "a"), and subtract the mean
of y from every y value (call them "b")
• Step 3: Calculate: ab, a2 and b2 for every value.
• Step 4: Sum up ab, sum up a2 and sum up b.
Key points
• Pearson correlation coefficient, also known as
Pearson R, is a statistical test that estimates
the strength between the different variables
and their relationships. Hence, whenever any
statistical test is performed between the two
variables, it is always a good idea for the
person to estimate the correlation coefficient
value to know the strong relationship between
them.
Key points
• The correlation coefficient of -1 means a
robust negative relationship. Therefore, it
imposes a perfect negative relationship
between the variables. If the correlation
coefficient is 0, it displays no relationship.
Moreover, if the correlation coefficient is 1, it
means a strong positive relationship.
Therefore, it implies a perfect positive
relationship between the variables.
Key points
• The Pearson correlation coefficient shows the
relationship between the two variables
calculated on the same interval or ratio scale.
In addition, It estimates the relationship
strength between the two continuous
variables.
Example of Correlation calculation
• Pearson’s correlation coefficients:
 Is there a relationship between the age of
husbands and the wives at the time of marriage?
 Husbands: 23 27 28 28 30 30 33 35 38
 Wives: 18 20 22 27 29 27 29 28 29
Correlation
x =30 y=25
X dx dx2 Y dy dy2 dx.dy
23 18
27 20
28 22
28 27
28 21
30 29
30 27
33 29
35 28
38 29
X=300 dx = dx2= Y=250 dy = dy2= dx.dy=
Correlation
x =30 y=25
X dx dx2 Y dy dy2 dx.dy
23 -7 49 18 -7 49 49
27 -3 9 20 -5 25 15
28 -2 4 22 -3 9 6
28 -2 4 27 2 4 -4
28 -2 4 21 -4 16 8
30 0 0 29 4 16 0
30 0 0 27 2 4 0
33 3 9 29 4 16 12
35 5 25 28 3 9 15
38 8 64 29 4 16 32
X=300 dx = 0 dx2=168 Y=250 dy = 0 dx2=164 dx.dy=133
Correlation

r = Sum of the r = dx.dy


product of the √dx2.dy2
deviation divided by
the square root of = 133 ÷ √168 x 164
the product of the = 133 ÷ √27552
sum of squares = 133 ÷ 165.99
= 0.8013
Properties of the Coefficient of Correlation
• The coefficient of correlation lies between -1 and +1 (-
1< r < +1)
• The nearer the coefficient is to ± 1 the stronger is the
relationship
• The coefficient of correlation is symmetrical with
respect to x and y
• The coefficient of correlation is independent of change
or origin and scale value of X and Y and by change of
scale we mean dividing or multiplying every value of X
and Y by some constant
• If X and Y are independent variables then coefficient of
correlation is zero, however the converse is not true
Regression
• Equation of a straight line.
• Given two points (x1, y1) and x2, y2), one can
draw a line
• Equation of a line is given by y = ax + b.
– Where b = y intercept (where the line crosses x bar
i.e. x = 0)
– a = gradient/slope.
Regression
Y

X
Regression
Gradient/slope (a) = ∆y = y2 – y 1
∆x x2 – x 1

Find the gradient and equation of a line


passing through two points (1,1) and (2,3)

Gradient ∆y = 3 – 1= 2
∆x 2–1
Regression
• Equation of a straight line.

y – 3 = 2(x – 2)
Y-3+3 = 2x – 4 + 3
Y= 2x-4+3
y = 2x -1
Regression
• The equation of a straight line is given by
y = b + ax
• In regression analysis, we call the x variables
the independent variables because we want to
use this variable to predict or estimate scores
on the dependent variables, the y variables.
• We can estimates the intercept (b) and the
slope (a), since all points cannot lie in one
straight line. This straight line, also known as
the line of best fit, is known as the regression
of y and x.
Regression

It can be shown that, by using least square


methods of estimation.

a= Σ(xi – Ҳ ) (y1 – ŷ)
Σ(xi – Ҳ )2

y = b + ax
and
b = y - ax
Regression
• A regression line is a straight line that best
represents the linear relationship between
two variables x and y.
• The regression line is considered to be the
‘line of best fit’ when the sum of the squared
distances of each of the actual scores, y and
predicted scores y is minimal i.e. (Σ(y – ŷ)2 is
minimal.
Regression
No x1 y1 x1 – Ҳ y1 – y
(Y1- Mean of Y)
(x – Ҳ )(y – ŷ)
(X1- Mean of X) (Y1- Mean of Y)
(x – Ҳ)2
(X1- Mean of X)

1 2 1
2 3 3
3 5 7
4 7 11
5 9 15
6 10 17
Sum 36 54
Mean 6 9
Regression
I x1 y1 x1 - Ҳ y1 - y (x – Ҳ )(y – ŷ) (x – Ҳ)2

1 2 1 -4 -8 32 16

2 3 3 -3 -6 18 9

3 5 7 -1 -2 2 1

4 7 11 1 2 2 1

5 9 15 3 6 18 9

6 10 17 4 8 32 16

Σ 36 54 0 0 104 52

Mean 6 9
Regression
Thus x = 6 and y = 9

b = 104 = 2
52
a = 9 - (2 x 6) = -3

Thus the regression line is


y = 2x - 3
Linearity

• Example: Most popular statistical methods


for interval data assume linear
relationships:
• In the following example the relationship is
non-linear: students that show weak daily
computer use
• have bad grades, but so do they ones that
show very strong use.
• • Popular measures like the Pearson’s r will
"not work", i.e. you will have a very weak
correlation and
• therefore miss this non-linear relationship
Linearity
Normal Distribution
• Most methods for interval data also require "normal
distribution“

• If you have data with "extreme cases" and/or data that is


skewed, some individuals will have much more "weight"
than the others.
• Hypothetical example: The "red" student who uses the
computer for very long hours will determine a positive
correlation and positive regression rate, whereas the
"black" ones suggest an in-existent correlation.
• Mean use of computers does not represent "typical" usage.
• The "green" student however, will not have a major impact
on the result, since the other data are well distributed along
the 2 axis. In this second case the "mean" represents a
"typical" student.
Normal distribution
Standard Normal Distribution

Mean +/- 1 SD  encompasses 68% of observations


Mean +/- 2 SD  encompasses 95% of observations
Mean +/- 3SD  encompasses 99.7% of observations
Principle of statistical analysis

The goal of statistical analysis is quite simple: find


structure in the data

DATA = STRUCTURE + NON-STRUCTURE

DATA = EXPLAINED VARIANCE + NOT EXPLAINED VARIANCE

Example: Simple regression analysis


• DATA = predicted regression line + residuals
• in other words: regression analysis tries to find a
line that will maximize prediction and minimize
residuals
2. The principle of statistical analysis
Confidence Intervals

• Confidence intervals express the range in which the


true value of a population parameter (as estimated
by the population statistic) falls, with a high degree
of confidence (usually 95% or 99%).
• Example: Take mean = 205.15;
95% CI =(204.70-205.60);
99% CI = 204.56-205.75.
END

You might also like