Data Analysis Notes
Data Analysis Notes
Basics:
• Most scales are made by simply adding
the values from different items
(sometimes called "Lickert scales")
• Eliminate items that have a high
number of non responses
Composite scales (indicators)
Basics cont.:
• Make sure to take into account missing values
(non responses) when you add up the
responses from the different items. A real
statistics program (SPSS) does that for you
Descriptive statistics
• Descriptive statistics are not very interesting
in most cases (unless they are used to
compare different cases in comparative
systems designs)
• Therefore, do not fill up pages of your report
with tons of Excel diagrams !!
. Which data analysis for which data types?
No 1,011 954
• Null hypothesis
Ho: there is no difference between the groups
• Alternative hypothesis
H1: there is a difference between the groups
• Collect data
• Perform test statistic e.g T test, Chi square or ANOVA
• Interpret P value and confidence intervals
P value 0.05 Reject Ho
P value > 0.05 Accept Ho
• Draw conclusions
CORRELATION
• The correlation is a powerful statistical tool
that is used in examining relationship
between two or more variable.
• We can use correlation coefficient to
determine the presence, direction and
magnitude of the relation between variables.
• We can determine if the difference among
subjects for one variable can be accounted
for, or explained by another variables.
Methods of studying Correlation
• Methods include:
– Scatter diagram
– Pearson’s correlation coefficient
– Spearman’s Rank correlation coefficient
– Method of least squares
• The first method is based on the knowledge
of graphs whereas the others are
mathematical methods
Types of Correlation
• Correlation may be classified as:
1. Positive and negative
2. Linear and non-linear
3. Simple and multiple
• Give examples of correlation
How To Calculate correlation
• Step 1: Find the mean of x, and the mean of y.
• Step 2: Subtract the mean of x from every x
value (call them "a"), and subtract the mean
of y from every y value (call them "b")
• Step 3: Calculate: ab, a2 and b2 for every value.
• Step 4: Sum up ab, sum up a2 and sum up b.
Key points
• Pearson correlation coefficient, also known as
Pearson R, is a statistical test that estimates
the strength between the different variables
and their relationships. Hence, whenever any
statistical test is performed between the two
variables, it is always a good idea for the
person to estimate the correlation coefficient
value to know the strong relationship between
them.
Key points
• The correlation coefficient of -1 means a
robust negative relationship. Therefore, it
imposes a perfect negative relationship
between the variables. If the correlation
coefficient is 0, it displays no relationship.
Moreover, if the correlation coefficient is 1, it
means a strong positive relationship.
Therefore, it implies a perfect positive
relationship between the variables.
Key points
• The Pearson correlation coefficient shows the
relationship between the two variables
calculated on the same interval or ratio scale.
In addition, It estimates the relationship
strength between the two continuous
variables.
Example of Correlation calculation
• Pearson’s correlation coefficients:
Is there a relationship between the age of
husbands and the wives at the time of marriage?
Husbands: 23 27 28 28 30 30 33 35 38
Wives: 18 20 22 27 29 27 29 28 29
Correlation
x =30 y=25
X dx dx2 Y dy dy2 dx.dy
23 18
27 20
28 22
28 27
28 21
30 29
30 27
33 29
35 28
38 29
X=300 dx = dx2= Y=250 dy = dy2= dx.dy=
Correlation
x =30 y=25
X dx dx2 Y dy dy2 dx.dy
23 -7 49 18 -7 49 49
27 -3 9 20 -5 25 15
28 -2 4 22 -3 9 6
28 -2 4 27 2 4 -4
28 -2 4 21 -4 16 8
30 0 0 29 4 16 0
30 0 0 27 2 4 0
33 3 9 29 4 16 12
35 5 25 28 3 9 15
38 8 64 29 4 16 32
X=300 dx = 0 dx2=168 Y=250 dy = 0 dx2=164 dx.dy=133
Correlation
X
Regression
Gradient/slope (a) = ∆y = y2 – y 1
∆x x2 – x 1
Gradient ∆y = 3 – 1= 2
∆x 2–1
Regression
• Equation of a straight line.
y – 3 = 2(x – 2)
Y-3+3 = 2x – 4 + 3
Y= 2x-4+3
y = 2x -1
Regression
• The equation of a straight line is given by
y = b + ax
• In regression analysis, we call the x variables
the independent variables because we want to
use this variable to predict or estimate scores
on the dependent variables, the y variables.
• We can estimates the intercept (b) and the
slope (a), since all points cannot lie in one
straight line. This straight line, also known as
the line of best fit, is known as the regression
of y and x.
Regression
a= Σ(xi – Ҳ ) (y1 – ŷ)
Σ(xi – Ҳ )2
y = b + ax
and
b = y - ax
Regression
• A regression line is a straight line that best
represents the linear relationship between
two variables x and y.
• The regression line is considered to be the
‘line of best fit’ when the sum of the squared
distances of each of the actual scores, y and
predicted scores y is minimal i.e. (Σ(y – ŷ)2 is
minimal.
Regression
No x1 y1 x1 – Ҳ y1 – y
(Y1- Mean of Y)
(x – Ҳ )(y – ŷ)
(X1- Mean of X) (Y1- Mean of Y)
(x – Ҳ)2
(X1- Mean of X)
1 2 1
2 3 3
3 5 7
4 7 11
5 9 15
6 10 17
Sum 36 54
Mean 6 9
Regression
I x1 y1 x1 - Ҳ y1 - y (x – Ҳ )(y – ŷ) (x – Ҳ)2
1 2 1 -4 -8 32 16
2 3 3 -3 -6 18 9
3 5 7 -1 -2 2 1
4 7 11 1 2 2 1
5 9 15 3 6 18 9
6 10 17 4 8 32 16
Σ 36 54 0 0 104 52
Mean 6 9
Regression
Thus x = 6 and y = 9
b = 104 = 2
52
a = 9 - (2 x 6) = -3