0% found this document useful (0 votes)
61 views

STAT - Lec.3 - Correlation and Regression

1. The document discusses correlation and regression analysis. Correlation describes the relationship between two variables using a correlation coefficient and scatter plot. Regression models the relationship by quantifying it with an equation. 2. The correlation coefficient ranges from -1 to 1, where values closer to 1 or -1 indicate a stronger positive or negative relationship, respectively. Regression estimates the relationship with an equation of the form Y = B0 + B1X + ε, where B0 is the intercept, B1 is the slope, and ε is the error. 3. An example calculates the correlation and estimates the regression equation for hours of study (X) and exam marks (Y). The correlation is positive and strong

Uploaded by

Salma Hazem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

STAT - Lec.3 - Correlation and Regression

1. The document discusses correlation and regression analysis. Correlation describes the relationship between two variables using a correlation coefficient and scatter plot. Regression models the relationship by quantifying it with an equation. 2. The correlation coefficient ranges from -1 to 1, where values closer to 1 or -1 indicate a stronger positive or negative relationship, respectively. Regression estimates the relationship with an equation of the form Y = B0 + B1X + ε, where B0 is the intercept, B1 is the slope, and ε is the error. 3. An example calculates the correlation and estimates the regression equation for hours of study (X) and exam marks (Y). The correlation is positive and strong

Uploaded by

Salma Hazem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Dr.

Ayman Amin Descriptive Statistics

Correlation and Regression


Simply, correlation coefficient along with scatter plot is a mean of describing the
nature of a relationship between variables, whilst regression model is a mean of
modelling – quantifying - the relationship between variables.

1. Correlation Coefficient
Correlation coefficient along with scatter plot can help to describe the nature of a
relationship between two variables, in particular the dimension and strength of the
relationship. By dimension, we mean positive of negative relation, and by strength
we mean weak, moderate, or strong relation. In general, the correlation coefficient
can be any value between -1 and 1. When the value of the correlation coefficient for
two variables is close to 1, it means that the relation between two variables is
positive strong. On the other hand, when the value of the correlation coefficient for
two variables is close to -1, it means that the relation between two variables is
negative strong. More details about the direction and strength of the relation based
on the correlation coefficient values are presented in the following simple figure:

The relation between the correlation coefficient (r) and scatter plot is shown in the
following Figure.

1
Dr. Ayman Amin Descriptive Statistics

The correlation coefficient (r) can be computed as:

∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅)


𝑟=
√∑(𝑥 − 𝑥̅ )2 ∑(𝑦 − 𝑦̅)2
Note that ∑ refers to the sum; and 𝑋̅ is the mean of X and 𝑌̅ is the mean of Y.

Example 1:
For the following two variables; hours of study (x) and marks (y), plot the scatter
diagram and compute the correlation coefficient and interpret the results?

Hours of study (x) 3 4 6 4 2 5


Marks (y) 6 8 9 5 4.5 9.5
Solution
First we plot the scatter diagram as follows.

Scatter diagram for hours of study (x) and marks (y)


Y 10

0
0 1 2 3 4 5 6 7
X

2
Dr. Ayman Amin Descriptive Statistics

From the figure it is clear that there is strong positive relation between hours of
study and marks. This implies that the number of hours and marks are increasing and
decreasing strongly together.

Now, we can compute the correlation coefficient to quantify exactly the value of the
correlation between hours of study and marks using the following table.

Y X ̅ )𝟐
(𝒀 − 𝒀 ̅ )𝟐
(𝑿 − 𝑿 ̅ )(𝒀 − 𝒀
(𝑿 − 𝑿 ̅)
6 3 (6 – 7)2 = 1 (3 – 4)2 = 1 (3 – 4)(6 – 7) = 1
8 4 (8 – 7)2 = 1 (4 – 4)2 = 0 (4 – 4)(8 – 7) = 0
9 6 (9 – 7)2 = 4 (6 – 4)2 = 4 (6 – 4)(9 – 7) = 4
5 4 (5 – 7)2 = 4 (4 – 4)2 = 0 (4 – 4)(5 – 7) = 0
4.5 2 (4.5 – 7)2 = 6.25 (2 – 4)2 = 4 (2 – 4)(4.5 – 7) = 5
(5 – 4)(9.5 – 7) =
9.5 5 (9.5 – 7)2 = 6.25 (5 – 4)2 = 1
2.5
ΣX = 24
ΣY = 42 ∑(𝑋 − 𝑋̅)2 = ∑(𝑋 − 𝑋̅)(𝑌 − 𝑌̅) =
𝑋̅= 24/6 = ∑(𝑌 − 𝑌̅)2 = 22.5
𝑌̅= 42/6 = 7 10 12.5
4
Note that 𝑋̅ is the mean of X, and 𝑌̅ is the mean of Y.

Based on the results from the table we can compute the correlation coefficient as:

∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) 12.5


𝑟= = = 0.8333
√∑(𝑥 − 𝑥̅ )2 ∑(𝑦 − 𝑦̅)2 √10 × 22.5
which confirms the same result of the scatter plot that there is strong positive
relation between hours of study and marks.

2. Regression Model
Regression model is a mean of modelling – quantifying - the relationship between
variables.

If there are two variables one of them is explained by the other, such as marks value
can be explained by the hours of study. In this case marks value is called "dependent
variable" and referred as Y, and hours of study variable is called "independent
variable" and referred as X. Then the regression model is equation quantifying the
relation between Y and X. Note: because there are only two variables Y and X
regression model is called "simple linear regression" which we are studying in our
course.

The simple linear regression has the following form:

𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀

3
Dr. Ayman Amin Descriptive Statistics

where

Y: is the dependent variable (response)

X: is the independent variable (predictor or explanatory)

𝛽0: is the intercept (the value of Y when X = 0)

𝛽1: is the slope of the regression line (the change in Y when X changes by one unit)

𝜀 : is the random error (unknown factors affect Y and are not the interest of study or cannot
be measured)

In reality the values of 𝛽0 and 𝛽1 are unknown and need to estimated. The estimate of
𝛽1is referred as 𝑏1 and calculated as follows:

∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅)


𝑏1 =
∑(𝑥 − 𝑥̅ )2

Similarly, the estimate of 𝛽0is referred as 𝑏0 and calculated as follows:

𝑏0 = 𝑦̅ − 𝑏1 𝑥̅

Accordingly, the estimated regression model is written as:


𝑌̂ = 𝑏0 + 𝑏1 𝑋,

where 𝑌̂ is the predicted value of Y using the estimated regression model.

Example 2:
For the given data of the two variables; hours of study (x) and marks (y), in
Example 1, estimate the regression model equation, and interpret the results?
Solution
By "estimate the regression model equation" we mean find the estimates 𝑏1 and 𝑏0
using their equations given above. To find these estimates we can use the following
table.

Y X (𝑿 − 𝑿̅ )𝟐 (𝑿 − 𝑿̅ )(𝒀 − 𝒀 ̅)
6 3 (3 – 4)2 = 1 (3 – 4)(6 – 7) = 1
8 4 (4 – 4)2 = 0 (4 – 4)(8 – 7) = 0
9 6 (6 – 4)2 = 4 (6 – 4)(9 – 7) = 4
5 4 (4 – 4)2 = 0 (4 – 4)(5 – 7) = 0
4.5 2 (2 – 4)2 = 4 (2 – 4)(4.5 – 7) = 5
9.5 5 (5 – 4)2 = 1 (5 – 4)(9.5 – 7) = 2.5
ΣY = 42 ΣX = 24
∑(𝑋 − 𝑋̅)2 = 10 ∑(𝑋 − 𝑋̅)(𝑌 − 𝑌̅) = 12.5
𝑌̅= 42/6 = 7 𝑋̅= 24/6 = 4
Note that 𝑋̅ is the mean of X, and 𝑌̅ is the mean of Y.

4
Dr. Ayman Amin Descriptive Statistics

∑(𝑥−𝑥̅ )(𝑦−𝑦̅) 12.5


From the table, 𝑏1 = ∑(𝑥−𝑥̅ )2
= = 1.25, and
10

𝑏0 = 𝑦̅ − 𝑏1 𝑥̅ = 7 − 1.25 × 4 = 2

Therefore, the estimated regression model is:

𝑌̂ = 𝑏0 + 𝑏1 𝑋
𝑌̂ = 2 + 1.25𝑋

The interpretation is that the predicted value of marks (Y) will be "2" when hours of
study (X) is 0, and the predicted value of marks (Y) will change by 1.25 marks when
hours of study (X) changes by one hour.

Now, we can use the estimated model to predict new values of marks (Y) for given
new values of hours of study (X). For instance, if we have a new value for hours of
study (X) is "7 hours" – which is not in the table- what is the predicted value for
marks (Y)? Simply we can substitute the value 7 of hours (X) in the estimated model
as: 𝑌̂ = 2 + 1.25 ∗ 7 = 10.75, which implies that the predicted value of marks is
10.75 for 7 hours of study (X).

3. Application
A personnel officer, at a certain institution, believes that there is a relationship
between a worker's age and absenteeism. In order to discover this relationship, she
has compiled the following information for 10 randomly selected employees:

Age (X) 19 22 25 27 30 33 36 39 42 57
Days absent (Y) 8 10 9 7 5 6 5 4 2 4
For the given data, answer the following questions:

1. Use the scatter diagram and correlation coefficient to show whether there is a
relation between the worker's age (X) and days absent (Y)?

2. Estimate the regression model and explain the meaning of its coefficients
estimates for the worker's age (X) and days absent (Y)?

3. Using the estimated regression for the worker's age (X) and days absent (Y) find
the predicted value of days absent (Y) when the worker's age (X) equals to 35
and 45?

Solution

5
Dr. Ayman Amin Descriptive Statistics

1. Use the scatter diagram and correlation coefficient to show whether there
is a relation between the worker's age (X) and days absent (Y)?

First we plot the scatter diagram as follows.

Scatter diagram worker's age (X) and days absent (y)


12

10

8
Days absent

0
0 10 20 30 40 50 60
Worker's age

From the figure it is clear that there is strong negative relation between the worker's
age and days absent. This implies that when the worker's age increases the number
of days absent decreases and vice versa.

Now, we can compute the correlation coefficient to quantify exactly the value of the
correlation between the worker's age and days absent using the following table.

x y ̅ )𝟐
(𝒀 − 𝒀 ̅ )𝟐
(𝑿 − 𝑿 ̅ )(𝒀 − 𝒀
(𝑿 − 𝑿 ̅)
19 8 4 196 -28
22 10 16 121 -44
25 9 9 64 -24
27 7 1 36 -6
30 5 1 9 3
33 6 0 0 0
36 5 1 9 -3
39 4 4 36 -12
42 2 16 81 -36
57 4 4 576 -48
ΣX = 330 ΣY = 60
∑(𝑌 − 𝑌̅)2 = ∑(𝑋 − 𝑋̅)2 ∑(𝑋 − 𝑋̅)(𝑌 − 𝑌̅) =
𝑋̅= 330/10 = 𝑌̅= 60/10 =
56 = 1128 -198
33 6
Based on the results from the table we can compute the correlation coefficient as:

∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) −198


𝑟= =
= −0.79
√∑(𝑥 − 𝑥̅ )2 ∑(𝑦 − 𝑦̅)2 √1128 × 56
which confirms the same result of the scatter plot that there is strong negative
relation between the worker's age and days absent..

6
Dr. Ayman Amin Descriptive Statistics

2. Estimate the regression model and explain the meaning of its coefficients
estimates for the worker's age (X) and days absent (Y)?

From the previous table we can directly compute 𝑏1 and 𝑏0 as:


∑(𝑥−𝑥̅ )(𝑦−𝑦̅) −198
𝑏1 = ∑(𝑥−𝑥̅ )2
= = −0.18, and
1128

𝑏0 = 𝑦̅ − 𝑏1 𝑥̅ = 6 − (−0.1755) × 33 = 11.79

Therefore, the estimated regression model is:

𝑌̂ = 𝑏0 + 𝑏1 𝑋
𝑌̂ = 11.79 − 0.18𝑋

The interpretation is that the predicted value of days absent (Y) will be "11.79"
when workers age (X) is 0 (which does not have logical meaning in this case), and
the predicted value of days absent (Y) will change by (-0.18) days when workers
age (X) changes by one year.

3. Using the estimated regression for the worker's age (X) and days absent (Y)
find the predicted value of days absent (Y) when the worker's age (X) equals to
35 and 45?

We can predict the values of days absent for new values of the worker's age by
substituting the values in the estimated regression as follows

For worker's age = 35: predicted value of days absent is 𝑌̂ = 11.79 − 0.18 ∗ 35 =
5.49 days.

For worker's age = 45: predicted value of days absent is 𝑌̂ = 11.79 − 0.18 ∗ 45 =
3.49 days.

7
Dr. Ayman Amin Descriptive Statistics

4. Exercise
A car was driven at various speeds and fuel economy in km travelled per litre was
recorded.

Speed (X) 10 20 15 20 60 90 75 55 30 40

Km/L (Y) 12 11 11 12 10 9 9.5 10 11 11

For the given data, answer the following questions:

1. Use the scatter diagram and correlation coefficient to show whether there is a
relation between the speed (X) and fuel economy in km (Y)?

2. Estimate the regression model and explain the meaning of its coefficients
estimates for the speed (X) and fuel economy in km (Y)?

3. Using the estimated regression for the speed (X) and fuel economy in km (Y) find
the predicted value of the fuel economy in km (Y) when the speed (X) equals to
35 and 45?

You might also like