0% found this document useful (0 votes)
40 views

05 Class RegressionCorrelation

This document discusses linear regression and correlation analysis. It begins with an overview of simple linear regression, describing how to estimate the regression coefficients to find the linear equation relating an outcome variable y to a predictor variable x. It discusses how the slope and y-intercept are estimated using the least squares method. The document provides an example using data on body weight and plasma volume. It describes how to interpret the regression coefficients and perform inference on the slope. The document also defines the correlation coefficient and how it differs from the slope in measuring the strength of the linear relationship between two variables.

Uploaded by

JipeeZed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

05 Class RegressionCorrelation

This document discusses linear regression and correlation analysis. It begins with an overview of simple linear regression, describing how to estimate the regression coefficients to find the linear equation relating an outcome variable y to a predictor variable x. It discusses how the slope and y-intercept are estimated using the least squares method. The document provides an example using data on body weight and plasma volume. It describes how to interpret the regression coefficients and perform inference on the slope. The document also defines the correlation coefficient and how it differs from the slope in measuring the strength of the linear relationship between two variables.

Uploaded by

JipeeZed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

BIOSTATISTICS

Prof. dr. Siswanto Agus Wilopo, S.U., M.Sc., Sc.D.


Department of Public Health,
Faculty of Medicine,
Universitas Gadjah Mada

1
LINEAR REGRESSION AND
CORRELATION

2
Lecture Topics

Scatterplots
Review equation of a line
Estimates of regression coefficients
Inference on regression slope
Correlation vs. slope
Correlation not causation
Multiple linear regression

3
Regression and Correlation

Describing association between two


continuous variables
– Scatterplot
– Simple linear regression
– Correlation coefficient

4
Regression and Correlation

GOAL: to find a linear equation relation


an outcome, y, to a predictor, x
Measure magnitude, direction, and
strength of relationship between y and
x

5
Example
 Association between body weight (x) and
plasma volume (y)
Body Plasma
Subject
Weight(kg) Volume(l)
1 58.0 2.75
2 70.0 2.86
3 74.0 3.37
4 63.5 2.76
5 62.0 2.62
6 70.5 3.49
7 71.0 3.05
8 66.0 3.12

6
Simple Linear Regression
y is the dependent variable
x is the independent variable
– Predictor
– Regressor
– Covariate
We try to predict y from x
Called “simple” because there is only one
independent variable x
If there are several independent variables,
it’s called multiple linear regression

7
Y, plasma volume (liters) Simple Linear Regression
3.5

2.5
55 60 65 70 75

X, body weight (kg)


8
How Do We Choose
the “Right” Line?
 The linear regression line is the line
which gets “closest” to all of the points
 How do we measure closeness to
more than one point?
n
minimize  (y  point_on_line )
i i
2

i1

9
Y, plasma volume (liters) Simple Linear Regression
3.5

2.5
55 60 65 70 75

X, body weight (kg)


10
How Do We Choose
the “Right” Line?
The linear regression line minimizes the
sum of squares of vertical deviations
Least squares line
Recall, from algebra, there are two values
which uniquely define any line
– y-intercept—where the line crosses the
y-axis (when x = 0)
– Slope—the “rise over the run”—how
much y changes for each one unit
change in x
11
The Equation of a Line
 y = mx + b
– b = y-intercept
– m = slope
 More common notation amongst
statisticians!
y  bˆ0  bˆ1 x
– b̂0 = y-intercept
– b̂1 = slope

12
The Equation of a Line
 b̂0 and b̂1 are called regression coefficients
 These two quantities are estimated by the
“least squares” method
 The intercept b̂0 is the estimated expected
value of y when x is 0
 The slope b̂1 is the estimated expected
change in y corresponding to a unit increase
in x

13
The Equation of a Line
y
y

y  bˆ0  bˆ1 x

b̂1

b̂0

x
0
14
The Slope, b̂1
 The slope b1 is the expected change in y
corresponding to a unit increase in x
– b̂1 = 0 No association between y
and x
– b̂1 > 0 Positive association (as x
increases y tends to increase)
– b̂1 < 0 Negative association (as x
increases y tends to
decrease)

15
The Slope
y
y b̂1 > 0

b̂1 = 0

b̂1 < 0
x
0
16
Simple Linear Regression Model
 Points don’t fall exactly on line, so to represent that
we add “ε”
– y  bˆ0  bˆ1 x  
– Observed value = mean + “error”
 ε is often called the “residual” or “error”
 Model without “ε” sometimes written

yˆ  bˆ0  bˆ1 x
where ŷ represents “predicted value” or point on
the line
 Note:
i  yi  yˆ i
17
Simple Linear Regression Model

Assumptions about ε
– Random noise
– Sometimes positive, sometimes
negative but, on average, it’s 0
– Normally distributed about 0

18
Plasma Example
Suppose we used least squares regression
to estimate a line relating plasma volume to
body weight from 8 data points
Plasma volume = 0.0857 + .0436 * weight
y = 0.0857 + .0436 x
Estimate of the intercept = 0.0857
Estimate of the slope = 0.0436
– Interpretation: For each kilogram
increase in body weight, we expect
plasma volume to increase by .0436
liters
19
Plasma Example
Measurement of plasma volume very time
consuming
Body weight easy to measure Use equation
and body weight to estimate plasma volume
– If we know an individual’s weight, we
can estimate his plasma volume
– Estimate the plasma volume for 60kg
man
– y = 0.0857 + .0436 x
– For this individual, x = 60
→ y = 0.0857 + .0436 * 60 = 2.7 liters

20
Estimated vs True Slope
 Is the estimated slope (.0436) a good
estimate?
 Is it close to the true population slope?
 Standard error of estimated slope: Measures
precision of estimated slope
 Standard error of estimated slope, b̂1 , is
influenced by sample size (n)
– The larger n, the smaller SE(b̂1 )
– Just like we’ve seen before, more
information means more precision!
21
Precision of Estimated Slope

Standard error of estimated slope, b̂1, is influenced by


how close the points are to the line.
The closer the points are to the line, the smaller SE( b̂1)
22
95% Confidence Interval for Slope

Plasma/Weight example
– 0.0436 ± 2.447 * .0153
– 0.0436 ± .0374
– (.0062, .0810)
With 95% confidence, we can say the
true value of b1 is between .0062 and
.0810 (i.e., a plausible range of values
for b1 is from .0063 to .0810)

23
Hypothesis Testing
 H0 : b1 = 0 (slope is zero)
 Ha : b1  0
 We need to calculate a test statistic based
on our sample and compare to appropriate
distribution to get a p-value
estimated_slope bˆ1 0.0436
t  
SE(estimated_slope) SE (b1 ) 0.0153 = 2.85
ˆ

 p-value = .03 based on t-test


 Plasma volume is positively associated with
body weight (p = .03)
24
Some Notes about Simple Linear
Regression
Model predicts y from x
Relationship between y and x is a straight
line (linear)—start with scatterplot
Can only use equation for predicting y
values based on x values inside range of
original x values
– Beware of extrapolation!
Variability of values about the line is
approximately normal
Pairs of data points (x, y) are independent
25
Correlation:
How Close Do the Points Fall on the Line?

It is not measured by the slope


Slope measures magnitude (size of
relationship)
Slope does not measure “closeness”
of points to the line
This is measured by the correlation
coefficient

26
The Correlation Coefficient, r

Measures the direction and strength of


the linear association between x and y
The correlation coefficient is between
-1 and +1
– r>0 Positive association
– r<0 Negative association
– r=0 No association

27
r = 0.01
r = 0.68
Y

Y
X X

12
10
r = 0.98

8
Y
Y

r = -0.9
4

X 28
Correlation in the Plasma Example
3.5
Y, plasma volume (liters)

r = .76

2.5
55 60 65 70 75

X, body weight (kg)


29
Examples of the Correlation
Coefficient
 Important to note that
correlation measures
strength of linear
association
r= 0
 There could be a
strong non-linear

Y
relationship between
y and x, and r may
not catch it

X
30
Examples of the Correlation
Coefficient
Outliers can really affect correlation coefficient
One extreme point can change r sizably

r = .7

31
The Moral

A picture is worth . . .
Always look at your scatterplot when
trying to interpret correlation coefficient

32
Slope (b1) vs. the Correlation
Coefficient (r)
Both indicate direction of association (positive
or negative)
Slope b̂ is the estimated expected change in
1
y per unit increase in x
Larger slopes do not necessarily mean a
stronger linear association
The correlation coefficient is scaled between -
1 and +1
r can be small, but you can still have a slope
b1 that is significantly different from 0

33
Testing for an Association

We can test whether the correlation


coefficient is 0 or not but: It turns out it is
equivalent to testing if the slope b̂1 is 0

34
R2—Coefficient of Determination

R2 is just as it sounds—r2—but is
written with an uppercase R.
Fraction of observed variability in y
that can be explained by x
R2 is a number between 0 and 1
When there is a straight line relation,
some of the variation in y is accounted
for by the fact that as x changes, it
pulls y along with it
35
Plasma/Weight Example

In plasma volume and body weight


example

r  .76
R  .76  .58
2 2

36
Notes and Caveats on Correlation
Coefficient
Measuring only linear relationships
Other kinds of relationships are also
important
Look and graph the data
Sensitive to outliers
Useful when x values are measured, but not
controlled by the experimental design
That is, x and y are random
What about ‘confounders’?

37
Multiple Linear Regression

Sometimes we are interested in


– predicting the value of continuous
variables as a function of more than 1
independent variable
– estimating the relationship between a
continuous outcome and one
independent variable in the presence of
confounding factors

38
Multiple Linear Regression
Multiple linear regression (MLR) is a method
to estimate:
– an equation to predict a value of the
dependent variable y from multiple
independent variables
(the x’s)
– the size and statistical significance of
the relationship between a dependent
variable, y, and one predictor, x1, after
adjusting for confounders (x2 . . .)

39
Hemoglobin (Hb) and Packed Cell
Volume (PCV)
Suppose we have collected data on 21
women and want to investigate the
relationship between Hb levels and
PCV
Previous research suggests age
related to both PCV and Hb level
In order to correctly estimate the
Hb/PCV relationship, we would need a
control for age
40
Hemoglobin (Hb) and Packed Cell
Volume (PCV)
One solution:
– Stratify women by age and perform
separate simple regressions of Hb on
PCV
Potential problems:
– We only have 21 women in our sample:
Strata could be quite small
– Estimates unstable [large SE (slopes)]
– This would be a pain

41
Hemoglobin (Hb) and Packed
Cell Volume (PCV)
 Better solution:
– Perform multiple linear regression with Hb level
as outcome; PCV and age as predictors
– Here we would need a multiple linear
regression equation with two predictors
– Just keep adding x’s to the simple linear
regression equation

y  bˆ0  bˆ1 x1  bˆ2 x2  bˆ3 x3.....

42
Multiple Regression Equation
 MLR with 2 predictors

y  bˆ0  bˆ1 x1  bˆ2 x2


 y = dependent variable
 x1 = first independent variable
 x2 = second independent variable
 b̂0 = intercept
 b̂1 = regression coefficient (slope) for x1
 b̂2 = regression coefficient (slope) for x
2
43
Hemoglobin and PCV

The following is an excerpt from the


data set on 21 women:

Sub. # Hb(g/dl) PCV (%) Age (yrs)


1 12.0 35 20
2 10.7 39 22
3 12.4 47 26
4 14.2 53 28
5 13.1 30 28

44
Hemoglobin vs. PCV(%)
Hemoglobin (g/l) Vs. PCV(%)
18
Hemoglobin(g/dl)

b̂0 = 5.78
16
14
Hb

b̂1 = .203
12
10

20 30 40 50 60
PCV

PCV(%)
Interpretation: Hemoglobin increases by
.203 gm/dl for every one unit increase in PCV
45
Hemoglobin and PCV vs. Age(yrs)

Hemoglobin (g/l) Vs. Age (years) PCV (%) Vs. Age (years)
18

60
Hemoglobin(g/dl)
16

50
PCV(%)
14

PCV
Hb

40
12

30
10

20
20 30 40 50 60 70 20 30 40 50 60 70
age age

Age (years) Age (years)

46
Hemoglobin and PCV
Main questions:
– Is the relationship between hemoglobin
and PCV only because both are related
to age?
– Can we control for age as a potential
confounder?
How does MLR estimate size of relationship
between Hb and PCV while controlling for
age?
– ‘statistical adjustment’

47
MLR results
y  bˆ  bˆ x  bˆ x
0 1 1 2 2

where x1 = PCV, x2 = age (yrs)

b̂0 = 5.52
b̂1 = 0.102
b̂2 = 0.101

48
Hemoglobin and PCV
The resulting equation:
– y = 5.5 + .102*PCV + .101*Age
– (It is a coincidence that coefficients for age and
PCV are nearly equal)
Interpretation
– For a given age, hemoglobin increases by .102
gm/dl for every one percent increase in PCV
– Recall, the coefficient of PCV from the simple
linear regression was .203
Conclusion
– Hemoglobin is positively associated with PCV
after accounting for age (p = .004)

49
Hemoglobin and Age

Interpretation
– For a given PCV level, hemoglobin
increases by .101 gm/dl for every one
year increase in age
Conclusion
– Hemoglobin is positively associated with
age after accounting for PCV (p = <
.001)

50
Regression Coefficients
 General interpretation of regression
coefficients from equation:

y  bˆ0  bˆ1 x1  bˆ2 x2


 b̂1 is the expected change in y per unit
change in x1 if x2 is held constant
 b̂1 estimates the relationship between y and
x1 adjusted (controlling) for x2
 (and vice versa for b̂2 )
51
Hemoglobin and PCV
 Suppose we received some more information
about the women in our sample—each woman’s
menopausal status
 we are interested in testing for a relationship
between hemoglobin level and menopausal status
 The problem:
– Age is a potential confounder!
 We will need a multiple regression model with two
independent variables
y  bˆ0  bˆ1 x  bˆ2 x2
where x1 is age and x2 is “menopausal status”
52
Indicator (Dummy) Variables
How can we represent “menopausal status”
as a variable?
– There are only two possible values, “pre-” and
“post-” menopause
– x2 = 1 if woman is post-menopausal
= 0 if woman is pre-menopausal
A predictor taking on one of two values, “0” or
“1,” is called an indicator or dummy variable
Used when we want to include grouping
factors as covariates / predictors

53
Interpreting Coefficients
 y  bˆ0  bˆ1 x  bˆ2 x2
–y = hemoglobin level
–x1 = age
–x2 = “menopausal status”
–b̂1 = expected change in hemoglobin
for a one year increase in age, after
adjusting for menopause

b
 How can we interpret 2 ?

54
Interpreting Coefficients
 Write out the equation for both “types”
of women with same age (say 50)
 ˆ  bˆ (50)  bˆ
– Post: y b0 1 2

– Pre: y  bˆ0  bˆ1 (50)

 Difference in two equations is exactly b̂2 !


 b̂2 = expected difference in hemoglobin
levels from post-menopausal women as
compared to pre-menopausal women
after adjusting for age
55
Results
b̂0 = 10.08
b̂1 = 0.07
b̂2 = 1.90

Interpretation
– After controlling for age, we would
expect post-menopausal women to
have hemoglobin level 1.9 units higher
than pre-menopausal women

56
Statistical Inference
 95% confidence intervals can be calculated
for coefficients in MLR

b1  tn k 1se(b1 )


Where n = number of observations, k =
number of covariates
 T-tests can also be used to test for
significance of association between y and x,
adjusting for other variables (tests b1 = 0)
b1
t n  k 1 
se(b1 )
57

You might also like