0% found this document useful (0 votes)
18 views

Chap 2. Data presentation

The document discusses methods of data organization and presentation in biostatistics, emphasizing the importance of summarizing raw data to reveal patterns. It covers various techniques such as frequency distributions, tables, graphs, and charts to effectively display data, including histograms and pie charts. Additionally, it highlights the significance of understanding data characteristics before deciding on precise analysis methods.

Uploaded by

edison
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Chap 2. Data presentation

The document discusses methods of data organization and presentation in biostatistics, emphasizing the importance of summarizing raw data to reveal patterns. It covers various techniques such as frequency distributions, tables, graphs, and charts to effectively display data, including histograms and pie charts. Additionally, it highlights the significance of understanding data characteristics before deciding on precise analysis methods.

Uploaded by

edison
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Biostatistics

Chap 3. Methods of data organiza4on


and presenta4on
• The data collected in a survey is called raw data.
• In most cases, useful informa8on is not immediately
evident from the mass of unsorted data.

• Collected data need to be organized in such a way as to


condense the informa8on they contain in a way that will
show pa>erns of varia8on clearly.

• Even small data sets are difficult to comprehend without


some summariza8on.
Note!!
• Precise methods of analysis can be
decided up on only when the
characteris8cs of the data are
understood.
3.1. Tables
Frequency Distribu4ons
• When analysing voluminous data collected it is quite useful to
put them into compact tables.

• The presenta8on of data in a meaningful way is done by


preparing a frequency distribu8on of the variable

• A frequency distribu8on tells how oGen a variable takes on


each of its possible values.
• For two qualita8ve variables, a con8ngency tables or cross-
tabula8on is useful.
Frequency Distribu8ons
• A Rela4ve Frequency Distribu4on presents the
corresponding propor8ons of observa8ons within the
classes i.e (Frequency/ n) x 100 ( n= size of sample)

Gender Absolute Rela4ve frequency Rela4ve


Frequency ( propor4on) frequency (%)

Females 10 10/18 (10/18) x 100


Males 8 8/18 (8/18) x 100

Total 18
5
Frequency distribu8on for con8nuous variables
• Frequency distribu8ons present data in a rela8vely compact
form, gives a good overall picture, and contain informa8on that
is adequate for many purposes, but there are usually some
things which can be determined only from the original data.

• The construc8on of grouped frequency distribu8on consists


essen8ally of four steps:
• (1) Choosing the classes,
• (2) sor8ng (or tallying) of the data into these classes,
• (3) coun8ng the number of items in each class, and
• (4) displaying the results in the form of a table
Frequency distribu8on for con8nuous variables

• Fist we group observa8ons into classes by choosing


a set of con8guous non-overlapping intervals, called
class intervals (the observa8ons can be grouped to
form a discrete variable from the con8nuous
variable).
Cumula8ve and Rela8ve Frequencies:

• When frequencies of two or more classes are added


up, such total frequencies are called Cumula8ve
Frequencies.

• This frequencies help as to find the total number of


items whose values are less than or greater than
some value.

• Rela8ve frequencies express the frequency of each


value or class as a percentage to the total frequency.
Mid-Point of a class interval and the
determina8on of Class Boundaries
• Mid-point or class mark (Xc) of an interval is the
value of the interval which lies mid-way between the
lower true limit (LTL) and the upper true limit (UTL)
of a class.
• It is calculated as:
• Xc= Upper Class Limit +Lower Class Limit
• 2
• The true limits are what the tabulated limits would
correspond with if one could measure exactly.
Frequency Distribution
(continuous variable)
Age of 6209 persons chosen randomly in Rwanda
Absolute Relative cumulative
Classes centres
frequencies frequencies frequencies
ci xi ni fi (%) Fi (%)
[0-10[ 5 311 5.0 5.0
[10-20[ 15 120 1.9 6.9
[20-30[ 25 2255 36.3 43.3

X 100
[30-40[ 35 2090 33.7 76.9
[40-50[
[50-60[
45
55
٪ 870
399
14.0
6.4
90.9
97.4
[60-70[ 65 127 2.0 99.4
[70-80[ (75) 37 0.6 100.0
10
6209 100
Frequency Absolute/rela8ve frequency
Rela8ve
Frequency
Absolute

2255 or 36.3%

2090 or 33.7%
Histogram
TOTAL :
40,3% 2.500 6209 or
100%
32,2% 2.000

870 or 14%
y
24,2% c1.500
n
e
u
q

399 or 6.4%
e
r
F
311 or 5%

16,1%
120 or 1.9%

1.000

37 or 0.6%
127 or 2%
8,1% 500

Mean = 33,03
Std. Dev. = 12,348
0% 0 N = 6.209
0 10 20 30 40 50 60 70 80
11
Age
Density of rela8ve frequency
Absolute rela8ve Density of rela8ve
Classes Mid-
frequencies frequencies (%) frequencies
points
ci xi ni

5,0 = (311/6209)x100 0,5

Rela4ve Frequency/by class


[0-10[ 5 311
[10-20[ 15 120 1,9 0,19
36,3 3,63

interval, i.e 10
[20-30[ 25 2255
[30-40[ 35 2090 33,7 3,37
[40-50[ 45 870 14,0 1,4
[50-60[ 55 399 6,4 0,64
[60-70[ 65 127 2,0 0,2
[70-80[ 75 37 0,6 0,06
6209 100 10
12
N.B.
When classes have the same intervals, we can use
• Absolute frequencies
• Rela8ve frequencies
• Density of frequency

When classes are having different intervals, we use density of


frequency.
• Why ? To avoid the overes8ma8on of the area of long interval
classes
=> to keep always the total area= 100% (ou 1).

13
3.2. Graphs
• Frequency distribu8ons can be oGen displayed effec8vely
using graphs or diagrams
• Diagrams give a very clear picture of data
• The rela8onship between numbers of various magnitudes
can usually be seen more quickly and easily from a graph
than from a table.
• They have greater a>rac8on and facilitate comparison.
• But it is not to be used when comparison is either not
possible or is not necessary.
• Diagramma8c representa8on is not an alterna8ve to
tabula8on.
• It can give only an approximate idea and as such where
greater accuracy is needed diagrams will not be suitable.
Histogram
• For quan8ta8ve con8nuous data.
• Put the observa8on in the ascending order
• Take a number of classes near to Ntot
• Define classes [1-2] [3-4] or [1-2[ [2-3[...
• Calculate the frequency (absolute, rela8ve, cumula8ve) or
the frequency density for each class
• Draw a rectangle for each class.
• The base of the rectangle= the interval of the class
• The height of each bar gives the frequency in each interval.
• The area of the rectangle is propor4onal (not necessarly
equal) to number of observa8ons of that class
• The total area equals the 100% of all observa8ons

15
frequencies
Density of rela4ve
con8nuous variable

Histogram

2.500

3,5
2.000
3

y 2,5
c1.500
n
e
u
q 2
e
r
F
1.000
1,5
1
500

0,5 Mean = 33,03


Std. Dev. = 12,348
0 N = 6.209
0 10 20 30 40 50 60 70 80
16
Age
Exercice no 1
The table shows the age distribution of 1st year students in
Law in the year of 2003

Draw the histogram of absolute frequences and relative


frequencies (= Percentage)
(one figure => 2 legendes). 17
Exercice n° 2
In a class, the marks (out of 20) of students in exam are :

9 15 15 7 11 12 14 10 11 8
8 11 11 14 8 10 11 11 10 11
7 15 12 6 14 9 15 8 8 14
15 10 11 13 11 11 15 12 15 10
11 9 8 13 9 8 13 14 15 15
10 10 7 15 15 7 14 9 3 10
15 10 15 8 15 8 14 9 6 13
12 11 9 9 13 14 8 13 8 5

Make a table of 10 classes, with equivalent interval (0-2; 2-4; 4-6;…18-20) of absolutes ,
relatives and cumulatives frequencies and the density of relative frequencies.

18
Exercise n° 3
absolute relative freq cumulative Dens of rel
Classes
freq (% ) freq (% ) freq

[0-2[ 0 0.00 0.00 0.00


[2-4[ 1 1.25 1.25 0.63
[4-6[ 1 1.25 2.50 0.63
[6-8[ 6 7.50 10.00 3.75
[8-10[ 19 23.75 33.75 11.88
[10-12[ 21 26.25 60.00 13.13
[12-14[ 10 12.50 72.50 6.25
[14-16[ 22 27.50 100.00 13.75
[16-18[ 0 0.00 100.00 0.00
[18-20[ 0 0.00 100.00 0.00
80 100 50

Draw the histogram of the densities of relative


frequencies.
19
Frequency Polygon

• If we join the midpoints of the tops of the adjacent


rectangles of the histogram with line segments a
frequency polygon is obtained.
• When the polygon is con8nued to the X-axis just out
side the range of the lengths the total area under the
polygon will be equal to the total area under the
histogram.
• Note that it is not essen8al to draw histogram in
order to obtain frequency polygon.
Frequency polygon
Ogive or Cumula8ve Frequency Curve

• When the cumula8ve frequencies of a


distribu8on are graphed the resul8ng curve is
called Ogive Curve.
• To construct an Ogive curve:
• i) Compute the cumula8ve frequency of the
distribu8on.
• ii) Prepare a graph with the cumula8ve frequency
on the ver8cal axis and the true upper class limits
(class boundaries) of the interval scaled along the
X-axis (horizontal axis).
• The true lower limit of the lowest class interval
with lowest scores is included in the X-axis scale;
this is also the true upper limit of the next lower
interval having a cumula8ve frequency of 0.
The line diagram
• The line graph is especially useful when a variable is
measured at each of many consecu8ve point in 8me.

• The 8me, in weeks, months or years is marked along the


horizontal axis; and the value of the quan8ty that is being
studied is marked on the ver8cal axis.

• The distance of each plo>ed point above the base-line


indicates its numerical value.

• The line graph is suitable for depic8ng a consecu8ve trend


of a series over a long period.
The line diagram
2.Bar Chart
• Bar diagrams are used to display absolute or rela8ve
frequencies distribu8on or to compare the frequency
distribu8on of categorical variables ( ordinal or
nominal)

• When we represent data using bar diagram, all the bars


must have equal width and the distance between bars
must be equal.
A. Simple bar chart
• It is a one-dimensional diagram in which the
bar represents the whole of the magnitude.
The height or length of each bar indicates the
size (frequency) of the figure represented.
Bar chart
%
45
40
35
30
25
20
15
10
5
0
Single Married Divorced Widowed
Marital status
B. Mul0ple bar chart

• In this type of chart the component figures


are shown as separate bars adjoining each
other.
• The height of each bar represents the actual
value of the component figure.
• It depicts distribu8onal pa>ern of more than
one variable
Mul4pleBar chart
%
50
Male
40 Female

30
20
10
0
Single Married Divorced Widowed
Marital status
C. Component (or sub-divided) Bar
Diagram
• Bars are sub-divided into component parts of
the figure.
• These sorts of diagrams are constructed when
each total is built up from two or more
component figures.
Component bar diagram
4. Pie-chart
• For displaying the rela8ve frequency distribu8on of
qualita8ve or quan8ta8ve discrete data
• it is a circle divided into sectors so that the areas of the
sectors are propor8onal to the frequencies.
3.3.Summarizing data

• The data must be summarized as succinctly


(concisely, briefly) as possible, since the
number of sample points is frequently large
and it is easy to lose track of the overall
picture by looking at all the data at once.
3.3.1. Descrip8ve Sta8s8cs
• Quan88es and techniques used to describe a
sample characteris8c or illustrate the sample
data.

• For numeric variables, there are two commonly


reported types of descrip8ve measures:
loca8on and dispersion
1.Measures of Central Tendency (Loca8on)
• The tendency of sta8s8cal data to get concentrated at
certain values is called the “Central Tendency” and the
various methods of determining the actual value at which
the data tend to concentrate are called of Central
Tendency or averages.
• Common measures of loca8on are:
(i) The mean, represents the arithme8c average of all
measurements in the popula8on.
(ii) the Median, represents the point where half the
measurements fall above it, and half the measurements fall
below it.
(iii) the Mode represents the value or class with the highest
frequency in the sample / popula8on

37
a) The Mean

• Let x1,x2,x3,…,xn be the realised values of a


random variable X, from a sample of size n.
The sample arithme4c mean is defined as:

n
1
x= n ∑ xi
i =1

38
Example
Example 1: The systolic blood pressure of seven
pa8ents were as follows:
151, 124, 132, 170, 146, 124 and 113.

x=
(151 + 124 + 132 + 170 + 146 + 124 + 113)
The mean is 7
= 137.14

39
∑ x The sum of
x=
n

Example 2.

Marks out of
20 for 20
students

15 7 12 10 8
11 14 10 11 11
15 6 9 8 14
16 13 11 12 10

Mean= 223/20 =11,15


40
b)The Median and Mode
• If the sample data are arranged in increasing
order, the median is
(i) the middle value if n is an odd number, or
(ii) midway between the two middle values if n is
an even number
• The mode is the most commonly occurring
value.

41
2. Median:

Posi4on of the median in value a rearranged into order of


magnitude (smallest first)
Sample size
2n+2
Exemple: 4
Marks out of 20 for 5 students: 15 7 12 10 8

1 2 3 4 5
7 8 10 12 15
n=5
Posi2on of the median = 3
Value of the median = 10
N.B.: Median= 50th percen0le = P50
Example 1 – n is odd
The reordered systolic blood pressure data seen earlier are:

113, 124, 124, 132, 146, 151, and 170.

The Median is the middle value of the ordered data, i.e.


132.

Two individuals have systolic blood pressure = 124 mm Hg,


so the Mode is 124.

43
Example 2 – n is even
Six men with high cholesterol par8cipated in a study to
inves8gate the effects of diet on cholesterol level. At the
beginning of the study, their cholesterol levels (mg/dL) were as
follows:
366, 327, 274, 292, 274 and 230.
Rearrange the data in numerical order as follows:

230, 274, 274, 292, 327 and 366.

The Median is half way between the middle two readings, i.e.
(274+292) ÷ 2 = 283.

Two men have the same cholesterol level- the Mode is 274.
44
1. Mode: the value or class with the highest frequency
in the sample / popula8on
Marks over 15 7 12 10 8
20 of 20 11 14 10 11 11
students 15 6 9 8 14
(QCM) 16 13 11 12 10

Mode = 11
If con8nuous variable : modal Class

45
Exemples of unimodal distribu8ons (one mode)

Rem: Bimodal distribu8on : 2 modes


Plurimodal distribu8on : several modes

46
Symetric Distribu4on and unimodal

Mean
median 47
Unimodal distribu4on with nega4ve
skewness

mean median
48
Unimodal distribu4on with posi4ve
Skewness

median mean
49
Skewness
• If extremely low or extremely high
observa8ons are present in a distribu8on,
then the mean tends to shiG towards those
scores.
• Based on the type of skewness, distribu8ons
can be:
a) Nega4vely skewed distribu4on: occurs when majority
of scores are at the right end of the curve and a few
small scores are sca>ered at the leG end.
b) Posi4vely skewed distribu4on: Occurs when the
majority of scores are at the leG end of the curve and a
few extreme large scores are sca>ered at the right
end.
c) Symmetrical distribu4on: It is neither posi8vely nor
nega8velyskewed. A curve is symmetrical if one half of
the curve is the mirror image of the other half.
c) Geometric mean

• GM is a type of mean or average, which indicates


the central tendency or typical value of a set of
numbers by using the product of their values (as
opposed to the arithme8c mean which uses their
sum).
• It is obtained by taking the nth root of the product
of “n” values, i.e, if the values of the observa8on
are demoted by
• then, GM =
GM
• For instance, the geometric mean of two
numbers, say 2 and 8, is just the
square root of their product; that is 2√2 × 8 =
4.
• As another example, the geometric mean of
the three numbers 4, 1, and 1/32 is the cube
root of their product (1/8), which is 1/2; that
is3√4 × 1 × 1/32 = ½
2. Measures of Dispersion

• Measures of dispersion characterise how spread


out the distribu8on is, i.e., how variable the data
are.
• Commonly used measures of dispersion include:
1. Range
2. Variance & Standard devia8on
3. Coefficient of Varia8on (or rela8ve standard
devia8on)
4. Inter-quar8le range

54
1.Range

1. Range:
The difference between the maximum and the
minimum value in the data set
Range = Max – Min
Eg. data: -4 -3 -1 1 3 5
Range = 5 – (-4) = 9
§ easy to calculate;
• useful for “best” or “worst” case scenarios
• sensi8ve to extreme values
55
2. Variance
2. Variance: the mean of squared devia8ons from
the mean N

∑ ( x − µ )²
i
Always
Popula8on : σ ² = i =1 posi8ve
N popula8on size
n

∑ ( x − x)² i
Sample : s² = i =1
( n − 1)
Eg. data: 4 3 1 2 3 5
Mean: 18/6 = 3
Squared devia8ons from the mean: 1 0 4 1 0 4
Sum of Squared devia8ons from the mean : 10
Variance: S² = 10/5 = 2 56
3 . Standard Devia8on
• The sample standard devia4on, s, is the square-root
of the variance

n
2
∑ (xi − x )
i =1
s=
n −1

n s has the advantage of being in the same units


as the original variable x

57
Example
Data Deviation Deviation2
151 13.86 192.02
124 -13.14 172.73
132 -5.14 26.45
170 32.86 1079.59
146 8.86 78.45
124 -13.14 172.73
113 -24.14 582.88
Sum = 960.0 Sum = 0.00 Sum = 2304.86
x = 137.14
58
Example (contd.)
7
2
∑ (x − x )
i =1
i = 2304.86

Therefore, 2304.86
s=
7 −1
= 19 .6
59
4. Coefficient of Varia8on
• In some cases the varaince of a variable changes with its mean
• The coefficient of varia4on (CV) or rela4ve standard devia4on (RSD) is a measure of
rela8ve variability.
• It is a ra8o of data dispersion (standard devia8on) to the mean and shows the extend of
variability in rela8on to the mean

⎛s⎞
CV = ⎜ ⎟ × 100%
⎝x⎠
• The CV is not affected by mul8plica8ve changes in scale
• Consequently, a useful way of comparing the dispersion of variables measured
independently to the unit in which the measurement was taken

• Generally small values of CV are considered best, since that means that the
variability in measurements is small rela8ve to their mean (measurements are
consistent in their magnitudes).
• i,e the higher the CV the greater the dispersion

60
Example
The CV of the blood pressure data is:

⎛ 19.6 ⎞
CV = 100 × ⎜ ⎟%
⎝ 137.1 ⎠
= 14.3%

i.e., the standard devia8on is 14.3% as large as


the mean.

61
5.Inter-quar8le range
• The Median divides a distribu8on into two halves.

• The first and third quar8les (denoted Q1 and Q3) are defined
as follows:
– 25% of the data lie below Q1 (and 75% is above Q1),
– 25% of the data lie above Q3 (and 75% is below Q3)

• The inter-quar4le range (IQR) is the difference between the


first and third quar8les, i.e.
IQR = Q3- Q1
62
4. Quar2les and interquar2le range
• The quar2les: The points where there are25%, 50% and 75% of
the scores

1st Quar8le \ P25: 1N + 1


4
2nd Quar8le \ P50 \ median: 2N + 2
4
3rd Quar8le \ P75: 3N + 3
4
• Interquar2le range: P75 – P25
63
Example
The ordered blood pressure data is:
113 124 124 132 146 151 170

Q1 Q2 Q3

Inter Quar8le Range (IQR) is 151-124 = 27

64
Exercise
In one class, the notes (out of 20) obtained in biosta8s8cs from a
sample of students are as follows:

9, 13, 14, 18, 20, 12, 14, 10, 11, 19

Calculate measures of central tendency (mean, mode, median)


and dispersion (variance, standard devia8on, CV, range, IQR).

65
Mean 14
Mode 14
Median 13.5

Variance 14.67
Std dev 3.83
Range 11
2.3.2. Box-plots
• A box-plot is a visual descrip8on of the
distribu8on based on
– Minimum
– Q1
– Median
– Q3
– Maximum
• Useful for comparing large sets of data

67
Building a box plot
1. Calculate important values

• 1st quar8le : Q1/4


• Median : Q1/2
• 3rd quar8le : Q3/4
LSV P25 Med P75 HSV
2. Calculate limit values
• Calculate interquar8le range (IQR) : Q3/4 – Q1/4
• Low limit value : Q1/4 – 1.5.IQR
• High limit value : Q3/4 + 1.5.IQR
3. Look for subsequent values
• Low fence : the real low value> = low limit value
• High fence : the real high value < = high limit value
4. Look for outliers
68
Example 1
The height of 12 individuals arranged in
increasing order are:
62, 64, 68, 70, 70, 74, 74, 76, 76, 78, 78, 80
Calculate Q1, Q3 and median

69
Example 1: Box-plot

70
Remarks
• The box is always limited by Q1 andQ3
• But the whiskers can represent several things according
different authors/programs
Ø the minimum and the maximum
Ø The low and high subsequent values
Ø A standard devia8on above and below the mean
h>p://en.wikipedia.org/wiki/Box_plot

• Importance of the Boxplot


Ø examina8on of the symetry of the distribu8on
Ø visualisa8on of outliers values

71
QUIZ 1/ 2 marks

Determine the type of data that we have

1. Marital status (divorced, single, married, widow)

2. number of teeth you have

You might also like