Chapter 4 Data Management
Chapter 4 Data Management
Learning Outcomes:
At the end of the chapter, the student should be able to:
1. Use a variety of statistical tools to process and manage numerical data.
2. Use the methods of linear regression and correlations to predict the value of a variable given certain
conditions.
3. Advocate the use of statistical data in decision making important decisions
LESSON 1: DATA
Data are pieces of information, usually numbers, recorded and used for the purpose of analysis. Data
can come from a census or surveys or observations. Usually, we gather large amounts of data. These data
need to be organized, processed and interpreted to become meaningful.
Frequency Distribution is defined as the arrangement of the gathered data by categories plus their
corresponding frequencies and class marks or midpoint.
EXAMPLE: Given the following scores in a Statistics examination, make a frequency distribution table.
50 85 91 54 62 72 68 70 79 90
58 35 52 61 93 98 60 62 76 99
64 78 49 88 73 51 69 80 93 89
68 98 66 96 55 77 57 61 70 92
46 73 83 91 79 53 62 59 82 93
Creating a Grouped Frequency Distribution
Step 1: Find the largest and smallest values and compute for the Range.
Lowest Score = 35, Highest Score = 99
Range (R) = 99 – 35 = 64
Step 3: Organize the class interval. Use the lowest score as the lower limit of the lowest class. Add c on each
succeeding lower limit per class.
Class interval
35-43
44-52
53-61
62-70
71-79
80-88
89-97
98-106
Step 4 and 5: Tally each score to the category of class interval it belongs. Summarize under column f
(frequency).
Step 6: Compute the Midpoint for each class interval and put it under column M.
Class interval f M
35-43 1 39
44-52 5 48
53-61 9 57
62-70 10 66
71-79 8 75
80-88 5 84
89-97 9 93
98-106 3 102
N = 50
Step 7: Find the cumulative frequencies. Depending on what you're trying to accomplish, it may not be
necessary to find the cumulative frequencies
Data can also be presented in graphical form. This form is the most effective means of organizing and
presenting statistical data because the important relationships are brought out more clearly and creatively in
virtually solid and colorful figures.
HISTOGRAM - A graph which displays the data by using continuous vertical bars of various heights to represent
frequencies. The horizontal axis can be either the class boundaries, the class marks, or the class limits.
(http://www.albany.edu/~reinhold/m308/Assgnmt1_HowTo.htm)
FREQUENCY POLYGON - A line graph. The frequency is placed along the vertical axis and the class midpoints
are placed along the horizontal axis. These points are connected with lines.
(http://www.albany.edu/~reinhold/m308/Assgnmt1_HowTo.htm)
10
8
frequency
0
0 20 40 60 80 100 120
Scores
After the data have been organized or presented using frequency distributions or graphs, analysis and
interpretation come in. Interpretation is the process of making sense of numerical data that has been
collected, presented and analyzed.
EXAMPLE: Using the frequency distribution table below, answer the questions that follows:
Answers:
1. 10% of the students obtained a score within 80-88.
2. 15 students have scored lower than 62.
3. The percentage of the students who have passed the Statistics examination is 34%.
LESSON 2: MEASURES OF CENTRAL TENDENCY
The first type of descriptive statistics identifies the center of the distribution of scores. These are
called measures of central tendency because they all identify the center of the distribution in different ways.
The three most important measures of central tendency are the mean, the median and the mode.
MEAN (𝒙 ̅) - The mean is the average of the set of scores. By far, the most common measure of central tendency
in statistics is the mean. It is the most sensitive measure of central tendency.
Arithmetic Mean – The most commonly used measure of central tendency. The sum of the
values of a group of items divided by the number of such items.
Σ𝑥
The sample mean: 𝑥̅ =
𝑛
Where: 𝑥̅ = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛
Σ𝑥 = 𝑡ℎ𝑒 𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑡ℎ𝑒 𝑠𝑐𝑜𝑟𝑒𝑠
𝑛 = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑠𝑒𝑠
Σ𝑥
The population mean: 𝜇 =
𝑁
Where: 𝜇 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛
Σ𝑥 = 𝑡ℎ𝑒 𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑡ℎ𝑒 𝑠𝑐𝑜𝑟𝑒𝑠
𝑁 = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑠𝑒𝑠
EXAMPLE: Consider the scores of ten people who took a make-up quiz in Algebra.
12 14 16 10 5 8 18 7 10 4
Σ𝑥 104
The sum of the scores is Σ𝑥 = 104, then the mean score is 𝑥̅ = = = 10.4
𝑛 10
Weighted Arithmetic Mean – can be expressed as the sum of the values multiplied by their
corresponding weights divided by the total weight.
Σf𝑥
The formula is: 𝑥̅ =
Σ𝑓
Where: 𝑓 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑟 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑖𝑡𝑒𝑚
𝑥 = 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑖𝑡𝑒𝑚
EXAMPLE: The final grades of a student at the end of semester are the following:
𝟏𝟕𝟓𝟑
̅=
𝒙 ̅ = 𝟖𝟕. 𝟔𝟓
𝒙
𝟐𝟎
𝟓+𝟕
𝑴𝒅 = =𝟔
𝟐
MODE (𝑴𝒐) - It is the most frequent or occurring score in a series. A distribution that consists of only one of
each score has n modes. A distribution where a single score is most frequent has one mode and is called
unimodal. When there are ties for the most frequent score, the distribution is bimodal if two scores tie or
multimodal if more than two scores tie.
Arrange the scores, so that it will be easier to find the most frequently occurring score.
16 18 21 21 23 23 23 24 25 26 28 30
MEASURE OF DISPERSON is a measure that describes how spread out or scattered a set of data. It is
also known as measures of variation or measures of spread.
A. Range
- It is the simplest measure of dispersion.
- It is the difference between the highest (maximum) and lowest (minimum) values.
𝑅 = 𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑉𝑎𝑙𝑢𝑒 − 𝐿𝑜𝑤𝑒𝑠𝑡 𝑉𝑎𝑙𝑢𝑒 for ungrouped data
Interpretation: Based on the computed range for sets A, B, and C, it can be concluded that set A has
greater variability than B and C.
Steps in computing for the Mean Absolute Deviation for Ungrouped Data
1. Compute for the mean of the distribution 𝑥̅ .
2. Subtract the mean from the individual score. Get the absolute value
|𝑥 − 𝑥̅ |.
3. Get the summation of all the absolute value of the differences of the mean and the individual scores
Σ|𝑥 − 𝑥̅ |.
∑|𝑥− 𝑥̅ |
4. To get the MAD, substitute the values into the formula 𝑀𝐴𝐷 = .
𝑛
x 𝑥 − 𝑥̅ |𝑥 − 𝑥̅ |
22 -5 5
24 -3 3
26 -1 1
28 1 1
30 3 3
32 5 5
Σ𝑥 = 162 Σ|𝑥 − 𝑥̅ | = 18
∑ 𝑥 162 ∑|𝑥 − 𝑥̅ |
𝑥̅ = = 𝑀𝐴𝐷 =
𝑛 6 𝑛
𝑥̅ = 27 18
=
6
𝑀𝐴𝐷 = 3.00
D. Variance and Standard Deviation
Standard Deviation
The square root of the variance. The population standard deviation is the square root of the population
variance and the sample standard deviation is the square root of the sample variance. The units on the
standard deviation is the same as the units of the population/sample.
EXAMPLE: Compute for the variance and standard deviation of the following sample data:
x 𝑥 − 𝑥̅ (𝑥 − 𝑥̅ )2
22 -5 25
24 -3 9
26 -1 1
28 1 1
30 3 9
32 5 25
Σ𝑥 = 162 ∑(𝑥 − 𝑥̅ )2 = 70
∑ 𝑥 162 ∑(𝑥 − 𝑥̅ )2
𝑥̅ = = 𝑠2 =
𝑛 6 𝑛−1
𝑥̅ = 27 2
70 70
𝑠 = = = 14.00
6−1 5
𝑠 = √14.00 = 3.74
The standard deviation is the most useful and important measure of variation/dispersion. It is widely
used in research and is used in drawing inferences from samples to populations. The interpretation of the
standard deviation is of great importance in Research and Statistics.
Chebyshev’s Theorem
The accuracy and the position of the scores in frequency distribution relative to the mean can be
computed by using the Chebyshev’s theorem.
EXAMPLE: If the mean score of the students enrolled in Business Statistics class is 66 points with a standard
deviations of 5 points, at least what percentage of the scores must lie between 46 and 86?
Solution: 𝑥̅ − 𝑘(𝑠) = 46
66 − 𝑘(5) = 46
5𝑘 = 20 → 𝑘 = 4
1 1 1 15
1− 2 = 1− 2 =1− = = 0.9375 𝑜𝑟 93. ∴ 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 93.75% 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎 𝑙𝑖𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 46 𝑎𝑛𝑑 86.
𝑘 4 16 16
LESSON 4: MEASURE OF RELATIVE POSITION
Measures of Relative Position are conversions of values, usually standardized test scores, to show where
a given value stands in relation to other values of the same grouping.
(𝑋 − 𝜇)
𝑧 =
𝜎
where z is the z-score, X is the value of the element or the raw score, μ is the population mean, and
σ is the standard deviation.
EXAMPLE: Raidah scored 55 on a mathematics test that had a mean of 45 and a standard deviation of 10. On
an English test with a mean of 56 and a standard deviation of 12, she had scored 70. Compare her relative
positions on the two tests.
Solution: Convert her scores for the two tests to standard score:
For Mathematics;
𝑋 − 𝜇 55 − 45
𝑧= = = 1.00
𝜎 10
For English;
𝑋 − 𝜇 70 − 56
𝑧= = = 1.17
𝜎 12
Since the standard score for English is larger, her relative position in English is higher than her relative
position in Mathematics.
EXAMPLE: Suppose that the mean of a test is 122 and the s is 24. If Jose earns a score of 146 on the test, his
deviation from the mean is 146-122 is 24. Dividing Jose’s deviation of 24 by the s of the test, we give him a z
of 1.00. If Edgar score is 110, then what is Edgar’s z-score?
110 - 122
z = = −0.50
24
EXAMPLE: Two equivalent intelligence test are given to similar group, the test are designed with different
scales. The statistics for the tests are listed below. Which is better a score of 145 on Test I or a score of 60
on Test II?
Test I Test II
Mean = 100 Mean = 40
s = 15 s=5
Therefore, a score of 145 on test I is 3.00 standard deviations above the mean and a score of 60 on
test II is 4.00 standard deviations above the mean. This implies that 60 is a better score than 145.
PERCENTILE
A percentile is a measure indicating the value below which a given percentage of observations in a group
of observations falls. For example, the 80th percentile is the value below which 80% of the observations may
be found.
Example: On an examination given 4500 students, Mia’s score of 340 was higher than the scores of 2,898
students who took the examination. What is the percentile of Mia’s score?
Solution:
𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 2898
= × 100
4500
= 0.644 × 100
= 64
QUARTILE
Refers to the value that divides the distribution into four (4) equal parts.
Q1 – refers to the value of the distribution that falls on the first one fourth of the distribution arranged
in magnitude.
Q2 – two-fourths or half of the distribution. This is also the median of the distribution.
Q3 – three-fourths of the distribution.
Step 3: Find the median of the data values that fall below Q 2.
16 18 21 21 22 22
Q1 = 21
Step 4: Find the median of the data values that are above Q2.
24 25 26 28 30 30
26+28
𝑚𝑒𝑑𝑖𝑎𝑛 = = 27
2
Q3 = 27
Box-and-Whisker Plots
A box-and-whisker plot or boxplot is a diagram based on the five-number summary of a data set. The
five-number summary of a data set consists of the five numbers determined by computing the minimum, Q 1
or the 1st Quartile, median, Q3 or the 3rd Quartile, and maximum value of the data set.
To construct a box-and-whisker plot, first draw an equal interval scale on which to make the box plot.
The boxplot is a visual representation of the distribution of the data. Greater distances in the diagram should
correspond to greater distances between numeric values.
Using the equal interval scale, draw a rectangular box with one end at Q1 and the other end at Q3. And
then draw a vertical segment at the median value. Finally, draw two horizontal segments on each side of the
box, one down to the minimum value and one up to the maximum value, (these segments are called the
"whiskers").
EXAMPLE: Draw a box-and-whisker plot for the data set
{16, 18, 21, 21, 22, 22, 23, 24, 25, 26, 28, 30, 30}.
Solution:
The Normal Probability curve is the most commonly used theoretical distributions in statistical
inference. De Moivre developed the mathematical equation of the normal curve in 1773. It is sometimes called
the Gaussian distribution in honor of Carl Friedrich Gauss, who derived the equation in the 19th century.
In most cases, this is used to determine the distribution of variables such as grades of students, weights
or heights of person, incomes of families, or IQ.
The standard normal curve represents a normal curve with mean 0 and standard deviation 1.
For population:
𝑥−𝜇
𝑧𝑥 =
𝜎
For sample:
𝑥 − 𝑥̅
𝑧𝑥 =
𝑠
Tables and calculators are used to determine the area under the normal curve. The following table of
Areas under the Normal Curve will help. Since, the normal curve is symmetrical, values for negative and
positive z-scores are the same.
EXAMPLE: Find the area under the standard normal curve for the following z-scores and draw and shade the
corresponding area on the curve.
a. Between z = 0 and z = 0.50
Solution: Using the table, the area between the mean and a z-score of 0.50 corresponds to
0.1915.
EXAMPLE: Express Delivery Service has found that the delivery times for packages are normally distributed,
with a mean of 16 hours and a standard deviation of 2 hours.
a. What is the probability that a randomly selected package will be delivered in between 12 and
17 hours?
b. What percent of packages will be delivered in more than 18.5 hours?
Solution:
a. Convert 12 and 17 hours to standard scores.
𝑥1 − 𝜇 12 − 16
𝑧1 = = = − 2.0
𝜎 2
𝑥2 − 𝜇 17 − 16
𝑧2 = = = 0.5
𝜎 2
From the table, a z-score of -2 has an area of 0.4772 and the area for a z-score of 0.5 is 0.1915. Hence,
the probability that a randomly package will be delivered in between 12 and 17 hours is 0.6687
b. Convert 18.5 hrs to z-score
𝑥1 − 𝜇 18.5 − 16
𝑧= = = 1.25
𝜎 2
From the table, a z-score of 1.25 has a corresponding area of .3944. This is the area from the mean to
the raw score of 18.5. Hence, the area from 18.5 and above is computed as:
𝐴 = 0.5 − 0.3944 = 0.1056
Hence, the percentage of packages that will be delivered in more than 18.5 hours is 10.56%.
EXAMPLE: 5000 students participated in a certain test yielding a result that follows the normal distribution
with mean of 65 points and standard deviation of 10 points.
a. Find the percent of a certain student marking more than 75 points and less than 85 points
inclusive.
b. b. Find the percent of a certain student scoring less than 60 points.
Solution:
a. Convert 75 and 85 points to z.
𝑥1 − 𝜇 75 − 65
𝑧1 = = = 1.0
𝜎 10
𝑥2 − 𝜇 85 − 65
𝑧2 = = = 2.0
𝜎 10
The following are the areas of the z-values of 1.0 and 2.0 respectively, 0.3413 and 0.4772.
𝑥1 − 𝜇 60 − 65
𝑧= = = −0.5
𝜎 10
The area corresponding to -0.5 is .1915. Since we are looking for the percent of a student scoring less
than 60 points,
Correlation is a bivariate analysis that measures the strength of association between two variables
and the direction of the relationship.
The main result of a correlation is called the correlation coefficient (or "r"). It ranges from -1.0 to
+1.0.
When the value of the correlation coefficient lies around ± 1, then it is said to be a perfect degree of
association between the two variables. The closer r is to +1 or –1, the stronger the correlation. The direction
of the relationship is simply the + (indicating a positive relationship between the variables) or - (indicating a
negative relationship between the variables) sign of the correlation.
Interpreting Correlation
Correlation is an effect size and so we can verbally describe the strength of the correlation using the
guide that Evans (1996) suggests for the absolute value of r:
.00-.19 “very weak”
.20-.39 “weak”
.40-.59 “moderate”
.60-.79 “strong”
.80-1.0 “very strong”
Scatterplot
An effective way to see a relationship in data is to display the information as a scatter plot. It shows
how two variables relate to each other by showing how closely the data points fit to a line. If the variables
are correlated, the points will fall along a line or curve. The better the correlation, the tighter the points will
hug the line.
A simple scatterplot can be used to (a) determine whether a relationship is linear, (b) detect outliers
and (c) graphically present a relationship. For example, determining whether a relationship is linear (or not)
is an important assumption if you are analyzing your data using a Correlation and Regression.
Various types of correlation can be interpreted through the patterns displayed on Scatterplots. These
are: positive (values increase together), negative (one value decreases as the other increases), null (no
correlation). The strength of the correlation can be determined by how closely packed the points are to each
other on the graph. Points that end up far outside the general cluster of points are known as outliers.
60
52
50
4849
42
40 39
36
33 Series1
30 31
25
20 20
Linear
(Series1)
10
0
0 20 40 60 80 100
Source: https://datavizcatalogue.com/methods/scatterplot.html
Source: https://datavizcatalogue.com/methods/scatterplot.html
1. Select the data you want to graph. (If data is already encoded. If not, encode and label your data in
MS excel).
2. Click the Insert tab, and then click Insert Scatter (X, Y) or Bubble Chart. Click Scatter. Click Ok.
3. Click the Design tab, and then click the chart style you want to use.
4. You can quickly edit the chart by clicking the icons available on the right side of the chart.
5. You can also add a trend line to your scatter plot. Right-click the Chart Area and Select Add Trend
line. Select Linear and Tick Display Equation on chart.
Pearson Correlation
Pearson r correlation is the most widely used correlation statistic to measure the degree of the
relationship between linearly related variables. The calculation of Pearson’s correlation coefficient and
subsequent significance testing of it requires the following data assumptions to hold: interval or ratio level;
linearly related; and bivariate normally distributed.
n xy x y
r
2
n x 2 x n y 2 y
2
Where:
r = Pearson r correlation coefficient
N = number of value in each data set
∑xy = sum of the products of paired scores
∑x = sum of x scores
∑y = sum of y scores
∑x2= sum of squared x scores
∑y2= sum of squared y scores
EXAMPLE: A study investigated the relationship of height and self-esteem of 20 randomly selected women.
The following are the heights in inches and the level of their self-esteem. Solve for the correlation coefficient
r.
Height Self-Esteem
(in inches) Level X2 Y2 XY
X Y
68 4.1 4624 16.81 278.80
71 4.6 5041 21.16 326.60
62 3.8 3844 14.44 235.60
75 4.4 5625 19.36 330.00
58 3.2 3364 10.24 185.60
60 3.1 3600 9.61 186.00
67 3.8 4489 14.44 254.60
68 4.1 4624 16.81 278.80
71 4.3 5041 18.49 305.30
69 3.7 4761 13.69 255.30
68 3.5 4624 12.25 238.00
67 3.2 4489 10.24 214.40
63 3.7 3969 13.69 233.10
62 3.3 3844 10.89 204.60
60 3.4 3600 11.56 204.00
63 4.0 3969 16.00 252.00
65 4.1 4225 16.81 266.50
67 3.8 4489 14.44 254.60
63 3.4 3969 11.56 214.20
61 3.6 3721 12.96 219.60
∑ 𝑿 =1308 ∑ 𝒀 =75.1 ∑ 𝑿𝟐 =85912 ∑ 𝒀𝟐=285.45 ∑ 𝑿𝒀 =4937.6
20(4937.6) − (1308)(75.1)
𝑟=
√20(85912) − (1308)2 ∙ √20(285.45) − (75.1)2
Interpretation: The r coefficient 0.731 indicates a positive strong relationship between height and self-esteem
level. This implies that shorter people have lower self-esteem and taller people have higher self-esteem.
Linear Regression
In practice a relationship is found to exist between two (or more) variables, and one wanted to express
this relationship in a mathematical form by finding an equation connecting the variables. To do this, one
should collect data showing the corresponding values of the variables. Next is to plot the points into the
rectangular coordinate system. The resulting graph is sometimes called the scatter plot or scatter diagram.
Linear regression tries to model the relationship between two variables by fitting a linear equation to
the observed data. One variable is considered to be an explanatory (independent) variable, and the other is
considered to be the dependent variable.
A linear regression line has an equation of the form 𝑌 = 𝑎𝑋 + 𝑏, where 𝑋 is the explanatory variable
and 𝑌 is the dependent variable. a is the slope of the line and b is the intercept (the value of y when x = 0)
∑ 𝑦 − 𝑎(∑ 𝑥)
𝑏=
𝑛
Step-by-step Procedure:
Step 2: Sum all x, y, x2 and xy, which gives us Σx, Σy, Σx2 and Σxy
(Σ means "sum up")
EXAMPLE: The table below shows some data for the first ten (10) years of a certain Manufacturing and Canning
company, Marina. Each row in the table shows Marina’s sales for a year, and the amount spent on advertising
in that year. Calculate the regression equation for the data using advertising as the explanatory variable.
Advertising
(in million Sales (in
Year pesos) million pesos)
1 18 665
2 23 758
3 25 823
4 28 1078
5 30 1199
6 33 1301
7 39 1472
8 47 1500
9 52 1604
10 61 1699
Solution:
X Y XY X2
1 18 665 11970 324
2 23 758 17434 529
3 25 823 20575 625
4 28 1078 30184 784
5 30 1199 35970 900
6 33 1301 42933 1089
7 39 1472 57408 1521
8 47 1500 70500 2209
9 52 1604 83408 2704
10 61 1699 103639 3721
∑ 𝑿 = 𝟑𝟓𝟔 ∑ 𝒀 = 𝟏𝟐𝟎𝟗𝟗 ∑ 𝑿𝒀 = 𝟒𝟕𝟒𝟎𝟐𝟏 ∑ 𝑿𝟐 = 𝟏𝟒𝟒𝟎𝟔
4. A new dialogue box for Regression will appear Select the Y Range. This is the predictor variable (also
called dependent variable). Select the X Range. These are the explanatory variables (also called
independent variables). These columns must be adjacent to each other. Check Labels Click in the
Output Range box and select any vacant in the work Check Residuals Click OK.
5. Excel produces the following Summary Output (rounded to 3 decimal places).
R square means that if the value is closer to 1, the better the regression line fits the data.
To check if the results are reliable or statistically significant, look at Significance F (0.000052). If this
value is less than 0.05, you're OK. It means that it is statistically significant. If Significance F is greater than
0.05, it's probably better to stop using this set of independent variables. Delete a variable with a high P-value
(greater than 0.05) and rerun the regression until Significance F drops below 0.05.
Coefficients
The regression line is: y = 24.992x + 320.175. In other words, for each unit increase in advertising,
sales increases with 320.175 units. This is an important information.
*The same example was used in performing the regression analysis in MS Excel, there might be a slight
difference in the final answer due to manual computation and rounding off data.