0% found this document useful (0 votes)
253 views

Chapter 4 Data Management

This document discusses data management and summarization techniques. It covers creating frequency distributions to organize data into classes and calculate frequencies. Guidelines for creating grouped frequency distributions include using 5-20 classes of equal width and ensuring classes are mutually exclusive and collectively exhaustive. The document demonstrates how to create a frequency distribution table and histogram for a set of exam scores. It also introduces frequency polygons for displaying grouped data with lines connecting class midpoint frequencies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
253 views

Chapter 4 Data Management

This document discusses data management and summarization techniques. It covers creating frequency distributions to organize data into classes and calculate frequencies. Guidelines for creating grouped frequency distributions include using 5-20 classes of equal width and ensuring classes are mutually exclusive and collectively exhaustive. The document demonstrates how to create a frequency distribution table and histogram for a set of exam scores. It also introduces frequency polygons for displaying grouped data with lines connecting class midpoint frequencies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

CHAPTER 4: DATA MANAGEMENT

Learning Outcomes:
At the end of the chapter, the student should be able to:
1. Use a variety of statistical tools to process and manage numerical data.
2. Use the methods of linear regression and correlations to predict the value of a variable given certain
conditions.
3. Advocate the use of statistical data in decision making important decisions

LESSON 1: DATA
Data are pieces of information, usually numbers, recorded and used for the purpose of analysis. Data
can come from a census or surveys or observations. Usually, we gather large amounts of data. These data
need to be organized, processed and interpreted to become meaningful.
Frequency Distribution is defined as the arrangement of the gathered data by categories plus their
corresponding frequencies and class marks or midpoint.

Grouped Frequency Distribution of Interval Data

Guidelines for classes


1. There should be between 5 and 20 classes.
2. The class width should be an odd number. This will guarantee that the class midpoints are integers
instead of decimals.
3. The classes must be mutually exclusive. This means that no data value can fall into two different
classes
4. The classes must be all inclusive or exhaustive. This means that all data values must be included.
5. The classes must be continuous. There are no gaps in a frequency distribution. Classes that have no
values in them must be included (unless it's the first or last class which are dropped).
6. The classes must be equal in width. The exception here is the first or last class. It is possible to have
a "below ..." or "... and above" class. This is often used with ages.

Creating a Grouped Frequency Distribution


1. Find the largest and smallest values and compute for the range:
Range = Maximum - Minimum
2. Compute for the number of classes and class width using the following formulas:
For Number of Classes:
Sturge’s Formula:
𝑘 = 1 + 3.3 log 𝑁, Where: k = no. of classes, N = no. of scores/cases
For Class Width:
𝑅
𝑐 = 𝑘 , Where: c = class width/interval size; R = Range; k = no. of classes
3. Organize the class interval.
4. Tally each score to the category of class interval it belongs.
5. Count the tally column and summarize it under “f”. Then add the total number of frequencies
(N = total).
6. Compute the Midpoint for each class interval and put it under column M.
𝐿𝑆+𝐻𝑆
𝑀= Where: M = midpoint, LS = lowest score in the class interval and HS = highest score in
2
the class interval.
7. Find the cumulative frequencies. Depending on what you're trying to accomplish, it may not be
necessary to find the cumulative frequencies.
8. If necessary, find the relative frequencies and/or relative cumulative frequencies.

EXAMPLE: Given the following scores in a Statistics examination, make a frequency distribution table.

50 85 91 54 62 72 68 70 79 90

58 35 52 61 93 98 60 62 76 99

64 78 49 88 73 51 69 80 93 89

68 98 66 96 55 77 57 61 70 92

46 73 83 91 79 53 62 59 82 93
Creating a Grouped Frequency Distribution

Step 1: Find the largest and smallest values and compute for the Range.
Lowest Score = 35, Highest Score = 99
Range (R) = 99 – 35 = 64

Step 2: Compute for the number of classes and class width. N = 50


𝑘 = 1 + 3.3 log 𝑁
𝑘 = 1 + 3.3 log 50 = 6.6 ≈ 7
𝑅 64
𝑐= = =9
𝑘 7

Step 3: Organize the class interval. Use the lowest score as the lower limit of the lowest class. Add c on each
succeeding lower limit per class.

Class interval
35-43
44-52
53-61
62-70
71-79
80-88
89-97
98-106

Step 4 and 5: Tally each score to the category of class interval it belongs. Summarize under column f
(frequency).

Class interval Tally f


35-43 I 1
44-52 IIIII 5
53-61 IIIII-IIII 9
62-70 IIIII-IIIII 10
71-79 IIIII-III 8
80-88 IIIII 5
89-97 IIIII-IIII 9
98-106 III 3
N = 50

Step 6: Compute the Midpoint for each class interval and put it under column M.

Class interval f M
35-43 1 39
44-52 5 48
53-61 9 57
62-70 10 66
71-79 8 75
80-88 5 84
89-97 9 93
98-106 3 102
N = 50

Step 7: Find the cumulative frequencies. Depending on what you're trying to accomplish, it may not be
necessary to find the cumulative frequencies

Class interval F M <cf >cf


35-43 1 39 1 50
44-52 5 48 6 49
53-61 9 57 15 44
62-70 10 66 25 35
71-79 8 75 33 25
80-88 5 84 38 17
89-97 9 93 47 12
98-106 3 102 50 3
N = 50
Step 8: If necessary, find the relative frequencies and/or relative cumulative frequencies.
𝑓
𝑅𝑓(%) = × 100
𝑁

Frequency Distribution of the Results of the Examination in Statistics

Class interval f M <cf >cf Rf(%)


35-43 1 39 1 50 2.00
44-52 5 48 6 49 10.00
53-61 9 57 15 44 18.00
62-70 10 66 25 35 20.00
71-79 8 75 33 25 16.00
80-88 5 84 38 17 10.00
89-97 9 93 47 12 18.00
98-106 3 102 50 3 6.00
N = 50 100.00
Representing Data using Graphs

Data can also be presented in graphical form. This form is the most effective means of organizing and
presenting statistical data because the important relationships are brought out more clearly and creatively in
virtually solid and colorful figures.

HISTOGRAM - A graph which displays the data by using continuous vertical bars of various heights to represent
frequencies. The horizontal axis can be either the class boundaries, the class marks, or the class limits.

(http://www.albany.edu/~reinhold/m308/Assgnmt1_HowTo.htm)

How to construct a Histogram manually:


1. Draw an XY-plane.
2. Place the class marks/midpoints or class limits or boundaries on the x-axis and frequency on the y-axis.
3. Label the marks so that the scale is clear and give a name to the horizontal and vertical axes.
4. Construct bars for each class. The height of each bar should correspond to the frequency of the class
at the base of the bar.

How to construct a Histogram using Microsoft Excel:


1. Input raw data into the worksheet. One column for the classes and another column for the frequency.
Select the entire dataset.
2. Click the Insert tab.

3. In the Charts group, click Column Chart.


4. Right click the chart and do the necessary adjustments using Format Plot Area.

FREQUENCY POLYGON - A line graph. The frequency is placed along the vertical axis and the class midpoints
are placed along the horizontal axis. These points are connected with lines.

(http://www.albany.edu/~reinhold/m308/Assgnmt1_HowTo.htm)

How to construct a Frequency polygon manually:


1. Draw and label the x-axis as the midpoint or class marks and y-axis as frequencies.
2. Place a point in the intersection of the midpoints and frequency per class.
3. Finally, connect the points. Include one class interval below the lowest value in your data and one
above the highest value. The graph will then touch the X-axis on both sides.

How to Construct a Frequency Polygon using MS Excel:


1. Set up the data. Highlight the two columns of data you want to use (for example midpoints and the
corresponding frequencies).
2. Select Insert → Charts → Insert XY (Scatter) → Scatter with Straight Lines. Click OK.
3. You can adjust or format your frequency polygon. Right click the Chart Area and choose your preferred
options.

Examination Results in Statistics


12

10

8
frequency

0
0 20 40 60 80 100 120
Scores

Interpreting Organized Data

After the data have been organized or presented using frequency distributions or graphs, analysis and
interpretation come in. Interpretation is the process of making sense of numerical data that has been
collected, presented and analyzed.

EXAMPLE: Using the frequency distribution table below, answer the questions that follows:

Frequency Distribution of the Results of the Examination in Statistics

Class interval F M <cf >cf Rf(%)


35-43 1 39 1 50 2.00
44-52 5 48 6 49 10.00
53-61 9 57 15 44 18.00
62-70 10 66 25 35 20.00
71-79 8 75 33 25 16.00
80-88 5 84 38 17 10.00
89-97 9 93 47 12 18.00
98-106 3 102 50 3 6.00
N = 50 100.00

1. What percent of the students obtained a score within 80-88?


2. How many students got a score lower than 62?
3. If the passing score is 80, what percentage of the students passed the Statistics Examination?

Answers:
1. 10% of the students obtained a score within 80-88.
2. 15 students have scored lower than 62.
3. The percentage of the students who have passed the Statistics examination is 34%.
LESSON 2: MEASURES OF CENTRAL TENDENCY

The first type of descriptive statistics identifies the center of the distribution of scores. These are
called measures of central tendency because they all identify the center of the distribution in different ways.
The three most important measures of central tendency are the mean, the median and the mode.

MEAN (𝒙 ̅) - The mean is the average of the set of scores. By far, the most common measure of central tendency
in statistics is the mean. It is the most sensitive measure of central tendency.

A. Mean for Ungrouped Data

Arithmetic Mean – The most commonly used measure of central tendency. The sum of the
values of a group of items divided by the number of such items.

Σ𝑥
The sample mean: 𝑥̅ =
𝑛
Where: 𝑥̅ = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛
Σ𝑥 = 𝑡ℎ𝑒 𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑡ℎ𝑒 𝑠𝑐𝑜𝑟𝑒𝑠
𝑛 = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑠𝑒𝑠

Σ𝑥
The population mean: 𝜇 =
𝑁
Where: 𝜇 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛
Σ𝑥 = 𝑡ℎ𝑒 𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑡ℎ𝑒 𝑠𝑐𝑜𝑟𝑒𝑠
𝑁 = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑠𝑒𝑠

EXAMPLE: Consider the scores of ten people who took a make-up quiz in Algebra.

12 14 16 10 5 8 18 7 10 4

Σ𝑥 104
The sum of the scores is Σ𝑥 = 104, then the mean score is 𝑥̅ = = = 10.4
𝑛 10

Weighted Arithmetic Mean – can be expressed as the sum of the values multiplied by their
corresponding weights divided by the total weight.

Σf𝑥
The formula is: 𝑥̅ =
Σ𝑓
Where: 𝑓 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑟 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑖𝑡𝑒𝑚
𝑥 = 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑖𝑡𝑒𝑚

EXAMPLE: The final grades of a student at the end of semester are the following:

Subjects Grades (x) Units (f)


ITPC 314 - Object-oriented Programming 85.00 3
ITPC 315 – Technopreneurship 88.00 3
ITPC 316 – Software Engineering 84.00 3
ITCC 317 - Management Information System 89.00 3
GECC 103 – Mathematics in the Modern World 87.00 3
GECC 106 – The Contemporary World 90.00 3
PE 101 – Physical Fitness 92.00 2

Then the mean grade of the student is:


3(85.00) + 3(88.00) + 3(84.00) + 3(89.00) + 3(87.00) + 3(90.00) + 2(92.00)
𝑥̅ =
3+3+3+3+3+3+2

𝟏𝟕𝟓𝟑
̅=
𝒙 ̅ = 𝟖𝟕. 𝟔𝟓
𝒙
𝟐𝟎

Characteristics of the Mean


1. It is the most reliable measure of central tendency.
2. Summarizes data in a way that is easy to understand.
3. Uses all the data.
4. Used in many statistical applications.
5. It is the best measure to use when the distribution is symmetrical or normal.
6. The mean is sensitive or greatly affected by extreme values.
MEDIAN (𝑴𝒅)- Is a measure of central tendency that occupies the middle position in an array of values. It is
the number that divides the bottom 50% of the data from the top 50%.

Median for Ungrouped Data


The median for ungrouped data is the middlemost value when the data or scores are arranged
in an array (ascending/descending manner).

If n is odd: Median (Md) is the middle score in the array


𝑛+1
𝑀𝑑 = ( )𝑡ℎ 𝑑𝑎𝑡𝑎
2
If n is even: Median (Md) is the average of two middle scores in an array
𝑛 𝑛+1
( 2 )𝑡ℎ 𝑑𝑎𝑡𝑎+( 2 )𝑡ℎ 𝑑𝑎𝑡𝑎
𝑀𝑑 =
2

EXAMPLE: The Median for Ungrouped Data


Compute for the median of the following scores.
a. 3, 9, 2, 8, 5, 7

Arrange the scores in an array: 2, 3, 5, 7, 8, 9 (n = 6)


Since n is even:
6 𝑡ℎ 6 + 1 𝑡ℎ
( ) 𝑑𝑎𝑡𝑎 + ( ) 𝑑𝑎𝑡𝑎 3𝑟𝑑 𝑑𝑎𝑡𝑎 + 4𝑡ℎ 𝑑𝑎𝑡𝑎
𝑀𝑑 = 2 2 =
2 2

Note: 3rd data = 5, 4th data = 7

𝟓+𝟕
𝑴𝒅 = =𝟔
𝟐

b. 34, 56, 27, 25, 98, 12, 32, 54, 47


Arrange the scores in an array: 12, 25, 27, 32, 34, 47, 54, 56, 98
(n = 9)
Since n is odd:
𝒏+𝟏 𝒕𝒉 𝟗+𝟏 𝒕𝒉
𝑴𝒅 = ( ) 𝒅𝒂𝒕𝒂 = ( ) 𝒅𝒂𝒕𝒂 = 𝟓𝒕𝒉 𝒅𝒂𝒕𝒂 = 𝟑𝟒
𝟐 𝟐

Characteristics of the Median


1. Easy to understand and easy to compute.
2. The point/ score that divides the distribution in to two halves.
3. The median is not affected by extreme values.

MODE (𝑴𝒐) - It is the most frequent or occurring score in a series. A distribution that consists of only one of
each score has n modes. A distribution where a single score is most frequent has one mode and is called
unimodal. When there are ties for the most frequent score, the distribution is bimodal if two scores tie or
multimodal if more than two scores tie.

Mode for Ungrouped Data


The most frequent occurring score is the mode. Arrange the scores from least to greatest and
inspect to determine the score/s that has/have more occurrences. The score or value which occurs
the greatest number of times in the data is the mode.

EXAMPLE: Find the mode of the following scores.


23 21 24 21 23 30 16 18 23 28 25 26

Arrange the scores, so that it will be easier to find the most frequently occurring score.
16 18 21 21 23 23 23 24 25 26 28 30

The mode is Mo = 23.


LESSON 3: MEASURE OF DISPERSION

MEASURE OF DISPERSON is a measure that describes how spread out or scattered a set of data. It is
also known as measures of variation or measures of spread.

A. Range
- It is the simplest measure of dispersion.
- It is the difference between the highest (maximum) and lowest (minimum) values.
𝑅 = 𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑉𝑎𝑙𝑢𝑒 − 𝐿𝑜𝑤𝑒𝑠𝑡 𝑉𝑎𝑙𝑢𝑒  for ungrouped data

Characteristic of the Range:


1. Easy to compute and understand.
2. Emphasizes the extreme values.
3. Most unstable and unreliable measure. Because its value can fluctuate greatly with a change in just
single score – either the lowest or the highest score.

EXAMPLE: Find the mode of the following sets of data:

Set A: 5, 7, 8, 8, 9, 10, 11, 12  Range = HV – LV = 12 – 5 = 7


Set B: 8, 9, 9, 10, 11, 11, 12, 12  Range = HV – LV = 12 – 8 = 4
Set C: 7, 7, 8, 8, 9, 9, 10, 10, 10  Range = HV – LV = 10 – 7 = 3

Interpretation: Based on the computed range for sets A, B, and C, it can be concluded that set A has
greater variability than B and C.

B. Mean Absolute Deviation (MAD)


- The mean deviation measures the average deviation of the values from the arithmetic mean. It gives
equal weight to the deviation of every observation.
- As a measure of variation/variability, the mean or average deviation is considered more important
because it takes into account all the individual values of the distribution.

MAD for Ungrouped Data:


∑|𝑥 − 𝑥̅ |
𝑀𝐴𝐷 =
𝑛
Where: MAD – Mean Absolute Deviation
x – individual value/score
𝑥̅ - sample mean
n – total number of items/score or observations

Steps in computing for the Mean Absolute Deviation for Ungrouped Data
1. Compute for the mean of the distribution 𝑥̅ .
2. Subtract the mean from the individual score. Get the absolute value
|𝑥 − 𝑥̅ |.
3. Get the summation of all the absolute value of the differences of the mean and the individual scores
Σ|𝑥 − 𝑥̅ |.
∑|𝑥− 𝑥̅ |
4. To get the MAD, substitute the values into the formula 𝑀𝐴𝐷 = .
𝑛

EXAMPLE: Find the MAD of the following scores.

x 𝑥 − 𝑥̅ |𝑥 − 𝑥̅ |
22 -5 5
24 -3 3
26 -1 1
28 1 1
30 3 3
32 5 5
Σ𝑥 = 162 Σ|𝑥 − 𝑥̅ | = 18

∑ 𝑥 162 ∑|𝑥 − 𝑥̅ |
𝑥̅ = = 𝑀𝐴𝐷 =
𝑛 6 𝑛
𝑥̅ = 27 18
=
6
𝑀𝐴𝐷 = 3.00
D. Variance and Standard Deviation

Variance (s2 and σ2)


- Examines how far, on average, each score is away from the mean. The sample variance is symbolized
as s2 and the population variance as σ2.
- The variance of a population is equal to the sum of the squared deviations about the mean divided by
the number of scores.

Variance of a Population (𝝈𝟐 )


 The average of the squares of the distances from the population mean. It is the sum of the squares of
the deviations from the mean divided by the population size. The units on the variance are the units
of the population squared.

Variance of a Population Formula


∑(𝑋 − 𝜇)2
𝜎2 =
𝑁

Where: 𝜎 2 − 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑜𝑓 𝑎 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛


𝑋 − 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
𝜇 − 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛
𝑁 − 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛

Variance of a Sample (𝒔𝟐 )


 Unbiased estimator of a population variance. Instead of dividing by the population size, the sum of the
squares of the deviations from the sample mean is divided by one less than the sample size. The units
on the variance are the units of the population squared.

Variance of a Sample Formula


∑(𝑥 − 𝑥̅ )2
𝑠2 =
𝑛−1

Where: 𝑠2 − 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑜𝑓 𝑎 𝑠𝑎𝑚𝑝𝑙𝑒


𝑥 − 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒
𝑥̅ − 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛
𝑛 − 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒

Standard Deviation
 The square root of the variance. The population standard deviation is the square root of the population
variance and the sample standard deviation is the square root of the sample variance. The units on the
standard deviation is the same as the units of the population/sample.

Standard Deviation of a Population (𝝈)


∑(𝑋 − 𝜇)2
𝜎=√
𝑁
Where: 𝜎 − 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑎 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
𝑋 − 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
𝜇 − 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛
𝑁 − 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛

Standard Deviation of a Sample


∑(𝑥 − 𝑥̅ )2
𝑠=√
𝑛−1
Where: 𝑠 − 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑎 𝑠𝑎𝑚𝑝𝑙𝑒
𝑥 − 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒
𝑥̅ − 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛
𝑛 − 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒

Variance and Standard Deviation for Ungrouped Data


To compute for the variance of ungrouped data, the following steps should be undertaken:
1. Find the Mean of the set of scores.
2. Subtract the Mean from each score/number and square the result
3. Then get the summation of those squared differences.
4. To compute for the variance, divide the summation by the total number of scores minus 1.
5. To compute for the standard deviation, just get the square root of the variance.

EXAMPLE: Compute for the variance and standard deviation of the following sample data:
x 𝑥 − 𝑥̅ (𝑥 − 𝑥̅ )2
22 -5 25
24 -3 9
26 -1 1
28 1 1
30 3 9
32 5 25
Σ𝑥 = 162 ∑(𝑥 − 𝑥̅ )2 = 70

∑ 𝑥 162 ∑(𝑥 − 𝑥̅ )2
𝑥̅ = = 𝑠2 =
𝑛 6 𝑛−1
𝑥̅ = 27 2
70 70
𝑠 = = = 14.00
6−1 5

𝑠 = √14.00 = 3.74

Interpreting the Standard Deviation

The standard deviation is the most useful and important measure of variation/dispersion. It is widely
used in research and is used in drawing inferences from samples to populations. The interpretation of the
standard deviation is of great importance in Research and Statistics.

 Chebyshev’s Theorem

The accuracy and the position of the scores in frequency distribution relative to the mean can be
computed by using the Chebyshev’s theorem.

The Chebyshev’s Theorem states that the proportion or


percentage of any data set that lies within k standard
deviations of the mean (where k is any positive integer
1
greater than one) is at least 1 −
𝑘2

EXAMPLE: If the mean score of the students enrolled in Business Statistics class is 66 points with a standard
deviations of 5 points, at least what percentage of the scores must lie between 46 and 86?

Solution: 𝑥̅ − 𝑘(𝑠) = 46
66 − 𝑘(5) = 46
5𝑘 = 20 → 𝑘 = 4

1 1 1 15
1− 2 = 1− 2 =1− = = 0.9375 𝑜𝑟 93. ∴ 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 93.75% 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎 𝑙𝑖𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 46 𝑎𝑛𝑑 86.
𝑘 4 16 16
LESSON 4: MEASURE OF RELATIVE POSITION
Measures of Relative Position are conversions of values, usually standardized test scores, to show where
a given value stands in relation to other values of the same grouping.

Standard Scores (z-Scores)


A standard score (or z-score) indicates how many standard deviations an element is from the mean.
A standard score can be calculated from the following formula.

(𝑋 − 𝜇)
𝑧 =
𝜎

where z is the z-score, X is the value of the element or the raw score, μ is the population mean, and
σ is the standard deviation.

How to interpret z-scores:


 A z-score less than 0 represents an element less than the mean.
 A z-score greater than 0 represents an element greater than the mean.
 A z-score equal to 0 represents an element equal to the mean.
 A z-score equal to 1 represents an element that is 1 standard deviation greater than the mean; a z-
score equal to 2, 2 standard deviations greater than the mean; etc.
 A z-score equal to -1 represents an element that is 1 standard deviation less than the mean; a z-score
equal to -2, 2 standard deviations less than the mean; etc.

EXAMPLE: Raidah scored 55 on a mathematics test that had a mean of 45 and a standard deviation of 10. On
an English test with a mean of 56 and a standard deviation of 12, she had scored 70. Compare her relative
positions on the two tests.

Solution: Convert her scores for the two tests to standard score:

For Mathematics;
𝑋 − 𝜇 55 − 45
𝑧= = = 1.00
𝜎 10

For English;
𝑋 − 𝜇 70 − 56
𝑧= = = 1.17
𝜎 12

Since the standard score for English is larger, her relative position in English is higher than her relative
position in Mathematics.

EXAMPLE: Suppose that the mean of a test is 122 and the s is 24. If Jose earns a score of 146 on the test, his
deviation from the mean is 146-122 is 24. Dividing Jose’s deviation of 24 by the s of the test, we give him a z
of 1.00. If Edgar score is 110, then what is Edgar’s z-score?

110 - 122
z = = −0.50
24

EXAMPLE: Two equivalent intelligence test are given to similar group, the test are designed with different
scales. The statistics for the tests are listed below. Which is better a score of 145 on Test I or a score of 60
on Test II?

Test I Test II
Mean = 100 Mean = 40
s = 15 s=5

z-score for test I z-score for test II


145 - 100 60 - 40
z = = 3.00 z = = 4.00
15 5

Therefore, a score of 145 on test I is 3.00 standard deviations above the mean and a score of 60 on
test II is 4.00 standard deviations above the mean. This implies that 60 is a better score than 145.
PERCENTILE

A percentile is a measure indicating the value below which a given percentage of observations in a group
of observations falls. For example, the 80th percentile is the value below which 80% of the observations may
be found.

Percentile for a Given Data Value

Given a set of data and a data value x,


𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠 𝑙𝑒𝑠𝑠 𝑡ℎ𝑎𝑛 𝑥
𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑜𝑓 𝑠𝑐𝑜𝑟𝑒 𝑥 = × 100
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠

Example: On an examination given 4500 students, Mia’s score of 340 was higher than the scores of 2,898
students who took the examination. What is the percentile of Mia’s score?
Solution:
𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 2898
= × 100
4500
= 0.644 × 100
= 64

Mia’s score of 340 places her at the 64th percentile.

QUARTILE
Refers to the value that divides the distribution into four (4) equal parts.
 Q1 – refers to the value of the distribution that falls on the first one fourth of the distribution arranged
in magnitude.
 Q2 – two-fourths or half of the distribution. This is also the median of the distribution.
 Q3 – three-fourths of the distribution.

EXAMPLE: Find Q1, Q2 and Q3 of the following scores.


23, 21, 24, 21, 22, 30, 16, 18, 22, 28, 25, 26, 30
Solution:
Step 1: Arrange the scores.
16 18 21 21 22 22 23 24 25 26 28 30 30
Step 2: Find the median or Q2.The median is 23.

Step 3: Find the median of the data values that fall below Q 2.
16 18 21 21 22 22
Q1 = 21
Step 4: Find the median of the data values that are above Q2.
24 25 26 28 30 30
26+28
𝑚𝑒𝑑𝑖𝑎𝑛 = = 27
2
Q3 = 27

Box-and-Whisker Plots

A box-and-whisker plot or boxplot is a diagram based on the five-number summary of a data set. The
five-number summary of a data set consists of the five numbers determined by computing the minimum, Q 1
or the 1st Quartile, median, Q3 or the 3rd Quartile, and maximum value of the data set.

To construct a box-and-whisker plot, first draw an equal interval scale on which to make the box plot.
The boxplot is a visual representation of the distribution of the data. Greater distances in the diagram should
correspond to greater distances between numeric values.

Using the equal interval scale, draw a rectangular box with one end at Q1 and the other end at Q3. And
then draw a vertical segment at the median value. Finally, draw two horizontal segments on each side of the
box, one down to the minimum value and one up to the maximum value, (these segments are called the
"whiskers").
EXAMPLE: Draw a box-and-whisker plot for the data set
{16, 18, 21, 21, 22, 22, 23, 24, 25, 26, 28, 30, 30}.

Solution:

1. Find/Compute for the five-number summary:


Minimum =16, Q1 = 21, Median = 23, Q3 = 27 and Maximum = 30.

2. Plot the values.


3.
LESSON 5: THE NORMAL DISTRIBUTION

The Normal Probability curve is the most commonly used theoretical distributions in statistical
inference. De Moivre developed the mathematical equation of the normal curve in 1773. It is sometimes called
the Gaussian distribution in honor of Carl Friedrich Gauss, who derived the equation in the 19th century.

In most cases, this is used to determine the distribution of variables such as grades of students, weights
or heights of person, incomes of families, or IQ.

The Normal Curve


A normal curve is a bell-shaped curve which shows the probability distribution of a continuous random
variable. The normal curve represents a normal distribution. The total area under the normal curve is one.
Thus, the parameters involved in a normal distribution is mean (μ) and standard deviation (σ).

Characteristics of the Normal Curve


1. The curve is symmetrical and bell-shaped.
2. The number of cases, N, is infinite.
3. The three measures of central tendency (Mean, Median and Mode) coincide at one point at the center
of distribution.
4. The height of the curve indicates the frequency of cases, expressed as probability, proportion or
percentage.
5. The basic unit of measurement is expressed in sigma units (σ) or standard deviations along the baseline.
𝑥
The sigma units are also called Z-scores ( ).
𝜎
6. Two parameters are used to describe the curve. One is the parameter mean which is equal to zero
(μ=0) and the other is the standard deviation which is equal 1 (σ=1).
7. Standard deviations or Z-scores departing away from the μ towards the right of the curve or above the
mean are expressed in positive values while the scores departing from the mean to the left of the
curve or below the mean are in negative values.

The Empirical Rule


In a normal distribution, approximately:
- 68% of the data lie within 1 standard deviation from the mean.
- 95% of the data lie within 2 standard deviations from the mean.
- 99.7% of the data lie within 3 standard deviations from the mean.
Standard Normal Distribution

The standard normal curve represents a normal curve with mean 0 and standard deviation 1.

It is helpful to convert raw scores to z-scores using the following formulas:

For population:
𝑥−𝜇
𝑧𝑥 =
𝜎

For sample:
𝑥 − 𝑥̅
𝑧𝑥 =
𝑠

Tables and calculators are used to determine the area under the normal curve. The following table of
Areas under the Normal Curve will help. Since, the normal curve is symmetrical, values for negative and
positive z-scores are the same.
EXAMPLE: Find the area under the standard normal curve for the following z-scores and draw and shade the
corresponding area on the curve.
a. Between z = 0 and z = 0.50
Solution: Using the table, the area between the mean and a z-score of 0.50 corresponds to
0.1915.

b. Between z = -1.50 and z = 0.50


Solution: Using the table, the area to the between -1.50 to the mean (0) is 0.4332 and the area
from the mean to 0.50 is 0.1915. Total area is 0.6247.

c. Between z = - 2.40 and z = 0


Solution: Using the table, the area between a z-score of -2.4 and the mean is 0.4918.
d. To the left of z = 2.30
Solution: Using the table, the area from the mean to z = 2.3 is 0.4893. The total area to the
left of z is 0.9893.

e. To the right of z = 1.00


Solution: Using the table, the area from the mean to a z-score of 1.0 is 0.3413. The total area
to the right of z = 1.00 is 0.1587.

EXAMPLE: Express Delivery Service has found that the delivery times for packages are normally distributed,
with a mean of 16 hours and a standard deviation of 2 hours.
a. What is the probability that a randomly selected package will be delivered in between 12 and
17 hours?
b. What percent of packages will be delivered in more than 18.5 hours?

Solution:
a. Convert 12 and 17 hours to standard scores.
𝑥1 − 𝜇 12 − 16
𝑧1 = = = − 2.0
𝜎 2
𝑥2 − 𝜇 17 − 16
𝑧2 = = = 0.5
𝜎 2
From the table, a z-score of -2 has an area of 0.4772 and the area for a z-score of 0.5 is 0.1915. Hence,
the probability that a randomly package will be delivered in between 12 and 17 hours is 0.6687
b. Convert 18.5 hrs to z-score
𝑥1 − 𝜇 18.5 − 16
𝑧= = = 1.25
𝜎 2
From the table, a z-score of 1.25 has a corresponding area of .3944. This is the area from the mean to
the raw score of 18.5. Hence, the area from 18.5 and above is computed as:
𝐴 = 0.5 − 0.3944 = 0.1056
Hence, the percentage of packages that will be delivered in more than 18.5 hours is 10.56%.

EXAMPLE: 5000 students participated in a certain test yielding a result that follows the normal distribution
with mean of 65 points and standard deviation of 10 points.

a. Find the percent of a certain student marking more than 75 points and less than 85 points
inclusive.
b. b. Find the percent of a certain student scoring less than 60 points.

Solution:
a. Convert 75 and 85 points to z.
𝑥1 − 𝜇 75 − 65
𝑧1 = = = 1.0
𝜎 10
𝑥2 − 𝜇 85 − 65
𝑧2 = = = 2.0
𝜎 10

The following are the areas of the z-values of 1.0 and 2.0 respectively, 0.3413 and 0.4772.

To get the percentage, subtract the two areas:


𝐴 = 0.4772 − 0.3413 = 0.1359
Hence, the percentage of a certain student marking 75 points and less than 85 points is 13.59%.

b. Percent of a certain student scoring less than 60 points.

𝑥1 − 𝜇 60 − 65
𝑧= = = −0.5
𝜎 10

The area corresponding to -0.5 is .1915. Since we are looking for the percent of a student scoring less
than 60 points,

𝐴 = 0.5 − 0.1915 = 0.3085


The percentage of a certain student scoring less than 60 points is 30.85%.

*All normal curves in the examples are generated from https://www.mathportal.org/calculators/statistics-


calculator/normal-distribution-calculator.php
LESSON 6: CORRELATION AND LINEAR REGRESSION

Correlation is a bivariate analysis that measures the strength of association between two variables
and the direction of the relationship.

The main result of a correlation is called the correlation coefficient (or "r"). It ranges from -1.0 to
+1.0.

When the value of the correlation coefficient lies around ± 1, then it is said to be a perfect degree of
association between the two variables. The closer r is to +1 or –1, the stronger the correlation. The direction
of the relationship is simply the + (indicating a positive relationship between the variables) or - (indicating a
negative relationship between the variables) sign of the correlation.

Interpreting Correlation

Correlation is an effect size and so we can verbally describe the strength of the correlation using the
guide that Evans (1996) suggests for the absolute value of r:
 .00-.19 “very weak”
 .20-.39 “weak”
 .40-.59 “moderate”
 .60-.79 “strong”
 .80-1.0 “very strong”

Scatterplot

An effective way to see a relationship in data is to display the information as a scatter plot. It shows
how two variables relate to each other by showing how closely the data points fit to a line. If the variables
are correlated, the points will fall along a line or curve. The better the correlation, the tighter the points will
hug the line.

A simple scatterplot can be used to (a) determine whether a relationship is linear, (b) detect outliers
and (c) graphically present a relationship. For example, determining whether a relationship is linear (or not)
is an important assumption if you are analyzing your data using a Correlation and Regression.

Various types of correlation can be interpreted through the patterns displayed on Scatterplots. These
are: positive (values increase together), negative (one value decreases as the other increases), null (no
correlation). The strength of the correlation can be determined by how closely packed the points are to each
other on the graph. Points that end up far outside the general cluster of points are known as outliers.

Example: Sample Scatterplots

60

52
50
4849
42
40 39
36
33 Series1
30 31
25
20 20
Linear
(Series1)
10

0
0 20 40 60 80 100
Source: https://datavizcatalogue.com/methods/scatterplot.html

Source: https://datavizcatalogue.com/methods/scatterplot.html

How to make a Scatterplot in MS Excel

1. Select the data you want to graph. (If data is already encoded. If not, encode and label your data in
MS excel).
2. Click the Insert tab, and then click Insert Scatter (X, Y) or Bubble Chart. Click Scatter. Click Ok.
3. Click the Design tab, and then click the chart style you want to use.

4. You can quickly edit the chart by clicking the icons available on the right side of the chart.

5. You can also add a trend line to your scatter plot. Right-click the Chart Area and Select Add Trend
line. Select Linear and Tick Display Equation on chart.
Pearson Correlation

Pearson r correlation is the most widely used correlation statistic to measure the degree of the
relationship between linearly related variables. The calculation of Pearson’s correlation coefficient and
subsequent significance testing of it requires the following data assumptions to hold: interval or ratio level;
linearly related; and bivariate normally distributed.

The following is the formula for r:

n xy    x  y 
r
  2
 
n  x 2   x   n  y 2   y 
2

Where:
r = Pearson r correlation coefficient
N = number of value in each data set
∑xy = sum of the products of paired scores
∑x = sum of x scores
∑y = sum of y scores
∑x2= sum of squared x scores
∑y2= sum of squared y scores

EXAMPLE: A study investigated the relationship of height and self-esteem of 20 randomly selected women.
The following are the heights in inches and the level of their self-esteem. Solve for the correlation coefficient
r.

Height (in inches) Self-Esteem Level


68 4.10
71 4.60
62 3.80
75 4.40
58 3.20
60 3.10
67 3.80
68 4.10
71 4.30
69 3.70
68 3.50
67 3.20
63 3.70
62 3.30
60 3.40
63 4.00
65 4.10
67 3.80
63 3.40
61 3.60
Solution: Solve for x2, y2 and xy together with their summations

Height Self-Esteem
(in inches) Level X2 Y2 XY
X Y
68 4.1 4624 16.81 278.80
71 4.6 5041 21.16 326.60
62 3.8 3844 14.44 235.60
75 4.4 5625 19.36 330.00
58 3.2 3364 10.24 185.60
60 3.1 3600 9.61 186.00
67 3.8 4489 14.44 254.60
68 4.1 4624 16.81 278.80
71 4.3 5041 18.49 305.30
69 3.7 4761 13.69 255.30
68 3.5 4624 12.25 238.00
67 3.2 4489 10.24 214.40
63 3.7 3969 13.69 233.10
62 3.3 3844 10.89 204.60
60 3.4 3600 11.56 204.00
63 4.0 3969 16.00 252.00
65 4.1 4225 16.81 266.50
67 3.8 4489 14.44 254.60
63 3.4 3969 11.56 214.20
61 3.6 3721 12.96 219.60
∑ 𝑿 =1308 ∑ 𝒀 =75.1 ∑ 𝑿𝟐 =85912 ∑ 𝒀𝟐=285.45 ∑ 𝑿𝒀 =4937.6

Substitute values into the formula:


n xy    x  y 
r
  2
 
n  x 2   x   n  y 2   y 
2

20(4937.6) − (1308)(75.1)
𝑟=
√20(85912) − (1308)2 ∙ √20(285.45) − (75.1)2

98752 − 98230.8 521.2


𝑟= = = 𝟎. 𝟕𝟑𝟏
√7376 ∙ √68.99 713.35

Interpretation: The r coefficient 0.731 indicates a positive strong relationship between height and self-esteem
level. This implies that shorter people have lower self-esteem and taller people have higher self-esteem.
Linear Regression

In practice a relationship is found to exist between two (or more) variables, and one wanted to express
this relationship in a mathematical form by finding an equation connecting the variables. To do this, one
should collect data showing the corresponding values of the variables. Next is to plot the points into the
rectangular coordinate system. The resulting graph is sometimes called the scatter plot or scatter diagram.

Linear regression tries to model the relationship between two variables by fitting a linear equation to
the observed data. One variable is considered to be an explanatory (independent) variable, and the other is
considered to be the dependent variable.

A linear regression line has an equation of the form 𝑌 = 𝑎𝑋 + 𝑏, where 𝑋 is the explanatory variable
and 𝑌 is the dependent variable. a is the slope of the line and b is the intercept (the value of y when x = 0)

To compute for the slope a;

𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦)


𝑎=
𝑛(∑ 𝒙𝟐 ) − (∑ 𝒙)𝟐

To compute for the intercept b,

∑ 𝑦 − 𝑎(∑ 𝑥)
𝑏=
𝑛
Step-by-step Procedure:

Step 1: For each (x, y) point calculate x2 and xy

Step 2: Sum all x, y, x2 and xy, which gives us Σx, Σy, Σx2 and Σxy
(Σ means "sum up")

Step 3: Calculate Slope a.

Step 4: Calculate Intercept b.

Step 5: Assemble the equation of a line: y = ax + b

EXAMPLE: The table below shows some data for the first ten (10) years of a certain Manufacturing and Canning
company, Marina. Each row in the table shows Marina’s sales for a year, and the amount spent on advertising
in that year. Calculate the regression equation for the data using advertising as the explanatory variable.

Advertising
(in million Sales (in
Year pesos) million pesos)
1 18 665
2 23 758
3 25 823
4 28 1078
5 30 1199
6 33 1301
7 39 1472
8 47 1500
9 52 1604
10 61 1699
Solution:
X Y XY X2
1 18 665 11970 324
2 23 758 17434 529
3 25 823 20575 625
4 28 1078 30184 784
5 30 1199 35970 900
6 33 1301 42933 1089
7 39 1472 57408 1521
8 47 1500 70500 2209
9 52 1604 83408 2704
10 61 1699 103639 3721
∑ 𝑿 = 𝟑𝟓𝟔 ∑ 𝒀 = 𝟏𝟐𝟎𝟗𝟗 ∑ 𝑿𝒀 = 𝟒𝟕𝟒𝟎𝟐𝟏 ∑ 𝑿𝟐 = 𝟏𝟒𝟒𝟎𝟔

To compute for the slope a;


𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦)
𝑎=
𝑛(∑ 𝒙𝟐 ) − (∑ 𝒙)𝟐
10(474021) − (356)(12099)
𝑎=
10(14406) − (356)2
𝑎 = 24.992

To compute for the intercept b,


∑ 𝑦 − 𝑎(∑ 𝑥)
𝑏=
𝑛
(12099) − 24.992(356)
𝑏=
10
𝑏 = 320.185

The regression equation is: 𝒀 = 𝟐𝟒. 𝟗𝟗𝟐𝑿 + 𝟑𝟐𝟎. 𝟏𝟖𝟓

How to perform Regression Analysis in MS Excel:

1. Encode your data in MS Excel worksheet.

2. On the Data tab, in the Analysis group, click Data Analysis.


3. A dialogue box will appear, select Regression and click OK.

4. A new dialogue box for Regression will appear Select the Y Range. This is the predictor variable (also
called dependent variable). Select the X Range. These are the explanatory variables (also called
independent variables). These columns must be adjacent to each other. Check Labels  Click in the
Output Range box and select any vacant in the work Check Residuals  Click OK.
5. Excel produces the following Summary Output (rounded to 3 decimal places).

R square means that if the value is closer to 1, the better the regression line fits the data.

Significance F and P-values

To check if the results are reliable or statistically significant, look at Significance F (0.000052). If this
value is less than 0.05, you're OK. It means that it is statistically significant. If Significance F is greater than
0.05, it's probably better to stop using this set of independent variables. Delete a variable with a high P-value
(greater than 0.05) and rerun the regression until Significance F drops below 0.05.

Coefficients
The regression line is: y = 24.992x + 320.175. In other words, for each unit increase in advertising,
sales increases with 320.175 units. This is an important information.
*The same example was used in performing the regression analysis in MS Excel, there might be a slight
difference in the final answer due to manual computation and rounding off data.

You might also like