Data visualization (3)
Data visualization (3)
• Data refers to raw facts, figures, or measurements collected about events, objects, or phenomena. It forms the foun-
dation for gaining insights and making informed decisions. Data can be qualitative, such as names or categories, or
quantitative, like numerical values or measurements. However, raw data in its initial form is often vast, unstructured,
and challenging to interpret.
• Examples of Data:
• Names of students in a class.
• Heights of individuals in centimeters.
• Monthly sales of a product.
• Customer satisfaction ratings.
• Tabulating and visualizing data are crucial processes that simplify and organize complex datasets, making them
easier to understand and analyze.
• Tabulation systematically arranges data into rows and columns, highlighting key features and relationships, while
visualization uses charts, graphs, and diagrams to provide a quick, engaging overview of the information.
• Importance of Data Visualization: Data visualization involves using graphical methods to represent data. It plays
a critical role in data analysis by enabling:
Organizing data
The process of organizing data related to a quantitative phenomenon typically involves the following stages:
• Raw data: A collection of individual observations in their original, unorganized form. For example, a list of exam
scores such as 45, 67, 89, 56, etc.
• Organized (Arrayed) data: Sorting raw data into ascending or descending order to make patterns more apparent.
For instance, arranging the scores as 45, 56, 67, 89.
• Discrete (Ungrouped) frequency distribution: Representing data by showing how often each individual value
occurs. For example, a table displaying the frequency of each exam score.
• Grouped frequency distribution: Combining data values into intervals or ranges (e.g., 40–49, 50–59) and showing
the frequency of observations within each range. This method is useful for summarizing large datasets.
• Continuous frequency distribution: Similar to grouped frequency distribution but used for continuous data,
where the intervals have no gaps (e.g., 40.5–49.5, 50.5–59.5). This is particularly relevant for measurements like
height or weight.
1
• Lets express it in the form of a discrete or ungrouped frequency distribution:
Age Tally Bar Frequency Age Tally Bar Frequency
30 :: 2 42 :: 2
31 : 1 43 :: 2
32 ::: 3 44 :: 2
33 :: 2 45 ::: 3
34 ::: 3 46 : 1
35 ; 5 47 ::: 3
36 :: 2 48 :: 2
37 : 1 49 :: 2
38 :: 2 50 ; 5
39 :: 2 51 : 1
40 : 1 52 : 1
41 : 1 53 : 1
• Identify Class Intervals: Start with the class intervals from the grouped frequency distribution. For example:
30–34, 35–39, 40–44, 45–49, 50–54.
• Adjust the boundaries of each class to make them continuous. This is done by subtracting a small value from
the lower boundary and adding the same value to the upper boundary.
• For 30–34: new lower boundary = 30 - 0.5=29.5, new upper boundary = 34 + 0.5=34.5. Repeat this for all
intervals.
• Create continuous classes: The adjusted intervals will now be: 29.5-34.5, 34.5-39.5, 39.5-44.5, 44.5-49.5, 49.5-
54.5.
2
Marks range Number of students
29.5–34.5 11
34.5–39.5 12
39.5–44.5 8
44.5–49.5 11
49.5–54.5 8
• The ideal number of classes in a frequency distribution is essential to ensure the data is represented in a balanced
manner. Too few classes can oversimplify the data and obscure important details, while too many classes can make
the data difficult to interpret.
• To determine the optimal number of classes, we will use the following formulas:
• k = 1 + 3.322 log10 N
• k = 1 + log2 N
Here, k is the approximate number of classes and N is the total number of observations. Given Data:
• Total number of observations, n = 50 (from the marks data in the previous example).
Using formula: k = 1 + log2 50, k = 1 + 5.65 = 6.65. Round k to the nearest whole number: k ≈ 7. Researchers may
adjust the number of classes based on specific dataset characteristics and the objectives of their analysis.
Histogram
• A histogram is a graphical representation of a grouped frequency distribution with continuous classes. It is an area
diagram and can be defined as a set of rectangles with bases along with the intervals between class boundaries and
with areas proportional to frequencies in the corresponding classes.
• Example: The following table represents the variable and its frequency distribution:
Variable range Frequency
10–20 15
21–30 23
31–40 9
41–50 36
51–60 53
61–70 48
71–80 60
Ogive
• An ogive is a graphical representation of the cumulative frequency distribution of a dataset. It is used to determine
the number of observations below a particular value in the dataset and is particularly helpful in understanding
the distribution of data. An ogive is a smooth, non-decreasing curve that progresses as the cumulative frequencies
increase.
Scattar plot
3
• Scatter plots are the graphs that present the relationship between two variables in a data-set. It represents data
points on a two-dimensional plane or on a Cartesian system. The independent variable or attribute is plotted on the
X-axis, while the dependent variable is plotted on the Y-axis. These plots are often called scatter graphs or scatter
diagrams.
• Here, a scattar plot is depicted where X-axis shows age and Y-axis shows weights of 50 students.
Cross-Section data
• Cross-section data refers to data collected at a single point in time across multiple subjects or entities. It provides
a snapshot of different entities or variables at a particular moment.
• Examples: Income levels of households in a city in 2025, GDP of various countries for the year 2023, test scores of
students in a particular class on a single exam day.
Time series data
• Time series data refers to data collected over multiple time periods for a single subject or entity. It captures how a
variable changes over time.
• Monthly sales of a product from January to December 2024, temperature readings recorded daily over a year, stock
prices of a company observed over a week.
Practice questions
1. A fitness center has conducted a survey to record the number of push-ups completed by 50 participants in a single
session. The following data represents the recorded push-ups for each participant: 45, 50, 38, 52, 47, 60, 43, 55, 49,
41, 62, 51, 48, 58, 40, 44, 46, 54, 53, 57, 39, 42, 61, 59, 56, 45, 50, 37, 63, 64, 41, 47, 52, 48, 46, 58, 44, 60, 49, 53,
55, 43, 62, 57, 45, 50, 40, 48, 54, 56. Create an ungrouped frequency distribution table. Organize the data into a
grouped frequency distribution table. Use the following intervals for the class ranges: 37–40, 41–44, 45–48, 49–52,
53–56, 57–60, 61–64. Draw a histogram to represent the grouped frequency distribution.
2. A teacher conducted a math exam for 40 students in the class. The teacher wants to analyze the overall performance
of the students to identify trends and areas for improvement. The following marks (out of 100) were obtained by
the students: 48, 72, 65, 89, 54, 77, 61, 92, 68, 74, 59, 81, 66, 90, 55, 73, 50, 85, 78, 69, 64, 70, 58, 88, 82, 60, 67, 76,
71, 53, 62, 84, 57, 75, 49, 86, 80, 63, 56, 79. Group the marks into class intervals of width 10 and draw histogram
and ogive.
3. A sports academy conducted a survey to examine the relationship between the age and height of 20 participants.
The following data represents the observations:
4
Age (Years) Height (cm) Age (Years) Height (cm)
12 140 22 174
13 145 23 175
14 150 24 176
15 152 25 178
16 158 26 179
17 160 27 180
18 165 28 181
19 168 29 182
20 170 30 183
21 172 31 184
Using the given data, draw a scatter plot.
4. A retail store wants to analyze the monthly sales trend for one of its popular products over the past year. The data
below shows the sales (in units) for each month:
Month Sales (Units) Month Sales (Units)
January 120 July 170
February 130 August 160
March 140 September 150
April 150 October 140
May 160 November 180
June 170 December 190
Using the data provided, plot a time series graph with: Months on the x-axis and sales (in units) on the y-axis.