0% found this document useful (0 votes)

12 views

FTA-Module 1-Notes (1)

The document provides a comprehensive overview of data types, their importance, and the foundational concepts of analytics, including data understanding, exploratory data analysis (EDA), and statistics. It covers various data types such as tabular, graphical, image, and audio data, and emphasizes the significance of data quality, cleaning, and preprocessing techniques. Additionally, it discusses statistical methods, including descriptive and inferential statistics, and their applications in making informed decisions and understanding data variability.

Uploaded by

sakshipatil5150

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

FTA-Module 1-Notes (1)

Uploaded by

sakshipatil5150

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Institutional Elective 1: Analytics Foundation – 6th Sem

Unit 1: Introduction to data, analytics and EDA

Introduction to Data Understanding

In today's data-driven world, understanding various types of data is essential for making
informed decisions and solving complex problems effectively. This course provides an overview
of different data types, emphasizes the importance of data understanding, and explores real-
world applications across various domains.
Overview of Data Types
Tabular Data: Tabular data is structured in rows and columns, similar to a spreadsheet. It is
commonly used for storing structured information, such as customer records, financial
transactions, or survey responses. Examples include CSV files, Excel spreadsheets, and database
tables.
Graphical Data: Graphical data represents information visually through charts, graphs, and
diagrams. It helps in conveying complex data patterns and trends in a digestible format.
Examples include line graphs, bar charts, pie charts, and scatter plots.
Image Data: Image data consists of visual information represented in the form of pixels. It is
commonly used in fields such as medical imaging, satellite imagery, and computer vision.
Examples include photographs, MRI scans, satellite images, and digital paintings.
Audio Data: Audio data represents sound information captured over time. It is utilized in
applications such as speech recognition, music analysis, and sound processing. Examples
include WAV files, MP3 files, speech recordings, and music tracks.
Importance of Data Understanding
Data understanding is crucial for decision-making and problem-solving processes for several
reasons:
Informed Decision Making: Understanding different data types enables individuals and
organizations to extract meaningful insights from their data, leading to informed decision-
making processes.

1
Risk Mitigation: Proper understanding of data helps in identifying potential risks and
uncertainties, allowing proactive measures to be taken to mitigate them.
Efficient Problem Solving: Data understanding facilitates the identification of relevant
information and patterns, making problem-solving processes more efficient and effective.
Quality Assurance: Understanding data ensures data quality and integrity, minimizing errors
and inaccuracies that could lead to flawed conclusions.
Real-world Applications of Different Data Types
Tabular Data: In finance, tabular data is used for analyzing stock prices, predicting market
trends, and managing investment portfolios. In healthcare, it is utilized for patient record
management, clinical trials analysis, and drug development.
Graphical Data: In marketing, graphical data is used for visualizing sales trends, customer
demographics, and advertising performance. In academia, it is employed for presenting
research findings, illustrating scientific concepts, and summarizing data analysis results.
Image Data: In agriculture, image data is used for monitoring crop health, assessing soil
conditions, and predicting yields. In security, it is utilized for facial recognition, object detection,
and surveillance.
Audio Data: In telecommunications, audio data is used for voice communication, speech
recognition, and voice-controlled devices. In entertainment, it is employed for music streaming,
audio editing, and sound production.
Understanding these data types empowers individuals and organizations to leverage their data
effectively, driving innovation, and creating value across various industries and domains.
Data Understanding and Interpretation
Data understanding and interpretation are fundamental steps in the data analysis process,
facilitating the extraction of meaningful insights and patterns from raw data. Exploratory Data
Analysis (EDA) techniques play a crucial role in this process by providing an initial overview of
the dataset's characteristics, including its structure, distributions, and relationships between
variables. Descriptive statistics, such as mean, median, mode, variance, and standard deviation,
offer quantitative summaries of the dataset's central tendency, dispersion, and shape. These
statistics help analysts gain a deeper understanding of the data's distribution and variability.

2
Additionally, data visualization techniques, such as those offered by Matplotlib, Seaborn, and
Plotly, enable the graphical representation of data, making it easier to identify trends, outliers,
and patterns visually. Understanding data distributions and patterns is essential for uncovering
underlying relationships and making informed decisions based on the data's insights, ultimately
driving effective problem-solving and decision-making processes.

Data understanding forms the foundation of effective analytics. Before diving into complex
analyses or building sophisticated models, it's crucial to thoroughly understand the data you're
working with. This guide will walk you through the essential steps and considerations for
gaining a deep understanding of data in analytics.
1. Data Profiling:
Start by profiling your dataset to get an overview of its characteristics.
Examine the size, structure, and data types to understand the scope of your dataset.
Identify any missing values, outliers, or inconsistencies that may impact your analysis.

2. Exploratory Data Analysis (EDA):

EDA involves visually exploring the data to uncover patterns, trends, and relationships.
Utilize summary statistics, data visualization techniques, and correlation analysis to gain
insights.
Look for distributions, anomalies, and interesting relationships that may inform your analysis.

3. Data Quality Assessment:

Assess the quality of your data to ensure its reliability and accuracy.
Address missing values, outliers, and inconsistencies through data cleaning and preprocessing.
Pay attention to data integrity issues that could affect the validity of your analysis.

4. Documentation and Documentation:

Document your findings from the data understanding process for transparency and
reproducibility.

3
Include descriptions of data sources, data dictionaries, and any data quality issues encountered.
Document your assumptions, methodologies, and initial insights to guide your analysis.

5. Integration of Domain Knowledge:

Incorporate domain knowledge into your analysis to provide context and validate assumptions.
Consult subject matter experts to gain insights into the data and identify relevant variables or
features.
Use domain expertise to interpret findings and draw meaningful conclusions.

6. Data Governance and Compliance:

Ensure compliance with regulatory requirements and organizational policies when working with
data.
Consider data governance, security, privacy, and ethical considerations throughout the analysis
process.
Adhere to best practices for handling sensitive data and protecting individual privacy rights.

Cleaning and Preprocessing Data

Cleaning and preprocessing data are critical steps in the data analysis process, ensuring that the
data is accurate, reliable, and suitable for further analysis or modeling.

4
Handling Missing Values, Outliers, and Duplicates:
Missing Values: Data may contain missing values, which can adversely affect analysis and
modeling. Common techniques for handling missing values include:
Removing rows or columns with missing values if they are negligible in quantity.
Imputing missing values using statistical measures such as mean, median, mode, or predictive
imputation.
Outliers: Outliers are data points that significantly deviate from the rest of the dataset.
Techniques for handling outliers include:
Identifying outliers using statistical methods (e.g., Z-score, IQR) and visual inspection.
Treating outliers by winsorizing (capping or flooring extreme values) or transforming them.
Duplicates: Duplicates are identical rows or observations within the dataset. Removing
duplicates ensures data integrity and avoids biases in analysis.
Data Transformation:
Normalization: Normalization scales numerical features to a standard range, typically between
0 and 1, making them comparable. It prevents features with larger scales from dominating the
model.
Standardization: Standardization transforms numerical features to have a mean of 0 and a
standard deviation of 1. It maintains the shape of the distribution and is useful when features
have different scales.
Feature Engineering:
Feature engineering involves creating new features or transforming existing ones to improve
model performance or capture additional information from the data.
Techniques include:
Creating interaction terms or polynomial features to capture nonlinear relationships.
Encoding categorical variables into numerical representations using techniques like one-hot
encoding or label encoding.
Extracting information from date/time variables, text data, or spatial data to create meaningful
features.

5
Data Imputation Methods:
Mean, Median, Mode Imputation: Imputing missing values with the mean, median, or mode of
the respective feature.
Predictive Imputation: Using machine learning algorithms to predict missing values based on
the values of other features. Techniques such as K-nearest neighbors (KNN), linear regression,
or decision trees can be employed for predictive imputation.

Data engineering
Data engineering forms the backbone of any analytics endeavor, laying the groundwork for
collecting, processing, and preparing data for analysis. Here's a concise guide to the basics of
data engineering for students embarking on their analytics journey.

1. Data Collection:
Data engineering starts with collecting data from various sources like databases, APIs, files, or
sensors.
Designing efficient data pipelines to extract, ingest, and aggregate data is key.

2. Data Storage:
Once collected, data needs storage in systems like relational databases, NoSQL databases, data
lakes, or cloud storage services.

3. Data Processing:
Transforming raw data into a usable format for analysis is done through data processing.
Distributed computing frameworks like Apache Spark or Apache Flink are used for efficient
processing of large datasets.

4. Data Integration:
Integrating data from different sources into a centralized repository ensures a unified view.

6
ETL (Extract, Transform, Load) processes are designed for this purpose, ensuring data
consistency and quality.

5. Data Governance and Security:

Data governance frameworks and security measures are implemented to manage and protect
data in compliance with regulations and policies.

6. Scalability and Performance:

Data engineering solutions must be scalable to handle growing data volumes and
computational demands.
Optimizing pipelines and storage systems ensures performance, reliability, and cost-
effectiveness.

7. Monitoring and Maintenance:

Continuous monitoring and maintenance are essential for ensuring system reliability,
availability, and performance.
Monitoring tools and metrics help detect issues and optimize system performance.

STATISTICS
Statistics is a science of facts and figures which may be readily available or obtained through
the process of direct enquiry or enumeration. It deals with the methods of collecting,
classifying and analyzing the data so as to draw some valid conclusions. Statistics, as a branch of
mathematics, serves as a foundational framework for collecting, organizing, analyzing,
interpreting, and presenting data. Its applications span across diverse fields, including science,
business, economics, medicine, and social sciences, where data-driven decision-making is
paramount.
Need for Statistics and Exploratory Data Analysis (EDA):

7
 Understanding Variation: Statistics helps us understand and quantify the variation
inherent in data. By analyzing variability, we can identify patterns, trends, and
relationships that provide insights into real-world phenomena.
 Making Informed Decisions: Statistics provides tools and techniques for making
informed decisions based on data evidence rather than intuition or anecdotal evidence.
It enables us to draw reliable conclusions and predictions from empirical observations.
 Quality Improvement: In fields such as manufacturing and healthcare, statistics is used
for quality improvement initiatives. It helps identify areas for improvement, monitor
processes, and make data-driven decisions to enhance quality and efficiency.
 Risk Assessment and Management: Statistics plays a vital role in risk assessment and
management by quantifying uncertainty and identifying potential risks. It helps
businesses and organizations make informed decisions to mitigate risks and maximize
opportunities.
 Exploratory Data Analysis (EDA): EDA is an essential step in the data analysis process
that involves summarizing the main characteristics of a dataset, often through visual
and numerical techniques. It helps analysts understand the structure of the data, detect
anomalies, and generate hypotheses for further investigation.

Descriptive statistics
Descriptive statistics serve as a fundamental tool for summarizing and conveying the main
characteristics of a dataset. By analyzing descriptive statistics, researchers gain valuable insights
into the central tendency, variability, and distribution of the data. Measures such as the mean,
median, and mode offer insights into the typical or central value of the dataset, while measures
of dispersion like variance and standard deviation quantify the spread or variability of the data
points around the central tendency. Additionally, descriptive statistics help to characterize the
distribution of the data, including its shape, skewness, and kurtosis, providing further
understanding of the dataset's underlying structure. Overall, descriptive statistics play a vital
role in providing a concise and informative summary of the dataset, facilitating data
interpretation and decision-making processes.

8
Inferential Statistics
Inferential statistics involve making inferences or generalizations about a population based on
sample data. It allows researchers to draw conclusions, make predictions, and test hypotheses
about population parameters using sample statistics. Common inferential techniques include:
 Hypothesis Testing: A statistical method used to assess whether observed differences or
effects are statistically significant or occurred by chance. It involves formulating null and
alternative hypotheses, selecting an appropriate test statistic, and determining the
probability of observing the test statistic under the null hypothesis (p-value).
 Confidence Intervals: Confidence intervals provide a range of values within which the
true population parameter is likely to lie with a certain level of confidence. They are
constructed based on sample statistics and provide a measure of uncertainty around the
point estimate.
 Regression Analysis: Regression analysis is used to model the relationship between one
or more independent variables (predictors) and a dependent variable (outcome). It
helps in predicting the value of the dependent variable based on the values of the
independent variables.
 Analysis of Variance (ANOVA): ANOVA is used to compare means across multiple
groups or treatments to determine whether there are statistically significant differences
between them. It assesses whether the variability between groups is greater than the
variability within groups.

Basic Definitions and Formulae in Statistics

Classification
The significance of a large mass of statistical data known as raw data cannot be understood
unless it is arranged in some definite manner. The process of arrangement in different groups
is called classification.
Frequency table

9
It is a tabular arrangement consisting of various classes of uniform size known as class intervals
and the number in each class known as frequency.
The difference between two consecutive vertical entries of the classes is known as the width of
the class usually denoted by ℎ. The average of the left and right end points of the class interval
is known as the midpoint of the class usually denoted by 𝑥.
Mean, Variance and Standard Deviation

(i) If 𝑥1 , 𝑥2 , … … … 𝑥𝑛 be a set of ‘𝑛’ values of a variate 𝑥, then Mean, denoted by 𝑥̅ ,

variance denoted by 𝑉 and Standard Deviation denoted by 𝜎 is defined as follows:

∑𝑥
 Mean = 𝑥̅ = 𝑛
∑(𝑥−𝑥̅ )2 ∑ 𝑥2
 Variance = 𝑉 = (or) 𝑉= − 𝑥̅ 2
𝑛 𝑛

 Standard Deviation = 𝜎 = √𝑉

(ii) For the grouped data in the form of a frequency distribution

∑ 𝑓𝑥
 Mean = 𝑥̅ = ∑𝑓

∑ 𝑓(𝑥−𝑥̅ )2 ∑ 𝑥 2𝑓
 Variance = 𝑉 = ∑𝑓
(or) 𝑉= ∑𝑓
− 𝑥̅ 2

 Standard Deviation = 𝜎 = √𝑉

Standard Deviation of the Combination of two groups

If 𝑚1 , 𝜎1 be the mean and standard deviation of a sample of size 𝑛1 and 𝑚2 , 𝜎2 be the mean
and standard deviation of a sample of size 𝑛2 then the standard deviation 𝜎 of the combined
ample of size 𝑛1 + 𝑛2 is given by

(𝑛1 + 𝑛2 )𝜎 2 = 𝑛1 𝜎12 + 𝑛2 𝜎22 + 𝑛1 𝐷12 + 𝑛2 𝐷22

10
where 𝐷𝑖 = 𝑚𝑖 − 𝑀 ; 𝑚 being the mean of combined sample.

Median

(i) If the values of a variable are arranged in the ascending or descending order of
magnitude:

 The median is the middle item if the number is odd

 The median is the mean of the two middle item if the number is even.

Thus the median is equal to the mid-value (ie) the value which divides the total
frequency into equal parts

(ii) For the grouped data:

𝑯 𝑵
𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑳 + ( − 𝑪)
𝑭 𝟐

where L = lower limit of the median

H = width of the median class
F = frequency of the median class
N = total frequency
C = cumulative frequency up to the class preceding the median class

Mode

(i) The mode is defined as that value of the variable which occurs most frequently (ie)
the value of the maximum frequency.

11
(ii) For the grouped distribution it is given by the formula for regular distribution

𝑯 𝒇𝟏
𝑴𝒐𝒅𝒆 = 𝑳 +
𝒇𝟏 + 𝒇𝟐

where L = lower limit of the mode

𝑓1 = excess of modal frequency over frequency of preceding class
𝑓2 = excess of modal frequency over following class
H = width of the model class

PROBLEMS:

1. Find the (i) mean, (ii) median, (iii) mode and (iv) standard deviation of a set of
observations: 6, 8, 7, 5, 4, 9, 3, 3, 3

Solution:

∑𝑥 6+ 8+ 7+ 5+ 4+ 9+ 3+3+3
(i) Mean = 𝑥̅ = = = 5.333
𝑛 9

(ii) set of observations in ascending order: 3 , 3 , 3 , 4 , 5 , 6 , 7 , 8 , 9

Median = middle term = 5

(iii) Mode = the number which occurs frequently = 3

(iv) Standard Deviation = 𝜎 = √𝑉

∑ 𝑥2
∴ Variance = 𝑉 = − 𝑥̅ 2
𝑛
1
𝑉 = 9 {62 + 82 + 72 + 52 + 42 + 92 + 32 + 32 + 32 } − 5.3332

12
𝑉 = 4.6667

Standard Deviation = 𝜎 = √𝑉 = √4.6667 = 2.1603

2. Find the (i) mean, (ii) median, (iii) mode and (iv) standard deviation for the following
grouped data
Mid value 5 10 15 20 25 30 35
Frequency 2 5 8 10 7 4 1

Solution:
𝒙 𝒇 𝒙𝒇 𝒙𝟐 𝒇 C.F
5 2 10 50 2
10 5 50 500 7
15 8 120 1800 15
20 10 200 4000 25
25 7 175 4375 32
30 4 120 3600 36
35 1 35 1225 37
∑ 𝒇 = 37 ∑ 𝒇𝒙 = 710 ∑ 𝒙𝟐 𝒇 = 15550

∑ 𝑓𝑥 710
(i) Mean = 𝑥̅ = ∑𝑓
= = 19.1892
37

37
(ii) Here 𝑁 = = 18.5 , which lies at x = 30, hence Median is 30
2

(iii) The value of x corresponding to the maximum frequency size 10 is 30. Hence mode is
30

13
(iv) Standard Deviation = 𝜎 = √𝑉
∑ 𝑥 2𝑓 15550
∴ Variance = 𝑉 = ∑𝑓
− 𝑥̅ 2 = − 19.18922 = 52.0453
37

Standard Deviation = 𝜎 = √𝑉 = √52.0453 = 7.2142

3. Find the (i) mean, (ii) median, (iii) mode and (iv) standard deviation for the following
grouped data
Class 0 - 10 10 - 20 20 - 30 30 - 40 40 - 50 50 - 60
Frequency 3 16 26 31 16 8

Solution:
Class 𝒇 𝒙 𝒇𝒙 𝒙𝟐 𝒇 C.F
0 - 10 3 5 15 75 3
10 - 20 16 15 240 3600 19
20 - 30 26 25 650 16250 45
30 - 40 31 35 1085 37975 76
40 - 50 16 45 720 32400 92
50 - 60 8 55 440 24200 100
∑ 𝒇 = 100 ∑ 𝒇𝒙 =3150 ∑ 𝒙𝟐 𝒇 = 114500

∑ 𝑓𝑥 3150
(i) Mean = 𝑥̅ = ∑𝑓
= = 31.5
100
𝑁 100
(ii) Here = = 50 , which falls in the interval 30 – 40.
2 2

Hence 𝐿 = 30, 𝐻 = 10, 𝐹 = 31, 𝐶 = 45

𝐻 𝑁 10
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝐿 + ( − 𝐶) = 30 + (50 − 45) = 31.6129
𝐹 2 31
(iii) The maximum frequency size 31 which falls in the interval 30 – 40
Hence 𝐿 = 30, 𝐻 = 10, 𝑓1 = 31 − 26 = 5, 𝑓2 = 31 − 16 = 15
𝐻 𝑓1 10(5)
𝑀𝑜𝑑𝑒 = 𝐿 + = 30 + = 32.
𝑓1 + 𝑓2 5 + 15
(iv) Standard Deviation = 𝜎 = √𝑉

14
∑ 𝑥 2𝑓 114500
∴ Variance = 𝑉 = ∑𝑓
− 𝑥̅ 2 = − 31.52 = 152.75
100

Standard Deviation = 𝜎 = √𝑉 = √152.75 = 12.3592

4. The scores obtained by two bats men X and Y in 10 matches are given below. Calculate
mean and standard deviation and coefficient of variation for each batsman. Who is the
better score getter and who is more consistent?

X 30 44 66 62 60 34 80 46 20 38
Y 34 46 701 38 55 48 60 34 45 30

Solution:

𝑿 𝒀 𝑿𝟐 𝒀𝟐
30 34 900 1156
44 46 1936 2116
66 70 4356 4900
62 38 3844 1444
60 55 3600 3025
34 48 1156 2304
80 60 6400 3600
46 34 2116 1156
20 45 400 2025
38 30 1444 900
∑ 𝑿 = 480 ∑ 𝒀 = 460 ∑ 𝑿𝟐 = 26152 ∑ 𝒀𝟐 = 22626

∑𝑋 480
Mean of X = 𝑋̅ = 𝑛 = 10 = 48
∑𝑌 460
Mean of Y = 𝑌̅ = 𝑛 = 10 = 46
∑ 𝑋2 26152
Variance of X = 𝜎12 = − 𝑋̅ 2 = − 482 = 311.2
𝑛 10

15
∑ 𝑌2 22626
Variance of Y = 𝜎22 = − 𝑌̅ 2 = − 462 = 146.6
𝑛 10

Standard Deviation of X = 𝜎1 = √𝑉 = √311.2 = 17.6409

Standard Deviation of Y = 𝜎2 = √𝑉 = √146.6 = 12.1078
𝜎 17.6409
Coefficient of variation of X = 𝑋̅1 ∗ 100 = ∗ 100 = 36.7519
48
𝜎 12.1078
Coefficient of variation of Y = 𝑌̅2 ∗ 100 = ∗ 100 = 326.3213
46

Since mean of X is greater than mean of Y, therefore X is the better score getter. And
coefficient of variation of X is less than the coefficient of variation of Y, then X is more
consistent than Y.

5. An analysis of monthly wages paid to the workers of two company A and B belonging to
the same industry gives the following results:

Number of Mean monthly

Company Variance
workers wages
X 500 186 81

Y 600 175 100

(i) Which company has a larger wages bill?

(ii) In which company is there greater variability in individual wages
(iii) Calculate the mean and standard deviation of wages of all the workers in the
company A and B taken together.

Solution:
No. of worker in company X = 𝑛1 = 500

16
∑𝑋
Mean of X = 𝑋̅ = = 186  ∑ 𝑋 = 𝑛1 𝑋̅ = 500(186) = 93000
𝑛1

No. of worker in company Y = 𝑛2 = 600

∑𝑌
Mean of Y = 𝑌̅ = = 175  ∑ 𝑌 = 𝑛2 𝑌̅ = 600(175) = 105000
𝑛2

Variance of X = 𝜎12 = 81
Variance of Y = 𝜎22 = 100
Standard Deviation of X = 𝜎1 = √81 = 9
Standard Deviation of Y = 𝜎2 = √100 = 10
𝜎 9
Coefficient of variation of X = 𝑋̅1 ∗ 100 = 186 ∗ 100 = 4.84
𝜎2 10
Coefficient of variation of Y = ∗ 100 = ∗ 100 = 5.71
𝑌̅ 175

(i) ∑ 𝑋 is lesser than ∑ 𝑌, therefore Y has larger wages bill.

(ii) Coefficient of variation of X is less than the coefficient of variation of Y, then Y
has greater variability in individual wages.
𝑛1 𝑋̅1 +𝑛2 𝑌̅2 500(186)+600(175)
(iii) Combined mean = = = 180
𝑛1+𝑛2 500+600

Combined S.D:
1
𝜎2 = (𝑛 𝜎 2 + 𝑛2 𝜎22 + 𝑛1 𝐷12 + 𝑛2 𝐷22 )
𝑛1 + 𝑛2 1 1
1
𝜎2 = (500(81) + 600(100) + 500(186 − 180)2 + 600(175 − 180)2 )
500 + 600

𝜎 2 = 121.3636

∴ 𝜎 = √121.36 = 11.0165

Correlation and Regression

Correlation and correlation coefficient

17
Co-variation of two independent magnitudes is known as correlation. If two variables x and y
are related in such a way that increase or decrease in one of them corresponds to increase or
decrease in the other. We say that the variables are positively correlated. Also if increase or
decrease in one of them corresponds to decrease or increase in the other, the variables are said
to be negatively correlated.

The numerical measure of correlation between x and y is known as Pearson’s coefficient of

correlation usually denoted by r and is refined as follows
∑(𝒙 − 𝒙
̅)(𝒚 − 𝒚
̅)
𝑟=
𝑛𝜎𝑥 𝜎𝑦

Alternative formula for r

∑ 𝑿𝒀 ̅) 𝟐
∑(𝒙−𝒙 ̅ )𝟐
∑(𝒚−𝒚
1. 𝑟 = ̅ ; 𝜎𝑥2 =
̅; 𝒀 =𝒚−𝒚
where 𝑿 = 𝒙 − 𝒙 ; 𝜎𝑦2 =
√∑ 𝐗 𝟐 √∑ 𝐘 𝟐 𝒏 𝒏

𝜎𝑥2 +𝜎𝑦2 −𝜎𝑧2

2. 𝑟 = where 𝑧 = 𝑥 − 𝑦
2𝜎𝑥 𝜎𝑦

Properties
 The coefficient of correlation numerically does not exceed unity

Covariance:
Let the corresponding values of two variable X and Y given in ordered pairs then the covariance
between X and Y is denoted by 𝐶𝑜𝑣(𝑋, 𝑌) and it is defined as

∑(𝒙 − 𝒙
̅)(𝒚 − 𝒚
̅)
𝑪𝒐𝒗(𝑿, 𝒀) =
𝒏

Alternative formula for 𝑪𝒐𝒗(𝑿, 𝒀)

18
1. 𝑪𝒐𝒗(𝑿, 𝒀) = 𝑬(𝑿𝒀) − 𝑬(𝑿) 𝑬(𝒀) ; where 𝑬(𝑿𝒀), 𝑬(𝑿) , 𝑬(𝒀) are the corresponding
mean
2. 𝑪𝒐𝒗(𝑿, 𝒀) = 𝒓 𝝈𝒙 𝝈𝒚

Properties
 The coefficient of correlation numerically does not exceed unity
 If X and Y are independent then 𝐶𝑜𝑣(𝑋, 𝑌) = 0

Note
If 𝑟 = ±1, we say that x and y are perfectly correlated and if 𝑟 = 0, we say that x and y are non-
correlated

PROBLEMS:
Q. Compute the coefficient of correlation and Covariance
x 1 2 3 4 5 6 7
y 9 8 10 12 11 13 14
Solution:
𝝈𝟐𝒙 +𝝈𝟐𝒚 −𝝈𝟐𝒛
(i) We have 𝒓= where 𝒛 = 𝒙 − 𝒚
𝟐𝝈𝒙 𝝈𝒚

x y z=x-y x2 y2 z2
1 9 -8 1 81 64
2 8 -6 4 64 36
3 10 -7 9 100 49
4 12 -8 16 144 64
5 11 -6 25 121 36

19
6 13 -7 36 169 49
7 14 -7 49 196 49
∑ 𝒙 = 28 ∑ 𝒚 = 77 ∑ 𝒛 = -49 ∑ 𝒙𝟐 = 140 ∑ 𝒚𝟐 = 875 ∑ 𝒛𝟐 = 347

∑𝑥 28 ∑ 𝑥2 140
𝑥̅ = = =4 ; 𝜎𝑥2 = − 𝑥̅ 2 = − 42 = 4
𝑛 7 𝑛 7
∑𝑦 77 ∑ 𝑦2 875
𝑦̅ = = = 11 ; 𝜎𝑦2 = − 𝑦̅ 2 = − 112 = 4
𝑛 7 𝑛 7
∑𝑧 −49 ∑ 𝑧2 347
𝑧̅ = = = −7 ; 𝜎𝑧2 = − 𝑧̅ 2 = − (−7)2 = 0.57
𝑛 7 𝑛 7
𝟒+𝟒−𝟎.𝟓𝟕
Therefore 𝒓= = 𝟎. 𝟗𝟑
𝟐(𝟐)(𝟐)

(i) 𝑪𝒐𝒗(𝑿, 𝒀) = 𝒓 𝝈𝒙 𝝈𝒚
= (𝟎. 𝟗𝟑)(𝟐)(𝟐) = 𝟑. 𝟕𝟐

Q. Obtain the coefficient of correlation and for the following data

x 1 3 4 2 5 8 9 10 13 15
y 8 6 10 8 12 16 16 10 32 32

Solution:
x y ̅
𝑿= 𝒙−𝒙 ̅
𝒀 =𝒚−𝒚 X2 Y2 XY
1 8 -6 -7 36 49 42
3 6 -4 -9 16 81 36
4 10 -3 -5 9 25 15
2 8 -5 -7 25 49 35
5 12 -2 -3 4 9 6
8 16 1 1 1 1 1
9 16 2 1 4 1 2
10 10 3 -5 9 25 -15

20
13 32 6 17 36 289 102
15 32 8 17 64 289 136
∑ 𝒙 = 70 ∑ 𝒚 = 150 ∑ 𝑿𝟐 = 204 ∑ 𝒀𝟐 = 818 ∑ 𝑿𝒀 = 360

(i) Coefficient of correlation

= 0.88

Problems for practice

1. Find the mean of the following by

Number 8 10 15 20
Frequency 5 8 8 4

2. The following is the frequency distribution of a random sample of weekly earnings of the
employees. Calculate the average weekly

Weekly 10 12 14 16 18 20 22 24 26 28 30 32
earning
No. of 3 6 10 15 24 42 75 90 79 55 36 26
employees

3. Find the mean of the following

Class 0 – 10 10 – 20 20 – 30 30 – 40 40 - 50
Frequency 7 8 20 10 5

4. Find the mean of the following

Class 0–8 8 – 16 16 – 24 24 – 32 32 - 40 40 - 48
Frequency 8 7 16 24 15 7

5. The total sale (in thousands) of a particular item in a shop, on 10 consecutive days is
reported by a clerk as 35, 29.6, 38, 30, 40, 41, 42, 45, 3.6, and 3.8. Calculate the average.
Later it was found that there was a number 10 in the machine and the reports of 4 th to 8th
day were 10 more than the true values and in the last 2 days he put a decimal in wrong
place (for example 3.6 was really 36). Calculate the true mean value.

21
6. For the two frequency distributions given below the mean calculated from the first was 25.4
and that the second was 32.5 find the value of the 𝑥 and 𝑦.

Class 10 – 20 20 – 30 30 – 40 40 - 50 50 – 60
Frequency - 1 20 15 10 𝑥 𝑦
Frequency - 2 4 8 4 2𝑥 𝑦

7. Find the median of the following data

Number 1 2 3 4 5 6 7 8 9
Frequency 8 10 11 16 20 25 15 9 6

8. Find the median of the following data

Number 5 10 15 20 25 30 35 40 45
Frequency 29 224 465 582 634 644 650 653 655

9. Find the median of the following

Class 20 – 30 30 – 40 40 - 50 50 – 60 60 - 70
Frequency 3 5 20 10 5

10. A number of particular articles has been classified according to their weight. After drying
for two week the same articles have again be weighted and similarly classified. It is known
that the median weight in the first weight was 20.8302 while in the second weighting it was
17.3502. Some frequencies 𝑎 and 𝑏 in the first weighting and 𝑥 and 𝑦 in the second are
𝑥 𝑦
missing. It is given that 𝑎 = 3 and 𝑏 = 2 . Find out the missing frequencies.

Class 0–5 5 – 10 10 – 15 15 - 20 20 – 25 25 - 30
Frequency - 1 𝑎 𝑏 11 52 75 22
Frequency - 2 𝑥 𝑦 40 50 30 28

11. In a factory employing 3000 person, 5% earn less than Rs. 3 per hour, 580 earn from Rs.
3.01 to Rs. 4.5 per hour, 30% earn from Rs. 4.51 to Rs. 6 per hour. 500 earn from Rs. 6.01 to
Rs. 7.5 per hour, 20% earn from Rs. 7.51 to 9 per hour and the rest earn Rs. 9.01 or more
per hour. What is the median wage?

12. According to the census of 2021, the following are the population figures in thousands of 20
cities: 2000, 1180, 1785, 1500, 560, 782, 1200, 385, 1123, 222, 2001, 1178, 1780, 1550, 559,
780, 1250, 390, 1120, 225. Find the median.

22
13. Find the mode of the following data

Number 1 2 3 4 5 6 7 8
Frequency 4 9 16 25 22 15 7 3

14. The median and mode are given to be Rs. 25 and Rs. 24 respectively. Calculate the missing
frequency.

Class 0 – 10 10 – 20 20 – 30 30 – 40 40 - 50
Frequency 14 𝑥 27 𝑦 15

15. Find the mode of the following distribution

Class 0 – 10 10 – 20 20 – 30 30 – 40 40 - 50 50 - 60 60 - 70
Frequency 5 8 7 12 28 20 10

16. The median and mode of the following wages are known to be Rs. 33.5 and Rs. 34
respectively. Find the value of 𝑥, 𝑦 and 𝑧. Given total frequency is 230.

Class 0 – 10 10 – 20 20 – 30 30 – 40 40 - 50 50 - 60 60 - 70
Frequency 4 16 𝑥 𝑦 𝑧 6 4

17. Calculate the mode form the following frequency distribution by the method of grouping

Number 4 5 6 7 8 9 10 11 12 13
Frequency 2 5 8 9 12 14 14 15 11 13

18. Calculate the standard deviation from the following frequency distribution

Number 6 7 8 9 10 11 12
Frequency 3 6 9 13 8 5 4

19. For a group of 200 candidates, the mean and standard deviation of scores were found to be
40 and 15 respectively. Later on it was discovered the score 43 and 35 was misread as 34
and 53 respectively. Find the corrected standard deviation corresponding to the corrected
figures.

20. Compute the standard deviation for the following data

Class 0- 100- 200- 300- 400- 500- 600- 700-

interval 99 199 299 399 499 599 699 799
Frequency 10 54 184 264 246 40 1 1

23
21. The first group of the two samples has 100 items with mean 15 and standard deviation 3. If
the whole group has 250 items with mean 15.6 and standard deviation √13.44. Find the
standard deviation of the second group.

22. The number examined, the mean weight and standard deviation in each group of
examination by three medical examination are given below. Find the mean weight and
standard deviation of the entire data when grouped together

Medical Number Mean weight Standard

examination examined (lbs) deviation ( lbs)
A 50 113 6
B 60 120 7

23. Find the correlation co-efficient between 𝑥 and 𝑦 from the given data:

X 21 23 30 54 57 58 72 78 87 90
Y 60 71 72 83 110 84 100 92 113 135

24. Find the correlation co-efficient between 𝑥 and 𝑦 from the given data:

𝑥 78 89 97 69 59 79 68 57
𝑦 125 137 156 112 107 138 123 108
25. Calculate the covariance of the following pairs of observation of two variables:
(10,35), (15,20), (20,30), (25,30), (30,35), (35,38), (40,42), (45,30), (50,40)

26. Find the Covariance by using co-efficient of correlation between industrial production and
export using the following data and comment on the result.

Production (in tons) 55 56 58 59 60 60 62

Exports(in tons) 35 38 38 39 44 43 45

27. Find the covariance for the data given below

𝑥 98 87 90 85 95 75
𝑦 15 12 10 10 16 7

28. Calculate the Covariance by using correlation co-efficient for the following heights in inches
of fathers (x) and their sons (y).

x 65 66 67 67 68 69 70 72
y 67 68 65 68 72 72 69 71

*********

Fundamentals of Foundation Engineering (2023)
100% (8)
Fundamentals of Foundation Engineering (2023)
436 pages
Unit-3 DS
No ratings yet
Unit-3 DS
21 pages
How To Create Your Cosmetic Product Information File
100% (4)
How To Create Your Cosmetic Product Information File
12 pages
Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
Data Processing and Analysis
100% (3)
Data Processing and Analysis
38 pages
Introduction to Data Analysis
No ratings yet
Introduction to Data Analysis
8 pages
Data Analysis: An In-depth Insight
From Everand
Data Analysis: An In-depth Insight
Pasquale De Marco
No ratings yet
Unit_1.pptx
No ratings yet
Unit_1.pptx
57 pages
Unit 2
No ratings yet
Unit 2
58 pages
Data Analysis for grade 5 elementary
No ratings yet
Data Analysis for grade 5 elementary
24 pages
Statstics NOTES SEM2
No ratings yet
Statstics NOTES SEM2
20 pages
Unit I (Notes 2)
No ratings yet
Unit I (Notes 2)
16 pages
Notes - EDA-Unit1 (2)
No ratings yet
Notes - EDA-Unit1 (2)
34 pages
Instant Access to Data Analysis for Beginners: 2 in 1 Guide: A Beginner's Adventure in Analysis and Visualization Daniel Garfield ebook Full Chapters
100% (3)
Instant Access to Data Analysis for Beginners: 2 in 1 Guide: A Beginner's Adventure in Analysis and Visualization Daniel Garfield ebook Full Chapters
37 pages
Approaches in data analysis [Slides] [Re-brand]
No ratings yet
Approaches in data analysis [Slides] [Re-brand]
13 pages
Download full Data Analysis for Beginners: 2 in 1 Guide: A Beginner's Adventure in Analysis and Visualization Daniel Garfield ebook all chapters
100% (1)
Download full Data Analysis for Beginners: 2 in 1 Guide: A Beginner's Adventure in Analysis and Visualization Daniel Garfield ebook all chapters
37 pages
Approaches in data analysis [Slides]
No ratings yet
Approaches in data analysis [Slides]
13 pages
Analysis and Interpretation of Data in Research
100% (1)
Analysis and Interpretation of Data in Research
7 pages
FDS-Unit II-ECE
No ratings yet
FDS-Unit II-ECE
22 pages
Data Sources Data Handling Data Visualization
No ratings yet
Data Sources Data Handling Data Visualization
23 pages
Unit 1 Introduction To Data Analysis
No ratings yet
Unit 1 Introduction To Data Analysis
10 pages
Notes Data Science With Python 1
No ratings yet
Notes Data Science With Python 1
18 pages
UNIT 1 Exploratory Data Analysis
100% (1)
UNIT 1 Exploratory Data Analysis
21 pages
ADA all Answer
No ratings yet
ADA all Answer
79 pages
Document (1)
No ratings yet
Document (1)
10 pages
Unit 1
No ratings yet
Unit 1
36 pages
Presentation Slide
100% (1)
Presentation Slide
8 pages
Data Analysis Is The Process of Gatherin1
No ratings yet
Data Analysis Is The Process of Gatherin1
5 pages
Comprehensive Guide to Modern Data Analysis Techniques
No ratings yet
Comprehensive Guide to Modern Data Analysis Techniques
4 pages
DAA SBMP PPT
No ratings yet
DAA SBMP PPT
11 pages
Data Sources Advance Data Handling
No ratings yet
Data Sources Advance Data Handling
23 pages
Data Analysis
No ratings yet
Data Analysis
87 pages
EDA 2
No ratings yet
EDA 2
69 pages
Data-Analysis-Chapter 1-compressed
No ratings yet
Data-Analysis-Chapter 1-compressed
20 pages
Data Analysis
No ratings yet
Data Analysis
22 pages
Elements
No ratings yet
Elements
2 pages
Notes - Unit 1 - Exploratory Data Analysis
No ratings yet
Notes - Unit 1 - Exploratory Data Analysis
33 pages
Module 1 - Introduction To Data Analytics
No ratings yet
Module 1 - Introduction To Data Analytics
21 pages
Data Analysis Is The Process of Gathering
No ratings yet
Data Analysis Is The Process of Gathering
5 pages
Python for Data Analysis
No ratings yet
Python for Data Analysis
84 pages
Rma Midterm Reviewer
No ratings yet
Rma Midterm Reviewer
11 pages
Notes
No ratings yet
Notes
5 pages
Data ANALYSIS and Data Interpretation
No ratings yet
Data ANALYSIS and Data Interpretation
15 pages
Unit 4
No ratings yet
Unit 4
33 pages
Section+2_Basics+of+Data+Analytics
No ratings yet
Section+2_Basics+of+Data+Analytics
29 pages
Types of Data
No ratings yet
Types of Data
3 pages
DecodingDataYourJourneytoInsights_sK0tHmRR
No ratings yet
DecodingDataYourJourneytoInsights_sK0tHmRR
12 pages
EDA
No ratings yet
EDA
24 pages
Session1-DataCharacteristics
No ratings yet
Session1-DataCharacteristics
41 pages
data analysis
No ratings yet
data analysis
3 pages
Updated notes of APR_084732
No ratings yet
Updated notes of APR_084732
6 pages
DA Unit 1
No ratings yet
DA Unit 1
43 pages
Advantages and Disadvantages of Data Analytics
No ratings yet
Advantages and Disadvantages of Data Analytics
6 pages
Data Sciecnce
No ratings yet
Data Sciecnce
16 pages
Notes 3 (Prepare Coursera)
No ratings yet
Notes 3 (Prepare Coursera)
67 pages
Data Analysis A Beginner Guide
No ratings yet
Data Analysis A Beginner Guide
1 page
Unit II Notes
No ratings yet
Unit II Notes
36 pages
Data Analytics - TYBCS
No ratings yet
Data Analytics - TYBCS
6 pages
DA-Unit-2-Trio-1
No ratings yet
DA-Unit-2-Trio-1
26 pages
4.1 Advanced Data Analysis & Visualization
No ratings yet
4.1 Advanced Data Analysis & Visualization
12 pages
Data Analysis - Version 2
No ratings yet
Data Analysis - Version 2
12 pages
Statistical Data Analysis Made Easy
From Everand
Statistical Data Analysis Made Easy
Pasquale De Marco
No ratings yet
DSP notes
No ratings yet
DSP notes
13 pages
Presentation 33112 Content Document 20250317022400PM
No ratings yet
Presentation 33112 Content Document 20250317022400PM
14 pages
Presentation 33109 Content Document 20250317022316PM
No ratings yet
Presentation 33109 Content Document 20250317022316PM
31 pages
DOC-20250327-WA0000.
No ratings yet
DOC-20250327-WA0000.
2 pages
CBA Assignments
No ratings yet
CBA Assignments
6 pages
Nguyen-Thi-Lan-Huong 10200506 OB A1
No ratings yet
Nguyen-Thi-Lan-Huong 10200506 OB A1
20 pages
1506 - Pio Reading 1 - A Leader's Guide To Why People Behave The Way Do
No ratings yet
1506 - Pio Reading 1 - A Leader's Guide To Why People Behave The Way Do
2 pages
Faculy Marked Assignment No. 3: Mechanical Engineering Department
No ratings yet
Faculy Marked Assignment No. 3: Mechanical Engineering Department
5 pages
Finite Element Exercises
0% (1)
Finite Element Exercises
2 pages
Master Mtr201607dan
0% (1)
Master Mtr201607dan
261 pages
DBR of Boiler For Ambuja Cement
No ratings yet
DBR of Boiler For Ambuja Cement
22 pages
Chemical Bonding - JEE Main 2020 January
No ratings yet
Chemical Bonding - JEE Main 2020 January
5 pages
Sample Science Lesson
No ratings yet
Sample Science Lesson
3 pages
Electric Charges, Forces, and Fields
100% (2)
Electric Charges, Forces, and Fields
37 pages
Colour Grading
No ratings yet
Colour Grading
24 pages
Lgav PDF
No ratings yet
Lgav PDF
35 pages
QFT-QPE
No ratings yet
QFT-QPE
19 pages
Cebu Warehouse-Rev00-20.02.2022-Op 1
No ratings yet
Cebu Warehouse-Rev00-20.02.2022-Op 1
10 pages
Case Study Ford Motor Company: New Strategies For International Growth
No ratings yet
Case Study Ford Motor Company: New Strategies For International Growth
4 pages
LG Service Center in Hyderabad
No ratings yet
LG Service Center in Hyderabad
16 pages
Color Trace
No ratings yet
Color Trace
9 pages
IE - 3 HiAN Trouble-Shooting For Non-Closing-003
No ratings yet
IE - 3 HiAN Trouble-Shooting For Non-Closing-003
8 pages
TS - 400kVA Earthing Transformer
No ratings yet
TS - 400kVA Earthing Transformer
8 pages
2025 IMSA Intersession Catalog - FINAL
No ratings yet
2025 IMSA Intersession Catalog - FINAL
65 pages
COMMUNITY DRIVEN DEVELOPMENT
No ratings yet
COMMUNITY DRIVEN DEVELOPMENT
10 pages
PDF DS Datasheet AR5516 SP Eq
No ratings yet
PDF DS Datasheet AR5516 SP Eq
2 pages
Abikoye Solarin Adekoya PBS2014
No ratings yet
Abikoye Solarin Adekoya PBS2014
6 pages
Cat Truck Pumps Steeltech USA
No ratings yet
Cat Truck Pumps Steeltech USA
9 pages
M40 Spare Parts Set
No ratings yet
M40 Spare Parts Set
3 pages
Câu So Sánh
No ratings yet
Câu So Sánh
15 pages
Directory of Overseas Workers Welfare Administration (Owwa) - Regional Welfare Offices (Rwos)
No ratings yet
Directory of Overseas Workers Welfare Administration (Owwa) - Regional Welfare Offices (Rwos)
2 pages
Literature Review On Single Parenthood
100% (2)
Literature Review On Single Parenthood
4 pages

Uploaded by

Uploaded by

Institutional Elective 1: Analytics Foundation – 6th Sem

Unit 1: Introduction to data, analytics and EDA

Introduction to Data Understanding

2. Exploratory Data Analysis (EDA):

3. Data Quality Assessment:

4. Documentation and Documentation:

5. Integration of Domain Knowledge:

6. Data Governance and Compliance:

Cleaning and Preprocessing Data

5. Data Governance and Security:

6. Scalability and Performance:

7. Monitoring and Maintenance:

Basic Definitions and Formulae in Statistics

(i) If 𝑥1 , 𝑥2 , … … … 𝑥𝑛 be a set of ‘𝑛’ values of a variate 𝑥, then Mean, denoted by 𝑥̅ ,

(ii) For the grouped data in the form of a frequency distribution

Standard Deviation of the Combination of two groups

(𝑛1 + 𝑛2 )𝜎 2 = 𝑛1 𝜎12 + 𝑛2 𝜎22 + 𝑛1 𝐷12 + 𝑛2 𝐷22

 The median is the middle item if the number is odd

(ii) For the grouped data:

where L = lower limit of the median

where L = lower limit of the mode

(ii) set of observations in ascending order: 3 , 3 , 3 , 4 , 5 , 6 , 7 , 8 , 9

(iii) Mode = the number which occurs frequently = 3

(iv) Standard Deviation = 𝜎 = √𝑉

Standard Deviation = 𝜎 = √𝑉 = √4.6667 = 2.1603

Standard Deviation = 𝜎 = √𝑉 = √52.0453 = 7.2142

Hence 𝐿 = 30, 𝐻 = 10, 𝐹 = 31, 𝐶 = 45

Standard Deviation = 𝜎 = √𝑉 = √152.75 = 12.3592

Standard Deviation of X = 𝜎1 = √𝑉 = √311.2 = 17.6409

Number of Mean monthly

Y 600 175 100

(i) Which company has a larger wages bill?

No. of worker in company Y = 𝑛2 = 600

(i) ∑ 𝑋 is lesser than ∑ 𝑌, therefore Y has larger wages bill.

Correlation and Regression

Correlation and correlation coefficient

The numerical measure of correlation between x and y is known as Pearson’s coefficient of

Alternative formula for r

𝜎𝑥2 +𝜎𝑦2 −𝜎𝑧2

Alternative formula for 𝑪𝒐𝒗(𝑿, 𝒀)

Q. Obtain the coefficient of correlation and for the following data

(i) Coefficient of correlation

Problems for practice

1. Find the mean of the following by

3. Find the mean of the following

4. Find the mean of the following

7. Find the median of the following data

8. Find the median of the following data

9. Find the median of the following

15. Find the mode of the following distribution

20. Compute the standard deviation for the following data

Class 0- 100- 200- 300- 400- 500- 600- 700-

Medical Number Mean weight Standard

Production (in tons) 55 56 58 59 60 60 62

27. Find the covariance for the data given below

You might also like