Statistics for Data Science
Statistics for Data Science
What is Statistics?
Statistics is a fundamental component of data science, providing methods for collecting, analyzing,
interpreting, and presenting data. It is broadly classified into descriptive and inferential statistics.
Descriptive statistics focus on summarizing data using measures such as mean, median, mode,
variance, and standard deviation, along with visual tools like histograms and box plots. Inferential
statistics, on the other hand, help in making predictions and generalizations about a population
based on a sample, using techniques like hypothesis testing, regression analysis, and confidence
intervals. Probability theory is also an essential part of statistics in data science, helping to model
uncertainties and make data-driven predictions. Concepts such as probability distributions (normal,
binomial, Poisson), Bayes’ theorem, and the central limit theorem are widely used in statistical
analysis. In data science, statistical techniques are employed in exploratory data analysis (EDA),
hypothesis testing, regression modeling, and time series analysis to uncover patterns and
relationships within data. Moreover, statistics play a crucial role in machine learning, aiding in feature
selection, model evaluation, and understanding bias-variance tradeoffs to build robust predictive
models. By leveraging statistical methods, data scientists can make informed decisions, validate
hypotheses, and develop accurate machine learning algorithms. Understanding statistics is,
therefore, essential for deriving meaningful insights from data and ensuring the reliability of
analytical models.
2. Dispersion
• Definition: Dispersion measures how to spread out the data points are in a dataset.
It shows variability in data.
• Common Measures: Range, Variance, Standard Deviation.
• Example: In two classes, if students' scores are (50, 52, 55, 58, 60) in one and (30,
40, 50, 60, 60, 70) in another, the second class has higher dispersion in scores.
3. Spread of Data
• Definition: The spread of data describes how values in a dataset differ from each
other and from the central value.
• Example: In a marathon, if runners finish within 1–2 minutes of each other, the
spread is small. If some finish in 2 hours and others in 4 hours, the spread is large.
Range, IQR, Variance, Standard Deviation and Standard Error
1. Range
o Definition: The difference between the maximum and minimum values in a
dataset. It measures the total spread of data.
o Formula: Range=Max Value−Min Value.
o Example: If exam scores are (45, 50, 60, 70, 90),
then: Range=90−45=45
o Interpretation: The data is spread out over a range of 45 points.
3. Real-Life Example
• Traffic Flow: The number of cars passing through a toll booth per hour often follows
Poisson distribution, where certain peak times have a higher probability of traffic
congestion.
2. Kurtosis
• Definition: Kurtosis measures the "tailedness" of a data distribution, showing how
extreme values (outliers) affect it.
• Types of Kurtoses:
o Leptokurtic (High Kurtosis): Tall, sharp peak with heavy tails (more extreme
outliers).
o Mesokurtic (Normal Kurtosis): Moderate peak and normal tails (follows
normal distribution).
o Platykurtic (Low Kurtosis): Flat peak with light tails (fewer outliers).
• Real-Life Example: Stock Market Returns often show leptokurtic behavior
because stock prices sometimes have extreme fluctuations.
3. Real-Life Example:
• A company launching a new product uses stratified samples to ensure feedback
from different age groups, ensuring a balanced and representative sample.
Representative, Non-Representative Sampling Techniques
Hybrid Sampling
What is Hybrid Sampling?
• Definition: Hybrid sampling is a method that combines two or more sampling techniques
(probability and non-probability) to improve data collection efficiency and accuracy.
• Purpose: It is used to balance representation and practicality, ensuring better coverage of
diverse populations while saving time and cost.
Real-Life Example:
• A healthcare study wants to analyze patient satisfaction in a city. They use:
o Stratified Sampling to divide patients into age groups.
o Convenience Sampling to collect data from hospitals where researchers have easy
access.
o Snowball Sampling to reach patients with rare diseases through referrals.
Confidence Interval
Definition: A confidence interval (CI) is a range of values, derived from sample data, that is
likely to contain the true population parameter (e.g., mean or proportion) with a certain
level of confidence.
Purpose: It provides an estimate of uncertainty in statistical analysis.
Common Confidence Levels:
• 90% CI → There's a 90% chance the population parameter falls within the interval.
• 95% CI → The most common, meaning there’s a 95% chance the true value is within
the range.
• 99% CI → More precise but leads to a wider interval.