0% found this document useful (0 votes)
5 views27 pages

Statistics for Data Science

Statistics is essential in data science for collecting, analyzing, and interpreting data, divided into descriptive and inferential statistics. It provides the foundation for making informed decisions and is crucial in machine learning for model evaluation and feature selection. The document also discusses various data types, levels of measurement, and concepts like bias, reliability, and validity in statistical analysis.

Uploaded by

Sagar Bathani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views27 pages

Statistics for Data Science

Statistics is essential in data science for collecting, analyzing, and interpreting data, divided into descriptive and inferential statistics. It provides the foundation for making informed decisions and is crucial in machine learning for model evaluation and feature selection. The document also discusses various data types, levels of measurement, and concepts like bias, reliability, and validity in statistical analysis.

Uploaded by

Sagar Bathani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Statistics for Data Science

What is Statistics?
Statistics is a fundamental component of data science, providing methods for collecting, analyzing,
interpreting, and presenting data. It is broadly classified into descriptive and inferential statistics.
Descriptive statistics focus on summarizing data using measures such as mean, median, mode,
variance, and standard deviation, along with visual tools like histograms and box plots. Inferential
statistics, on the other hand, help in making predictions and generalizations about a population
based on a sample, using techniques like hypothesis testing, regression analysis, and confidence
intervals. Probability theory is also an essential part of statistics in data science, helping to model
uncertainties and make data-driven predictions. Concepts such as probability distributions (normal,
binomial, Poisson), Bayes’ theorem, and the central limit theorem are widely used in statistical
analysis. In data science, statistical techniques are employed in exploratory data analysis (EDA),
hypothesis testing, regression modeling, and time series analysis to uncover patterns and
relationships within data. Moreover, statistics play a crucial role in machine learning, aiding in feature
selection, model evaluation, and understanding bias-variance tradeoffs to build robust predictive
models. By leveraging statistical methods, data scientists can make informed decisions, validate
hypotheses, and develop accurate machine learning algorithms. Understanding statistics is,
therefore, essential for deriving meaningful insights from data and ensuring the reliability of
analytical models.

Why is Statistics Important?


Statistics is crucial in data science because it provides the mathematical foundation for analyzing
data, identifying patterns, and making informed decisions. It enables data scientists to summarize
large datasets, extract meaningful insights, and quantify uncertainty using probability theory.
Descriptive statistics help in organizing and visualizing data, while inferential statistics allow for
making predictions and generalizations about populations based on samples. Statistical methods
are essential for hypothesis testing, regression analysis, and experimental design, ensuring that
conclusions drawn from data are valid and reliable. In machine learning, statistics are fundamental
for model evaluation, feature selection, and performance optimization, helping to prevent issues like
overfitting and underfitting. Additionally, probability distributions, statistical tests, and Bayesian
inference are widely used to model real-world uncertainties and optimize predictive algorithms.
Without statistics, data-driven decision-making would lack accuracy, leading to misleading
conclusions and ineffective models. Thus, statistics are the backbone of data science, ensuring that
data analysis is rigorous, evidence-based, and capable of driving meaningful insights across various
industries, from healthcare and finance to technology and business analytics.
Scales and levels of measurements
The scales and levels of measurement in statistics define how data is categorized, measured, and
analyzed. There are four primary levels of measurement: Nominal, Ordinal, Interval, and Ratio.
Each level has different properties and determines the types of statistical analyses that can be
performed.
1. Nominal Scale (Categorical Data)
• Definition: Data is classified into distinct categories with no inherent order.
• Characteristics: No numerical value or ranking.
• Examples:
o Gender (Male, Female, Other)
o Blood type (A, B, AB, O)
o Types of cars (Sedan, SUV, Truck)
2. Ordinal Scale (Ranked Data)
• Definition: Data is categorized in a meaningful order, but the difference between values is
not uniform.
• Characteristics: Can be ranked, but gaps between ranks are not equal.
• Examples:
o Education level (High School, Bachelor's, Master's, PhD)
o Customer satisfaction rating (Poor, Average, Good, Excellent)
o Competition rankings (1st, 2nd, 3rd place)
3. Interval Scale (Equal Differences, No True Zero)
• Definition: Data is measured on a scale with equal intervals between values but no true zero
point.
• Characteristics: Can perform addition and subtraction but not multiplication or division.
• Examples:
o Temperature in Celsius or Fahrenheit (0°C doesn’t mean "no temperature")
o IQ scores (Difference between 100 and 110 is the same as between 120 and 130)
o SAT scores
4. Ratio Scale (Equal Differences, True Zero Exists)
• Definition: Data has equal intervals and a meaningful zero, allowing all arithmetic
operations.
• Characteristics: Allows for the full range of mathematical operations (addition, subtraction,
multiplication, division).
• Examples:
o Height (cm, inches)
o Weight (kg, lbs)
o Age (years)
o Income ($0 means no income)
Differences Between Quantitative and Qualitative Data

Discrete, Continuous and Boolean Datasets


What is the time series?
A time series is a sequence of data points recorded at specific time intervals, typically in chronological order.
It is used to analyze trends, patterns, and dependencies over time. Time series data can be collected at
regular intervals (e.g., daily, monthly, yearly) or irregular intervals.

Key Characteristics of Time Series Data:


1. Temporal Dependence: The value of a data point depends on past values.
2. Trend: A long-term increase or decrease in the data over time.
3. Seasonality: Regular and repeating patterns (e.g., higher sales in December).
4. Cyclic Patterns: Fluctuations that occur over irregular time periods.
5. Stationarity: The statistical properties of data remain constant over time.

What is Special Data?


Spatial data (also called geospatial data) refers to data that represents the location, shape,
and relationship of objects in a geographical space. It includes coordinates (latitude,
longitude), maps, and geographic features. Spatial data is crucial for analyzing patterns and
trends in a physical space and is used in Geographic Information Systems (GIS) and
spatial analytics.

Types of Spatial Data:


1. Vector Data: Uses points, lines, and polygons to represent geographical features
(e.g., city locations, roads, boundaries).
2. Raster Data: Uses grid-based structures like satellite images and heatmaps to
represent spatial information.
Difference between Categorical and Numerical Data
What is the Multivariate data and what are its types?
Multivariate data refers to datasets containing multiple variables or attributes observed
simultaneously for each data point. It helps in understanding relationships, patterns, and
dependencies between multiple factors, making it essential in data science, machine
learning, and statistics.
Types of Multivariate Data
1. Multivariate Categorical Data
o Definition: Data where multiple categorical variables are recorded for each
observation.
o Example: A customer survey recording gender, preferred product category,
and payment method (Male, Electronics, Credit Card).
2. Multivariate Numerical Data
o Definition: Data where multiple numerical variables are measured for each
observation.
o Example: A student's dataset with height, weight, and exam scores (5.7 ft, 65
kg, 85%).
3. Mixed Multivariate Data
o Definition: A combination of categorical and numerical variables in the same
dataset.
o Example: A dataset containing a person’s age (numeric), education level
(categorical), and income (numeric).
4. Time-Series Multivariate Data
o Definition: Data where multiple variables are recorded over time at regular
intervals.
o Example: Weather data containing temperature, humidity, and wind speed
recorded every hour.
5. Spatial Multivariate Data
o Definition: Data where multiple variables are associated with specific
geographic locations.
o Example: A dataset mapping city with pollution levels, temperature, and
population density.
Difference between Structured and Unstructured Data

What are the Boolean data types?


A Boolean data type is a data type that can hold only two possible values: True (1) or False
(0). It is used in logic-based operations, decision-making, and control flow in programming
and data science. Boolean data is fundamental in computer science, databases, and
machine learning, where binary conditions are evaluated.
Characteristics of Boolean Data Type:
1. Binary Nature: Can only be True/False, Yes/No, or 1/0.
2. Used in Logical Operations: Boolean algebra operations like AND, OR, and NOT.
3. Decision Making: Commonly used in conditional statements in programming (if-else
conditions).
4. Data Filtering: Used in queries and searches (e.g., filtering data where "Is Active =
True").

What is Ture and Error, Types of Errors


In statistics and measurement theory, True Score and Error Score are components of the
Observed Score in any measurement process.
1. True Score (T):
o The actual, correct, and consistent value of a measurement if there were no
errors.
o It represents the real ability, knowledge, or characteristic being measured.
o Example: If a student’s actual math ability is 85%, their true score should be
85%.
2. Error Score (E):
o The difference between the observed score and the true score, caused by
various errors.
o It includes factors like measurement inaccuracies, environmental factors, or
respondent mistakes.
o Example: If a student was distracted during an exam and scored 78% instead
of 85%, the error score is 85% - 78% = 7%.
Formula:
Observed Score (X) = True Score (T) + Error Score (E).

Types of Errors in Measurement


1. Systematic Error:
o An error that consistently occurs in one direction due to faulty measurement
tools or bias.
o Example: A weighing scale that always shows 2 kg extra.
2. Random Error:
o An unpredictable error that occurs due to temporary influences like mood,
fatigue, or environmental conditions.
o Example: A student performing poorly on a test due to illness.
3. Type I Error (False Positive):
o Rejecting a true hypothesis (detecting an effect that does not exist).
o Example: A COVID-19 test incorrectly detecting the virus in a healthy person.
4. Type II Error (False Negative):
o Accepting a false hypothesis (failing to detect an actual effect).
o Example: A smoke alarm failing to detect a real fire.

Type I & II Error


In hypothesis testing, errors can occur when making decisions about a hypothesis. These
errors are classified into Type I Error (False Positive) and Type II Error (False Negative).
1. Type I Error (False Positive)
• Occurs when a true null hypothesis (H₀) is incorrectly rejected.
• This means we conclude that an effect exists when it does not.
• Example: A fire alarm goes off when there is no fire.
2. Type II Error (False Negative)
• Occurs when a false null hypothesis (H₀) is incorrectly accepted.
• This means we fail to detect an effect when it exists.
• Example: A smoke detector failing to go off when there is a fire.

Reliability and Validity


Reliability and validity are two key concepts in research and measurement to ensure the
accuracy and consistency of data.
1. Reliability
• Definition: The consistency and repeatability of a measurement or test. A
measurement is considered reliable if it produces the same results under consistent
conditions.
• Key Types of Reliability:
o Test-Retest Reliability: Consistency of results over time.
o Inter-Rater Reliability: Consistency between different observers or raters.
o Internal Consistency: Consistency within the test itself (e.g., survey
questions measuring the same concept).
• Example: A weighing scale that gives the same weight every time you step on it under
the same conditions is reliable.
2. Validity
• Definition: The accuracy and correctness of a measurement, meaning it measures
what it is supposed to measure.
• Key Types of Validity:
o Content Validity: The extent to which a test covers all aspects of the concept
being measured.
o Construct Validity: Whether the test truly measures the theoretical concept
it claims to measure.
o Criterion Validity: How well the test correlates with an external standard or
outcome.
• Example: If a weighing scale shows incorrect weight (e.g., 5 kg extra) every time, it
is reliable but not valid. A valid scale should give the correct weight.
What is Bais, how do you remove the bias?
Bias in data refers to systematic errors that lead to inaccurate or misleading conclusions.
It occurs when data is collected, analyzed, or interpreted in a way that favors certain
outcomes over others, reducing fairness and accuracy.
Example of Bias:
• Hiring Bias: A company uses an AI recruitment system trained on past hiring data,
which favors male candidates over female candidates because past hiring decisions
were biased.

How to Remove Bias in Data?


1. Collect Representative Data: Ensure data comes from diverse and unbiased
sources.
2. Eliminate Sampling Bias: Use random sampling to avoid over-representing certain
groups.
3. Preprocess Data Fairly: Remove or adjust biased features (e.g., gender, race) in
machine learning models.
4. Use Fair Algorithms: Apply bias-detection tools and fairness-aware models.
5. Check for Data Imbalance: Ensure all groups have equal representation in training
data.
6. Regularly Audit Models: Continuously monitor and correct biases in AI and data-
driven decisions.

What is Statistics, types of Statistics?


Statistics is the branch of mathematics that deals with collecting, organizing, analyzing,
interpreting, and presenting data to make informed decisions. It is widely used in research,
business, healthcare, and data science.
What is Data Analysis, what are the types of data analysis?
Data Analysis is the process of inspecting, cleaning, transforming, and interpreting data
to discover useful insights, patterns, and trends for decision-making. It is widely used in
business, healthcare, finance, and data science.
Sample, Population and Central Tendency
Sample
• Definition: A subset of data selected from a larger group (population) used for
analysis.
• Example: Surveying 1,000 students from a university of 10,000 students to study
average study hours.
Population
• Definition: The entire group from which a sample is taken for research.
• Example: The 10,000 students in a university represent the population in study-on-
study habits.
Central Tendency
• Definition: A statistical measure that represents the center of a dataset using Mean,
Median, or Mode.
• Example: The average (Mean) height of students in a class is 5.7 feet, summarizing
the dataset.

Mean, Median and Mode of data


Mean (Average)
• Definition: The sum of all values divided by the total number of values.
• Example: If the test scores of five students are 70, 80, 90, 85, and 75, the mean is:
(70+80+90+85+75) / 5 = 80.
So, the mean score is 80.

Median (Middle Value)


• Definition: The middle value in an ordered dataset.
• Example: If the test scores are 70, 75, 80, 85, 90, the middle value (Median) is 80.
Mode (Most Frequent Value)
• Definition: The value that appears most frequently in a dataset.
• Example: If the test scores are 70, 80, 80, 85, 90, the most repeated score is 80 (Mode
= 80).

Limitations of mean, median and mode of the data


Limitations of Mean:
• Affected by Outliers: A single extreme value can distort the mean, making it
unreliable in skewed distributions.
• Not Suitable for Categorical Data: Mean can only be calculated for numerical data,
not for categories like "Male" or "Female."
Limitations of Median:
• Ignores All Data Except the Middle Value: It does not consider how far values are
from each other, losing important details.
• Less Sensitive to Small Changes: If numbers change slightly, the median might stay
the same, reducing its responsiveness.
Limitations of Mode:
• May Not Exist or May Have Multiple Values: In some datasets, there may be no
mode or multiple modes, making interpretation difficult.
• Not Useful for Small Data Sets: In small datasets, mode may not provide
meaningful insights compared to mean or median.

What is the mean of sample and population


Validity Dispersion and Spreadness of Data
1. Validity
• Definition: Validity refers to how accurately a measurement or test represents what
it is supposed to measure.
• Example: A weighing scale is valid if it consistently shows the correct weight of an
object.

2. Dispersion
• Definition: Dispersion measures how to spread out the data points are in a dataset.
It shows variability in data.
• Common Measures: Range, Variance, Standard Deviation.
• Example: In two classes, if students' scores are (50, 52, 55, 58, 60) in one and (30,
40, 50, 60, 60, 70) in another, the second class has higher dispersion in scores.

3. Spread of Data
• Definition: The spread of data describes how values in a dataset differ from each
other and from the central value.
• Example: In a marathon, if runners finish within 1–2 minutes of each other, the
spread is small. If some finish in 2 hours and others in 4 hours, the spread is large.
Range, IQR, Variance, Standard Deviation and Standard Error
1. Range
o Definition: The difference between the maximum and minimum values in a
dataset. It measures the total spread of data.
o Formula: Range=Max Value−Min Value.
o Example: If exam scores are (45, 50, 60, 70, 90),
then: Range=90−45=45
o Interpretation: The data is spread out over a range of 45 points.

2. Interquartile Range (IQR)


o Definition: The difference between the third quartile (Q3) and first quartile
(Q1) of a dataset. It measures the spread of the middle 50% of data.
o Formula: IQR=Q3−Q1
o Example: If Q1 = 50 and Q3 = 80,
then: IQR=80−50=30
o Interpretation: The middle 50% of the data lies within a 30-point spread.
Normal Distribution and Standard Deviation
1. Normal Distribution
• Definition: A bell-shaped, symmetrical probability distribution where most values
cluster around the mean.
• Characteristics:
o Mean, median, and mode is equal.
o Data is symmetrically distributed around the mean.
o Follows the 68-95-99.7 rule (Empirical Rule).
• Real-Life Example:
o Height of People: In a large population, most people have an average height,
with fewer people being extremely short or tall, forming a bell curve.

2. Standard Deviation in Normal Distribution


• Definition: Measures of how spread-out values are from the mean in a normal
distribution.
• Empirical Rule (68-95-99.7 Rule):
o 68% of data falls within 1 standard deviation (σ) of the mean.
o 95% falls within 2σ.
o 99.7% falls within 3σ.
• Real-Life Example:
o IQ Scores: The average IQ is 100 with a standard deviation of 15.
▪ 68% of people have an IQ between 85 and 115.
▪ 95% have IQs between 70 and 130.
▪ 99.7% fell between 55 and 145.

Data Distribution and Types of Data Distribution


1. What is Data Distribution?
• Definition: Data distribution refers to how data points are spread across a dataset.
It helps in understanding the shape, central tendency, and variability of data.
• Importance: Used in statistics, machine learning, and data science to make
predictions and analyze patterns.

2. Types of Data Distribution


1. Normal Distribution (Gaussian Distribution)
• Definition: A symmetrical, bell-shaped distribution where most values cluster
around the mean.
• Example: Heights of people in a large population follow a normal distribution.
2. Skewed Distribution
• Definition: A distribution where data is asymmetrically spread.
• Types:
o Right-Skewed (Positive Skew): Tail extends to the right (e.g., income
distribution).
o Left-Skewed (Negative Skew): Tail extends to the left (e.g., age at retirement).
• Example: Salaries in a company (a few high earners create right skew).
3. Uniform Distribution
• Definition: All values have equal probability of occurring.
• Example: Rolling a fair die, where each number (1-6) has an equal chance of
appearing.
4. Binomial Distribution
• Definition: Represents two possible outcomes (Success/Failure) over multiple
trials.
• Example: Flipping a coin 10 times and counting heads.
5. Poisson Distribution
• Definition: Models the probability of rare events occurring in a fixed interval of time
or space.
• Example: Number of customer arrivals at a bank per hour.
6. Exponential Distribution
• Definition: Models the time between events in a Poisson process.
• Example: Time between earthquakes in a region.

3. Real-Life Example

• Traffic Flow: The number of cars passing through a toll booth per hour often follows
Poisson distribution, where certain peak times have a higher probability of traffic
congestion.

Skewness and Kurtosis


1. Skewness
• Definition: Skewness measures the asymmetry of data distribution. It tells whether
data is symmetrically distributed or leans toward one side.
• Types of Skewness:
o Positive Skew (Right Skewed): Tail extends toward the right (higher values).
o Negative Skew (Left-Skewed): Tail extends toward the left (lower values).
o Zero Skewness: Data is perfectly symmetrical (normal distribution).
• Real-Life Example: Income Distribution is right-skewed because most people earn
average wages, while a few people earn extremely high salaries.

2. Kurtosis
• Definition: Kurtosis measures the "tailedness" of a data distribution, showing how
extreme values (outliers) affect it.
• Types of Kurtoses:
o Leptokurtic (High Kurtosis): Tall, sharp peak with heavy tails (more extreme
outliers).
o Mesokurtic (Normal Kurtosis): Moderate peak and normal tails (follows
normal distribution).
o Platykurtic (Low Kurtosis): Flat peak with light tails (fewer outliers).
• Real-Life Example: Stock Market Returns often show leptokurtic behavior
because stock prices sometimes have extreme fluctuations.

Differentiating Between Primary and Secondary Data

Sampling and Sampling Techniques


1. What is Sampling?
• Definition: Sampling is the process of selecting a subset of individuals from a larger
population to analyze and make predictions about the whole population.
• Importance: Saves time, cost, and effort compared to studying the entire
population.

2. Types of Sampling Techniques


A. Probability Sampling (Random Selection, Equal Chance)
1. Simple Random Sampling: Every individual has an equal chance of being selected.
o Example: Drawing names from a hat for a prize.
2. Stratified Sampling: Population is divided into groups (strata), and samples are
taken proportionally from each group.
o Example: Selecting students from different grades in a school for a survey.
3. Cluster Sampling: Population is divided into clusters (groups), and entire clusters
are randomly selected.
o Example: Choosing random neighborhoods in a city to survey.
4. Systematic Sampling: Every nth person is selected from a list.
o Example: Surveying every 10th customer entering a mall.

B. Non-Probability Sampling (Non-Random, Based on Convenience)


1. Convenience Sampling: Choosing participants based on ease of access.
o Example: Surveying people at a nearby coffee shop.
2. Judgmental (Purposive) Sampling: Selecting individuals based on expert
judgment.
o Example: Interviewing top doctors about a new treatment.
3. Snowball Sampling: Existing participants refer to new participants (useful for hard-
to-reach populations).
o Example: Researching drug addiction by asking participants to refer others.
4. Quota Sampling: Selecting a fixed quota from specific groups.
o Example: Surveying 50 men and 50 women for a market study.

3. Real-Life Example:
• A company launching a new product uses stratified samples to ensure feedback
from different age groups, ensuring a balanced and representative sample.
Representative, Non-Representative Sampling Techniques
Hybrid Sampling
What is Hybrid Sampling?
• Definition: Hybrid sampling is a method that combines two or more sampling techniques
(probability and non-probability) to improve data collection efficiency and accuracy.
• Purpose: It is used to balance representation and practicality, ensuring better coverage of
diverse populations while saving time and cost.

Real-Life Example:
• A healthcare study wants to analyze patient satisfaction in a city. They use:
o Stratified Sampling to divide patients into age groups.
o Convenience Sampling to collect data from hospitals where researchers have easy
access.
o Snowball Sampling to reach patients with rare diseases through referrals.

Differentiate between Descriptive and Inferential Statistics

Choosing the Right Statistical Method


Choosing the right statistical method depends on data type, objective, and analysis requirements.
Below are the key steps to determine the best approach:
1. Identify the Type of Data
• Numerical Data (Continuous or Discrete) → Use parametric tests like t-tests, ANOVA,
regression.
• Categorical Data → Use non-parametric tests like Chi-square tests.

3. Consider Data Distribution


• If Data is Normally Distributed (Bell Curve) → Use parametric tests (T-test,
ANOVA).
• If Data is Not Normally Distributed → Use non-parametric tests (Mann-Whitney U
test, Kruskal-Wallis).
4. Sample Size Matters
• Large Sample (>30) → Use parametric tests (Central Limit Theorem applies).
• Small Sample (<30) → Use non-parametric tests (less assumption-dependent).
5. One Real-Life Example
• A healthcare researcher wants to compare blood pressure levels between two
groups: those who take medicine and those who don’t.
o Right method? T-test (since blood pressure is numerical and comparing two
groups).

Confidence Interval
Definition: A confidence interval (CI) is a range of values, derived from sample data, that is
likely to contain the true population parameter (e.g., mean or proportion) with a certain
level of confidence.
Purpose: It provides an estimate of uncertainty in statistical analysis.
Common Confidence Levels:
• 90% CI → There's a 90% chance the population parameter falls within the interval.
• 95% CI → The most common, meaning there’s a 95% chance the true value is within
the range.
• 99% CI → More precise but leads to a wider interval.

Difference Between Dependent and Independent Variables

Inferential Statistics and Hypothesis Testing


1. What is Inferential Statistics?
• Definition: Inferential statistics is the process of analyzing sample data to make
predictions, generalizations, or conclusions about a larger population.
• Purpose: It helps in making data-driven decisions based on probability and
estimation.
• Methods Used:
o Hypothesis Testing
o Confidence Intervals
o Regression Analysis
Example:
A marketing team analyzes the purchasing behavior of 1,000 customers and uses
inferential statistics to predict the behavior of 1 million customers.

2. What is Hypothesis Testing?


• Definition: Hypothesis testing is a statistical method used to determine whether
there is enough evidence in a sample to support or reject a claim about a population.
• Purpose: Helps in making scientific and business decisions based on statistical
evidence.
• Types of Hypothesis Tests:
o T-Test (Comparing two means)
o Chi-Square Test (For categorical data)
o ANOVA (Comparing more than two groups)
Example:
A pharmaceutical company tests whether a new drug is more effective than the existing
one. They conduct a hypothesis test to determine if there is a statistically significant
improvement in patients using the new drug.

You might also like