Pattern Recognition - Unit - 1&2
Pattern Recognition - Unit - 1&2
Supervised Learning:
o Uses labeled data, where each data point has a corresponding output or
category.
o The goal is to learn a mapping function that can predict the output for new,
unseen data.
o Examples: Classification (predicting categories) and Regression (predicting
continuous values).
Unsupervised Learning:
o Uses unlabeled data, where there are no predefined outputs or categories.
o The goal is to discover hidden patterns or structures within the data.
o Examples: Clustering (grouping similar data points), Dimensionality Reduction
(reducing the number of variables).
Essentially, it aims to represent the data in a way that makes patterns easier to
identify.
It allows us to update our belief about the class of a pattern based on new
evidence (the features).
It forms the basis for Bayesian classifiers, which are widely used in applications
like spam filtering and medical diagnosis.
It allows us to incorporate prior knowlege into the classification process.
P(C∣X)=P(X)P(X∣C)P(C)
Where:
Parametric Classification:
o Assumes that the data follows a known probability distribution (e.g., Gaussian).
o Estimates the parameters of the distribution from the training data.
o Examples: Gaussian Naive Bayes, Linear Discriminant Analysis (LDA).
o Generally faster and requires less training data, but relies on the distribution
assumption being correct.
Non-Parametric Classification:
o Makes no assumptions about the underlying data distribution.
o The model complexity grows with the size of the training data.
o Examples: k-Nearest Neighbors (k-NN), Support Vector Machines (SVM).
o More flexible and can handle complex data distributions, but can be
computationally expensive and require more training data.
d(p,q)=i=1∑n(qi−pi)2
It is often caused by a model that is too complex for the given data.
The model essentially memorizes the training data instead of learning
generalizable patterns.
It leads to high variance and poor generalization performance.
A typical pattern recognition system can be broken down into several key
stages, which are often represented in a block diagram. Here's a general
overview:
General Block Diagram and Explanation:
1. Sensing/Data Acquisition:
o This is the initial stage where raw data is collected from the real world. This
data can be in various forms, such as images, audio signals, sensor readings,
or text.
o Sensors or input devices are used to capture the data.
o For example, a camera for image recognition, a microphone for speech
recognition, or sensors for medical data.
2. Preprocessing:
o The raw data is often noisy or contains irrelevant information. Preprocessing
aims to clean and prepare the data for further analysis.
o Common preprocessing techniques include:
Noise reduction (filtering).
Normalization (scaling data to a specific range).
Segmentation (isolating objects of interest).
Enhancement (increase the image quality).
o This step is vital to increase the signal to noise ratio of the data.
3. Feature Extraction:
o This stage involves extracting relevant features from the preprocessed data.
Features are characteristics that help distinguish between different patterns.
o The goal is to reduce the dimensionality of the data while retaining the most
important information.
o Examples of features include:
Edges and corners in images.
Frequency components in audio signals.
Statistical properties of data.
5. Classification/Pattern Recognition:
o This is the core of the pattern recognition system. The extracted features are
used to classify the input data into predefined categories or to recognize
patterns.
o Various classification algorithms can be used, such as:
k-Nearest Neighbors (k-NN).
Support Vector Machines (SVM).
Neural networks.
Bayesian classifiers.
6. Post-processing/Decision Making:
o This stage involves interpreting the classification results and making decisions
based on them.
o It may involve:
Refining the classification results.
Providing a final output or action.
Displaying the result to a user.
In summary:
Q.2 Discuss the Decision Theory in Pattern Recognition and take the
reference of under-fitting, good fit and over fitting.
Goal:
o The primary goal of decision theory in pattern recognition is to minimize the risk
of making incorrect classifications.2
o This involves calculating the probabilities of different outcomes and weighing
them against the potential consequences of each decision.3
Key Concepts:
o Bayes' Theorem:
A cornerstone of decision theory, Bayes' theorem allows us to calculate the
posterior probability of a class given the observed features.4
It's crucial for updating our beliefs about class membership as new evidence
becomes available.5
o Risk Minimization:
Decision theory focuses on minimizing the expected loss or risk associated with
classification errors.
This involves defining a "loss function" that quantifies the cost of each type of
misclassification.6
Decision Boundaries:
o Decision theory helps to define optimal decision boundaries that separate
different classes in the feature space.
Underfitting:
o When a model underfits the data, it's essentially making overly simplistic
decisions.7
o The decision boundaries are too coarse, leading to high misclassification rates.
o From a decision theory perspective, the model's assumptions are too rigid,
failing to capture the underlying probability distributions of the data.8
o Essentially the model has high bias.9
Good Fitting:
o A well-fitted model strikes a balance between complexity and accuracy.10
o It creates decision boundaries that accurately reflect the underlying data
patterns, minimizing the risk of misclassification.
o Decision theory helps to optimize these boundaries, ensuring they align with the
probabilistic structure of the data.11
o Essentially the model has low bias, and low variance.
Overfitting:
o Overfitting leads to excessively complex decision boundaries that conform too
closely to the training data, including its noise.12
o While the model may achieve near-perfect accuracy on the training set, it
performs poorly on unseen data.13
o Decision theory highlights the risk of overfitting by emphasizing the importance
of generalization.
o Essentially the model has low bias, but very high variance. The model is too
tightly fit to the training data.14
In essence:
Decision theory provides the tools to assess the quality of a pattern recognition
model by quantifying the risks associated with its decisions.15
It helps to guide the model selection process, favoring models that minimize
these risks and generalize well to unseen data.16
By using decision theory, one can attempt to find the optimal balance between
the bias and the variance of a model.
Learning:
o Learning refers to the process by which a system acquires knowledge or skills
from data. In pattern recognition, this typically involves training a model on a
dataset.3
o Types of Learning:
Supervised Learning: The system learns from labeled data, where each data
point has a corresponding output or category.4 The goal is to learn a mapping
function that can predict the output for new, unseen data.5
Unsupervised Learning: The system learns from unlabeled data, where there
are no predefined outputs or categories.6 The goal is to discover hidden
patterns or structures within the data.
Reinforcement Learning: The system learns through trial and error, receiving
feedback in the form of rewards or penalties.7 The goal is to learn a policy that
maximizes the cumulative reward.89
o Learning Algorithms:
Various algorithms are used for learning, including neural networks, support
vector machines, decision trees, and Bayesian classifiers.
These algorithms adjust their parameters or structure based on the training data
to improve their performance.10
Adaptation:
o Adaptation refers to the system's ability to modify its behavior or parameters in
response to changes in the environment or data.
o This is crucial for handling non-stationary data, where the underlying patterns
may change over time.
o Adaptation Mechanisms:
Online Learning: The system updates its model incrementally as new data
becomes available.11
Incremental Learning: Similar to online learning, but it can also retain
previously learned information.12
Concept Drift Adaptation: The system detects and adapts to changes in the
underlying data distribution.13
Feedback Mechanisms: The system receives feedback from its environment
and adjusts its behavior accordingly.
o Importance of Adaptation:
Adaptation allows pattern recognition systems to remain effective in dynamic
environments.
It improves the system's robustness to noise and variations in data.15
It enables the system to learn and improve over time.16
1. Problem Definition:
o Clearly define the problem that the pattern recognition system aims to solve.
o Specify the input data, the desired output, and the performance criteria.
o For example, "Develop a system to classify images of handwritten digits."
2. Data Collection and Preparation:
o Gather a representative dataset that covers the range of variations in the input
data.
o Preprocess the data to remove noise, normalize values, and handle missing
data.18
o Split the data into training, validation, and test sets.19
3. Feature Extraction:
o Identify and extract relevant features from the preprocessed data.
o Select features that are informative and discriminative for the classification task.
o Examples: edges, textures, frequency components, statistical properties.
4. Model Selection:
o Choose a suitable classification or recognition algorithm based on the problem
and data characteristics.
o Consider factors such as accuracy, computational complexity, and
interpretability.
o Examples: k-NN, SVM, neural networks, decision trees.20
5. Model Training:
o Train the chosen model on the training dataset.
o Adjust the model's parameters to minimize the error on the training data.21
o Use techniques like cross-validation to prevent overfitting.
6. Model Evaluation:
o Evaluate the trained model on the validation and test datasets.
Concept:
o Relies on statistical models and probability theory to classify patterns.2
o Assumes that patterns can be characterized by probability distributions.
o Uses techniques like Bayes' theorem, Gaussian models, and Hidden Markov
Models.3
Strengths:
o Well-established mathematical foundation.
o Effective for problems with well-defined statistical properties.
o Can handle uncertainty and noise.
Weaknesses:
o Requires assumptions about the underlying data distributions.
o Can be computationally expensive for high-dimensional data.
o May struggle with complex, non-linear patterns.
Examples: Bayesian classifiers, Gaussian Mixture Models (GMMs), Hidden
Markov Models (HMMs).4
Concept:
o Describes patterns using formal grammars and structural relationships.5
o Represents patterns as hierarchical structures of primitives.
o Useful for patterns with inherent structural information (e.g., character
recognition, shape analysis).6
Strengths:
o Effective for representing and recognizing structured patterns.7
o Can handle variations in pattern structure.
o Provides a symbolic representation of patterns.8
Weaknesses:
o Difficult to define grammars for complex patterns.
o Sensitive to noise and distortions.
o Can be computationally expensive.
Examples: Formal grammars, attributed relational graphs.9
3. Template Matching:
Concept:
o Compares an unknown pattern to a set of stored templates.10
o Classifies the pattern based on the best match.
o Simple and intuitive approach.
Strengths:
o Simple and easy to implement.
o Effective for problems with limited variation in patterns.
Weaknesses:
o Sensitive to variations in scale, rotation, and distortion.11
o Requires a large number of templates for complex patterns.12
o Can be computationally expensive.
Examples: Image correlation, dynamic time warping (DTW).
4. Neural Networks:
Concept:
o Uses interconnected layers of artificial neurons to learn patterns.
o Can learn complex, non-linear mappings between inputs and outputs.
o Powerful for a wide range of pattern recognition tasks.
Strengths:
o Can learn complex, non-linear patterns.
o Robust to noise and variations.
o Can be trained with large datasets.13
Weaknesses:
o Can be computationally expensive.
o Requires large amounts of training data.
o "Black box" nature makes it hard to understand the reasoning.
Examples: Convolutional Neural Networks (CNNs), Recurrent Neural Networks
(RNNs).
Concept:
o Finds an optimal hyperplane that separates different classes in the feature
space.14
o Uses kernel functions to handle non-linear decision boundaries.
o Effective for high-dimensional data.
Strengths:
o Effective for high-dimensional data.
o Can handle non-linear decision boundaries.
o Good generalization performance.
Weaknesses:
o Can be computationally expensive for large datasets.15
o Kernel selection can be challenging.
o Less effective for very large datasets compared to Neural networks.
Comparison Table:
Approach Strengths Weaknesses Applications
Requires
Robust to noise, Speech
Statistical distributional
strong recognition,
assumptions,
mathematical computationally medical
foundation expensive diagnosis
Handles
structured Character
Complex grammars,
Syntactic patterns, recognition,
sensitive to noise
symbolic shape analysis
representation
Sensitive to Object
Template Simple, easy to
variations, requires recognition,
Matching implement
many templates image alignment
Image
Learns complex Computationally
Neural recognition,
patterns, robust expensive, requires
Networks natural language
to noise large datasets
processing
Computationally Image
Effective in high
expensive for large classification,
SVMs dimensions, good
datasets, kernel text
generalization
selection categorization
Choosing an Approach:
Often, hybrid approaches that combine multiple techniques can provide the best
results.
Q.5 Explain the concepts of True Positive (TP), False Positive (FP), True
Negative (TN), and False Negative (FN) and relate it for calculating
accuracy and error rate. Try to apply on some values and mention your
interpretation.
Ans.5
Absolutely! Let's break down True Positives, False Positives, True Negatives,
and False Negatives, and then see how they're used in accuracy and error rate
calculations.
Accuracy:
o The overall proportion of correct predictions.
o Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
o It tells you how often the model is correct overall.
Error Rate (Misclassification Rate):
o The overall proportion of incorrect predictions.
o Formula: Error Rate = (FP + FN) / (TP + TN + FP + FN)
o Or, Error Rate = 1 - Accuracy
o It tells you how often the model is wrong overall.
Let's say we have a medical test for a disease, and we've tested 100 patients.
The results are:
Accuracy:
o Accuracy = (40 + 30) / (40 + 30 + 10 + 20) = 70 / 100 = 0.7 or 70%
o Interpretation: The model correctly classified 70% of the patients.
Error Rate:
o Error Rate = (10 + 20) / (40 + 30 + 10 + 20) = 30 / 100 = 0.3 or 30%
o Interpretation: The model incorrectly classified 30% of the patients.
Interpretation Considerations:
In medical contexts, False Negatives (FN) are often more critical than False
Positives (FP). A False Negative means a sick patient is missed, which can
have severe consequences.
When evaluating a model, it's essential to consider the context and the relative
costs of FP and FN.
Accuracy alone can be misleading, especially in imbalanced datasets (where
one class is much more frequent than the other). Other metrics like precision,
recall, and F1-score are also very usefull.
Ans.6
1. Expectation (Expected Value)
Concept:
o The expectation of a random variable is the average value we expect it to take,
considering the probabilities of its possible values.
o It represents the long-run average of repeated trials or observations.
Formula (Discrete Random Variable):
o If X is a discrete random variable with possible values x1,x2,...,xn and
corresponding probabilities P(X=x1),P(X=x2),...,P(X=xn), then the expected
value of X, denoted as E(X) or μX, is:
E(X)=∑i=1nxi⋅P(X=xi)
Formula (Continuous Random Variable):
o If X is a continuous random variable with probability density function f(x), then
the expected value of X1 is:
E(X)=∫−∞∞x⋅f(x)dx
2. Mean
Concept:
o The mean is a specific type of expected value, typically used to describe the
average of a set of data points.
o For a sample, it's the sum of the values divided by the number of values.
Formula (Sample Mean):
o If x1,x2,...,xn are n data points, then the sample mean, denoted as xˉ, is:
xˉ=n1∑i=1nxi
Formula (Population Mean):
o The population mean is equivalent to the expected value of a random variable
sampled from that population.
3. Covariance
Concept:
o Covariance measures the degree to which two random variables change
together.
o A positive covariance indicates that the variables tend to increase or decrease
together.
o A negative covariance indicates that2 they tend to vary in opposite directions.
Formula (Population Covariance):
o If X and Y are two random variables with means μX and μY, respectively, then
the population covariance, denoted as Cov(X,Y), is:
Cov(X,Y)=E[(X−μX)(Y−μY)]
Formula (Sample Covariance):
o If (x1,y1),(x2,y2),...,(xn,yn) are n pairs of data points, then the sample
covariance, denoted as sxy, is:
sxy=n−11∑i=1n(xi−xˉ)(yi−yˉ)
Example
Scenario:
o We have a six-sided die, and we want to analyze the expected value and
covariance.
o We also have a set of paired data.
Expected Value (Die Roll):
o Let X be the random variable representing the outcome of a die roll.
o The possible values are xi=1,2,3,4,5,6, and each has a probability of P(X=xi
)=1/6.
o E(X)=(1⋅1/6)+(2⋅1/6)+(3⋅1/6)+(4⋅1/6)+(5⋅1/6)+(6⋅1/6)=3.5
o Interpretation: The expected value of a die roll is 3.5.
Mean (Sample Data):
o Data: [2, 4, 6, 8, 10]
o Mean = (2+4+6+8+10)/5 = 30/5 = 6.
Covariance (Sample Data):
o Data:
X: [1, 2, 3, 4, 5]
Y: [2, 4, 5, 4, 5]
o Mean of X: xˉ=3
o Mean of Y: yˉ=4
o sxy=5−11[(1−3)(2−4)+(2−3)(4−4)+(3−3)(5−4)+(4−3)(4−4)+(5−3)(5−4)]=41
[4+0+0+0+2]=1.5
o Interpretation: The covariance of 1.5 indicates a positive relationship between X
and Y. As X increases, Y tends to increase as well.
UNIT-II
Ans.1
P(C∣X)=P(X)P(X∣C)P(C)
4. Loss Function (L(α, C)): This function quantifies the cost associated with
making a particular decision (α) when the true class is C. It allows us to
incorporate the consequences of misclassification into the decision-making
process.
5. Risk (R(α|X)): This is the expected loss associated with making a decision (α)
given the observed feature vector (X). It is calculated as:
Q.2 What is the difference between prior probability and posterior probability?
Q.3 Define loss function and explain its role in decision-making
Q.4 Differentiate between discriminant functions and decision boundaries.
Q.5 What is Normal Density Function? Give an example.
Q.6 Define multivariate normal distribution and mention its application.
Q.7 Differentiate between Linear Discriminant Analysis (LDA) and Quadratic
Discriminant Analysis (QDA).
Discriminant Functions:
o A discriminant function is a function that takes a feature vector as input and
produces a scalar value that represents the "score" or "likelihood" of the input
belonging to a particular class.11
o It's used to assign data points to classes based on these scores.
o For example, in linear discriminant analysis, the discriminant function is a linear
combination of the features.12
Decision Boundaries:
o A decision boundary is the surface or boundary in the feature space that
separates different classes.13
o It's defined by the points where the discriminant functions for different classes
are equal.
o In a two-class problem, the decision boundary is the set of points where the two
discriminant functions have the same value.
o Discriminant functions are used to create the decision boundaries.14
Q.1 Derive the equation of the Normal Density Function. Explain its
significance in pattern recognition.
Ans.1=
f(z)=2π1e−2z2
o Where:
z is the standard normal variable.
e is Euler's number (approximately 2.71828).
π is pi (approximately 3.14159).
z=σx−μ
o Where:
x is the general normal variable.
μ is the mean of the distribution.
σ is the standard deviation of the distribution.
3. Substitute and Solve:
o Substitute the expression for z into the standard normal PDF:
f(σx−μ)=2π1e−21(σx−μ)2
f(x)=σ2π1e−21(σx−μ)2
o This is the PDF of the normal distribution with mean μ and standard deviation σ.
Ans.2
Absolutely, let's derive the multivariate normal density function and then explore
its real-world applications.
Ans.3
Definition:
o A discriminant function is a function that takes a feature vector as input and
produces a scalar value that represents the "score" or "likelihood" of the input
belonging to a particular class.
o It's used to assign data points to classes based on these scores.
How it's Used for Classification:
o In a classification problem with multiple classes, we define a discriminant
function for each class.
o Given a new data point (feature vector), we evaluate all the discriminant
functions.
o The class corresponding to the highest discriminant function value is assigned
to the data point.
o Essentially, the discriminant function transforms the feature space into a space
where class separation is more apparent.
o The decision boundaries between classes are formed where the discriminant
functions of two classes are equal.
g(x)=wTx+w0
where:
P(Ci∣x)=P(x)P(x∣Ci)P(Ci)
o Where P(Ci) is the prior probability and P(x|Ci) is the likelihood.
2. Gaussian Likelihoods:
o We assume that the likelihoods P(x∣Ci) are Gaussian (normal) distributions:
P(x∣Ci)=(2π)d/2∣Σi∣1/21exp[−21(x−μi)TΣi−1(x−μi)]
o Where μi is the mean vector and Σi is the covariance matrix for class Ci.
3. Linear Discriminant Function (Equal Covariance):
o If we assume that the covariance matrices are equal (Σ1=Σ2=Σ), we can
simplify the decision rule.
o We take the logarithm of the ratio of the posterior probabilities and set it to a
threshold:
lnP(C2∣x)P(C1∣x)>θ
o By simplifying the logarithm of the likelihood ratio, and canceling out similar
terms because of the equal covariance matrices, we reach a linear form.
o The resulting discriminant function is:
g(x)=wTx+w0
Where:
w=Σ−1(μ1−μ2)
w0=−21(μ1TΣ−1μ1−μ2TΣ−1μ2)+lnP(C2)P(C1)
Example
1. Calculate w:
o w=Σ−1(μ1−μ2)=[1001]([12]−[34])=[−2−2]
2. Calculate w0:
o w0=−21(μ1TΣ−1μ1−μ2TΣ−1μ2)+lnP(C2)P(C1)
o w0=−21(5−25)+ln(1.5)≈10+0.405≈10.405
3. Linear Discriminant Function:
o g(x)=[−2−2]T[x1x2]+10.405=−2x1−2x2+10.405
Now, to classify a new data point, we would evaluate g(x). If g(x)>0, we classify
it as C1; otherwise, we classify it as C2.
Q.4
a) Compare Linear, Quadratic, and Bayesian Classifiers.
Explain the advantages and disadvantages of each with examples.
Ans.4=
Can be linear or
Decision Quadratic non-linear,
Linear
Boundary (curved) depends on the
model
Variable,
depends on
Complexity Lower Higher
model
complexity
Variable,
Computational depends on
Lower Higher
Cost model
complexity
Q.5
a) Discuss real-world applications of Bayesian Classifiers.
Explain how discriminant functions are used in spam detection or medical
diagnosis.
Ans.5
Bayesian classifiers, with their ability to handle uncertainty and incorporate prior
knowledge, find applications in various domains:1
1. Spam Detection:
Discriminant Function:
o In spam detection, a discriminant function is used to calculate a score that
represents the likelihood of an email being spam.15
o For example, in a Naive Bayes classifier, the discriminant function might be a
linear combination of the log probabilities of the words in the email, weighted by
their probabilities of appearing in spam emails.
o g(x)=∑i=1nwi⋅log(P(wordi∣spam))
o Where:
g(x) is the discriminant function.16
wi is a weight for each word (often 1).
P(wordi∣spam) is the probability of word_i appearing in spam.
Classification:
o If the discriminant function value exceeds a certain threshold, the email is
classified as spam; otherwise, it's classified as non-spam.
o The threshold is typically determined through training and validation.
2. Medical Diagnosis:
Discriminant Function:
o In medical diagnosis, a discriminant function can be used to calculate a score
that represents the likelihood of a patient having a particular disease.17
o For example, in a Bayesian network, the discriminant function might combine
the probabilities of various symptoms and test results, weighted by their
relationships to the disease.18
o g(x)=log(P(disease∣symptoms,test_results))
o Where:
g(x) is the discriminant function.19
P(disease∣symptoms,test_results) is the posterior probability of the disease
given the observed symptoms and test results.
Classification:
o If the discriminant function value exceeds a certain threshold, the patient is
diagnosed with the disease; otherwise, the disease is ruled out.
o The threshold is determined based on the desired sensitivity and specificity of
the test.
o In cases where multiple diseases are possible, multiple discriminant functions
are used, and the one with the highest value determines the most likely
disease.