0% found this document useful (0 votes)
173 views15 pages

UNIT-2 ML notes

The document provides an overview of nearest neighbor-based models in machine learning, focusing on proximity measures that quantify similarity and dissimilarity between data points. It discusses various distance metrics such as Euclidean, Manhattan, and Hamming distances, as well as classification algorithms like K-Nearest Neighbors (KNN) and their applications in regression and classification tasks. Additionally, it covers performance evaluation metrics for classifiers and regression algorithms.

Uploaded by

srinu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
173 views15 pages

UNIT-2 ML notes

The document provides an overview of nearest neighbor-based models in machine learning, focusing on proximity measures that quantify similarity and dissimilarity between data points. It discusses various distance metrics such as Euclidean, Manhattan, and Hamming distances, as well as classification algorithms like K-Nearest Neighbors (KNN) and their applications in regression and classification tasks. Additionally, it covers performance evaluation metrics for classifiers and regression algorithms.

Uploaded by

srinu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

1

UNIT-2
Nearest Neighbor-Based Models
Introduction to Proximity Measures:
 In machine learning, proximity measures quantify the similarity or dissimilarity between
data points, crucial for tasks like clustering, classification, and outlier detection.
 These measures, often represented as distances or similarity scores, help algorithms
understand how alike or different objects are, enabling informed decision-making.
 Proximity measures are mathematical functions that assess the degree of similarity or
dissimilarity between data points or objects.
 They provide a numerical representation of how "close" two objects are, based on their
attributes or features.
Types:
 Similarity Measures: Quantify how alike two objects are, with higher values
indicating greater similarity.
 Dissimilarity Measures: Quantify how different two objects are, with lower values
indicating greater similarity.
Applications:
 Clustering: Grouping similar data points together.
 Classification: Assigning data points to categories based on their proximity to known
classes.
 Nearest Neighbor Algorithms: Finding the most similar data points to a given point.
Examples of Proximity Measures:
 Euclidean Distance: Measures the straight-line distance between two points in a
multi-dimensional space.
 Hamming Distance: Counts the number of positions where two strings or binary
vectors differ.
How they work:
 Proximity measures typically take two objects as input and return a numerical value
representing their proximity.
2

 The specific formula or algorithm used depends on the type of data and the desired
interpretation of proximity.

 For example, Euclidean distance calculates the square root of the sum of the squared
differences between corresponding attributes.
--------------------------------------------------------------------------------------------------------------
Distance Measures:
 Distance measures quantify the dissimilarity or similarity between data points, crucial
for algorithms like k-NN and clustering, with common examples including Euclidean,
Manhattan, and Hamming distances.
1. Euclidean Distance: The straight-line distance between two points in a multi-dimensional
space, often referred to as the L2 norm.
Formula: √((x1 - y1)² + (x2 - y2)² + ... + (xn - yn)²)
2. Manhattan Distance: The sum of the absolute differences between corresponding
coordinates of two points, also known as the L1 norm.
Formula: |x1 - y1| + |x2 - y2| + ... + |xn - yn|
3. Minkowski Distance: A generalization of both Euclidean and Manhattan distances,
allowing for different powers (p) to be used in the calculation.
Formula: (∑|xi - yi|p)^(1/p)
4. Hamming Distance: The number of positions at which two strings of equal length differ,
commonly used for comparing binary vectors or strings.
Formula: Count the number of differing bits/characters.
5. Cosine Similarity: Measures the cosine of the angle between two vectors, indicating the
similarity between their orientations rather than their magnitudes.
Formula: (Vector A . Vector B) / (||Vector A|| * ||Vector B||)
Choosing the Right Distance Metric:
The choice of distance metric depends on the type of data, the problem, and the
algorithm being used. Consider the following factors:
Data Type: Are the features continuous, categorical, or binary?
Algorithm: Some algorithms are more sensitive to certain distance metrics than others.
3

Outliers: Some metrics are more robust to outliers than others.


Dimensionality: Some metrics perform better in high-dimensional spaces.

--------------------------------------------------------------------------------------------------------------------------------------------------------------

Non-metric Similarity Functions:


 These functions don't necessarily adhere to the triangle inequality or other metric
properties.
 They allow for more flexible interpretations of similarity, potentially capturing non-
linear relationships or accommodating data transformations that might not be captured
by strict metrics.
Examples of Non-Metric Similarity Functions:
Warping Distance: Used to measure the similarity between curves or sequences, allowing
for non-linear mappings and distortions.
Longest Common Subsequence (LCSS): Used to find the longest subsequence that is
common to two sequences, which can be used to measure similarity between them.

---------------------------------------------------------------------------------------------------------------
4

Proximity between Binary Patterns:


 Measuring proximity between binary patterns involves quantifying their similarity or
dissimilarity, often using metrics like Hamming distance or Jaccard index, which are
useful for tasks like clustering and pattern recognition.
 Binary patterns are data representations where each element can be either 0 or 1,
representing the presence or absence of a specific feature or attribute.
 Similarity/Dissimilarity: Proximity measures help determine how similar or dissimilar
two binary patterns are.
 Clustering: Patterns with high proximity can be grouped together, allowing for the
identification of clusters or categories.
 Pattern Recognition: Proximity measures can help identify patterns or anomalies
within a dataset.
 Decision Making: Understanding the relationships between binary patterns can inform
decision-making in various fields.
Common Proximity Measures for Binary Patterns:
 Hamming Distance: Counts the number of positions where two binary patterns differ.
 Example: If pattern A is "1010" and pattern B is "0011", the Hamming distance is 3
(they differ in the first, second, and fourth positions).
 Jaccard Index: Measures the similarity between two sets by dividing the number of
elements common to both sets by the total number of elements in both sets.
It's calculated as: |A ∩ B| / |A ∪ B|.
 Local Binary Patterns (LBP): A technique used in computer vision to encode local
texture information by comparing a pixel's intensity to its neighbors.
Applications:
 Image Processing: LBP is commonly used for texture analysis and image
classification.
 Data Mining: Proximity measures can be used for clustering, anomaly detection, and
pattern discovery.
 Social Network Analysis: Identifying similar users or groups.
5

----------------------------------------------------------------------------------------------------
Different Classification Algorithms Based on Distance Measures in Machine Learning
6

1. Logistic Regression: Logistics regression uses sigmoid function above to return the
probability of a label. It is widely used when the classification problem is binary – true or false,
win or lose, positive or negative.

2. Decision Tree: Decision tree builds tree branches in a hierarchy approach and each branch
can be considered as an if-else statement. The branches develop by partitioning the dataset into
subsets based on most important features. Final classification happens at the leaves of the
decision tree.

3. Random Forest: Random forest is a collection of decision trees. It is a common type of


ensemble methods which aggregate results from multiple predictors. Random forest
additionally utilizes bagging technique that allows each tree trained on a random sampling of
original dataset and takes the majority vote from trees.
7

4. Support Vector Machine (SVM): Support vector machine finds the best way to classify the
data based on the position in relation to a border between positive class and negative class.
This border is known as the hyperplane which maximize the distance between data points from
different classes.

5. K-Nearest Neighbour (KNN): k nearest neighbour algorithm as representing each data


point in a n dimensional space – which is defined by n features. And it calculates the distance
between one point to another, then assign the label of unobserved data based on the labels of
nearest observed data points.

6. Naive Bayes: Naive Bayes is based on Bayes’ Theorem – an approach to calculate


conditional probability based on prior knowledge, and the naive assumption that each feature is
independent to each other.

----------------------------------------------------------------------------------------------------------------
8

K-Nearest Neighbor(KNN) Algorithm for Machine Learning


 K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
 K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
 K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
 K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
 K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
 It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.

How does K-NN work?


9

The K-NN working can be explained on the basis of the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
Step-6: Our model is ready.

Firstly, we will choose the number of neighbors, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It can
be calculated as:
10

By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors
in category A and two nearest neighbors in category B. Consider the below image:

-----------------------------------------------------------------------------------------------------------------
Radius Distance Nearest Neighbor Algorithm
 The Radius Nearest Neighbors algorithm identifies data points within a specified radius
of a given point, extending the concept of k-nearest neighbors by focusing on a fixed
distance rather than a fixed number of neighbors.
 Distance Calculation: The algorithm first calculates the distance between a query
point and all other data points in the dataset using a chosen distance metric (e.g.,
Euclidean distance, Manhattan distance).
 Radius Definition: A user-defined radius (r) is specified, determining the maximum
distance to consider as a "neighbor".
 Neighbor Identification: The algorithm identifies all data points whose distance from
the query point is less than or equal to the specified radius.
Applications:
 Classification: In classification tasks, the algorithm predicts the class of a new data
point by assigning it the most common class among the neighbors within the radius.
 Regression: In regression tasks, the algorithm predicts a target value by averaging the
target values of the neighbors within the radius, or using local interpolation.
Advantages:
 Simplicity: The algorithm is relatively simple to understand and implement.
11

 Flexibility: It can be used for both classification and regression tasks.


 Efficiency: Efficient algorithms can provide fast approximate query responses.
Disadvantages:
 Parameter Tuning: The choice of the radius parameter (r) can affect performance and
may require careful tuning.
 Computational Cost: In some cases, calculating distances between all data points can
be computationally expensive, especially for large datasets

--------------------------------------------------------------------------------------------------------------
KNN Regression:
 K-Nearest Neighbors (k-NN) regression is a non-parametric, instance-based learning
algorithm used for predicting continuous values by averaging the output values of the k
nearest data points in the training set.
 Non-parametric: k-NN regression doesn't make assumptions about the underlying data
distribution or the relationship between input and output variables.
 Instance-based: It stores the entire training dataset and makes predictions based on the
similarity of new data points to existing ones.
How it works:
 Choose k: Determine the number of nearest neighbors to consider for prediction.
 Calculate distances: Calculate the distance between the new data point and all points
in the training set, often using Euclidean distance.
12

 Select k nearest neighbors: Identify the k closest data points in the training set.
 Predict the output: Average the output values (target values) of the k nearest
neighbors to predict the output for the new data point.
Advantages:
 Simplicity: k-NN is a straightforward and intuitive algorithm.
 Flexibility: It can handle non-linear relationships between input and output variables.
 No training phase: It doesn't require a separate training phase; predictions are made
directly from the data.
Disadvantages:
 Computational cost: Calculating distances to all data points can be computationally
expensive, especially for large datasets.
 Sensitivity to parameters: The choice of k and the distance metric can significantly
impact the performance of the model.
 Not interpretable: It's difficult to understand why a particular prediction was made.

---------------------------------------------------------------------------------------------------------------
Performance of Classifier:
13

 Classifier performance in machine learning is evaluated using metrics like accuracy,


precision, recall, and F1-score. Which measure how well a model predicts and
categorizes data.
 These metrics help assess the model's predictive capabilities and identify areas for
improvement.
 Examples of Classifier Algorithms are Decision Trees, Support Vector Machines, Naive
Bayes, Random Forest.
Accuracy: The overall correctness of the classifier, calculated as the number of correctly
classified instances divided by the total number of instances.
Precision: The ability of the classifier to avoid false positives, calculated as the number of
true positives divided by the sum of true positives and false positives.
Recall: The ability of the classifier to find all relevant instances, calculated as the number of
true positives divided by the sum of true positives and false negatives.
F1-score: The harmonic mean of precision and recall, providing a balanced measure of the
classifier's performance.

---------------------------------------------------------------------------------------------------------------
Performance of Regression Algorithms:
14

 The performance of regression algorithms, we use metrics like Mean Squared Error
(MSE), Root Mean Squared Error (RMSE), R-squared, and Mean Absolute Error
(MAE).
 Which quantify how well the model predicts continuous values by measuring the
difference between predicted and actual values.

Key Performance Metrics for Regression Algorithms:


Mean Squared Error (MSE):
 Calculates the average of the squared differences between predicted and actual values.
 Penalizes larger errors more heavily due to the squaring operation.
 A lower MSE indicates better model performance.
Root Mean Squared Error (RMSE):
 The square root of the MSE.
 Provides a value in the same units as the target variable, making it easier to interpret.
 A lower RMSE indicates better model performance.
R-squared (Coefficient of Determination):
 Measures the proportion of variance in the dependent variable that is explained by the
independent variable(s).
 Ranges from 0 to 1, with higher values indicating a better fit.
 A value of 1 indicates that the model perfectly predicts the target variable.
Mean Absolute Error (MAE):
 Calculates the average of the absolute differences between predicted and actual values.
15

 Less sensitive to outliers than MSE and RMSE.


 A lower MAE indicates better model performance.
Mean Absolute Percentage Error (MAPE):
 Calculates the average of the absolute percentage differences between predicted and
actual values.
 Expresses the error as a percentage, making it easier to compare different models.
 A lower MAPE indicates better model performance
---------------------------------------------------------------------------------------------------------------

You might also like