UNIT-2 ML notes
UNIT-2 ML notes
UNIT-2
Nearest Neighbor-Based Models
Introduction to Proximity Measures:
In machine learning, proximity measures quantify the similarity or dissimilarity between
data points, crucial for tasks like clustering, classification, and outlier detection.
These measures, often represented as distances or similarity scores, help algorithms
understand how alike or different objects are, enabling informed decision-making.
Proximity measures are mathematical functions that assess the degree of similarity or
dissimilarity between data points or objects.
They provide a numerical representation of how "close" two objects are, based on their
attributes or features.
Types:
Similarity Measures: Quantify how alike two objects are, with higher values
indicating greater similarity.
Dissimilarity Measures: Quantify how different two objects are, with lower values
indicating greater similarity.
Applications:
Clustering: Grouping similar data points together.
Classification: Assigning data points to categories based on their proximity to known
classes.
Nearest Neighbor Algorithms: Finding the most similar data points to a given point.
Examples of Proximity Measures:
Euclidean Distance: Measures the straight-line distance between two points in a
multi-dimensional space.
Hamming Distance: Counts the number of positions where two strings or binary
vectors differ.
How they work:
Proximity measures typically take two objects as input and return a numerical value
representing their proximity.
2
The specific formula or algorithm used depends on the type of data and the desired
interpretation of proximity.
For example, Euclidean distance calculates the square root of the sum of the squared
differences between corresponding attributes.
--------------------------------------------------------------------------------------------------------------
Distance Measures:
Distance measures quantify the dissimilarity or similarity between data points, crucial
for algorithms like k-NN and clustering, with common examples including Euclidean,
Manhattan, and Hamming distances.
1. Euclidean Distance: The straight-line distance between two points in a multi-dimensional
space, often referred to as the L2 norm.
Formula: √((x1 - y1)² + (x2 - y2)² + ... + (xn - yn)²)
2. Manhattan Distance: The sum of the absolute differences between corresponding
coordinates of two points, also known as the L1 norm.
Formula: |x1 - y1| + |x2 - y2| + ... + |xn - yn|
3. Minkowski Distance: A generalization of both Euclidean and Manhattan distances,
allowing for different powers (p) to be used in the calculation.
Formula: (∑|xi - yi|p)^(1/p)
4. Hamming Distance: The number of positions at which two strings of equal length differ,
commonly used for comparing binary vectors or strings.
Formula: Count the number of differing bits/characters.
5. Cosine Similarity: Measures the cosine of the angle between two vectors, indicating the
similarity between their orientations rather than their magnitudes.
Formula: (Vector A . Vector B) / (||Vector A|| * ||Vector B||)
Choosing the Right Distance Metric:
The choice of distance metric depends on the type of data, the problem, and the
algorithm being used. Consider the following factors:
Data Type: Are the features continuous, categorical, or binary?
Algorithm: Some algorithms are more sensitive to certain distance metrics than others.
3
--------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------
4
----------------------------------------------------------------------------------------------------
Different Classification Algorithms Based on Distance Measures in Machine Learning
6
1. Logistic Regression: Logistics regression uses sigmoid function above to return the
probability of a label. It is widely used when the classification problem is binary – true or false,
win or lose, positive or negative.
2. Decision Tree: Decision tree builds tree branches in a hierarchy approach and each branch
can be considered as an if-else statement. The branches develop by partitioning the dataset into
subsets based on most important features. Final classification happens at the leaves of the
decision tree.
4. Support Vector Machine (SVM): Support vector machine finds the best way to classify the
data based on the position in relation to a border between positive class and negative class.
This border is known as the hyperplane which maximize the distance between data points from
different classes.
----------------------------------------------------------------------------------------------------------------
8
The K-NN working can be explained on the basis of the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
Step-6: Our model is ready.
Firstly, we will choose the number of neighbors, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It can
be calculated as:
10
By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors
in category A and two nearest neighbors in category B. Consider the below image:
-----------------------------------------------------------------------------------------------------------------
Radius Distance Nearest Neighbor Algorithm
The Radius Nearest Neighbors algorithm identifies data points within a specified radius
of a given point, extending the concept of k-nearest neighbors by focusing on a fixed
distance rather than a fixed number of neighbors.
Distance Calculation: The algorithm first calculates the distance between a query
point and all other data points in the dataset using a chosen distance metric (e.g.,
Euclidean distance, Manhattan distance).
Radius Definition: A user-defined radius (r) is specified, determining the maximum
distance to consider as a "neighbor".
Neighbor Identification: The algorithm identifies all data points whose distance from
the query point is less than or equal to the specified radius.
Applications:
Classification: In classification tasks, the algorithm predicts the class of a new data
point by assigning it the most common class among the neighbors within the radius.
Regression: In regression tasks, the algorithm predicts a target value by averaging the
target values of the neighbors within the radius, or using local interpolation.
Advantages:
Simplicity: The algorithm is relatively simple to understand and implement.
11
--------------------------------------------------------------------------------------------------------------
KNN Regression:
K-Nearest Neighbors (k-NN) regression is a non-parametric, instance-based learning
algorithm used for predicting continuous values by averaging the output values of the k
nearest data points in the training set.
Non-parametric: k-NN regression doesn't make assumptions about the underlying data
distribution or the relationship between input and output variables.
Instance-based: It stores the entire training dataset and makes predictions based on the
similarity of new data points to existing ones.
How it works:
Choose k: Determine the number of nearest neighbors to consider for prediction.
Calculate distances: Calculate the distance between the new data point and all points
in the training set, often using Euclidean distance.
12
Select k nearest neighbors: Identify the k closest data points in the training set.
Predict the output: Average the output values (target values) of the k nearest
neighbors to predict the output for the new data point.
Advantages:
Simplicity: k-NN is a straightforward and intuitive algorithm.
Flexibility: It can handle non-linear relationships between input and output variables.
No training phase: It doesn't require a separate training phase; predictions are made
directly from the data.
Disadvantages:
Computational cost: Calculating distances to all data points can be computationally
expensive, especially for large datasets.
Sensitivity to parameters: The choice of k and the distance metric can significantly
impact the performance of the model.
Not interpretable: It's difficult to understand why a particular prediction was made.
---------------------------------------------------------------------------------------------------------------
Performance of Classifier:
13
---------------------------------------------------------------------------------------------------------------
Performance of Regression Algorithms:
14
The performance of regression algorithms, we use metrics like Mean Squared Error
(MSE), Root Mean Squared Error (RMSE), R-squared, and Mean Absolute Error
(MAE).
Which quantify how well the model predicts continuous values by measuring the
difference between predicted and actual values.