ML Unit-IV Notes
ML Unit-IV Notes
Classification
Classification Terminologies
Classifier – It is an algorithm that is used to map the input data to a specific category.
Classification Model – The model predicts or draws a conclusion to the input data given for training, it
will predict the class or category for the data.
Binary Classification – It is a type of classification with two outcomes, for eg – either true or false.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-Class Classification – The classification with more than two classes, in multi-class classification
each sample is assigned to one and only one label or target. Example: Classifications of types of crops,
Classification of types of music.
Multi-label Classification – This is a type of classification where each sample is assigned to a set of
labels or targets. For example, when predicting a given food category, animal category, vegetables
category
Classification Terminologies
Train the Classifier – Each classifier in sci-kit learn uses the fit(X, y) method to fit the model for training
the train X and train label y.
Predict the Target – For an unlabeled observation X, the predict(X) method returns predicted label y.
Evaluate – This basically means the evaluation of the model i.e classification report, accuracy score, etc.
Classification
As we know, the Supervised Machine Learning algorithm can be broadly classified into Regression and
Classification Algorithms. In Regression algorithms, we have predicted the output for continuous
The Classification algorithm to identify the category of new observations on the basis of training data.
In Classification, a machine learns from the given dataset or observations and then classifies new
observation into a number of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or
It can be performed on both structured or unstructured data. The process starts with predicting the
It takes labelled input data, which means it contains input with the corresponding output.
The classification predictive modeling is the task of approximating the mapping function from input
variables to discrete output variables.
The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below diagram, there
are two classes, class A and Class B. These classes have features that are similar to each other and dissimilar
to other classes.
Classification
Analysis of the customer data to predict whether he will buy computer accessories (Target class: Yes
or No)
Classifying fruits from features like colour, taste, size, weight (Target class: Apple, Orange, Cherry,
Banana)
Gender classification from hair length as training feature (Target class: Male or Female)
Heart disease detection can be identified as a classification problem, this is a binary classification
since there can be only two classes i.e has heart disease or does not have heart disease. The
classifier, in this case, needs training data to understand how the given input variables are related to
the class. And once the classifier is trained accurately, it can be used to detect whether heart disease is
there or not for a particular patient.
Classification
Uses / Applications of Classification
Predicting the gender of a person by his/her handwriting style
Speech Recognition
Drugs Classification
Logistic Regression
Non-linear Models: Nonlinear regression models are anything that doesn’t follow this one form.
Nonlinear regression is much more flexible in the shapes of the curves that it can fit
K-Nearest Neighbours
Kernel SVM
Naïve Bayes
Lazy Learner firstly stores the training dataset and wait until it receives the test dataset.
In Lazy learner case, classification is done on the basis of the most related data stored in the
training dataset.
Eager Learners:
Eager Learners develop a classification model based on a training dataset before receiving a test
dataset.
Opposite to Lazy learners, Eager Learner takes more time in learning, and less time in prediction.
The outcome is measured with a dichotomous variable meaning it will have only two possible outcomes.
Logistic regression is one of the most popular Machine Learning algorithms, which comes under the
Supervised Learning technique.
It is used for predicting the categorical dependent variable using a given set of independent variables.
The outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
Logistic Regression
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether the cells are
cancerous or not, a mouse is obese or not based on its weight, etc.
The goal of logistic regression is to find a best-fitting relationship between the dependent variable
and a set of independent variables.
It is better than other binary classification algorithms like nearest neighbour since it quantitatively explains
the factors leading to classification.
It is used to predict the probability of a binary (yes/no) event occurring. An example of logistic
regression could be applying machine learning to determine if a person is likely to be infected with
COVID-19 or not. Since we have two possible outcomes to this question - yes they are infected, or no they
are not infected - this is called binary classification.
Logistic Regression
In this imaginary example, the probability of a person being infected with COVID-19 could be based on
the viral load, symptoms and the presence of antibodies (Independent Variables), which would
influence our outcome (Dependent Variable).
Logistic regression uses the concept of predictive modeling as regression; therefore, it is called logistic
regression, but is used to classify samples; Therefore, it falls under the classification algorithm.
Logistic Regression Equation
The Logistic regression equation can be obtained from the Linear Regression equation. The mathematical
steps to get Logistic Regression equations are given below:
We know the equation of the straight line can be written as:
In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by (1-y):
But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will become:
The independent
variable should
not have multi-
collinearity.
The dependent
variable must be
categorical in
nature.
Types of Logistic Regression
Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the
dependent variable or the types having no quantitative significance. For example, these variables may
represent “Type A” or “Type B” or “Type C”. such as "cat", "dogs", or "sheep"
Ordinal: In ordinal Logistic regression, dependent variable can have 3 or more possible ordered types or
the types having a quantitative significance. For example, these variables may represent “poor” or
“good”, “very good”, “Excellent” and each category can have the scores like 0,1, 2, 3., and "low",
"Medium", or "High".
Advantages & Disadvantages
Logistic regression is specifically meant for classification, it is useful in understanding how a set of
independent variables affect the outcome of the dependent variable.
The main disadvantage of the logistic regression algorithm is that it only works when the predicted
variable is binary, it assumes that the data is free of missing values and assumes that the predictors are
independent of each other.
Use Cases of Logistic Regression
Identifying risk factors for diseases
Word classification
Weather Prediction
Voting Applications
In health care, logistic regression can be used to predict if a tumor is likely to be benign or malignant.
In the financial industry, logistic regression can be used to predict if a transaction is fraudulent or not.
In marketing, logistic regression can be used to predict if a targeted audience will respond or not.
How does Logistic Regression Works
Consider the following example: An organization wants to determine an employee’s salary increase
based on their performance.
For this purpose, a linear regression algorithm will help them decide. Plotting a regression line by
considering the employee’s performance as the independent variable, and the salary increase as the
dependent variable will make their task easier.
How does Logistic Regression Works
Now, what if the organization wants to know whether an employee would get a promotion or not
based on their performance? The above linear graph won’t be suitable in this case. As such, we clip the
line at zero and one, and convert it into a sigmoid curve (S curve).
Sigmoid Function
The sigmoid function is a mathematical function used to map the predicted values to probabilities.
It maps any real value into another value within a range of 0 and 1.
The value cannot go beyond this limit, so it forms a curve like the "S" form. The S-form curve is called
the sigmoid function or logistic function.
In logistic regression, we use the concept of the threshold value, which defines the probability of either 0 or
1. Such as values above the threshold value tends to 1, and a value below the threshold values tends to 0.
The matrix itself can be easily understood, but the related terminologies may be confusing. Since it shows
the errors in the model performance in the form of a matrix, hence also known as an error matrix.
Some features of Confusion matrix are given below:
For the 2 prediction classes of classifiers, the matrix is of 2*2 table, for 3 classes, it is 3*3 table, and so
on.
The matrix is divided into two dimensions, that are predicted values and actual values along with the
total number of predictions.
Predicted values are those values, which are predicted by the model, and actual values are the true
values for the given observations.
Performance Metric: Confusion Matrix
It looks like the below table:
It not only tells the error made by the classifiers but also the type of errors such as it is either type-I
or type-II error.
With the help of the confusion matrix, we can calculate the different parameters for the model, such as
accuracy, precision, etc.
Suppose we are trying to create a model that can predict the result for the disease that is either a
person has that disease or not. So, the confusion matrix for this is given as:
Calculations using Confusion Matrix
We can perform various calculations for the model, such as the model's accuracy, using this matrix. These
calculations are given below:
Classification Accuracy
Misclassification rate
Precision
Recall / Sensitivity
ROC Curve
It can be calculated as the ratio of the number of correct predictions made by the classifier to all number of
predictions made by the classifiers.
It is good to use the Accuracy metric when the target variable classes in data are approximately
balanced.
For example, if 60% of classes in a fruit image dataset are of Apple, 40% are Mango. In this case, if the
model is asked to predict whether the image is of Apple or Mango, it will give a prediction with 97% of
accuracy.
It is recommended not to use the Accuracy measure when the target variable majorly belongs to
one class.
For example, Suppose there is a model for a disease prediction in which, out of 100 people, only five
people have a disease, and 95 people don't have one. In this case, if our model predicts every person
with no disease (which means a bad prediction), the Accuracy measure will be 95%, which is not
correct.
Misclassification rate
It is also termed as Error rate, and it defines how often the model gives the wrong predictions.
The value of error rate can be calculated as the number of incorrect predictions to all number of the
predictions made by the classifier.
It Number of positive patterns predicted correctly, by total number of patterns in positive class.
It is also similar to the Precision metric; however, it aims to calculate the proportion of actual positive
that was identified incorrectly.
It can be calculated as True Positive or predictions that are actually true to the total number of positives,
either correctly predicted as positive or incorrectly predicted as negative (true Positive and false
negative).
Recall / Sensitivity
When to use Precision and Recall?
From the above definitions of Precision and Recall, we can say that recall determines the
performance of a classifier with respect to a false negative, whereas precision gives information
about the performance of a classifier with respect to a false positive.
So, if we want to minimize the false negative, then, Recall should be as near to 100%, and if we
want to minimize the false positive, then precision should be close to 100% as possible.
In simple words, if we maximize precision, it will minimize the FP errors, and if we maximize
recall, it will minimize the FN error.
F- Measure / F-Score / F1-Score
If two models have low precision and high recall or vice versa, it is difficult to compare these models. So,
for this purpose, we can use F-score.
F-score or F1 Score is a metric to evaluate a binary classification model on the basis of predictions that are
made for the positive class.
It is a type of single score that represents both Precision and Recall.
So, the F1 Score can be calculated as the harmonic mean of both precision and Recall, assigning
equal weight to each of them.
This score helps us to evaluate the recall and precision at the same time.
The F-score is maximum if the recall is equal to the precision. It can be calculated using the below formula:
F- Measure / F-Score / F1-Score
When to use F-Score?
As F-score make use of both precision and recall, so it should be used if both of them are important
for evaluation, but one (precision or recall) is slightly more important to consider than the other. For
example, when False negatives are comparatively more important than false positives, or vice versa.
Null error rate
It defines how often our model would be incorrect if it always predicted the majority class.
As per the accuracy paradox, it is said that "the best classifier has a higher error rate than the null error
rate."
Receiver Operating Characteristics / ROC Curve
The ROC is a graph displaying a classifier's performance for all possible thresholds. The graph is plotted
between the true positive rate (on the Y-axis) and the false Positive rate (on the x-axis).
The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and FPR (False Positive
Rate) on X-axis.
TPR or true Positive rate is a synonym for Recall, hence can be calculated as:
To calculate value at any point in a ROC curve, we can evaluate a logistic regression model multiple times
with different classification thresholds, but this would not be much efficient. So, for this, one efficient
method is used, which is known as AUC.
AUC: Area Under the ROC curve
AUC is known for Area Under the ROC curve. As its name suggests, AUC calculates the two-dimensional
area under the entire ROC curve, as shown below image:
AUC calculates the performance across all the thresholds and provides an aggregate measure. The value of
AUC ranges from 0 to 1. It means a model with 100% wrong prediction will have an AUC of 0.0, whereas
models with 100% correct predictions will have an AUC of 1.0.
Precision vs Recall
Precision Recall
It helps us to measure the ability to classify positive samples in the model. It helps us to measure how many positive samples were correctly classified by
the ML model.
While calculating the Precision of a model, we should consider both While calculating the Recall of a model, we only need all positive samples while
Positive as well as Negative samples that are classified. all negative samples will be neglected.
When a model classifies most of the positive samples correctly as well as When a model classifies a sample as Positive, but it can only classify a few
many false-positive samples, then the model is said to be a high recall and positive samples, then the model is said to be high accuracy, high precision, and
low precision model. low recall model.
The precision of a machine learning model is dependent on both the Recall of a machine learning model is dependent on positive samples and
negative and positive samples. independent of negative samples.
In Precision, we should consider all positive samples that are classified as The recall cares about correctly classifying all positive samples. It does not
positive either correctly or incorrectly. consider if any negative sample is classified as positive.
Gradient Decent / Steepest Decent
It is most commonly used iterative optimization algorithms to train machine learning models & deep
learning models by means of minimizing errors between actual and expected results. Further,
gradient descent is also used to train Neural Networks.
Similarly, in machine learning, optimization is the task of minimizing the cost function parameterized by the
model's parameters.
The main objective of gradient descent is to minimize the convex function using iteration of parameter
updates. Once these machine learning models are optimized, these models can be used as powerful tools
for Artificial Intelligence and various computer science applications.
If we move towards a negative gradient or away from the gradient of the function at the current point,
it will give the local minimum of that function.
Whenever we move towards a positive gradient or towards the gradient of the function at the current
point, we will get the local maximum of that function.
Gradient Decent / Steepest Decent
The main objective of using a gradient descent algorithm is to minimize the cost function using
iteration. To achieve this goal, it performs two steps iteratively:
Calculates the first-order derivative of the function to compute the gradient or slope of that function.
Move away from the direction of the gradient, which means slope increased from the current point by
alpha times, where Alpha is defined as Learning Rate. It is a tuning parameter in the optimization
process which helps to decide the length of the steps.
How does Gradient Decent work?
Before starting the working principle of gradient descent, we should know some basic concepts to find out
the slope of a line from linear regression. The equation for simple linear regression is given as:
Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.
Gradient Decent / Steepest Decent
The starting point(shown in above fig.) is used to evaluate the performance as it is considered just as an
arbitrary point. At this starting point, we will derive the first derivative or slope and then use a tangent line
to calculate the steepness of this slope. Further, this slope will inform the updates to the parameters
(weights and bias).
The slope becomes steeper at the starting point or arbitrary point, but whenever new parameters are
generated, then steepness gradually reduces, and at the lowest point, it approaches the lowest point, which
is called a point of convergence.
The main objective of gradient descent is to minimize the cost function or the error between expected and
actual.
Types of Gradient Decent
Based on the error in various training models, the Gradient Descent learning algorithm can be divided into
It is computationally efficient as all resources are used for all training samples.
Stochastic Gradient Decent
Stochastic gradient descent (SGD) is a type of gradient descent that runs one training example per
iteration. Or in other words, it processes a training epoch for each example within a dataset and updates
each training example's parameters one at a time.
As it requires only one training example at a time, hence it is easier to store in allocated memory. However,
it shows some computational efficiency losses in comparison to batch gradient systems as it shows
frequent updates that require more detail and speed.
Further, due to frequent updates, it is also treated as a noisy gradient. However, sometimes it can be
helpful in finding the global minimum and also escaping the local minimum.
Advantages of Stochastic gradient descent:
In Stochastic gradient descent (SGD), learning happens on every example, and it consists of a few
advantages over other gradient descent.
It is easier to allocate in desired memory.
It is relatively fast to compute than batch gradient descent.
It is more efficient for large datasets.
Mini Batch Gradient Decent
Mini Batch gradient descent is the combination of both batch gradient descent and stochastic gradient
descent.
It divides the training datasets into small batch sizes then performs the updates on those batches
separately. Splitting training datasets into smaller batches make a balance to maintain the computational
efficiency of batch gradient descent and speed of stochastic gradient descent.
Hence, we can achieve a special type of gradient descent with higher computational efficiency and less
noisy gradient descent.
It is computationally efficient.