0% found this document useful (0 votes)
10 views49 pages

ML Unit-IV Notes

The document provides an overview of classification in machine learning, detailing key terminologies such as classifiers, classification models, and types of classification (binary, multi-class, and multi-label). It explains the process of training classifiers, predicting targets, and evaluating models using metrics like confusion matrix, accuracy, precision, and recall. Additionally, it covers logistic regression, its applications, advantages, and disadvantages, along with the sigmoid function and performance metrics.

Uploaded by

Asfia Al Hera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views49 pages

ML Unit-IV Notes

The document provides an overview of classification in machine learning, detailing key terminologies such as classifiers, classification models, and types of classification (binary, multi-class, and multi-label). It explains the process of training classifiers, predicting targets, and evaluating models using metrics like confusion matrix, accuracy, precision, and recall. Additionally, it covers logistic regression, its applications, advantages, and disadvantages, along with the sigmoid function and performance metrics.

Uploaded by

Asfia Al Hera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

ML Unit-IV

Classification
Classification Terminologies
Classifier – It is an algorithm that is used to map the input data to a specific category.

Classification Model – The model predicts or draws a conclusion to the input data given for training, it
will predict the class or category for the data.

Feature – A feature is an individual measurable property of the phenomenon being observed.

Binary Classification – It is a type of classification with two outcomes, for eg – either true or false.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.

Multi-Class Classification – The classification with more than two classes, in multi-class classification
each sample is assigned to one and only one label or target. Example: Classifications of types of crops,
Classification of types of music.

Multi-label Classification – This is a type of classification where each sample is assigned to a set of
labels or targets. For example, when predicting a given food category, animal category, vegetables
category
Classification Terminologies
Train the Classifier – Each classifier in sci-kit learn uses the fit(X, y) method to fit the model for training
the train X and train label y.

Predict the Target – For an unlabeled observation X, the predict(X) method returns predicted label y.

Evaluate – This basically means the evaluation of the model i.e classification report, accuracy score, etc.
Classification
As we know, the Supervised Machine Learning algorithm can be broadly classified into Regression and

Classification Algorithms. In Regression algorithms, we have predicted the output for continuous

values, but to predict the categorical values, we need Classification algorithms.

The Classification algorithm to identify the category of new observations on the basis of training data.

 In Classification, a machine learns from the given dataset or observations and then classifies new

observation into a number of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or

dog, etc. Classes can be called as targets/labels or categories.

It can be performed on both structured or unstructured data. The process starts with predicting the

class of given data points.


Classification
Unlike regression, the output variable of Classification is a category, not a value, such as "Green or
Blue", "fruit or animal", etc.

It takes labelled input data, which means it contains input with the corresponding output.

In classification algorithm, a discrete output function(y) is mapped to input variable(x).

The classification predictive modeling is the task of approximating the mapping function from input
variables to discrete output variables.

The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data.

Classification algorithms can be better understood using the below diagram. In the below diagram, there
are two classes, class A and Class B. These classes have features that are similar to each other and dissimilar
to other classes.
Classification

The algorithm which implements the classification on a dataset is known as a classifier.


 We use the training dataset to get better boundary conditions which could be used to determine
each target class. Once the boundary conditions are determined, the next task is to predict the
target class. The whole process is known as classification.
Classification
Target class examples:

Analysis of the customer data to predict whether he will buy computer accessories (Target class: Yes
or No)

Classifying fruits from features like colour, taste, size, weight (Target class: Apple, Orange, Cherry,
Banana)

Gender classification from hair length as training feature (Target class: Male or Female)

Heart disease detection can be identified as a classification problem, this is a binary classification
since there can be only two classes i.e has heart disease or does not have heart disease. The
classifier, in this case, needs training data to understand how the given input variables are related to
the class. And once the classifier is trained accurately, it can be used to detect whether heart disease is
there or not for a particular patient.
Classification
Uses / Applications of Classification
 Predicting the gender of a person by his/her handwriting style

 Predicting whether monsoon will be normal next year

 Email Spam Detection

 Speech Recognition

 Identifications of Cancer tumor cells.

 Drugs Classification

 Biometric Identification, etc.

 Bank customers loan pay willingness prediction.

 Facial key points detection

 Pedestrians detection in an automotive car driving.


Types of Classification
Linear Models: A linear regression equation simply sums the terms. While the model must be linear in
the parameters, you can raise an independent variable by an exponent to fit a curve. For instance, you
can include a squared or cubed term.

Logistic Regression

Support Vector Machines

Non-linear Models: Nonlinear regression models are anything that doesn’t follow this one form.
Nonlinear regression is much more flexible in the shapes of the curves that it can fit

K-Nearest Neighbours

Kernel SVM

Naïve Bayes

Decision Tree Classification

Random Forest Classification


Types of Learners in Classfication
Lazy Learners:

Lazy Learner firstly stores the training dataset and wait until it receives the test dataset.

In Lazy learner case, classification is done on the basis of the most related data stored in the
training dataset.

It takes less time in training but more time for predictions.

Example: K-NN algorithm, Case-based reasoning

Eager Learners:

Eager Learners develop a classification model based on a training dataset before receiving a test
dataset.

Opposite to Lazy learners, Eager Learner takes more time in learning, and less time in prediction.

Example: Decision Trees, Naïve Bayes, ANN.


Logistic Regression
It is a classification algorithm in machine learning that uses one or more independent variables to
determine an outcome

The outcome is measured with a dichotomous variable meaning it will have only two possible outcomes.

Logistic regression is one of the most popular Machine Learning algorithms, which comes under the
Supervised Learning technique.

It is used for predicting the categorical dependent variable using a given set of independent variables.

The outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
Logistic Regression
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1).

The curve from the logistic function indicates the likelihood of something such as whether the cells are
cancerous or not, a mouse is obese or not based on its weight, etc.

The goal of logistic regression is to find a best-fitting relationship between the dependent variable
and a set of independent variables.

It is better than other binary classification algorithms like nearest neighbour since it quantitatively explains
the factors leading to classification.

It is used to predict the probability of a binary (yes/no) event occurring. An example of logistic
regression could be applying machine learning to determine if a person is likely to be infected with
COVID-19 or not. Since we have two possible outcomes to this question - yes they are infected, or no they
are not infected - this is called binary classification.
Logistic Regression
In this imaginary example, the probability of a person being infected with COVID-19 could be based on
the viral load, symptoms and the presence of antibodies (Independent Variables), which would
influence our outcome (Dependent Variable).

Logistic regression uses the concept of predictive modeling as regression; therefore, it is called logistic
regression, but is used to classify samples; Therefore, it falls under the classification algorithm.
Logistic Regression Equation
The Logistic regression equation can be obtained from the Linear Regression equation. The mathematical
steps to get Logistic Regression equations are given below:

We know the equation of the straight line can be written as:

In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by (1-y):

But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will become:

The above equation is the final equation for Logistic Regression


Assumptions for Logistic Regression

The independent
variable should
not have multi-
collinearity.

The dependent
variable must be
categorical in
nature.
Types of Logistic Regression
Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.

Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the
dependent variable or the types having no quantitative significance. For example, these variables may
represent “Type A” or “Type B” or “Type C”. such as "cat", "dogs", or "sheep"

Ordinal: In ordinal Logistic regression, dependent variable can have 3 or more possible ordered types or
the types having a quantitative significance. For example, these variables may represent “poor” or
“good”, “very good”, “Excellent” and each category can have the scores like 0,1, 2, 3., and "low",
"Medium", or "High".
Advantages & Disadvantages
Logistic regression is specifically meant for classification, it is useful in understanding how a set of
independent variables affect the outcome of the dependent variable.

The main disadvantage of the logistic regression algorithm is that it only works when the predicted
variable is binary, it assumes that the data is free of missing values and assumes that the predictors are
independent of each other.
Use Cases of Logistic Regression
Identifying risk factors for diseases

Word classification

Weather Prediction

Voting Applications

In health care, logistic regression can be used to predict if a tumor is likely to be benign or malignant.

In the financial industry, logistic regression can be used to predict if a transaction is fraudulent or not.

In marketing, logistic regression can be used to predict if a targeted audience will respond or not.
How does Logistic Regression Works
Consider the following example: An organization wants to determine an employee’s salary increase
based on their performance.

For this purpose, a linear regression algorithm will help them decide. Plotting a regression line by
considering the employee’s performance as the independent variable, and the salary increase as the
dependent variable will make their task easier.
How does Logistic Regression Works
Now, what if the organization wants to know whether an employee would get a promotion or not
based on their performance? The above linear graph won’t be suitable in this case. As such, we clip the
line at zero and one, and convert it into a sigmoid curve (S curve).
Sigmoid Function
The sigmoid function is a mathematical function used to map the predicted values to probabilities.

It maps any real value into another value within a range of 0 and 1.

The value cannot go beyond this limit, so it forms a curve like the "S" form. The S-form curve is called
the sigmoid function or logistic function.

In logistic regression, we use the concept of the threshold value, which defines the probability of either 0 or
1. Such as values above the threshold value tends to 1, and a value below the threshold values tends to 0.

Formula for sigmoid function:


Sigmoid Function
So, if the value of z goes to positive infinity then the predicted value of y will become 1 and if it goes to
negative infinity then the predicted value of y will become 0. And if the outcome of the sigmoid function is
more than 0.5 then we classify that label as class 1 or positive class and if it is less than 0.5 then we can
classify it to negative class or label as class 0.
Performance Metric: Confusion Matrix
The confusion matrix is a matrix used to determine the performance of the classification models for a
given set of test data. It can only be determined if the true values for test data are known.

The matrix itself can be easily understood, but the related terminologies may be confusing. Since it shows
the errors in the model performance in the form of a matrix, hence also known as an error matrix.
Some features of Confusion matrix are given below:

For the 2 prediction classes of classifiers, the matrix is of 2*2 table, for 3 classes, it is 3*3 table, and so
on.

The matrix is divided into two dimensions, that are predicted values and actual values along with the
total number of predictions.

Predicted values are those values, which are predicted by the model, and actual values are the true
values for the given observations.
Performance Metric: Confusion Matrix
It looks like the below table:

The above table has the following cases:


True Negative: Model has given prediction No, and the real or actual value was also No.
True Positive: The model has predicted yes, and the actual value was also true.
False Negative: The model has predicted no, but the actual value was Yes, it is also called as Type-II error.
False Positive: The model has predicted Yes, but the actual value was No. It is also called a Type-I error.
Need for Confusion Matrix
It evaluates the performance of the classification models, when they make predictions on test data, and
tells how good our classification model is.

It not only tells the error made by the classifiers but also the type of errors such as it is either type-I
or type-II error.

With the help of the confusion matrix, we can calculate the different parameters for the model, such as
accuracy, precision, etc.

Example: We can understand the confusion matrix using an example.

Suppose we are trying to create a model that can predict the result for the disease that is either a
person has that disease or not. So, the confusion matrix for this is given as:
Calculations using Confusion Matrix
We can perform various calculations for the model, such as the model's accuracy, using this matrix. These
calculations are given below:

Classification Accuracy

Misclassification rate

Precision

Recall / Sensitivity

F-Measure / F-Score / F1-Score

Null error rate

ROC Curve

AUC : Area Under the ROC Curve


Classification Accuracy
It is one of the important parameters to determine the accuracy of the classification problems. It defines
how often the model predicts the correct output.

It can be calculated as the ratio of the number of correct predictions made by the classifier to all number of
predictions made by the classifiers.

The formula is given below:


Classification Accuracy
When to Use Accuracy?

It is good to use the Accuracy metric when the target variable classes in data are approximately
balanced.

For example, if 60% of classes in a fruit image dataset are of Apple, 40% are Mango. In this case, if the
model is asked to predict whether the image is of Apple or Mango, it will give a prediction with 97% of
accuracy.

When not to use Accuracy?

It is recommended not to use the Accuracy measure when the target variable majorly belongs to
one class.

For example, Suppose there is a model for a disease prediction in which, out of 100 people, only five
people have a disease, and 95 people don't have one. In this case, if our model predicts every person
with no disease (which means a bad prediction), the Accuracy measure will be 95%, which is not
correct.
Misclassification rate
It is also termed as Error rate, and it defines how often the model gives the wrong predictions.

 The value of error rate can be calculated as the number of incorrect predictions to all number of the
predictions made by the classifier.

The formula is given below:


Precision
It can be defined as the number of correct outputs provided by the model or out of all positive classes
that have predicted correctly by the model, how many of them were actually true.

It Number of positive patterns predicted correctly, by total number of patterns in positive class.

The precision metric is used to overcome the limitation of Accuracy.

It can be calculated using the below formula:


Recall / Sensitivity
It is defined as the out of total positive classes, how our model predicted correctly.

This is defined as fraction of the patterns that are correctly classified.

The recall must be as high as possible.

It is also similar to the Precision metric; however, it aims to calculate the proportion of actual positive
that was identified incorrectly.

It can be calculated as True Positive or predictions that are actually true to the total number of positives,
either correctly predicted as positive or incorrectly predicted as negative (true Positive and false
negative).
Recall / Sensitivity
When to use Precision and Recall?

From the above definitions of Precision and Recall, we can say that recall determines the
performance of a classifier with respect to a false negative, whereas precision gives information
about the performance of a classifier with respect to a false positive.

So, if we want to minimize the false negative, then, Recall should be as near to 100%, and if we
want to minimize the false positive, then precision should be close to 100% as possible.

In simple words, if we maximize precision, it will minimize the FP errors, and if we maximize
recall, it will minimize the FN error.
F- Measure / F-Score / F1-Score
If two models have low precision and high recall or vice versa, it is difficult to compare these models. So,
for this purpose, we can use F-score.

F-score or F1 Score is a metric to evaluate a binary classification model on the basis of predictions that are
made for the positive class.

It is a type of single score that represents both Precision and Recall.

So, the F1 Score can be calculated as the harmonic mean of both precision and Recall, assigning
equal weight to each of them.

This score helps us to evaluate the recall and precision at the same time.

The F-score is maximum if the recall is equal to the precision. It can be calculated using the below formula:
F- Measure / F-Score / F1-Score
When to use F-Score?

As F-score make use of both precision and recall, so it should be used if both of them are important
for evaluation, but one (precision or recall) is slightly more important to consider than the other. For
example, when False negatives are comparatively more important than false positives, or vice versa.
Null error rate
It defines how often our model would be incorrect if it always predicted the majority class.

As per the accuracy paradox, it is said that "the best classifier has a higher error rate than the null error
rate."
Receiver Operating Characteristics / ROC Curve
The ROC is a graph displaying a classifier's performance for all possible thresholds. The graph is plotted
between the true positive rate (on the Y-axis) and the false Positive rate (on the x-axis).

ROC curve stands for Receiver Operating Characteristics Curve

The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and FPR (False Positive
Rate) on X-axis.

TPR or true Positive rate is a synonym for Recall, hence can be calculated as:

FPR or False Positive Rate can be calculated as:


Receiver Operating Characteristics / ROC Curve
It is the score which lies between 0 to 1. 0 stands for Bad and 1 stands for Good. 0.5 is better. If ROC score
is 0.78 then it means 78% of predicted values are correct and rest 22% are predicted wrongly.

To calculate value at any point in a ROC curve, we can evaluate a logistic regression model multiple times
with different classification thresholds, but this would not be much efficient. So, for this, one efficient
method is used, which is known as AUC.
AUC: Area Under the ROC curve
AUC is known for Area Under the ROC curve. As its name suggests, AUC calculates the two-dimensional
area under the entire ROC curve, as shown below image:

AUC calculates the performance across all the thresholds and provides an aggregate measure. The value of
AUC ranges from 0 to 1. It means a model with 100% wrong prediction will have an AUC of 0.0, whereas
models with 100% correct predictions will have an AUC of 1.0.
Precision vs Recall
Precision Recall
It helps us to measure the ability to classify positive samples in the model. It helps us to measure how many positive samples were correctly classified by
the ML model.

While calculating the Precision of a model, we should consider both While calculating the Recall of a model, we only need all positive samples while
Positive as well as Negative samples that are classified. all negative samples will be neglected.

When a model classifies most of the positive samples correctly as well as When a model classifies a sample as Positive, but it can only classify a few
many false-positive samples, then the model is said to be a high recall and positive samples, then the model is said to be high accuracy, high precision, and
low precision model. low recall model.

The precision of a machine learning model is dependent on both the Recall of a machine learning model is dependent on positive samples and
negative and positive samples. independent of negative samples.

In Precision, we should consider all positive samples that are classified as The recall cares about correctly classifying all positive samples. It does not
positive either correctly or incorrectly. consider if any negative sample is classified as positive.
Gradient Decent / Steepest Decent
It is most commonly used iterative optimization algorithms to train machine learning models & deep
learning models by means of minimizing errors between actual and expected results. Further,
gradient descent is also used to train Neural Networks.

Optimization algorithm refers to the task of minimizing/maximizing an objective function f(x)


parameterized by x.

Similarly, in machine learning, optimization is the task of minimizing the cost function parameterized by the
model's parameters.

The main objective of gradient descent is to minimize the convex function using iteration of parameter
updates. Once these machine learning models are optimized, these models can be used as powerful tools
for Artificial Intelligence and various computer science applications.

It helps in finding the local minimum of a function.


Gradient Decent / Steepest Decent
The best way to define the local minimum or local maximum of a function using gradient descent is as
follows:

If we move towards a negative gradient or away from the gradient of the function at the current point,
it will give the local minimum of that function.

Whenever we move towards a positive gradient or towards the gradient of the function at the current
point, we will get the local maximum of that function.
Gradient Decent / Steepest Decent
The main objective of using a gradient descent algorithm is to minimize the cost function using
iteration. To achieve this goal, it performs two steps iteratively:

Calculates the first-order derivative of the function to compute the gradient or slope of that function.

Move away from the direction of the gradient, which means slope increased from the current point by
alpha times, where Alpha is defined as Learning Rate. It is a tuning parameter in the optimization
process which helps to decide the length of the steps.
How does Gradient Decent work?
Before starting the working principle of gradient descent, we should know some basic concepts to find out
the slope of a line from linear regression. The equation for simple linear regression is given as:

Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.
Gradient Decent / Steepest Decent
The starting point(shown in above fig.) is used to evaluate the performance as it is considered just as an
arbitrary point. At this starting point, we will derive the first derivative or slope and then use a tangent line
to calculate the steepness of this slope. Further, this slope will inform the updates to the parameters
(weights and bias).

The slope becomes steeper at the starting point or arbitrary point, but whenever new parameters are
generated, then steepness gradually reduces, and at the lowest point, it approaches the lowest point, which
is called a point of convergence.

The main objective of gradient descent is to minimize the cost function or the error between expected and
actual.
Types of Gradient Decent
Based on the error in various training models, the Gradient Descent learning algorithm can be divided into

Batch gradient descent,

Stochastic gradient descent, and

Mini-batch gradient descent.


Batch Gradient Decent
Batch gradient descent (BGD) is used to find the error for each point in the training set and update the
model after evaluating all training examples. This procedure is known as the training epoch. In simple
words, it is a greedy approach where we have to sum over all examples for each update.

Advantages of Batch gradient descent:

It produces less noise in comparison to other gradient descent.

It produces stable gradient descent convergence.

It is computationally efficient as all resources are used for all training samples.
Stochastic Gradient Decent
Stochastic gradient descent (SGD) is a type of gradient descent that runs one training example per
iteration. Or in other words, it processes a training epoch for each example within a dataset and updates
each training example's parameters one at a time.

As it requires only one training example at a time, hence it is easier to store in allocated memory. However,
it shows some computational efficiency losses in comparison to batch gradient systems as it shows
frequent updates that require more detail and speed.

Further, due to frequent updates, it is also treated as a noisy gradient. However, sometimes it can be
helpful in finding the global minimum and also escaping the local minimum.
Advantages of Stochastic gradient descent:
In Stochastic gradient descent (SGD), learning happens on every example, and it consists of a few
advantages over other gradient descent.
It is easier to allocate in desired memory.
It is relatively fast to compute than batch gradient descent.
It is more efficient for large datasets.
Mini Batch Gradient Decent
Mini Batch gradient descent is the combination of both batch gradient descent and stochastic gradient
descent.

It divides the training datasets into small batch sizes then performs the updates on those batches
separately. Splitting training datasets into smaller batches make a balance to maintain the computational
efficiency of batch gradient descent and speed of stochastic gradient descent.

Hence, we can achieve a special type of gradient descent with higher computational efficiency and less
noisy gradient descent.

Advantages of Mini Batch gradient descent:

It is easier to fit in allocated memory.

It is computationally efficient.

It produces stable gradient descent convergence.

You might also like