0% found this document useful (0 votes)
64 views

Unit 4 Model Evaluation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views

Unit 4 Model Evaluation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Model Evaluation

Performance metrics -Baseline Models -Hyper parameters-Manual


Hyperparameter -Automatic Hyperparameter -Grid Search -Random Search -
Debugging Strategies

What do you mean by Model Evaluation ?

Model evaluation is a critical process in machine learning that is used to assess


the performance of a trained model. It involves comparing the predicted values
from the model to the actual values in the test dataset to determine the level of
agreement between the two.

Evaluation metrics are quantitative measures used to assess the performance and
effectiveness of a statistical or machine learning model. These metrics provide
insights into how well the model is performing and help in comparing different
models or algorithms.

When evaluating a machine learning model, it is crucial to assess its predictive


ability, generalization capability, and overall quality. Evaluation metrics provide
objective criteria to measure these aspects. The choice of evaluation metrics
depends on the specific problem domain, the type of data, and the desired
outcome.
Fig 1 Model Evaluation Metrics

Why Evaluation metrics is important ?

 To assess the performance of a model: Evaluation metrics provide a


quantitative measure of how well a model performs on a given task.

This is essential for understanding a model’s strengths and weaknesses and


deciding whether to deploy it to production.

 To compare different models: Evaluation metrics can be used to compare


machine learning models trained on the same dataset to solve the same
problem.

For example, if two models have similar accuracy scores, but if one has a

higher precision score, that will be preferred.


 To tune hyperparameters: Evaluation metrics are often used to tune the
hyperparameters of a machine learning model. Hyperparameters control a
model’s training process, such as the learning rate and the number of
epochs.

 By adjusting the hyperparameters, data scientists can improve the


performance of their models.

 To monitor the performance of a model over time: Evaluation metrics can


be used to monitor the performance of a machine learning model over time.
This is important because models can degrade performance over time due
to changes in the data distribution and concept drift.

 By monitoring the performance of a model, data scientists can


identify any problems early and take corrective action.

 To identify overfitting: Overfitting occurs when a model learns the training


data too well and cannot generalize to new data.

 Evaluation metrics can identify overfitting by comparing the


model’s performance on the training data to its performance on a
held-out test set.

How to Choose the Best Evaluation Metrics?

To evaluate the machine learning models, you can follow these steps:

 Choose the right evaluation metric: The choice of the evaluation metric
will depend on the specific machine learning task and the desired outcome.

 For a classification model, you can choose the accuracy, precision,


recall, and F1 score as evaluation metrics.

 For a Regression model, you may use mean absolute error (MAE),
mean squared error (MSE), and root mean squared error (RMSE) as
evaluation metrics.
 Split data into training and test sets: The training set is used to train the
model, and the test set is used to evaluate the model’s performance on
unseen data. This is to ensure that the model is more balanced with the
training data.

 To reduce the risk of overfitting, use a cross-validation technique.

 Train and evaluate the multiple models: Try different machine learning
algorithms and hyperparameters to see which models perform the best on
the training data.

 Select the best model: Once you have evaluated all of your models, you
can select the best model based on the evaluation metrics.

 For example, if the model is going to be used to make high-stakes


decisions, then it is important to select a model with high accuracy
and precision.

Parameters for Evaluation Metrics

 True Positive (TP): Predicted positive and it’s true.


True Negative (TN): Predicted negative and it’s true.
False Positive (FP): Predicted positive and it’s false.
False Negative (FN) : Predicted negative and it’s false.

The parameters such as TP,TN ,FP,FN is required for the construction of


the Confusion Matrix
Accuracy
 Accuracy is one of the most commonly used evaluation metrics in
classification problems. It measures the proportion of correct predictions
in the total prediction made. It is defined as:
 Accuracy = Number of Correct Predictions/Total Number of
Predictions
 Mathematically, it is defined as:
 Accuracy = TP + TN / (TP + TN + FP + FN)

Precision
 Precision evaluates the accuracy of the positive prediction made by the
classifier. In simple terms, precision answers the question: “Of all the
instances that the model predicted as positive, how many were actually
positive”.
 Mathematically it is defined as:
 Precision = True Positive (TP) / True Positive (TP) + False Positive
(FP)
Recall
 The recall is also known as sensitivity or true positive rate. It is the ratio of
the number of true positive predictions to the total number of actual
positive instances in the dataset. Recall measures the ability of a model to
identify all relevant instances.
 Mathematically, recall is defined as:
 Recall = True Positive (TP) / True Positive (TP) + False Negative (FN)
F1-Score
 F1 score is the harmonic mean of precision and recall. It provide a single
metric that balances the trade-off between precision and recall. It is
espically useful when the class distribution is imbalanced.
 Mathematically, it is given by:
 F1 Score = 2 x [(Precision x Recall)/ (Precision + Recall)]
 The F1-score ranges between 0 and 1.
1: indicates perfect precision and recall
0: neither precision nor recall.

AUC-ROC Curve
 AUC-ROC stands for the Area Under the Receiver Operating
Characteristic Curve. ROC curve is a graphical representation of
classification model performance at different thresholds. It is created by
plotting the True Positive Rate (TPR) against the False Positive Rate
(FPR). Whereas AUC represents the area under the ROC curve. It provides
a single scalar value that summarizes the overall performance of a classifier
across all possible threshold values.
The formula of TPR ad FPR:
 True Positive Rate (TPR/Sensitivity/Recall) =
True Positive / True Positive + False Negative
False Positive Rate (FPR) = False Positive / False Positive + True
Negative.
 A typical AUC-ROC curve looks like:

Confusion Matrix

A confusion matrix is an N X N matrix, where N is the number of predicted


classes. For the problem in hand, we have N=2, and hence we get a 2 X 2 matrix.
It is a performance measurement for machine learning classification problems
where the output can be two or more classes. Confusion matrix is a table with 4
different combinations of predicted and actual values. It is extremely useful for
measuring precision-recall, Specificity, Accuracy, and most importantly, AUC-
ROC curves.

The different elements in confusion matrix are


 True Positive: You predicted positive, and it’s true.
 True Negative: You predicted negative, and it’s true.
 False Positive: (Type 1 Error): You predicted positive, and it’s false.
 False Negative: (Type 2 Error): You predicted negative, and it’s false.
 Accuracy: the proportion of the total number of correct predictions that
were correct.
 Positive Predictive Value or Precision: the proportion of positive cases
that were correctly identified.
 Negative Predictive Value: the proportion of negative cases that were
correctly identified.
 Sensitivity or Recall: the proportion of actual positive cases which are
correctly identified.
 Specificity: the proportion of actual negative cases which are correctly
identified.
 Rate: It is a measuring factor in a confusion matrix. It has also 4 types
TPR, FPR, TNR, and FNR.

The accuracy for the problem in hand comes out to be 88%. As you can see from
the above two tables, the Positive Predictive Value is high, but the negative
predictive value is quite low. The same holds for Sensitivity and Specificity. This
is primarily driven by the threshold value we have chosen. If we decrease our
threshold value, the two pairs of starkly different numbers will come closer.
In general, we are concerned with one of the above-defined metrics. For instance,
in a pharmaceutical company, they will be more concerned with a minimal wrong
positive diagnosis. Hence, they will be more concerned about high Specificity.
On the other hand, an attrition model will be more concerned with Sensitivity.
Confusion matrices are generally used only with class output models.

F1 Score

In the last section, we discussed precision and recall for classification problems
and also highlighted the importance of choosing a precision/recall basis for our
use case. What if, for a use case, we are trying to get the best precision and recall
at the same time? F1-Score is the harmonic mean of precision and recall values
for a classification problem. The formula for F1-Score is as follows:

Now, an obvious question that comes to mind is why you are taking a harmonic
mean and not an arithmetic mean. This is because HM punishes extreme values
more. Let us understand this with an example. We have a binary classification
model with the following results:

A confusion matrix is a matrix representation of the prediction results of any


binary testing that is often used to describe the performance of the
classification model (or “classifier”) on a set of test data for which the true values
are known.
The confusion matrix itself is relatively simple to understand, but the related
terminology can be confusing.
Confusion matrix with 2 class labels.

Predicted Class

Positive Negative

Actual Class Positive TP = 2 FN = 1

Negative FP = 0 TN = 7

Precision

It tells What proportion of predicted positive classes was actually positive.


Formula: Precision = TP / (TP + FP)
In the given example, Precision Value is: 2 / (2 + 0) = 1

The Result – Our model has a precision of 1 so it means when it predicts a


positive class, it is correct 100%.
Recall

It tells What proportion of actual positive classes was predicted correctly.


Formula: Recall = TP / (TP + FN)
In the given example, Precision Value is: 2 / (2 + 1) = 0.667

The Result – Our model has a recall of .667 so it means that it identified 66.7%
of positive classes correctly.
False Positive Rate

It is known as the TYPE I error as well. It tells What proportion of Negative


classes was wrongly classified.
Formula – False Positive Rate = FP/(FP+TN)
In the given example, False Positive Rate is: 0 / (0 + 7) = 0

The Result – Our model has a zero False Positive rate so it means that it identified
0% of negative classes wrongly.
False Negative Rate

It is known as the TYPE II error as well. It tells What proportion of Positive


classes was wrongly classified.
Formula – False Negative Rate = FN/(FN+TP)
In the given example, False Negative Rate is: 1 / (1 + 2) = 0.333

The Result – Our model has a .33 False Negative rate so it means that it identified
33% of positive classes wrongly.
Accuracy

It tells what proportion of negative and positive classes are correctly classified.
Formula – Accuracy = (TP + TN) / (P+N), where P and N are total positive and
negative classes.
In the given example, Accuracy is (2+7)/(3+7)= .90

The Result – Our model has a .90 accuracy value so it means that 90% of the
time, the prediction is correct.
Sensitivity

It tells What proportion of actual positive classes was predicted correctly. This
is the same as the Recall value.
Formula: Sensitivity = TP / (TP + FN)
In the given example, Sensitivity is: 2 / (2 + 1) = 0.667

The Result – Our model has a sensitivity of .667 so it means that it identified
66.7% of positive classes correctly.
Specificity

It tells What proportion of actual negative classes was predicted correctly.


Formula: Specificity = TN / (TN + FP)
In the given example, Specificity is: 7 / (7 + 0) = 1

The Result – Our model has a Specificity of 1 so it means that it identified 100%
of negative classes correctly.
F1 Score

It tells that both precision and recall are balanced. Neither is too high or nor is too
low. It’s the best value is 1 when both are balanced and 0 if any of them is very
high.

Formula: F1 Score = 2 x ((Precision*Recall) / (Precision + Recall))


In the given example, F1 Score is: 2 x (1 * .667) /(1+.667) = .800
The result – Our model has an F1 Score of .80 which is closed to best value 1 so
precision and recall are in balance.

Baseline Models:

1.It is mainly used to understand the data

2.It is mainly used for performing the iteration faster

3.It provides a Benchmark in the Iteration metrics

The different parameters in hyper parameter are

1.Learning rate

2.Number of layers

3.Number of units

4.Activation units

Hyperparameter tuning consists of finding a set of optimal hyperparameter values


for a learning algorithm while applying this optimized algorithm to any data set.
That combination of hyperparameters maximizes the model’s performance,
minimizing a predefined loss function to produce better results with fewer errors.
Note that the learning algorithm optimizes the loss based on the input data and
tries to find an optimal solution within the given setting. However,
hyperparameters describe this setting exactly.
For instance, if we work on natural language processing (NLP) models, we
probably use neural networks, support-vector machines (SVMs), Bayesian
networks, and Extreme Gradient Boosting (XGB) for tuning parameters.

Hyperparameter types

Some important hyperparameters that require tuning in neural networks are:

 Number of hidden layers: It’s a trade-off between keeping our neural


network as simple as possible (fast and generalized) and classifying our
input data correctly. We can start with values of four to six and check our
data’s prediction accuracy when we increase or decrease this
hyperparameter.
 Number of nodes/neurons per layer: More isn't always better when
determining how many neurons to use per layer. Increasing neuron count
can help, up to a point. But layers that are too wide may memorize the
training dataset, causing the network to be less accurate on new data.
 Learning rate: Model parameters are adjusted iteratively — and the
learning rate controls the size of the adjustment at each step. The lower the
learning rate, the lower the changes to parameter estimates are. This means
that it takes a longer time (and more data) to fit the model — but it also
means that it is more likely that we actually find the minimum loss.
 Momentum: Momentum helps us avoid falling into local minima by
resisting rapid changes to parameter values. It encourages parameters to
keep changing in the direction they were already changing, which helps
prevent zig-zagging on every iteration. Aim to start with low momentum
values and adjust upward as needed.
Methods for tuning hyperparameters

Now that we understand what hyperparameters are and the importance of tuning
them, we need to know how to choose their optimal values. We can find these
optimal hyperparameter values using manual or automated methods.

When tuning hyperparameters manually, we typically start using the default


recommended values or rules of thumb, then search through a range of values
using trial-and-error. But manual tuning is a tedious and time-consuming
approach. It isn’t practical when there are many hyperparameters with a wide
range.

Automated hyperparameter tuning methods use an algorithm to search for the


optimal values. Some of today’s most popular automated methods are grid search,
random search, and Bayesian optimization. Let’s explore these methods in detail.

Grid search

Grid search is a sort of “brute force” hyperparameter tuning method. We create a


grid of possible discrete hyperparameter values then fit the model with every
possible combination. We record the model performance for each set then select
the combination that has produced the best performance.
Grid search is a hyperparameter tuning method in which we create a grid of

possible discrete hyperparameter values, then fit the model with every possible

combination.

Grid search is an exhaustive algorithm that can find the best combination of
hyperparameters. However, the drawback is that it’s slow. Fitting the model with
every possible combination usually requires a high computation capacity and
significant time, which may not be available.

Random search

The random search method (as its name implies) chooses values randomly rather
than using a predefined set of values like the grid search method.
Random search tries a random combination of hyperparameters in each iteration
and records the model performance. After several iterations, it returns the mix
that produced the best result.

Random search tries a random combination of hyperparameters in each

iteration and records the model performance. After several iterations, it returns

the mix that produced the best result.

Random search is appropriate when we have several hyperparameters with


relatively large search domains. We can make discrete ranges (for instance, [5-
100] in steps of 5) and still get a reasonably good set of combinations.

The benefit is that random search typically requires less time than grid search to
return a comparable result. It also ensures we don't end up with a model that's
biased toward value sets arbitrarily chosen by users. Its drawback is that the result
may not be the best possible hyperparameter combination.
Steps for Tuning the Hyper parameters tuning :

1.Identify the parameters

2.Use different parameters such as Grid search,Random search Bayesian


Optimisation GA,PSO

3.Cross Validate the data

4.Split the data into different Folds.

5.Each machine Learning Model use accuracy

6.Obtain the best accuracy uing the Hyperparameter modelling .

Manual Hyper parameter and Automated Hyperparameter :

Choosing the optimal set of hyperparameters requires an in-depth understanding


of the nature and scale of each hyperparameter. Hyperparameter tuning can be
done either manually or by automated methods. Before this, a robust evaluation
criterion has to be determined to evaluate the model performance using each set
of hyperparameters. A good example of this technique is k-fold cross-validation.

There are two main types of hyperparameter tuning:

 Manual hyperparameter tuning: Manual hyperparameter tuning


involves experimenting with different sets of hyperparameters manually
using the trial and error method. The results of each trial are tracked and
used as feedback to obtain a combination of hyperparameters that yield the
highest model performance.
 Automated hyperparameter tuning: In automated hyperparameter
tuning, the optimal set of hyperparameters is found by using an algorithm.
An automatic hyperparameter tuning technique involves methods in which
the user defines a set of hyperparameter combinations or a range for each
hyperparameter, and the tuning algorithm runs the trials to find the optimal
set of hyperparameters for the model.

Three different steps in Model evaluation are

1.Performance Metrics

2. Default Baseline Models

3. Determining whether to gather more data

4. Selecting hyperparameters

5. Debugging strategies

6. Example: multi-digit number recognition

Debugging Strategies :

The different steps in Model Evaluation are

1.Performance Metrics of the Models

2.Identifying the Default Base lines

3.Select the Hyper-parameters

4.Debugging strategies
Debugging strategies:

Need to get around both difficulties:

1. Whether performance is intrinsically poor or has a bug

2. Whether parts are compensating for each other

3. Design a test case that is so simple that the test behavior can be predicted, or

4. Design a test that exercises one part of the neural net implementation in
isolation

Important Debugging Tests

1. Visualize the model in action

2. Visualize the worst mistakes

3. Reasoning about software using train and test error

4. Fit a tiny dataset

5. Compare back-propagated derivatives to numerical derivatives

6. Monitor histograms of activations and gradient


Debugging Strategies :

Steps involved in Debugging Strategies :

1. Identify the Error: Identifying an error in a wrong may result in the


wastage of time. It is very obvious that the production errors reported by
users are hard to interpret, and sometimes the information we receive is
misleading. Thus, it is mandatory to identify the actual error.

2. Find the Error Location: Once the error is correctly discovered, you will
be required to thoroughly review the code repeatedly to locate the position
of the error. In general, this step focuses on finding the error rather than
perceiving it.

3. Analyze the Error: The third step comprises error analysis, a bottom-up
approach that starts from the location of the error followed by analyzing
the code. This step makes it easier to comprehend the errors. Mainly error
analysis has two significant goals, i.e., evaluation of errors all over again
to find existing bugs and postulating the uncertainty of incoming collateral
damage in a fix.

4. Prove the Analysis: After analyzing the primary bugs, it is necessary to


look for some extra errors that may show up on the application. By
incorporating the test framework, the fourth step is used to write automated
tests for such areas.

5. Cover Lateral Damage: The fifth phase is about accumulating all of the
unit tests for the code that requires modification. As when you run these
unit tests, they must pass.
6. Fix & Validate: The last stage is the fix and validation that emphasizes
fixing the bugs followed by running all the test scripts to check whether
they pass.

Debugging Strategies

o For a better understanding of a system, it is necessary to study the system


in depth. It makes it easier for the debugger to fabricate distinct illustrations
of such systems that are needed to be debugged.

o The backward analysis analyzes the program from the backward location
where the failure message has occurred to determine the defect region. It
is necessary to learn the area of defects to understand the reason for defects.

o In the forward analysis, the program tracks the problem in the forward
direction by utilizing the breakpoints or print statements incurred at
different points in the program. It emphasizes those regions where the
wrong outputs are obtained.

o To check and fix similar kinds of problems, it is recommended to utilize


past experiences. The success rate of this approach is directly proportional
to the proficiency of the debugger.

Problem 1:
Suppose 1000 people had their COVID-19 tested.Out of 427 had positive
findings and 573 had negative findings.Now out of 427 positives who had tested
covid positive, 369 of them had an actual COVID-19 positive case.Out
of 573 people who had tested negative for Covid-19, only 558 did not have
Covid-19.
True Positive :369
False negative :384 15
True Negative :558
False Positive :58

Sensitivity=(True Positives (A))/(True Positives (A)+False Negatives (C))


=369/(369+15) = 369/384 = 0.961 (Model Predicted 96%)

Specificity=(True Negatives (D))/(True Negatives (D)+False Positives (B))


Specificity=(558)/(558+58)
Specificity=558/616
Specificity=0.906 (90.6%)

Problem 2:

True Positive (TP) =10


True Negative (TN)=12
False Positive (FP)=1
False Negative (FN)=2
Accuracy :
Accuracy is one metric for evaluating classification models. Formally accuracy
could be defined as the number of correct predictions to a total number of
predictions.

Precision :
Precision is a measure of the accuracy.

Recall :
Recall is the true positive rate
F1 Score :
F1 score is a machine learning evaluation metric that measures a model’s
accuracy. It combines the precision and recall scores of a model.The accuracy
metric computes how many times a model made a correct prediction across the
entire dataset.

You might also like