Unit 4 Model Evaluation
Unit 4 Model Evaluation
Evaluation metrics are quantitative measures used to assess the performance and
effectiveness of a statistical or machine learning model. These metrics provide
insights into how well the model is performing and help in comparing different
models or algorithms.
For example, if two models have similar accuracy scores, but if one has a
To evaluate the machine learning models, you can follow these steps:
Choose the right evaluation metric: The choice of the evaluation metric
will depend on the specific machine learning task and the desired outcome.
For a Regression model, you may use mean absolute error (MAE),
mean squared error (MSE), and root mean squared error (RMSE) as
evaluation metrics.
Split data into training and test sets: The training set is used to train the
model, and the test set is used to evaluate the model’s performance on
unseen data. This is to ensure that the model is more balanced with the
training data.
Train and evaluate the multiple models: Try different machine learning
algorithms and hyperparameters to see which models perform the best on
the training data.
Select the best model: Once you have evaluated all of your models, you
can select the best model based on the evaluation metrics.
Precision
Precision evaluates the accuracy of the positive prediction made by the
classifier. In simple terms, precision answers the question: “Of all the
instances that the model predicted as positive, how many were actually
positive”.
Mathematically it is defined as:
Precision = True Positive (TP) / True Positive (TP) + False Positive
(FP)
Recall
The recall is also known as sensitivity or true positive rate. It is the ratio of
the number of true positive predictions to the total number of actual
positive instances in the dataset. Recall measures the ability of a model to
identify all relevant instances.
Mathematically, recall is defined as:
Recall = True Positive (TP) / True Positive (TP) + False Negative (FN)
F1-Score
F1 score is the harmonic mean of precision and recall. It provide a single
metric that balances the trade-off between precision and recall. It is
espically useful when the class distribution is imbalanced.
Mathematically, it is given by:
F1 Score = 2 x [(Precision x Recall)/ (Precision + Recall)]
The F1-score ranges between 0 and 1.
1: indicates perfect precision and recall
0: neither precision nor recall.
AUC-ROC Curve
AUC-ROC stands for the Area Under the Receiver Operating
Characteristic Curve. ROC curve is a graphical representation of
classification model performance at different thresholds. It is created by
plotting the True Positive Rate (TPR) against the False Positive Rate
(FPR). Whereas AUC represents the area under the ROC curve. It provides
a single scalar value that summarizes the overall performance of a classifier
across all possible threshold values.
The formula of TPR ad FPR:
True Positive Rate (TPR/Sensitivity/Recall) =
True Positive / True Positive + False Negative
False Positive Rate (FPR) = False Positive / False Positive + True
Negative.
A typical AUC-ROC curve looks like:
Confusion Matrix
The accuracy for the problem in hand comes out to be 88%. As you can see from
the above two tables, the Positive Predictive Value is high, but the negative
predictive value is quite low. The same holds for Sensitivity and Specificity. This
is primarily driven by the threshold value we have chosen. If we decrease our
threshold value, the two pairs of starkly different numbers will come closer.
In general, we are concerned with one of the above-defined metrics. For instance,
in a pharmaceutical company, they will be more concerned with a minimal wrong
positive diagnosis. Hence, they will be more concerned about high Specificity.
On the other hand, an attrition model will be more concerned with Sensitivity.
Confusion matrices are generally used only with class output models.
F1 Score
In the last section, we discussed precision and recall for classification problems
and also highlighted the importance of choosing a precision/recall basis for our
use case. What if, for a use case, we are trying to get the best precision and recall
at the same time? F1-Score is the harmonic mean of precision and recall values
for a classification problem. The formula for F1-Score is as follows:
Now, an obvious question that comes to mind is why you are taking a harmonic
mean and not an arithmetic mean. This is because HM punishes extreme values
more. Let us understand this with an example. We have a binary classification
model with the following results:
Predicted Class
Positive Negative
Negative FP = 0 TN = 7
Precision
The Result – Our model has a recall of .667 so it means that it identified 66.7%
of positive classes correctly.
False Positive Rate
The Result – Our model has a zero False Positive rate so it means that it identified
0% of negative classes wrongly.
False Negative Rate
The Result – Our model has a .33 False Negative rate so it means that it identified
33% of positive classes wrongly.
Accuracy
It tells what proportion of negative and positive classes are correctly classified.
Formula – Accuracy = (TP + TN) / (P+N), where P and N are total positive and
negative classes.
In the given example, Accuracy is (2+7)/(3+7)= .90
The Result – Our model has a .90 accuracy value so it means that 90% of the
time, the prediction is correct.
Sensitivity
It tells What proportion of actual positive classes was predicted correctly. This
is the same as the Recall value.
Formula: Sensitivity = TP / (TP + FN)
In the given example, Sensitivity is: 2 / (2 + 1) = 0.667
The Result – Our model has a sensitivity of .667 so it means that it identified
66.7% of positive classes correctly.
Specificity
The Result – Our model has a Specificity of 1 so it means that it identified 100%
of negative classes correctly.
F1 Score
It tells that both precision and recall are balanced. Neither is too high or nor is too
low. It’s the best value is 1 when both are balanced and 0 if any of them is very
high.
Baseline Models:
1.Learning rate
2.Number of layers
3.Number of units
4.Activation units
Hyperparameter types
Now that we understand what hyperparameters are and the importance of tuning
them, we need to know how to choose their optimal values. We can find these
optimal hyperparameter values using manual or automated methods.
Grid search
possible discrete hyperparameter values, then fit the model with every possible
combination.
Grid search is an exhaustive algorithm that can find the best combination of
hyperparameters. However, the drawback is that it’s slow. Fitting the model with
every possible combination usually requires a high computation capacity and
significant time, which may not be available.
Random search
The random search method (as its name implies) chooses values randomly rather
than using a predefined set of values like the grid search method.
Random search tries a random combination of hyperparameters in each iteration
and records the model performance. After several iterations, it returns the mix
that produced the best result.
iteration and records the model performance. After several iterations, it returns
The benefit is that random search typically requires less time than grid search to
return a comparable result. It also ensures we don't end up with a model that's
biased toward value sets arbitrarily chosen by users. Its drawback is that the result
may not be the best possible hyperparameter combination.
Steps for Tuning the Hyper parameters tuning :
1.Performance Metrics
4. Selecting hyperparameters
5. Debugging strategies
Debugging Strategies :
4.Debugging strategies
Debugging strategies:
3. Design a test case that is so simple that the test behavior can be predicted, or
4. Design a test that exercises one part of the neural net implementation in
isolation
2. Find the Error Location: Once the error is correctly discovered, you will
be required to thoroughly review the code repeatedly to locate the position
of the error. In general, this step focuses on finding the error rather than
perceiving it.
3. Analyze the Error: The third step comprises error analysis, a bottom-up
approach that starts from the location of the error followed by analyzing
the code. This step makes it easier to comprehend the errors. Mainly error
analysis has two significant goals, i.e., evaluation of errors all over again
to find existing bugs and postulating the uncertainty of incoming collateral
damage in a fix.
5. Cover Lateral Damage: The fifth phase is about accumulating all of the
unit tests for the code that requires modification. As when you run these
unit tests, they must pass.
6. Fix & Validate: The last stage is the fix and validation that emphasizes
fixing the bugs followed by running all the test scripts to check whether
they pass.
Debugging Strategies
o The backward analysis analyzes the program from the backward location
where the failure message has occurred to determine the defect region. It
is necessary to learn the area of defects to understand the reason for defects.
o In the forward analysis, the program tracks the problem in the forward
direction by utilizing the breakpoints or print statements incurred at
different points in the program. It emphasizes those regions where the
wrong outputs are obtained.
Problem 1:
Suppose 1000 people had their COVID-19 tested.Out of 427 had positive
findings and 573 had negative findings.Now out of 427 positives who had tested
covid positive, 369 of them had an actual COVID-19 positive case.Out
of 573 people who had tested negative for Covid-19, only 558 did not have
Covid-19.
True Positive :369
False negative :384 15
True Negative :558
False Positive :58
Problem 2:
Precision :
Precision is a measure of the accuracy.
Recall :
Recall is the true positive rate
F1 Score :
F1 score is a machine learning evaluation metric that measures a model’s
accuracy. It combines the precision and recall scores of a model.The accuracy
metric computes how many times a model made a correct prediction across the
entire dataset.