Unit - I Chap-4 Model Evaluation and Development
Unit - I Chap-4 Model Evaluation and Development
Model Evaluation
Model evaluation is the process of understanding a performance, as well as its strengths and
weaknesses using different evaluation metrics.
It helps to find the best model that represents our data
It is an important part of the model development process.
Model evaluation is important to assess the efficiency of a model.
Evaluating a models through:
1. Hold-Out
2. Cross-Validation
Model Evaluation Methods
1. Hold-Out
2. Cross-Validation.
1. Hold-Out
The dataset is divided into a training set and a validation set.
The model is “trained” on the training set,
and its “performance” is “validated” on the validation set.
A large dataset is randomly divided to three subsets:
♦ Training data is a subset of the dataset used to build predictive models.
♦ Validation data is a subset of the dataset used to assess the performance of model and
selecting the best-performing model.
♦ Test data is used for assess the likely future performance of a model. If a model fit to the
training set much better than it fits the test set.
Example: If there are 20 data items, 12 are used in training set and remaining 8 are used in test set.
Model Evaluation Methods
2. Cross-Validation
Cross validation is a technique used to evaluate the performance of a model on unseen data.
Several types of cross validation techniques are- k-fold cross validation, leave-one-out cross
validation, and stratified cross validation.
When only a limited amount of data is available, to achieve an unbiased estimate of the model
performance, we use k-fold cross-validation.
In k-fold cross-validation, we divide the dataset into ‘k’ subsets of equal size.
We build models ‘k’ times, each time leaving out one of the subsets from training and use it as
the test set. If ‘k’ equals the sample size, this is called "leave-one-out".
The main purpose of cross validation is
to prevent overfitting, which occurs
when a model is trained too well on
the training data and performs
poorly on new, unseen data
Model Evaluation Metrics
Classification Evaluation
Regression Evaluation
Model Evaluation Metrics
Classification Evaluation
To evaluate the performance of a model, different types of classification metrics are: accuracy,
precision, confusion matrix, log-loss, and AUC (area under the ROC curve).
1. Confusion Matrix
Confusion Matrix is the graphical representation of the Actual and Predicted values.
It is used to describe the performance of a classification model on a set of the test data for
which the true values are known.
Model Evaluation Metrics
Model Evaluation Metrics
Example: Suppose we are trying to create a model that can predict if a patient has cancer or not.
2. Accuracy:
A number of correct predictions divided by the total number of predictions.
Example:
Accuracy = (True Positives + True Negatives) /(True Positives + False Positives + False
Negatives + True Negatives)
Accuracy = 100+50 / 100+10+5+50 = 150/165 *100 = 91%
Accuracy is useful when the target class is well balanced but is not a good for the unbalanced classes.
Model Evaluation Metrics
3. Precision
A number of true positives divided by the number of predicted positives.
Example:
Precision = True Positives / (True Positives + False Positives)
Precision = 100 / 100+10 = 100/110 =0.91 *100 = 91%
When our model predicts that a patient does not have cancer, it is correct 91percent of the
time.
4. Recall or Sensitivity
A number of true positives divided by the total number of true positives and false negatives.
Recall is also known as TPR- True Positive Rate.
Example:
Recall = True Positives / (True Positives + False Negative)
Recall = 100 / 100+05 = 100/105 = 0.95 *100 = 95%
The 95 percent of all cancer patients are correctly predicted by the model to have cancer.
Increasing precision decreases recall and vice versa, this is known as the precision/recall
Model Evaluation Metrics
5. Specificity:
A number of true negatives divided by the total number of true negatives and false positives.
Specificity is the opposite of sensitivity, also known as TNR true negative rate.
Example:
Specificity = True Negatives / (True Negative + False Positives)
Specificity = 50 / 50+10 = 50/60 =0.83 *100 = 83%
A specificity of 0.83 means 83 percent of all patients that didn’t have cancer are predicted correctly.
6. F1-Score
When two models have low precision and high recall or vice versa, it becomes hard to compare those
models, therefore to solve this issue we can use F-score.
“F-score is a harmonic mean of Precision and Recall”.
By calculating F-score, we can evaluate the recall and precision at the same time. Also, if the recall is
equal to precision, The F-score is maximum.
Example:
F1-Score = (2*Recall*precision)/ (Recall + Precision)
F1-Score = (2*0.95 * 0.91) / (0.95 + 0.91) = 1.729 / 1.86 =0.929 * 100 = 93%
F1 score is high, i.e., both precision and recall of the classifier indicate good results.
Model Evaluation Metrics
7. Error Rate (Misclassification Rate):
A number of all incorrect predictions divided by the total number of the datasets.
The best error rate is 0.0 and worst error rate is 1.0.
Error Rate = 1 - Accuracy or Error Rate= (False Negative + False Positive) / Total
Error Rate = 1- 0.91 or (10+5)/165 = 0.09
8. AUC-ROC curve
Receiver Operating Characteristic (ROC)
ROC Curve is a graphical plotting the True Positive Rate (y-axis) against the False Positive
Rate (x-axis)
It shows the performance of a classification model at all at different classification thresholds.
It was originally developed to test military radar receivers.
Model Evaluation Metrics
Area Under Curve (AUC)
This curve represents the area under the ROC curve.
It measures the overall performance of the binary classification model.
Both TPR and FPR range between 0 to 1, Also area will always lie between 0 and 1, and A
greater value of AUC denotes better model performance.
Our main goal is to maximize this area in order to have the highest TPR and lowest FPR at the
given threshold.
The AUC measures the probability that the model will assign a randomly chosen positive
instance a higher predicted probability compared to a randomly chosen negative instance.
Model Evaluation Metrics
Regression Evaluation or Metrics
A regression model can only predict values that are lower or higher than the actual value.
These metrics aim to shows us the prediction error of our model.
An error is defined as the difference between predicted and the actual value.
The lower the error, the better the performance of the model, and the higher the error, the
worse the performance of the model.
Linear regression
▲ It finds the linear relationship between the dependent and independent variables using a best-
fit straight line.
▲ Linear regression adjusts the line between the data for accurate predictions.
▲ The objective of Linear Regression is to find a line that minimizes the prediction errors.
▲ Linear equation to be $y = mx+c$,
where y is the dependent and x is the independent data given in your dataset.
Model Evaluation Metrics
Example: Implementation of Linear Regression
GLUCOSE
AGE
LEVEL x-x^ y-y^ (x-x^)2 x-x^ * y-y^
σሺ௫ ି௫ ሻሺ௬ ି௬ ሻ
X
Y
σ ௫ ି௫ ଶ
43 99 1.8333 18 3.3611 33 m=
21 65 -20.17 -16 406.69 322.666667
25 79 -16.17 -2 261.36 32.3333333 0.385224983
42 75 0.8333 -6 0.6944 -5
57 87 15.833 6 250.69 95
59 81 17.833 0 318.03 0
41.167 81 0 0 1240.8 478
Avg of X Avg of Y Sum Sum Sum Sum
y=mx+c 81=(0.3852*41.16)+C Coefficient C= 65.14157152
Model Evaluation Metrics
1. Mean Absolute Error
Mean Absolute Error (MAE) is the average of the difference between the actual values and
predicted values. This difference is known as prediction error.
It measures the average of the residuals in the dataset.
It doesn’t give any idea about the direction of the error
It measures how far the predictions made by a model
from the actual output.
Less error decreases the MSE and produces more accurate predictions.
MSE=1/6 X (99-81.7056)2+(65-73.2312)2+(79-74.772)2+
(75-81.3204)2+(87-87.0984)2+(81-87.8688)2
= [(17.2944)2+(8.2312)2+(4.228)2+(6.3204)2+(0.0984)2+(6.8688)2 ] / 6
The mean square error (MSE) for above dataset is 78.6437
Model Evaluation Metrics
3. Root Mean Square Error (RMSE)
Root mean squared error is the square root of the mean of the square of all of the error. i.e.
average square root difference between the predicted values and the actual values.
RMSE assess the amount of error in a regression
Low RMSE values indicate that the model predicted more accurate. Higher values suggest
more error and less accurate predictions
RMSE is more sensitive to outliers.
= sqrt(17.2944)2+(8.2312)2+(4.228)2+(6.3204)2+(0.0984)2+(6.8688)2
= sqrt(78.6437) Error Squared Root Mean
Age Glucose
Predicted (y^) (Predicted - (Error) Sequare
RMSE=8.8681 X Level (Y)
Actual) (y-y˄)2 Error
43 99 81.7056 17.2944 299.0963
21 65 73.2312 8.2312 67.7527
25 79 74.772 4.228 17.8760
42 75 81.3204 6.3204 39.9475 8.8681
57 87 87.0984 0.0984 0.0097
59 81 87.8688 6.8688 47.1804
86.328 78.6437
Model Evaluation Metrics
4. Root Mean Square Log Error (RMSLE)
Root Mean Squared Logarithmic Error is calculated by applying log to the actual and the
predicted values and then taking their differences.
It punishes the model more if the predicted value is less than the actual value while the model
is less punished if the predicted value is more than the actual value. It does not punish high
errors due to the log. RMSE > RMSLE
A histogram can be used to represent the actual data which is covered with a density curve that
represents the predicted result.
distplot displays a histogram overlapped with a density curve.
Distribution Plot
Distribution plots are important for exploratory data analysis.
They help us detect outliers and skewness, or get an overview of the measures of central
tendency (mean, median, and mode).
This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing for
smoother distributions by smoothing out the noise.
The horizontal axis of the histogram represents the entire range of data values. The vertical
axis represents the relative frequency.
Distribution Plot
Use distribution plots in addition to more formal hypothesis tests to determine whether the
sample data comes from a specified distribution.
Normal Probability Plots: Use normplot a assessing whether a data set is almost normally
distributed. Use probplot to create Probability Plots for distributions other than normal.
Quantile-Quantile Plots: Use qqplot to determining if two data sets come from populations
with a common distribution. the quantiles of the first data set against the quantiles of the
second data set.
Cumulative Distribution Plots: Use cdfplot or ecdf to display the empirical cumulative
distribution function (cdf) of the sample data for visual comparison to the theoretical cdf of a
specified distribution
Generalization error
Generalization error
Generalization error is the out-of-sample error that measures how accurately a model can predict
values for previously unseen data. i.e. determines the model’s ability to react to new unseen data.
Generalization error is composition bias-variance. Generalization error measured by MSE. E=b+v
A difference between predicted and actual data is made by model inaccuracy, sampling error
and noise.
• Noise-An unnecessary or irrelevant data that can reduce our model’s performance. It is the
irreducible error, the lowest bound of generalization error
• Bias-
Differences between the expected predicted values and the actual values.
Bias is the prediction error that is introduced in the model due to oversimplifying.
It represents how much a model’s predictions different from the correct values.
High bias indicates that model is underfitting the data because it cannot capture meaningful patterns.
Low Bias: Model’s average predictions are very close to the actual values. When bias is low, the model
performs best at fitting the training data and making accurate predictions on new or unseen
data.
High Bias: Model’s average predictions are very long from the actual values and the model unable to
Generalization error
Variance-
• The amount of difference in the prediction if the different training data was used.
• High variance indicates that the model is overfitting the data
• Variance errors are either of low variance or high variance.
Low variance- A very small change in predictions when we change the input dataset.
High variance- A large difference in the predictions when we change the input data.
Overfitting
• Overfitting occurs when the trained model performs well on the training data and poorly
performs on the testing datasets (new data). i.e. when a model tries to cover all the data
points present in the given dataset.
• Model becomes more complex with the presence of noise in a data set.
• Overfitting can happen due to low bias and high variance.
• The data points covered by this line are training data.
Reasons for Overfitting
• training data contains noise.
• The model has a high variance
• Training dataset size is insufficient
• The model is too complex
Techniques to reduce Overfitting
• Using K-fold cross-validation
• Using Regularization techniques such as Lasso and Ridge
• Training model with sufficient data
• Adopting ensembling techniques
• Increase training data.
• Reduce model complexity.
Underfitting
• Underfitting occurs when a model is not able to make accurate predictions based on training
data.
• Underfitting is just the opposite of overfitting.
• Underfitting have poor performance both in training and testing sets.
• Underfitting occurs due to high bias and low variance.
Reasons for Underfitting
• High bias and low variance
• The size of the training dataset used is not enough.
• The model is too simple.
• Training data is not cleaned and also contains noise in it.
It aims to reduce the sum of squared errors between the actual and predicted values by adding
a penalty term to the regression estimates that reduces the coefficients and brings them closer
to zero. i.e. penalizes large coefficient values.
After adding a hyperparameter, the model will also change the best-fit line.
Prediction by using Ridge Regression