0% found this document useful (0 votes)
4 views

Lecture04

The document provides an overview of machine learning basics, focusing on key concepts such as sensitivity, specificity, accuracy, precision, and F1 score, which are essential for evaluating model performance. It also discusses feature engineering, including handling missing data and normalizing data, as well as the importance of training, testing, and validation sets. Additionally, it covers the concepts of overfitting and underfitting, and encourages familiarity with Python programming for practical applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture04

The document provides an overview of machine learning basics, focusing on key concepts such as sensitivity, specificity, accuracy, precision, and F1 score, which are essential for evaluating model performance. It also discusses feature engineering, including handling missing data and normalizing data, as well as the importance of training, testing, and validation sets. Additionally, it covers the concepts of overfitting and underfitting, and encourages familiarity with Python programming for practical applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Lecture

Machine Learning
04

Machine Learning Basics

Arslan Ali Khan


[email protected]
Department of Cyber-Security and Data Science
Riphah Institute of Systems Engineering (RISE),
Riphah International University, Islamabad, Pakistan.
Confusion Matrix 51
Sensitivity or Recall 51

• Sensitivity is a measure of how well a machine learning model can


detect positive instances. It is also known as the true positive rate
(TPR) or recall.
• In other words, sensitivity measures the proportion of actual positives
that are correctly identified as such (e.g., the percentage of sick
people who are correctly identified as having the condition).
• Sensitivity is used to evaluate model performance because it allows
us to see how many positive instances the model was able to
correctly identify.

Sensitivity = TP / TP+FN
Sensitivity = (True Positive/True Positive + False Negative)
Sensitivity or Recall 51

• Let’s consider an example of a medical test for a rare disease to


understand the concept of sensitivity. Suppose that the test has a
sensitivity of 95%. This means that if 100 people who have the
disease take the test, the test will correctly identify 95 of them as
positive, but it will miss 5 of them (false negatives).
• As shown above, A model with high sensitivity will have few false
negatives, which means that it is missing a few of the positive
instances.
Sensitivity or Recall 51

• In various use cases, it is important for the models to have high


sensitivity because we want our models to be able to find all of the
positive instances in order to make accurate predictions.
• The sum of sensitivity (true positive rate) and false negative rate
would be 1. The higher the true positive rate, the better the model is
in identifying the positive cases in the correct manner.
Specificity 51

• When sensitivity is used to evaluate model performance, it is often


compared to specificity. Specificity measures the proportion of true
negatives that are correctly identified by the model.
• This implies that there will be another proportion of actual negative
which got predicted as positive and could be termed as false
positives. This proportion could also be called a True Negative Rate
(TNR).

Specificity = TN / TN+FP
Specificity = (True Negative)/(True Negative + False Positive)
Specificity 51

• Let’s consider an example of a medical test for a rare disease.


Suppose that the test has a specificity of 95%. This means that if 100
people who do not have the disease take the test, the test will
correctly identify 95 of them as negative, but it will incorrectly identify
5 of them as positive (false positives).
• Thus, the specificity, in this case, can be defined as a measure of the
proportion of people not suffering from the disease who got predicted
correctly as the ones who are not suffering from the disease. In other
words, the proportion of person who is healthy actually got predicted
as healthy is specificity.
Specificity 51

• The sum of specificity (true negative rate) and false positive rate
would always be 1. High specificity means that the model is correctly
identifying most of the negative results, while a low specificity means
that the model is mislabeling a lot of negative results as positive.
Sensitivity vs Specificity 51

• Sensitivity: The ability of a test to correctly identify patients with a


disease.
• Specificity: The ability of a test to correctly identify people without
the disease
Accuracy 51

• Accuracy is a metric that measures how often a machine learning


model correctly predicts the outcome. You can calculate accuracy by
dividing the number of correct predictions by the total number of
predictions.
• In other words, accuracy answers the question: how often the model
is right?

Accuracy = TP + TN / TP+TN+FP+FN
Precision 51

• Precision is a metric that measures how often a machine learning


model correctly predicts the positive class. You can calculate
precision by dividing the number of correct positive predictions (true
positives) by the total number of instances the model predicted as
positive (both true and false positives).
• The precision is also known as positive predictive value.

Precision = TP / TP+FP
F1 score 51

• The F1 score or F-measure is described as the harmonic mean of the


precision and recall of a classification model. The two metrics
contribute equally to the score, ensuring that the F1 metric correctly
indicates the reliability of a model.
F1=2∗((precision∗recall)/(precision+recall))
where precision is the number of true positives divided by the sum of
true positives and false positives, and recall is the number of true
positives divided by the sum of true positives and false negatives.
F1 score 51

• The F1 score or F-measure is described as the harmonic mean of the


precision and recall of a classification model. The two metrics
contribute equally to the score, ensuring that the F1 metric correctly
indicates the reliability of a model.
Confusion Matrix-Example
Mathematical Concepts (Self Study)

• Scalar • Random Variable


• Vector • Probability Distribution
• Matrix • Probability Mass Function
• Norm • Probability Density Function
• Eigen Decomposition
• Singular Value Decomposition
Feature Engineering

• Dealing with Missing Data


Missing values are data points that are absent for a specific variable in a
dataset. They can be represented in various ways, such as blank cells,
null values, or special symbols like “NA” or “unknown.” These missing
data points pose a significant challenge in data analysis and can lead to
inaccurate or biased results.
Feature Engineering

• Dealing with Missing Data


Missing values can pose a significant challenge in data analysis, as they can:
• Reduce the sample size: This can decrease the accuracy and reliability
of your analysis.
• Introduce bias: If the missing data is not handled properly, it can bias
the results of your analysis.
• Make it difficult to perform certain analysis: Some statistical
techniques require complete data for all variables, making them
inapplicable when missing values are present
Feature Engineering

• Dealing with Missing Data


• Replacing missing values with estimated values.

• Preserves sample size: Doesn’t reduce data points.

• Can introduce bias: Estimated values might not be accurate.

Use of Mean, Median, and Mode:

• Replace missing values with the mean, median, or mode of the relevant variable.

• Simple and efficient: Easy to implement.

• Can be inaccurate: Doesn’t consider the relationships between variables.


Feature Engineering

• Handling Categorical Data


Categorical data is data that can be divided into groups or categories,
such as gender, hair color, or product type.
Feature Engineering

• Normalizing Data
Normalization in machine learning is the process of translating data into
the range [0, 1] (or any other range).
• Feature Construction or Generation
Feature Generation (also known as feature construction, feature
extraction or feature engineering) is the process of transforming features
into new features that better relate to the target. This can involve
mapping a feature into a new feature using a function like log, or
creating a new feature from one or multiple features using multiplication
or addition.
Feature Scaling 56

A technique often applied as part of data preparation for machine learning.


Goal: Change the values of numeric columns in the dataset to a common scale, without
distorting differences in the ranges of values.

Normalization
Min-max normalization: Guarantees all features will have the exact same scale but does
not handle outliers well.

Z-score standardization: Handles outliers, but does not produce normalized data with the
exact same scale.
Training, Testing and Validation Sets 57
Training, Testing and Validation Set 58
K-Fold Cross Validation 59

K-fold cross-validation is a
technique for evaluating
predictive models.

The dataset is divided into k


subsets or folds. The model is
trained and evaluated k times,
using a different fold as the
validation set each time.

Performance metrics from each


fold are averaged to estimate the
model's generalization
performance.
K-Fold Cross Validation 60
Under-fitting and Over-fitting 61

• Overfitting occurs when the model fits the training data too well and does not
Overfitting generalize so it performs badly on the test data.
• Its the result of an excessively complic ated model.

• Underfitting occurs when the model does not fit the data well enough.
Underfitting • Is result of an excessively simple model.
Under-fitting and Over-fitting 62

• Both overfitting and underfitting lead to poor predictions on new datasets.

• A learning model that overfits or underfits does not generalize well.


63

About Python

Familiarize yourself with Python Programming this week.


Python 64

• Install Anaconda Navigator https://www.anaconda.com/products/individual


Python 65

Environments and Libraries

Notebook Pandas
Qtconsole Scipy
Orange Matplotlib
Vscode Sklearn
PyCharm Numpy
Python Exercises to solve this week 66

• https://pynative.com/python-exercises-with-solutions/

• https://www.w3resource.com/machine-learning/scikit-learn/iris/index.php

• https://www.practicepython.org/
Reading Task for this week

Relevant sections from Chapter 2 of Text Book


Reading Task for this week
Part I: Understanding Machine Learning Chapter 2 and 3
Chapter 1 and 2

You might also like