0% found this document useful (0 votes)
135 views163 pages

ML Notes Question Bank Exstraction From Notes

ml question

Uploaded by

Vidhya Gopinath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
135 views163 pages

ML Notes Question Bank Exstraction From Notes

ml question

Uploaded by

Vidhya Gopinath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 163

lOMoARcPSD|49044958

AL 3451 ML

Machine Learning (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by vidhya gopinath ([email protected])
lOMoARcPSD|49044958

O MAILAM
Engineering College
Chennai,
Approved by AICTE, New Delhi, AMiliatcd to Anna University,
"A" Grade &
Accredited by National Board of Accreditation(NBA), Accredited by NAACwith
Accredited by TATA ConsultancyServices (TCS), Chennai)
Science
Department Of Artificial Intelligence and Data
&

Department Of Computer Science and Business System

Sub. Code: AL 3451 YEAR/ SEM: II/IV

Sub. Name: MACHINE LEARNING

UNIT IINTRODUCTION TO MACHINE LEARNING


and motivation for
Review of Linear Algebra for machine learning; Introduction
applications, Vapnik-Chervonenkis
machine learning; Examples of machine learning Hypothesis spaces,
(VC) dimension, Probably Approximately Correct (PAC) learning,
Inductive bias, Generalization, Bias variance trade-off.

UNIT II SUPERVISED LEARNING


variables, Bayesian linear
Linear Regression Models: Least squares, single & multiple Discriminant function
Models:
regression, gradient descent, Linear Classification
Logistic regression,
Perceptron algorithm, Probabilistic discriminative model -
classifier - Support
Probabilisticgenerative model - Naive Bayes, Maximum margin
vector machine, Decision Tree, Random Forests
UNSUPERVISED LEARNING
UNIT III ENSEMBLE TECHNIQUES AND
Ensemble Learning
Combining multiple learners: Model combination schemes, Voting,
K-means, Instance Based
- bagging, boosting, stacking, Unsupervised learning: maximization.
Learning:KNN, Gaussian mixture models and Expectation

UNIT IV NEURAL NETWORKS


training gradient descent
Multilayer perceptron, activation functions, network from shallow
optimization - stochastic gradient descent, error backpropagation,
problem) -
networks to deep networks -Unit saturation (aka the vanishing gradient
dropout.
ReLU, hyperparametertuning, batch normalization,regularization,

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

LEARNING EXPERIMENTS
UNIT V DESIGN AND ANALYSIS OF MACHINE
resampling -
Guidelines for machinc lcarning cxperiments, Cross Validation (CV) and
assessing a single
K-fold CV, bootstrapping, measuring classificr performance, test, McNemar's
classificationalgorithm and comparingtwo classificationalgorithms - t
test, K-fold CV paired t test.

TEXTBOOKS:
Fourth Edition,
1. Ethem Alpaydin, "Introduction to Machine Learning", MIT Press,
2020.
2. Stephen Marsland,"MachineLearning: An AlgorithmicPerspective,"Second Edition",
CRC Press, 2014.

REFERENCES:
2006.
1. Christopher M. Bishop, "Pattern Recognitionand Machine Learning", Springer,
2. Tom Mitchell, "Machine Learning", McGraw Hill, 3rd Edition, 1997.
Machine
3. Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar, "Foundations of
Learning",Second Edition, MIT Press, 2012, 2018.
4. lan Goodfellow,Yoshua Bengio, Aaron Courville, "Deep Learning", MIT Press, 2016
5. Sebastain Raschka, Vahid Mirjalili , "Python Machine Learning", Packt publishing,
3rd Edition, 2019.

PREPARED BY

Ms.M. Nithya ,APJAI&DS MEy


Mr.Thiyaneshwaran, AP/CSBS

HOD PRINCÍPAL
Dr. Artheeswart,Prof &Head /AI&DS

Mr. S. Prasanna Prof.& Head / CSBS

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

Approved by AICTE, New Delhi, Affiliated to Anna University, Chennai,


Accredited by National Board of Accreditation (NBA), Accredited by NAAC with “A” Grade &
Accredited by TATA Consultancy Services (TCS), Chennai)

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE


II YEAR / IV SEM
AL3451- MACHINE LEARNING
UNIT I INTRODUCTION TO MACHINE LEARNING
SYLLABUS:
Review of Linear Algebra for machine learning; Introduction and motivation for
machine learning; Examples of machine learning applications, Vapnik-Chervonenkis
(VC) dimension, Probably Approximately Correct (PAC) learning, Hypothesis spaces,
Inductive bias, Generalization, Bias variance trade-off.

PART A
1. Define linear Algebra.
Linear Algebra is an essential field of mathematics, which defines the study of
vectors, matrices, planes, mapping, and lines required for linear transformation.
2. What are the Benefits of learning Linear Algebra before Machine learning?
 Better Graphic experience
 Improved Statistics]
 Creating better Machine Learning algorithms
 Estimating the forecast of Machine Learning
 Easy to Learn

3. List some of the supervised learning algorithms can be created using Linear
Algebra.
 Logistic Regression
 Linear Regression
 Decision Trees
 Support Vector Machines (SVM)

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 1

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

4. List some of the unsupervised learning algorithms can be created using


Linear Algebra.
 Single Value Decomposition (SVD)
 Clustering
 Components Analysis

5. Give some examples of Linear Algebra in Machine Learning.


 Datasets and Data Files
 Linear Regression
 Recommender Systems
 One-hot encoding
 Regularization Principal Component Analysis
 Images and Photographs
 Singular-Value Decomposition
 Deep Learning
 Latent Semantic Analysis

6. Define Machine Learning.


 Arthur Samuel, an early American leader in the field of computer gaming
and artificial intelligence, coined the term “Machine Learning” in 1959
while at IBM.
 He defined machine learning as “the field of study that gives
computers the ability to learn without being explicitly
programmed“.
 Machine learning is programming computers to optimize a performance
criterion using example data or past experience. The model may be
predictive to make predictions in the future, or descriptive to gain
knowledge from data.

7. Mention the various classification of Machine Learning


Machine learning implementations are classified into four major
categories, depending on the nature of the learning “signal” or “response”
available to a learning system which are as follows:

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 2

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

 Supervised learning
 Unsupervised learning
 Reinforcement learning
 Semi-supervised learning

8. Define Supervised learning


 Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs.
 The given data is labeled.
 Both classification and regression problems are supervised learning
problems.

9. Define Unsupervised learning


 Unsupervised learning is a type of machine learning algorithm used to
draw inferences from datasets consisting of input data without labeled
responses.
 In unsupervised learning algorithms, classification or categorization is not
included in the observations.
 In unsupervised learning the agent learns patterns in the input without
any explicit feedback.
 The most common unsupervised learning task is clustering: detecting
potentially useful clusters of input examples.

6. What is Reinforcement learning?


In reinforcement learning the agent learns from a series of reinforcements:
rewards and punishments. Reinforcement learning is the problem of getting an agent
to act in the world so as to maximize its rewards. A learner is not told what actions
to take as in most forms of machine learning but instead must discover which actions
yield the most reward by trying them.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 3

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

7. What is Semi-supervised learning?


 Semi-Supervised learning is a type of Machine Learning algorithm that
represents the intermediate ground between Supervised and
Unsupervised learning algorithms.
 It uses the combination of labeled and unlabeled datasets during the
training period, where an incomplete training signal is given: a training set
with some of the target outputs missing.

8. How to Categorize algorithm based on required Output?


 Classification
 Regression
 Clustering

9. What is Vapnik-Chervonenkis (VC) dimension?


 The Vapnik-Chervonenkis (VC) dimension is a measure of the capacity of a
hypothesis set to fit different data sets.
 The VC dimension is a measure of the complexity of a model, which can help
us understand how well it can fit different data sets.
 The VC dimension of a hypothesis set H is the largest number of points that
can be shattered by H.

10. Define Hypothesis set.


A hypothesis set H shatters a set of points S if, for every possible labeling of
the points in S, there exists a hypothesis in H that correctly classifies the points. In
other words, a hypothesis set shatters a set of points if it can fit any possible
labeling of those points.

11. Define Probably Approximately Correct (PAC) learning


 PAC (Probably Approximately Correct) learning is a theoretical framework in
machine learning that deals with the learning of hypotheses from a given set
of data.
 It provides guarantees on the generalization performance of a learning algorithm
by bounding the error between the learned hypothesis and the true hypothesis.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 4

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

12. Define Hypothesis.


A hypothesis space is a collection of possible functions or classifiers that a
learning algorithm can choose from to make predictions or decisions based on input
data.

13. Define Hypothesis in Machine Learning (ML).


The hypothesis is one of the commonly used concepts of statistics in
Machine Learning. It is specifically used in Supervised Machine learning, where
an ML model learns a function that best maps the input to corresponding outputs
with the help of an available dataset.

14. Define Hypothesis space (H):


Hypothesis space is defined as a set of all possible legal hypotheses;
hence it is also known as a hypothesis set. It is used by supervised machine
learning algorithms to determine the best possible hypothesis to describe the
target function or best maps input to output.

15. Define Hypothesis (h)


A hypothesis is a function that best describes the target in
supervised machine learning. The hypothesis that an algorithm would come
up depends upon the data and also depends upon the restrictions and bias that
we have imposed on the data.
The Hypothesis can be calculated as:
y= mx + b
Where,
 y = range
 m = slope of the lines
 x = domain
 b = intercept

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 5

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

16. Define Hypothesis in Statistics.


In statistics, a hypothesis refers to a statement or assumption about
a population parameter. It is a proposition or educated guess that helps
guide statistical analyses. There are two types of hypotheses: the null
hypothesis (H0) and the alternative hypothesis (H1 or Ha).
 Null Hypothesis(H0): This hypothesis suggests that there is no
significant difference or effect, and any observed results are due to
chance. It often represents the status quo or a baseline assumption.
 Alternative Hypothesis(H1 or Ha): This hypothesis contradicts the
null hypothesis, proposing that there is a significant difference or
effect in the population. It is what researchers aim to support with
evidence.

17. Define Inductive bias.


 Inductive bias is a key component of any machine learning model. It
guides the model's learning process and influences its ability to
make predictions on new data.
 The inductive bias is what allows a model to choose one hypothesis
over another when multiple hypotheses are consistent with the
observed training data.

18. Define Generalization.


Generalization defines the ability of an ML model to provide a suitable
output by adapting the given set of unknown input. It means after providing
training on the dataset, it can produce reliable and accurate output.

19. What is Overfitting?


Overfitting occurs when our machine learning model tries to cover
all the data points or more than the required data points present in the
given dataset. Because of this, the model starts caching noise and
inaccurate values present in the dataset, and all these factors reduce the
efficiency and accuracy of the model. The overfitted model has low
bias and high variance.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 6

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

20. What is Underfitting?


Underfitting occurs when our machine learning model is not able to
capture the underlying trend of the data. To avoid the overfitting in the
model, the fed of training data can be stopped at an early stage, due to
which the model may not learn enough from the training data. As a result,
it may fail to find the best fit of the dominant trend in the data.

21. What are Errors in Machine Learning?


In machine learning, an error is a measure of how accurately an
algorithm can make predictions for the previously unknown dataset. On
the basis of these errors, the machine learning model is selected that can
perform best on the particular dataset.

22. What are the types of errors?


There are mainly two types of errors in machine learning, which
are:
 Reducible errors: These errors can be reduced to improve the
model accuracy. Such errors can further be classified into bias
and Variance.
 Irreducible errors: These errors will always be present in the
model regardless of which algorithm has been used. The cause
of these errors is unknown variables whose value can't be
reduced.

23. What is Bias?


Bias can be defined as an inability of machine learning algorithms
such as Linear Regression to capture the true relationship between the
data points. Each algorithm begins with some amount of bias because bias
occurs from assumptions in the model, which makes the target function
simple to learn. A model has either:
o Low Bias: A low bias model will make fewer assumptions about
the form of the target function.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 7

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

o High Bias: A model with a high bias makes more assumptions,


and the model becomes unable to capture the important features
of our dataset. A high bias model also cannot perform well on
new data.

24. What is a Variance Error?


The variance would specify the amount of variation in the prediction
if the different training data was used. In simple words, variance tells
that how much a random variable is different from its expected
value.

25. List the Ways to Reduce High Variance.


o Reduce the input features or number of parameters as a
model is overfitted.
o Do not use a much complex model.
o Increase the training data.
o Increase the Regularization term.

26. Bias-Variance Trade-Off


 While building the machine learning model, it is really important to
take care of bias and variance in order to avoid overfitting and
underfitting in the model.
 If the model is very simple with fewer parameters, it may have low
variance and high bias.
 Whereas, if the model has a large number of parameters, it will have
high variance and low bias.
 So, it is required to make a balance between bias and variance
errors, and this balance between the bias error and variance error
is known as the Bias-Variance trade-off.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 8

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

PART-B

1. Explain in detail about linear algebra for machine learning and give some popular
examples of linear algebra and explain them.

Review of Linear Algebra for machine learning


1.1 Linear algebra for Machine Learning
1.2 Benefits of learning Linear Algebra before Machine learning
1.2.1. Better Graphic experience
1.2.2. Improved Statistics
1.2.3. Creating better Machine Learning algorithms
1.2.4. Estimating the forecast of Machine Learning
1.2.5. Easy to Learn
1.3 Minimum Linear Algebra for Machine Learning
1.3.1 Notation
1.3.2 Operations
1.3.3 Matrix Factorization
1.4 Examples of Linear Algebra in Machine Learning
1.4.1 Datasets and Data Files
1.4.2 Linear Regression
1.4.3 Recommender Systems
1.4.4 One-hot encoding
1.4.5 Regularization Principal Component Analysis
1.4.6 Images and Photographs
1.4.7 Singular-Value Decomposition
1.4.8 Deep Learning
1.4.9 Latent Semantic Analysis

1.1 Linear Algebra for Machine learning


Linear Algebra is an essential field of mathematics, which defines the study of
vectors, matrices, planes, mapping, and lines required for linear transformation.
Linear algebra plays a vital role and key foundation in machine learning, and
it enables ML algorithms to run on a huge number of datasets.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 9

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

Linear algebra is used almost in each concept of Machine learning, specifically,


it can perform the following task:
o Optimization of data.
o Applicable in loss functions, regularization, covariance matrices, Singular
Value Decomposition (SVD), Matrix Operations, and support vector
machine classification.
o Implementation of Linear Regression in Machine Learning.
o Linear algebra is also used in neural networks and the data science field.

1.2 Benefits of learning Linear Algebra before Machine learning:


1.2.1 Better Graphic experience
 Linear Algebra helps to provide better graphical processing in Machine
Learning like Image, audio, video, and edge detection.
 These are the various graphical representations supported by Machine
Learning projects that you can work on.
 Further, parts of the given data set are trained based on their categories
by classifiers provided by machine learning algorithms.
 These classifiers also remove the errors from the trained data.
 Moreover, Linear Algebra helps solve and compute large and complex data
set through a specific terminology named Matrix Decomposition
Techniques. There are two most popular matrix decomposition
techniques, which are as follows:
 Q-R
 L-U
1.2.2 Improved Statistics:
 Statistics is an important concept to organize and integrate data in
Machine Learning.
 Also, linear Algebra helps to understand the concept of statistics in a
better manner.
 Advanced statistical topics can be integrated using methods, operations,
and notations of linear algebra.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 10

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

1.2.3 Creating better Machine Learning algorithms:


 Linear Algebra also helps to create better supervised as well as
unsupervised Machine Learning algorithms.
 Few supervised learning algorithms can be created using Linear Algebra,
which is as follows:
 Logistic Regression
 Linear Regression
 Decision Trees
 Support Vector Machines (SVM)
 Further, below are some unsupervised learning algorithms listed that can
also be created with the help of linear algebra as follows:
 Single Value Decomposition (SVD)
 Clustering
 Components Analysis
 With the help of Linear Algebra concepts, you can also self-customize the
various parameters in the live project and understand in-depth knowledge
to deliver the same with more accuracy and precision.
1.2.4 Estimating the forecast of Machine Learning:
 If you are working on a Machine Learning project, then you must be a
broad-minded person and also, you will be able to impart more
perspectives.
 Hence, in this regard, you must increase the awareness and affinity of
Machine Learning concepts. You can begin with setting up different graphs,
visualization, using various parameters for diverse machine learning
algorithms or taking up things that others around you might find difficult
to understand.
1.2.5 Easy to Learn:
Linear Algebra is an important department of Mathematics that is easy
to understand. It is taken into consideration whenever there is a requirement
of advanced mathematics and its applications.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 11

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

1.3 Minimum Linear Algebra for Machine Learning


1.3.1 Notation:
Notation in linear algebra enables you to read algorithm descriptions in
papers, books, and websites to understand the algorithm's working. Even if you use
for-loops rather than matrix operations, you will be able to piece things together.
1.3.2 Operations:
Working with an advanced level of abstractions in vectors and matrices can
make concepts clearer, and it can also help in the description, coding, and even
thinking capability. In linear algebra, it is required to learn the basic operations such
as addition, multiplication, inversion, transposing of matrices, vectors, etc.
1.3.3 Matrix Factorization:
One of the most recommended areas of linear algebra is matrix factorization,
specifically matrix deposition methods such as SVD and QR.
1.4 Examples of Linear Algebra in Machine Learning
Below are some popular examples of linear algebra in Machine learning:
1.4.1 Datasets and Data Files
 Each machine learning project works on the dataset, and we fit the
machine learning model using this dataset.
 Each dataset resembles a table-like structure consisting of rows and
columns. Where each row represents observations, and each column
represents features/Variables. This dataset is handled as a Matrix, which
is a key data structure in Linear Algebra.
 Further, when this dataset is divided into input and output for the
supervised learning model, it represents a Matrix(X) and Vector(y), where
the vector is also an important concept of linear algebra.
1.4.2 Images and Photographs
 In machine learning, images/photographs are used for computer vision
applications. Each Image is an example of the matrix from linear algebra
because an image is a table structure consisting of height and width for
each pixel.
 Moreover, different operations on images, such as cropping, scaling,
resizing, etc., are performed using notations and operations of Linear
Algebra.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 12

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

1.4.3 One Hot Encoding


 In machine learning, sometimes, we need to work with categorical data.
These categorical variables are encoded to make them simpler and easier
to work with, and the popular encoding technique to encode these variables
is known as one-hot encoding.
 In the one-hot encoding technique, a table is created that shows a variable
with one column for each category and one row for each example in the
dataset.
 Further, each row is encoded as a binary vector, which contains either zero
or one value. This is an example of sparse representation, which is a
subfield of Linear Algebra.
1.4.4 Linear Regression
 Linear regression is a popular technique of machine learning borrowed
from statistics.
 It describes the relationship between input and output variables and is
used in machine learning to predict numerical values.
 The most common way to solve linear regression problems using Least
Square Optimization is solved with the help of Matrix factorization
methods.
 Some commonly used matrix factorization methods are LU decomposition,
or Singular-value decomposition, which are the concept of linear algebra.
1.4.5 Regularization
 In machine learning, we usually look for the simplest possible model to
achieve the best outcome for the specific problem.
 Simpler models generalize well, ranging from specific examples to unknown
datasets. These simpler models are often considered models with smaller
coefficient values.
 A technique used to minimize the size of coefficients of a model while it is
being fit on data is known as regularization.
 Common regularization techniques are L1 and L2 regularization.
 Both of these forms of regularization are, in fact, a measure of the
magnitude or length of the coefficients as a vector and are methods lifted
directly from linear algebra called the vector norm.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 13

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

1.4.6 Principal Component Analysis


 Generally, each dataset contains thousands of features, and fitting the
model with such a large dataset is one of the most challenging tasks of
machine learning.
 Moreover, a model built with irrelevant features is less accurate than a
model built with relevant features.
 There are several methods in machine learning that automatically reduce
the number of columns of a dataset, and these methods are known as
Dimensionality reduction.
 The most commonly used dimensionality reductions method in machine
learning is Principal Component Analysis or PCA. This technique makes
projections of high-dimensional data for both visualizations and training
models. PCA uses the matrix factorization method from linear algebra.
1.4.7 Singular-Value Decomposition
 Singular-Value decomposition is also one of the popular dimensionality
reduction techniques and is also written as SVD in short form.
 It is the matrix-factorization method of linear algebra, and it is widely used
in different applications such as feature selection, visualization, noise
reduction, and many more.
1.4.8 Latent Semantic Analysis
 Natural Language Processing or NLP is a subfield of machine learning that
works with text and spoken words.
 NLP represents a text document as large matrices with the occurrence of
words. For example, the matrix column may contain the known
vocabulary words, and rows may contain sentences, paragraphs, pages,
etc., with cells in the matrix marked as the count or frequency of the
number of times the word occurred.
 It is a sparse matrix representation of text. Documents processed in this
way are much easier to compare, query, and use as the basis for a
supervised machine learning model.
 This form of data preparation is called Latent Semantic Analysis, or LSA
for short, and is also known by the name Latent Semantic Indexing or LSI.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 14

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

1.4.9 Recommender System


 A recommender system is a sub-field of machine learning, a predictive
modelling problem that provides recommendations of products. For
example, online recommendation of books based on the customer's
previous purchase history, recommendation of movies and TV series, as we
see in Amazon & Netflix.
 The development of recommender systems is mainly based on linear
algebra methods. We can understand it as an example of calculating the
similarity between sparse customer behaviour vectors using distance
measures such as Euclidean distance or dot products.
 Different matrix factorization methods such as singular-value
decomposition are used in recommender systems to query, search, and
compare user data.
1.4.10 Deep Learning
 Artificial Neural Networks or ANN are the non-linear ML algorithms that
work to process the brain and transfer information from one layer to
another in a similar way.
 Deep learning studies these neural networks, which implement newer and
faster hardware for the training and development of larger networks with a
huge dataset.
 All deep learning methods achieve great results for different challenging
tasks such as machine translation, speech recognition, etc.
 The core of processing neural networks is based on linear algebra data
structures, which are multiplied and added together.
 Deep learning algorithms also work with vectors, matrices, tensors (matrix
with more than two dimensions) of inputs and coefficients for multiple
dimensions.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 15

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

2. Define Machine Learning. Give an introduction to Machine Learning.

INTRODUCTION TO MACHINE LEARNING

2.1 Machine Learning


2.1.1 Definition of Machine Learning
2.1.2 Definition of learning
2.1.3 Examples
2.1.3.1 Handwriting recognition learning problem
2.1.3.2 A robot driving learning problem
2.2 Classification of Machine Learning
2.2.1 Supervised learning
2.2.2 Unsupervised learning
2.2.3 Reinforcement learning
2.2.4 Semi-supervised learning
2.3 Categorizing based on required Output
2.3.1 Classification
2.3.2 Regression
2.3.3 Clustering

2.1 Machine Learning:


2.1.1 Definition of Machine Learning:
 Arthur Samuel, an early American leader in the field of computer
gaming and artificial intelligence, coined the term “Machine
Learning” in 1959 while at IBM.
 He defined machine learning as “the field of study that gives
computers the ability to learn without being explicitly programmed.
 Machine learning is programming computers to optimize a
performance criterion using example data or past experience. The
model may be predictive to make predictions in the future, or
descriptive to gain knowledge from data.
 Figure 1.1 shows the block diagram of Machine Learning.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 16

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

Figure 1.1 Block diagram of Machine Learning


2.1.1 Definition of learning:
A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P, if its
performance at tasks T, as measured by P, improves with experience
E.

2.1.3 Examples
2.1.3.1 Handwriting recognition learning problem
 Task T: Recognizing and classifying handwritten words
within images
 Performance P: Percent of words correctly classified
 Training experience E: A dataset of handwritten words
with given classifications
2.1.3.2 A robot driving learning problem
 Task T: Driving on highways using vision sensors
 Performance P: Average distance traveled before an error
 Training experience E: A sequence of images and steering
commands recorded while observing a human driver

2.2 Classification of Machine Learning


 Machine learning implementations are classified into four major
categories, depending on the nature of the learning “signal” or “response”
available to a learning system which are as follows:
2.2.1 Supervised learning:
 Supervised learning is the machine learning task of learning a
function that maps an input to an output based on example input-
output pairs.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 17

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

 The given data is labeled.


 Both classification and regression problems are supervised
learning problems.
 For example, the inputs could be camera images, each one
accompanied by an output saying “bus” or “pedestrian,” etc.
 An output like this is called a label.
 The agent learns a function that, when given a new image,
predicts the appropriate label.

2.2.2 Unsupervised learning:


 Unsupervised learning is a type of machine learning algorithm used
to draw inferences from datasets consisting of input data without
labeled responses.
 In unsupervised learning algorithms, classification or
categorization is not included in the observations.
 In unsupervised learning the agent learns patterns in the input
without any explicit feedback.
 The most common unsupervised learning task is clustering:
detecting potentially useful clusters of input examples.
 For example, when shown millions of images taken from the
Internet, a computer vision system can identify a large cluster of
similar images which an English speaker would call “cats.”

2.2.3 Reinforcement learning:


 In reinforcement learning the agent learns from a series of
reinforcements: rewards and punishments.
 Reinforcement learning is the problem of getting an agent to act in
the world so as to maximize its rewards.
 A learner is not told what actions to take as in most forms of
machine learning but instead must discover which actions yield the
most reward by trying them.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 18

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

 For example — Consider teaching a dog a new trick: we cannot tell


him what to do, what not to do, but we can reward/punish it if it
does the right/wrong thing.

2.2.4 Semi-supervised learning:


 Semi-Supervised learning is a type of Machine Learning algorithm
that represents the intermediate ground between Supervised and
Unsupervised learning algorithms.
 It uses the combination of labeled and unlabeled datasets during
the training period, where an incomplete training signal is given: a
training set with some of the target outputs missing.

2.3 Categorizing based on required Output


2.3.1 Classification:
 The Classification algorithm is a Supervised Learning technique
that is used to identify the category of new observations on the
basis of training data.
 In Classification, a program learns from the given dataset or
observations and then classifies new observation into a number
of classes or groups.
 Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc.
Classes can be called as targets/labels or categories.

2.3.2 Regression:
 Regression is a supervised learning technique which helps in
finding the correlation between variables and enables us to predict
the continuous output variable based on the one or more predictor
variables.
 It is mainly used for prediction, forecasting, time series modeling,
and determining the causal-effect relationship between variables.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 19

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

2.3.3 Clustering:
 Clustering or cluster analysis is a machine learning technique,
which groups the unlabeled dataset.
 It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points.
 The objects with the possible similarities remain in a group that
has less or no similarities with another group."

3.Explain in detail about Vapnik-Chervonenkis (VC) dimension.

Vapnik-Chervonenkis (VC) dimension


3.1 Vapnik-Chervonenkis (VC) dimension
3.1.1 Bounds of VC – Dimension
3.1.2 Shattering of a set

3.1 Vapnik-Chervonenkis (VC) dimension


 The Vapnik-Chervonenkis (VC) dimension is a measure of the capacity
of a hypothesis set to fit different data sets.
 The VC dimension is a measure of the complexity of a model, which can
help us understand how well it can fit different data sets.
 The VC dimension of a hypothesis set H is the largest number of points
that can be shattered by H.
 A hypothesis set H shatters a set of points S if, for every possible labeling
of the points in S, there exists a hypothesis in H that correctly classifies
the points. In other words, a hypothesis set shatters a set of points if it
can fit any possible labeling of those points.
3.1.1 Bounds of VC – Dimension
The VC dimension provides both upper and lower bounds on the number of
training examples required to achieve a given level of accuracy. The upper bound
on the number of training examples is logarithmic in the VC dimension, while the
lower bound is linear.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 20

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

3.1.2 Shattering of a set


Let D be a dataset containing N examples for a binary classification problem
with class labels 0 and 1. Let H be a hypothesis space for the problem. Each
hypothesis h in H partitions D into two disjoint subsets as follows:
{x ∈ D ∣ h(x) = 0} and {x ∈ D ∣ h(x) = 1}.
Such a partition of S is called a “dichotomy” in D. It can be shown that there
are 2 N possible dichotomies in D. To each dichotomy of D there is a unique
assignment of the labels “1” and “0” to the elements of D. Conversely, if S is any
subset of D then, S defines a unique hypothesis h as follows:

Thus, to specify a hypothesis h, we need only specify the set {x ∈ D ∣ h(x) = 1}.
Figure 3.1 shows all possible dichotomies of D if D has three elements. In the figure,
we have shown only one of the two sets in a dichotomy, namely the set {x ∈ D ∣ h(x) = 1}. The
circles and ellipses represent such sets.

Figure 1.2 Different forms of the set{xεS:h(x)=1} for D={a,b,c}


Definition A set of examples D is said to be shattered by a hypothesis space H if and
only if for every dichotomy of D there exists some hypothesis in H consistent with the
dichotomy of D.
The following example illustrates the concept of Vapnik-Chervonenkis dimension.
Example In figure 1.3, we see that an axis-aligned rectangle can shatter four points in two
dimensions. Then VC(H), when H is the hypothesis class of axis-aligned rectangles in two
dimensions, is four. In calculating the VC dimension, it is enough that we find four points

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 21

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

that can be shattered; it is not necessary that we be able to shatter any four points in two
dimensions.

Figure 1.3 An axis-aligned rectangle can shattered four points. Only rectangle
covering two points are shown
VC dimension may seem pessimistic. It tells us that using a rectangle as our
hypothesis class, we can learn only datasets containing four points and not more.

4.Explain in detail Probably Approximately Correct (PAC) learning.


Probably Approximately Correct (PAC) learning
 PAC (Probably Approximately Correct) learning is a theoretical
framework in machine learning that deals with the learning of
hypotheses from a given set of data.
 It provides guarantees on the generalization performance of a learning
algorithm by bounding the error between the learned hypothesis and
the true hypothesis.
 The goal is that, with high probability (the "probably" part), the selected
function will have low generalization error (the "approximately correct"
part).
 In probably approximately correct (PAC) learning, given a class, C, and
examples drawn from some unknown but fixed probability distribution,
p(x), we want to find the number of examples, N, such that with
probability at least 1 − δ, the hypothesis h has error at most ε, for
arbitrary δ ≤ 1/2 and ε> 0
P{CΔh ≤ ε} ≥ 1 – δ
where CΔh is the region of difference between C and h.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 22

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

 In this case, because S is the tightest possible rectangle, the error region
between C and h = S is the sum of four rectangular strips (see figure 2.7).
 We would like to make sure that the probability of a positive example falling
in here (and causing an error) is at most ε.
 For any of these strips, if we can guarantee that the probability is upper
bounded by ε /4, the error is at most 4(ε /4) = ε.
 Note that we count the overlaps in the corners twice, and the total actual error
in this case is less than 4(ε /4).
 The probability that a randomly drawn example misses this strip is 1 − ε /4.
 The probability that all N independent draws miss the strip is (1− ε /4)N , and
the probability that all N independent draws miss any of the four strips is at
most 4(1 − ε /4)N , which we would like to be at most δ. We have the inequality.
(1 − x) ≤ exp[−x]
So if we choose N and δ such that we have
4 exp[−ε N/4] ≤ δ
we can also write 4(1 − ε /4)N ≤ δ. Dividing both sides by 4, taking (natural) log and
rearranging terms, we have
N ≥ (4/ ε) log(4/δ

Figure 1.4 The difference between h and C is the sum of four rectangle
strips, one of which is shaded.
Therefore, provided that we take at least (4/ ε) log(4/δ) independent examples
from C and use the tightest rectangle as our hypothesis h, with confidence probability
at least 1 − δ, a given point will be misclassified with error probability at most ε. We
can have arbitrary large confidence by decreasing δ and arbitrary small error by

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 23

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

decreasing ε, and we see in equation 2.7 that the number of examples is a slowly
growing function of 1/ ε and 1/δ, linear and logarithmic, respectively.

5. Explain in detail Hypothesis spaces in Machine Learning.


Hypothesis spaces in Machine Learning
5.1 Hypothesis
5.2 Hypothesis in Machine Learning (ML)
5.2.1 Hypothesis space (H)
5.2.2 Hypothesis (h)
5.2.3 Hypothesis in Statistics
5.1 Hypothesis
A hypothesis space is a collection of possible functions or classifiers that a
learning algorithm can choose from to make predictions or decisions based on input
data.
5.2 Hypothesis in Machine Learning (ML)
The hypothesis is one of the commonly used concepts of statistics
in Machine Learning. It is specifically used in Supervised Machine
learning, where an ML model learns a function that best maps the input
to corresponding outputs with the help of an available dataset as shown
in figure 1.5.

Figure 1.5 Hypothesis spaces in Machine Learning

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 24

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

In supervised learning techniques, the main aim is to determine the


possible hypothesis out of hypothesis space that best maps input to the
corresponding or correct outputs.
There are some common methods given to find out the possible hypothesis
from the Hypothesis space, where hypothesis space is represented by uppercase-
h (H) and hypothesis by lowercase-h (h).
5.2.1 Hypothesis space (H):
Hypothesis space is defined as a set of all possible legal hypotheses;
hence it is also known as a hypothesis set. It is used by supervised machine
learning algorithms to determine the best possible hypothesis to describe the
target function or best maps input to output.

5.2.2 Hypothesis (h)


A hypothesis is a function that best describes the target in
supervised machine learning. The hypothesis that an algorithm would come
up depends upon the data and also depends upon the restrictions and bias that
we have imposed on the data.
The Hypothesis can be calculated as:
y= mx + b
Where,
 y = range
 m = slope of the lines
 x = domain
 b = intercept
To better understand the Hypothesis Space and Hypothesis consider the
following coordinate that shows the distribution of some data:

Figure 1.6 distribution of data to understand the hypothesis space

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 25

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

Say suppose we have test data for which we have to determine the outputs or results. The
test data is as shown in figure 1.7.

Figure 1. 7 test data to determine the output


We can predict the outcomes by dividing the coordinate as shown in figure 1.8.

Figure 1.8 predicts the outcomes by dividing the coordinate


So the test data would yield the following result shown in figure 1.9.

Figure 1.9 the test data would yield this result


But note here that we could have divided the coordinate plane as:
The way in which the coordinate would be divided
depends on the data, algorithm and constraints.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 26

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

 All these legal possible ways in which we can divide the


coordinate plane to predict the outcome of the test data
composes of the Hypothesis Space.
 Each individual possible way is known as the hypothesis.
 Hence, in this example the hypothesis space as shown as
figure 1.10.

Figure 1.10 example the hypothesis space


5.2.3 Hypothesis in Statistics
In statistics, a hypothesis refers to a statement or assumption about
a population parameter. It is a proposition or educated guess that helps
guide statistical analyses. There are two types of hypotheses: the null
hypothesis (H0) and the alternative hypothesis (H1 or Ha).
 Null Hypothesis(H0): This hypothesis suggests that there is no
significant difference or effect, and any observed results are due to
chance. It often represents the status quo or a baseline assumption.
 Alternative Hypothesis(H1 or Ha): This hypothesis contradicts the
null hypothesis, proposing that there is a significant difference or
effect in the population. It is what researchers aim to support with
evidence.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 27

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

6. Write about Inductive bias.


Inductive bias
 Inductive bias is a key component of any machine learning model. It
guides the model's learning process and influences its ability to make
predictions on new data.
 The inductive bias is what allows a model to choose one hypothesis over
another when multiple hypotheses are consistent with the observed
training data.
 Without any inductive bias, a learning algorithm would not be able to
generalize beyond the specific training examples it has seen, making it
incapable of predicting outcomes for new instances.
 Different learning algorithms have different inductive biases, which affect
their performance and suitability for various types of problems. Some
common examples of inductive biases include:
1. Occam’s razor: The preference for simpler hypotheses over more
complex ones when both are consistent with the observed data. This bias is
based on the principle that simpler models are more likely to generalize well
to new data.
2. Smoothness: The assumption that nearby points in the input space
are likely to have similar outputs. This bias is often used in regression and
classification problems and is the basis for kernel methods and radial basis
function networks.
3. Feature selection: The assumption that only a subset of input
features is relevant to the target variable. This bias is used in many feature
selection algorithms, which aim to identify and use the most informative
features for learning.
4. Hierarchical structure: The assumption that the underlying
structure of the data can be represented by a hierarchy of concepts or
features. This bias is present in deep learning models, which learn
hierarchical representations of the data through multiple layers of
abstraction.
 Inductive bias is a crucial aspect of machine learning, as it influences
a model’s ability to generalize to new data. Selecting an appropriate

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 28

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

inductive bias for a specific problem is an essential step in the design


of effective machine learning algorithms.

7. What is Generalization? Explain in detail.


Generalization defines the ability of an ML model to provide a suitable output by
adapting the given set of unknown input. It means after providing training on the dataset,
it can produce reliable and accurate output.
General Terms:
o Signal: It refers to the true underlying pattern of the data that helps the machine
learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of
the model.
o Bias: Bias is a prediction error that is introduced in the model due to
oversimplifying the machine learning algorithms. Or it is the difference between
the predicted values and the actual values.
o Variance: If the machine learning model performs well with the training dataset,
but does not perform well with the test dataset, then variance occurs.
7.2.3.1 Overfitting
 Overfitting occurs when our machine learning model tries to cover
all the data points or more than the required data points present in
the given dataset. Because of this, the model starts caching noise
and inaccurate values present in the dataset, and all these factors
reduce the efficiency and accuracy of the model. The overfitted model
has low bias and high variance.
 The chances of occurrence of overfitting increase as much we provide
training to our model. It means the more we train our model, the
more chances of occurring the overfitted model.
 Overfitting is the main problem that occurs in supervised learning.
 Example: The concept of the overfitting can be understood by the
below graph of the linear regression output as shown in figure 1.11.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 29

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

Figure 1.11 Overfitting graph of the linear regression


 As we can see from the above graph, the model tries to cover all the
data points present in the scatter plot. It may look efficient, but in
reality, it is not so. Because the goal of the regression model to find
the best fit line, but here we have not got any best fit, so, it will
generate the prediction errors.
How to avoid the Overfitting in Model?
Both overfitting and underfitting cause the degraded performance of the
machine learning model. But the main cause is overfitting, so there are some
ways by which we can reduce the occurrence of overfitting in our model.
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
7.3.2.2 Underfitting
 Underfitting occurs when our machine learning model is not able to
capture the underlying trend of the data. To avoid the overfitting in
the model, the fed of training data can be stopped at an early stage,
due to which the model may not learn enough from the training data.
As a result, it may fail to find the best fit of the dominant trend in the
data.
 In the case of underfitting, the model is not able to learn enough from
the training data, and hence it reduces the accuracy and produces
unreliable predictions.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 30

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

 An underfitted model has high bias and low variance.


 Example: We can understand the underfitting using below output of
the linear regression model as shown in the figure 1.12.

Figure 1.12 underfitting in linear regression model


 As we can see from the above diagram, the model is unable to
capture the data points present in the plot.
How to avoid underfitting?
o By increasing the training time of the model.
o By increasing the number of features.
7.3.2.3 Goodness of Fit
 The "Goodness of fit" term is taken from the statistics, and the goal
of the machine learning models to achieve the goodness of fit. In
statistics modeling, it defines how closely the result or predicted
values match the true values of the dataset.
 The model with a good fit is between the underfitted and overfitted
model, and ideally, it makes predictions with 0 errors, but in
practice, it is difficult to achieve it.
 As when we train our model for a time, the errors in the training
data go down, and the same happens with test data. But if we train
the model for a long duration, then the performance of the model
may decrease due to the overfitting, as the model also learn the
noise present in the dataset.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 31

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

 The errors in the test dataset start increasing, so the point, just
before the raising of errors, is the good point, and we can stop here
for achieving a good model.

Figure 1.13
There are two other methods by which we can get a good point for our model, which
are the resampling method to estimate model accuracy and validation dataset.

8. Explain in detail about Bias variance trade-off.


Bias variance trade-off
8.1 Errors in Machine Learning
8.1.1 Reducible errors
8.1.2 Irreducible errors
8.2 What is Bias?
8.2.1 Low Bias
8.2.2 High Bias
8.3 What is a Variance Error?
8.3.1 Low variance
8.3.2 High variance
8.3.3 Different Combinations of Bias-Variance
8.3.3.1 Low-Bias, Low-Variance
8.3.3.2 Low-Bias, High-Variance
8.3.3.3 High-Bias, Low-Variance
8.3.3.4 High-Bias, High-Variance

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 32

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

8.1 Errors in Machine Learning


 In machine learning, an error is a measure of how accurately an algorithm
can make predictions for the previously unknown dataset.
 On the basis of these errors, the machine learning model is selected that
can perform best on the particular dataset.
 There are mainly two types of errors in machine learning, which are:
o Reducible errors: These errors can be reduced to improve the
model accuracy. Such errors can further be classified into bias and
Variance.

Figure 1.14
o Irreducible errors: These errors will always be present in the model
regardless of which algorithm has been used. The cause of these
errors is unknown variables whose value can't be reduced.
What is Bias?
 In general, a machine learning model analyses the data, find patterns
in it and make predictions. While training, the model learns these
patterns in the dataset and applies them to test data for prediction.
 While making predictions, a difference occurs between prediction
values made by the model and actual values/expected values, and this
difference is known as bias errors or Errors due to bias.
 Bias can be defined as an inability of machine learning algorithms such
as Linear Regression to capture the true relationship between the data

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 33

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

points. Each algorithm begins with some amount of bias because bias
occurs from assumptions in the model, which makes the target
function simple to learn. A model has either:
o Low Bias: A low bias model will make fewer assumptions about
the form of the target function.
o High Bias: A model with a high bias makes more assumptions,
and the model becomes unable to capture the important features
of our dataset. A high bias model also cannot perform well on
new data.
 Generally, a linear algorithm has a high bias, as it makes them learn
fast. The simpler the algorithm, the higher the bias it has likely to be
introduced. Whereas a nonlinear algorithm often has low bias.
 Some examples of machine learning algorithms with low bias are
Decision Trees, k-Nearest Neighbours and Support Vector
Machines. At the same time, an algorithm with high bias is Linear
Regression, Linear Discriminant Analysis and Logistic Regression.
Ways to reduce High Bias:
High bias mainly occurs due to a much simple model. Below are some
ways to reduce the high bias:
o Increase the input features as the model is underfitted.
o Decrease the regularization term.
o Use more complex models, such as including some polynomial
features.
7.3 What is a Variance Error?
 The variance would specify the amount of variation in the prediction if the
different training data was used. In simple words, variance tells that
how much a random variable is different from its expected
value. Ideally, a model should not vary too much from one training dataset
to another, which means the algorithm should be good in understanding
the hidden mapping between inputs and output variables. Variance errors
are either of low variance or high variance.

7.3.1 Low variance

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 34

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

 Low variance means there is a small variation in the prediction of the target
function with changes in the training data set.
 At the same time, High variance shows a large variation in the prediction
of the target function with changes in the training dataset.
 A model that shows high variance learns a lot and perform well with the
training dataset, and does not generalize well with the unseen dataset.
 As a result, such a model gives good results with the training dataset but
shows high error rates on the test dataset.
 Since, with high variance, the model learns too much from the dataset, it
leads to overfitting of the model. A model with high variance has the below
problems:
o A high variance model leads to overfitting.
o Increase model complexities.
o Usually, nonlinear algorithms have a lot of flexibility to fit
the model, have high variance.

Figure 1.15
o Some examples of machine learning algorithms with low
variance are, Linear Regression, Logistic Regression,
and Linear discriminant analysis. At the same time,
algorithms with high variance are decision tree, Support
Vector Machine, and K-nearest neighbours.
o Ways to Reduce High Variance:
o Reduce the input features or number of parameters as a
model is overfitted.
o Do not use a much complex model.
o Increase the training data.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 35

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

o Increase the Regularization term.


 Different Combinations of Bias-Variance
There are four possible combinations of bias and variances, which
are represented as shown in figure

Figure 1.16

1. Low-Bias,Low-Variance:
The combination of low bias and low variance shows an ideal machine learning
model. However, it is not possible practically.
2. Low-Bias, High-Variance: With low bias and high variance, model predictions
are inconsistent and accurate on average. This case occurs when the model
learns with a large number of parameters and hence leads to an overfitting
3. High-Bias, Low-Variance: With High bias and low variance, predictions are
consistent but inaccurate on average. This case occurs when a model does not
learn well with the training dataset or uses few numbers of the parameter. It
leads to underfitting problems in the model.
4. High-Bias,High-Variance:
With high bias and high variance, predictions are inconsistent and also
inaccurate on average.
How to identify High variance or High Bias?
High variance can be identified if the model has:

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 36

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

o Low training error and high-test error.


High Bias can be identified if the model has:
o High training error and the test error is almost similar to training error.
Bias-Variance Trade-Off
While building the machine learning model, it is really important to take care of bias
and variance in order to avoid overfitting and underfitting in the model. If the model is
very simple with fewer parameters, it may have low variance and high bias. Whereas, if
the model has a large number of parameters, it will have high variance and low bias.
So, it is required to make a balance between bias and variance errors, and this balance
between the bias error and variance error is known as the Bias-Variance trade-off.

For an accurate prediction of the model, algorithms need a low variance and low bias.
But this is not possible because bias and variance are related to each other:
o If we decrease the variance, it will increase the bias.
o If we decrease the bias, it will increase the variance.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 37

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 1

Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a


model that accurately captures the regularities in training data and simultaneously
generalizes well with the unseen dataset. Unfortunately, doing this is not possible
simultaneously. Because a high variance algorithm may perform well with training data,
but it may lead to overfitting to noisy data. Whereas, high bias algorithm generates a
much simple model that may not even capture important regularities in the data. So,
we need to find a sweet spot between bias and variance to make an optimal model.
Hence, the Bias-Variance trade-off is about finding the sweet spot to make a
balance between bias and variance errors.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 38

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

Approved by AICTE, New Delhi, Affiliated to Anna University, Chennai,


Accredited by National Board of Accreditation (NBA), Accredited by NAAC with “ A” Grade &
Accredited by TATA Consultancy Services (TCS), Chennai)

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE


II YEAR / IV SEM
AL3451- MACHINE LEARNING

UNIT II SUPERVISED LEARNING

SYLLABUS:
Linear Regression Models: Least squares, single & multiple variables,
Bayesian linear regression, gradient descent, Linear Classification
Models: Discriminant function- Percepton algorithm– Probabilistic
discriminative model - Logistic regression, Probabilistic generative
model – Naive Bayes, Maximum margin classifier – Support vector
machine, Decision Tree, Random forests
PART A
1. Define Regression.

 Regression is a supervised learning technique which helps in finding the


correlation between variables and enables us to predict the continuous
output variable based on the one or more predictor variables.
 It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.

2. Define Classification.

 The Classification algorithm is a Supervised Learning technique that is


used to identify the category of new observations on the basis of training
data.
 In Classification, a program learns from the given dataset or observations
and then classifies new observation into a number of classes or groups.
 Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can
be called as targets/labels or categories.

3. Define Clustering.
 Clustering or cluster analysis is a machine learning technique, which
groups the unlabeled dataset.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 1

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

 It can be defined as "A way of grouping the data points into different
clusters, consisting of similar data points.
 The objects with the possible similarities remain in a group that has
less or no similarities with another group."

4. What is Linear Regression?


 In statistics, linear regression is a linear approach to modeling the
relationship between a dependent variable and one or more independent
variables.
 Let X be the independent variable and Y be the dependent variable.
 A linear relationship between these two variables as follows:

Where,
m: Slope
c: y-intercept

5. What is Least Squares Regression Line?

 Least squares are a commonly used method in regression analysis


for estimating the unknown parameters by creating a model which
will minimize the sum of squared errors between the observed data
and the predicted data.

6. Narrate Least Squares Regression Equation

 The equation that minimizes the total of all squared prediction


errors for known Y scores in the original correlation analysis.

where
Y´ represents the predicted value;
X represents the known value;
b and a represent numbers calculated from the original correlation
analysis

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 2

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

7. List and define the types of Linear Regression.

It is of two types: Simple and Multiple.


 Simple Linear Regression is where only one independent variable is
present and the model has to find the linear relationship of it with the
dependent variable
 Equation of Simple Linear Regression, where bo is the intercept, b1 is
coefficient or slope, x is the independent variable and y is the dependent
variable.

o In Multiple Linear Regression there are more than one independent


variables for the model to find the relationship.
Equation of Multiple Linear Regression, where bo is the intercept,
b1,b2,b3,b4…,bn are coefficients or slopes of the independent variables
x1,x2,x3,x4…,xn and y is the dependent variable.

8. Define Linear Regression Model.

A Linear Regression model’s main aim is to find the best fit linear
line and the optimal values of intercept and coefficients such that the error is
minimized.

9. What is error or residual?

 Error is the difference between the actual value and Predicted value and the
goal is to reduce this difference.
 The vertical distance between the data point and the regression line is known
as error or residual.
 Each data point has one residual and the sum of all the differences is known
as the Sum of Residuals/Errors.

10. Define Bayesian Regression.

 Bayesian Regression is used when the data is insufficient in the dataset or


the data is poorly distributed.
 The output of a Bayesian Regression model is obtained from a probability
distribution.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 3

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

 The aim of Bayesian Linear Regression is to find the ‘posterior‘


distribution for the model parameters.
 The expression for Posterior is :

where
o Posterior: It is the probability of an event to occur; say, H, given that
another event; say, E has already occurred. i.e., P(H | E).
o Prior: It is the probability of an event H has occurred prior to another
event. i.e., P(H)
o Likelihood: It is a likelihood function in which some parameter variable
is marginalized.

11. List the Advantages and Disadvantages of Bayesian Regression.

Advantages of Bayesian Regression:

 Very effective when the size of the dataset is small.

 Particularly well-suited for on-line based learning (data is


received in real-time), as compared to batch based learning,
where we have the entire dataset on our hands before we start
training the model. This is because Bayesian Regression doesn’t
need to store data.

 The Bayesian approach is a tried and tested approach and is very


robust, mathematically. So, one can use this without having any
extra prior knowledge about the dataset.

Disadvantages of Bayesian Regression:


 The inference of the model can be time-consuming.

 If there is a large amount of data available for our dataset, the


Bayesian approach is not worth it.

12. What are the two types of Classifications problem?

 Two-class problems:

o Binary representation or Binary Classifier:

o If the classification problem has only two possible outcomes,


then it is called as Binary Classifier.

o There is a single target variable t ∈ {0, 1}

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 4

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

 t = 1 represents class C1
 t = 0 represents class C2
o Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM,
CAT or DOG, etc.

o Multi-class Problems:
o If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
o Example: Classifications of types of crops, Classification of types
of music.

13. List the different Types of ML Classification Algorithms.


 Logistic Regression
 K-Nearest Neighbors
 Support Vector Machines
 Kernel SVM
 Naïve Bayes
 Decision Tree Classification
 Random Forest Classification

14. What is Discriminant function?


 A function of a set of variables that is evaluated for samples of events or
objects and used as an aid in discriminating between or classifying them.
 A discriminant function (DF) maps independent (discriminating) variables
into a latent variable D.
 DF is usually postulated to be a linear function:
D = a0 + a1 x1 + a2 x2 ... aNxN

15. Define Probabilistic discriminative models.


 Discriminative models are a class of supervised machine learning
models which make predictions by estimating conditional
probability P(y|x).
 For the two-class classification problem, the posterior probability of class
C1 can be written as a logistic sigmoid acting on a linear function of x

 For the multi-class case, the posterior probability of class Ck is given by a


softmax transformation of a linear function of x

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 5

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

16. Define Logistics Regression


 Logistic regression is the Machine Learning algorithms, under the
classification algorithm of Supervised Learning technique.
 Logistic regression is used to describe data and the relationship between
one dependent variable and one or more independent variables.
 The independent variables can be nominal, ordinal, or of interval type.
 Logistic regression predicts the output of a categorical dependent
variable.
 Therefore the outcome must be a categorical or discrete value.

17. Define Logistic Function or Sigmoid Function.


 The logistic function is also known as the sigmoid function.
 The sigmoid function is a mathematical function used to map the
predicted values to probabilities.
 The value of the logistic regression must be between 0 and 1, so it forms
a curve like the "S" form.
 The S-form curve is called the Sigmoid function or the logistic function.

18. List the types of Logistic Regression.


 Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two
possible types of the dependent variables, such as 0 or 1, Pass or Fail,
etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or
more possible unordered types of the dependent variable, such as
"cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more
possible ordered types of dependent variables, such as "low",
"Medium", or "High".

19. What are the Steps in Logistic Regression?


 To implement the Logistic Regression using Python, the steps are given
below:
 Data Pre-processing step
 Fitting Logistic Regression to the Training set
 Predicting the test result
 Test accuracy of the result

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 6

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

 Visualizing the test set result.

20. List the advantages of Logistic Regression Algorithm.


 Logistic regression performs better when the data is linearly separable
 It does not require too many computational resources
 There is no problem scaling the input features
 It is easy to implement and train a model using logistic regression

21. Define Probabilistic Generative model


 Given a model of one conditional probability, and estimated probability
distributions for the variables X and Y, denoted P(X) and P(Y), can
estimate the conditional probability using Bayes' rule:

 A generative model is a statistical model of the joint probability


distribution on given observable variable X and target variable Y.
Given a generative model for P(X|Y), can estimate:

22. Define Discriminative model.


 A discriminative model is a model of the conditional probability of the
target Y, given an observation x given a discriminative model for
P(Y|X), can estimate:

 Classifier based on a generative model is a generative classifier, while a


classifier based on a discriminative model is a discriminative classifier

23. List the types of Generative models.


 Types of generative models are:
 Naive Bayes classifier or Bayesian network
 Linear discriminant analysis

24. Mention the algorithms in Discriminative models.


 Logistic regression
 Support Vector Machines
 Decision Tree Learning
 Random Forest
25. Define Support Vector Machine (SVM)

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 7

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

 Support Vector Machine(SVM) is a supervised machine learning


algorithm used for both classification and regression.
 The objective of SVM algorithm is to find a hyperplane in an N-
dimensional space that distinctly classifies the data points.
 Hyperplanes are decision boundaries that help classify the data points.

26. Define Hinge loss function

 The cost is 0 if the predicted value and the actual value are of the same
sign. If they are not, then calculate the loss value.

27. Define SVM Kernel.


 The SVM kernel is a function that takes low dimensional input space
and transforms it into higher-dimensional space, ie it converts non-
separable problem to separable problem. It is mostly useful in non-
linear separation problems.

28. List the types of SVMs


 There are two different types of SVMs, each used for different things:
o Simple SVM: Typically used for linear regression and classification
problems.
o Kernel SVM: Has more flexibility for non-linear data.

29. What are the advantages and disadvantages of SVM?


Advantages
 Effective on datasets with multiple features, like financial or medical
data.
 Effective in cases where number of features is greater than the number
of data points.
 Its memory efficient as it uses a subset of training points in the decision
function called support vectors
 Different kernel functions can be specified for the decision functions and
its possible to specify custom kernels
Disadvantages
 If the number of features is a lot bigger than the number of data points,
choosing kernel functions and regularization term is crucial.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 8

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

 SVMs don't directly provide probability estimates. Those are calculated


using an expensive five-fold cross-validation.
 Works best on small sample sets because of its high training time.
30. List the applications of SVM.
SVMs can be used to solve various real-world problems:
 SVMs are helpful in text and hypertext categorization.
 Classification of images can also be performed using SVMs.
 Classification of satellite data like SAR data using supervised SVM.
 Hand-written characters can be recognized using SVM.
 The SVM algorithm has been widely applied in the biological and other
sciences. They have been used to classify proteins with up to 90% of the
compounds classified correctly.

31. Define Decision Tree


 Decision Tree is a Supervised learning technique that can be used for
both classification and Regression problems.
 It is a tree-structured classifier, where internal nodes represent the
features of a dataset, branches represent the decision rules and each leaf
node represents the outcome.
 In a Decision tree, there are two nodes, the Decision Node and Leaf
Node.

32. List the types of Decision Trees


1. Categorical Variable Decision Tree: Decision Tree which has a categorical
target variable then it called a Categorical variable decision tree.
2. Continuous Variable Decision Tree: Decision Tree has a continuous target
variable then it is called Continuous Variable Decision Tree.

33. Mention the reason for using Decision Trees


 Decision Trees usually mimic human thinking ability while making a
decision, so it is easy to understand.
 The logic behind the decision tree can be easily understood because it
shows a tree-like structure.

34. List the algorithms used to construct Decision Trees:


 ID3 → (extension of D3)
 C4.5 → (successor of ID3)
 CART → (Classification And Regression Tree)
 CHAID → (Chi-square automatic interaction detection Performs multi-level
splits when computing classification trees)
 MARS → (multivariate adaptive regression splines)

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 9

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

PART B

1. Explain in detail about Linear Regression Models. Or Explain Linear


Regression Models: Least squares, single & multiple variables, Bayesian
linear regression, gradient descent.

LINEAR REGRESSION MODELS


1.1 Linear Regression
1.2 Least Squares Regression Line
1.2.1 Least Squares Regression Equation
1.2.2 Least Squares Regression in Python
1.3 Types of Linear Regression
1.4 Linear Regression Model
1.5 Bayesian Regression
1.5.1 Implementation Of Bayesian Regression Using Python
1.6 Gradient descent
1.6.1 Cost Function
1.6.2 Gradient Descent Algorithm.

1.1 Linear Regression


 In statistics, linear regression is a linear approach to modeling the
relationship between a dependent variable and one or more
independent variables.
 Let X be the independent variable and Y be the dependent variable.
 A linear relationship between these two variables as follows:

Where,
m: Slope
c: y-intercept
 Linear regression algorithm shows a linear relationship between a
dependent (y) and one or more independent (x) variables, hence called
as linear regression.
 Linear regression finds how the value of the dependent variable is
changing according to the value of the independent variable.
 The linear regression model provides a sloped straight line
representing the relationship between the variables.
 Consider the below Figure 2.1, which represents the relationship
between independent and dependent variables

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 10

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

Figure 2.1 – Relationship between independent and dependent


variables

1.2 Least Squares Regression Line


 Least squares are a commonly used method in regression analysis
for estimating the unknown parameters by creating a model which
will minimize the sum of squared errors between the observed data
and the predicted data.
1.2.1 Least Squares Regression Equation
 The equation that minimizes the total of all squared prediction
errors for known Y scores in the original correlation analysis.

where
Y´ represents the predicted value;
X represents the known value;
b and a represent numbers calculated from the original correlation
analysis
1.2.2 Least Squares Regression in Python
Scenario: A rocket motor is manufactured by combining an igniter
propellant and a sustainer propellant inside a strong metal housing. It
was noticed that the shear strength of the bond between two propellers
is strongly dependent on the age of the sustainer propellant.
Problem Statement: Implement a simple linear regression algorithm
using Python to build a machine learning model that studies the
relationship between the shear strength of the bond between two
propellers and the age of the sustainer propellant.
Step 1: Import the required Python libraries.
# Importing Libraries

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 11

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

importnumpyasnp
import pandas aspd
importmatplotlib.pyplotasplt

Step 2: Next step is to read and load the dataset.


# Loading dataset
data = pd.read_csv('PropallantAge.csv')
data.head()
data.info()

Step 3: Create a scatter plot just to check the relationship between


these two variables.
# Plotting the data
plt.scatter(data['Age of Propellant'],data['Shear
Strength'])

Step 4: Next step is to assign X and Y as independent and dependent


variables respectively.
# Computing X and Y
X = data['Age of Propellant'].values
Y = data['Shear Strength'].values

Step 5: Compute the mean of variables X and Y to determine the values


of slope (m) and y-intercept.
Also, let n be the total number of data points.
# Mean of variables X and Y
mean_x = np.mean(X)
mean_y = np.mean(Y)
# Total number of data values
n = len(X)

Step 6: Calculate the slope and the y-intercept using the formulas
# Calculating 'm' and 'c'
num = 0
denom = 0
for i in range(n):
num += (X[i] - mean_x) * (Y[i] - mean_y)
denom += (X[i] - mean_x) ** 2
m = num / denom
c = mean_y - (m * mean_x)

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 12

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

# Printing coefficients
print("Coefficients")
print(m, c)

The above step has given the values of m and c.


Substituting them ,
Shear Strength =
2627.822359001296 + (-37.15359094490524)
* Age of Propellant

Step 7: The above equation represents the linear regression model.


Let’s plot this graphically. Refer fig 2.2
# Plotting Values and Regression Line
maxx_x = np.max(X) + 10
minn_x = np.min(X) - 10

# line values for x and y


x = np.linspace(minn_x, maxx_x, 1000)
y=c+m*x
# Plotting Regression Line
plt.plot(x, y, color='#58b970', label='Regression Line')
# Plotting Scatter Points
plt.scatter(X, Y, c='#ef5423', label='Scatter Plot')
plt.xlabel('Age of Propellant (in years)')
plt.ylabel('Shear Strength')
plt.legend()
plt.show()
Output:

Figure 2.2 – Example for Regression Line

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 13

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

1.3 Types of Linear Regression


It is of two types: Simple and Multiple.
o Simple Linear Regression is where only one independent variable is
present and the model has to find the linear relationship of it with the
dependent variable
Equation of Simple Linear Regression, where bo is the intercept, b1 is
coefficient or slope, x is the independent variable and y is the dependent
variable.

o In Multiple Linear Regression there are more than one independent


variables for the model to find the relationship.
Equation of Multiple Linear Regression, where bo is the intercept,
b1,b2,b3,b4…,bn are coefficients or slopes of the independent variables
x1,x2,x3,x4…,xn and y is the dependent variable.

1.4 Linear Regression Model


 A Linear Regression model’s main aim is to find the best fit linear line
and the optimal values of intercept and coefficients such that the
error is minimized.
 Error is the difference between the actual value and Predicted value
and the goal is to reduce this difference.

Figure 2.3 – Example for Linear Regression Model

 In the above figure 2.3,


 x is our dependent variable which is plotted on the x-axis and y is the
dependent variable which is plotted on the y-axis.
 Black dots are the data points i.e the actual values.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 14

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

 bo is the intercept which is 10 and b1 is the slope of the x variable.


 The blue line is the best fit line predicted by the model i.e the predicted
values lie on the blue line.
 The vertical distance between the data point and the regression line is
known as error or residual.
 Each data point has one residual and the sum of all the differences is known
as the Sum of Residuals/Errors.

Mathematical Approach:
Residual/Error = Actual values – Predicted Values
Sum of Residuals/Errors = Sum(Actual- Predicted Values)
Square of Sum of Residuals/Errors = (Sum(Actual- Predicted Values))2

1.5 Bayesian Regression


 Bayesian Regression is used when the data is insufficient in the dataset or
the data is poorly distributed.
 The output of a Bayesian Regression model is obtained from a probability
distribution.
 The aim of Bayesian Linear Regression is to find the ‘posterior‘
distribution for the model parameters.
 The expression for Posterior is :

where
o Posterior: It is the probability of an event to occur; say, H, given that
another event; say, E has already occurred. i.e., P(H | E).
o Prior: It is the probability of an event H has occurred prior to another
event. i.e., P(H)
o Likelihood: It is a likelihood function in which some parameter
variable is marginalized.
 The Bayesian Ridge Regression formula is as follows:
p(y|λ) = N(w|0, λ^-1Ip)
where
 'y' is the expected value,
 lambda is the distribution's shape parameter before the lambda
parameter
 the vector "w" is made up of the elements w0, w1,....

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 15

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

1.5.1 Implementation Of Bayesian Regression Using Python


 Boston Housing dataset, which includes details on the average price of
homes in various Boston neighborhoods.
 The r2 score will be used for evaluation.
 The crucial components of a Bayesian Ridge Regression model:

Program
fromsklearn.datasets import load_boston
fromsklearn.model_selection import train_test_split
fromsklearn.metrics import r2_score
fromsklearn.linear_model import BayesianRidge

# Loading the dataset


dataset = load_boston()
X, y = dataset.data, dataset.target

# Splitting the dataset into testing and training sets


X_train, X_test, y_train, y_test = train_test_split
(X, y, test_size = 0.15, random_state = 42)

# Creating to train the model


model = BayesianRidge()
model.fit(X_train, y_train)

# Model predicting the test data


prediction = model.predict(X_test)

# Evaluation of r2 score of the model against the test dataset


print(f"Test Set r2 score : {r2_score(y_test, prediction)}")

Output
Test Set r2 score : 0.7943355984883815

Advantages of Bayesian Regression:


 Very effective when the size of the dataset is small.
 Particularly well-suited for on-line based learning (data is received in
real-time), as compared to batch based learning, where we have the
entire dataset on our hands before we start training the model. This is
because Bayesian Regression doesn’t need to store data.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 16

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

 The Bayesian approach is a tried and tested approach and is very


robust, mathematically. So, one can use this without having any extra
prior knowledge about the dataset.

Disadvantages of Bayesian Regression:


 The inference of the model can be time-consuming.
 If there is a large amount of data available for our dataset, the Bayesian
approach is not worth it.

1.6 Gradient descent


1.6.1 Cost Function
 The cost is the error in our predicted value.
 It is calculated using the Mean Squared Error function as shown in figure
3.4.

Figure 2.4 – Example for Cost function

 The goal is to minimize the cost as much as possible in order to find the
best fit line.

1.6.2 Gradient Descent Algorithm.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 17

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

 Gradient descent is an optimization algorithm that finds the best-fit line


for a given training dataset in a smaller number of iterations.
 If m and c are plotted against MSE, it will acquire a bowl shape as shown
in figure 2.4a and figure 2.4b.

Figure 2.4a – Process of gradient descent algorithm

Figure 2.4b – Gradient Descent Shape

Learning Rate
 A learning rate is used for each pair of input and output values. It is a
scalar factor and coefficients are updated in direction towards minimizing
error.
 The process is repeated until a minimum sum squared error is achieved or
no further improvement is possible.

Step by Step Algorithm:


1. Initially, let m = 0, c = 0
Where L = learning rate — controlling how much the value of “m” changes
with each step.
The smaller the L, greater the accuracy. L = 0.001 for a good accuracy.
1. Calculating the partial derivative of loss function “m” to get the

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 18

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

derivative D.

 Similarly, find the partial derivative with respect to c, Dc..

 Update the current values of m and c using the following equation:

 Repeat this process until our Cost function is very small (ideally 0).

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 19

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

2. Explain in detail about Linear Classification Models – Discriminant


function.

LINEAR CLASSIFICATION MODELS – DISCRIMINANT FUNCTION.


2.1 Linear Classification Models

2.2 Types of ML Classification Algorithms

2.3 Discriminant function

2.1 Linear Classification Models

 The Classification algorithm is a Supervised Learning technique that


is used to identify the category of new observations on the basis of
training data.
 In Classification, a program learns from the given dataset or
observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam,
cat or dog, etc.
 Classes can be called as targets/labels or categories.
 The output variable of Classification is a category, not a value, such
as "Green or Blue", "fruit or animal", etc.
 Since the Classification algorithm is a supervised learning
technique, hence it takes labeled input data, which means it
contains input with the corresponding output.
 In classification algorithm, a discrete output function(y) is mapped
to input variable(x).
y=f(x), where y = categorical output

 The best example of an ML classification algorithm is Email Spam


Detector.
 The goal of the classification algorithm is
o Take a D-dimensional input vector x
o Assign it to one of K discrete classes Ck , k = 1, . . . , K
 In the most common scenario, the classes are taken to be disjoint
and each input is assigned to one and only one class
 The input space is divided into decision regions
 The boundaries of the decision regions
o decision boundaries
o decision surfaces

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 20

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

 With linear models for classification, the decision surfaces are linear
functions, Classes that can be separated well by linear surfaces are
linearly separable.
 In the figure 2.5, there are two classes, class A and Class B.
 These classes have features that are similar to each other and
dissimilar to other classes.

Figure 2.5 – Example of Classification

 The algorithm which implements the classification on a dataset is


known as a classifier.
 There are two types of Classifications:
o Two-class problems :
o Binary representation or Binary Classifier:
o If the classification problem has only two possible outcomes,
then it is called as Binary Classifier.
o There is a single target variable t ∈ {0, 1}
o t = 1 represents class C1
o t = 0 represents class C2
o Examples: YES or NO, MALE or FEMALE, SPAM or NOT
SPAM, CAT or DOG, etc.
o Multi-class Problems:
o If a classification problem has more than two outcomes,
then it is called as Multi-class Classifier.
o Example: Classifications of types of crops, Classification of
types of music.
o 1-of-K coding scheme
o There is a K-long target vector t, such that
If the class is Cj, all elements tk of t are zero for k ≠ j and
one for k = j tk is the probability that the class is Ck, K = 6
and Ck = 4, then t = (0, 0, 0, 1, 0, 0)T

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 21

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

 The simplest approach to classification problems is through


construction of a discriminant function that directly assigns each
vector x to a specific class

2.2 Types of ML Classification Algorithms:


 Logistic Regression
 K-Nearest Neighbors
 Support Vector Machines
 Kernel SVM
 Naïve Bayes
 Decision Tree Classification
 Random Forest Classification

2.3Discriminant function
 A function of a set of variables that is evaluated for samples of events
or objects and used as an aid in discriminating between or classifying
them.
 A discriminant function (DF) maps independent (discriminating)
variables into a latent variable D.
 DF is usually postulated to be a linear function:
D = a0 + a1 x1 + a2 x2 ... aNxN
 The goal of discriminant analysis is to find such values of the
coefficients {ai, i=0,...,N} that the distance between the mean values of
DF is maximal for the two groups.
 Whenever there is a requirement to separate two or more classes
having multiple features efficiently, the Linear Discriminant Analysis
model is considered the most common technique to solve such
classification problems.
 For example, if there are classes with multiple features and need to
separate them efficiently. Classify them using a single feature, then it
may show overlapping as shown in figure 3.6.

Figure 2.6 – Example for Classification using single feature


 To overcome the overlapping issue in the classification process, must
increase the number of features regularly.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 22

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

3. Explain in detail about Linear Discriminant Functions and its types.


Also elaborate about logistic regression in detail.

LINEAR DISCRIMINANT FUNCTIONS AND LOGISTIC REGRESSION


3.1 Linear Discriminant Functions
3.2 The Two-Category Case
3.3 The Multi-category Case
3.4 Generalized Linear Discriminant Functions
3.5 Probabilistic discriminative models
3.6 Logistics Regression
3.6.1 Logistic Function (Sigmoid Function)
3.6.2 Assumptions for Logistic Regression
3.6.3 Logistic Regression Equation
3.6.4 Type of Logistic Regression
3.6.5 Steps in Logistic Regression
3.6.6 Advantages of Logistic Regression Algorithm

3.1 Linear Discriminant Functions


A discriminant function that is a linear combination of the
components of x can be written as

(2.1)
where w is the weight vector and w0 the bias or threshold weight.

3.2 The Two-Category Case


 For a discriminant function of the form of eq.2.1, a two-category
classifier implements the following decision rule:
 Decide w1 if g(x)>0 and w2 if g(x)<0.
 Thus, x is assigned to w1 if the inner product wTx exceeds the
threshold – w0 and to w2 otherwise.
 If g(x)=0, x can ordinarily be assigned to either class, or can be left
undefined.
 The equation g(x)=0 defines the decision surface that separates
points assigned to w1 from points assigned to w2.
 When g(x) is linear, this decision surface is a hyperplane.
 If x1 and x2 are both on the decision surface, then

(2.2)
or

(2.3)

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 23

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

Figure 2.7: The linear decision boundary H separates the feature


space into two half-spaces.
 In figure 2.7, the hyperplane H divides the feature space into two
half-spaces:
o Decisionregion R1 for w1
o region R2 for w2.
 The discriminant function g(x) gives an algebraic measure of the
distance from x to the hyperplane.
 The way to express x as

(2.4)
where xp is the normal projection of x onto H, and r is the desired
algebraic distance which is positive if x is on the positive side and
negative if x is on the negative side. Then, because g(xp)=0,

Since then

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 24

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

(2.5)

or

(2.6)

o The distance from the origin to H is given by .


o If w0>0, the origin is on the positive side of H, and if w0<0, it is on the
negative side.

o If w0=0, then g(x) has the homogeneous form , and the


hyperplane passes through the origin

3.3 The Multi-category Case


 To devise multi category classifiers employing linear discriminant
functions reduce the problem to c two-class problems.
 Defining c linear discriminant functions

(2.7)

and assigning x to wi if for all j¹ i; in case of ties, the


classification is left undefined.
 The resulting classifier is called a linear machine.

Figure 2.8: Decision boundaries defined by linear machines

 A linear machine divides the feature space into c decision regions


as shown in figure 2.8, with gj(x) being the largest discriminant
if x is in region Ri.
 If Ri and Rj are contiguous, the boundary between them is a
portion of the hyperplane Hij defined by

or

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 25

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

is normal to Hij and the signed distance from x to Hij is


given by

3.4 Generalized Linear Discriminant Functions


 The linear discriminant function g(x) can be written as

(2.8)

where the coefficients wi are the components of the weight


vector w.

Quadratic Discriminant Function

(2.9)

3.5 Probabilistic discriminative models


 Discriminative models are a class of supervised machine learning
models which make predictions by estimating conditional
probability P(y|x).
 For the two-class classification problem, the posterior probability
of classC1 can be written as a logistic sigmoid acting on a linear
function of x

 For the multi-class case, the posterior probability of class Ckis


given by a softmax transformation of a linear function of x

3.6 Logistics Regression


o Logistic regression is the Machine Learning algorithms, under the
classification algorithm of Supervised Learning technique.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 26

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

o Logistic regression is used to describe data and the relationship


between one dependent variable and one or more independent
variables.
o The independent variables can be nominal, ordinal, or of interval type.
o Logistic regression predicts the output of a categorical dependent
variable.
o Therefore the outcome must be a categorical or discrete value.
o It can be either Yes or No, 0 or 1, true or False, etc. it gives the
probabilistic values which lie between 0 and 1.
o Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification
problems.
o The figure 2.9 predicts the logistic function

Figure 2.9 – Logistic Function or Sigmoid Function

3.6.1 Logistic Function (Sigmoid Function):


o The logistic function is also known as the sigmoid function.
o The sigmoid function is a mathematical function used to map the
predicted values to probabilities.
o The value of the logistic regression must be between 0 and 1, so it
forms a curve like the "S" form.
o The S-form curve is called the Sigmoid function or the logistic
function.
3.6.2 Assumptions for Logistic Regression:
o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.
3.6.3 Logistic Regression Equation:
 The Logistic regression equation can be obtained from the Linear
Regression equation.
 The mathematical steps are given below:
 The equation of the straight line can be written as:

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 27

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

o In Logistic Regression y can be between 0 and 1 only, let's


divide the above equation by (1-y):

o For the range between -[infinity] to +[infinity], take logarithm of


the equation:

The above equation is the final equation for Logistic Regression.


3.6.4 Type of Logistic Regression:
 Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only
two possible types of the dependent variables, such as 0 or 1,
Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3
or more possible unordered types of the dependent variable,
such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more
possible ordered types of dependent variables, such as "low",
"Medium", or "High".

3.6.5 Steps in Logistic Regression:


 To implement the Logistic Regression using Python, the steps are
given below:
 Data Pre-processing step
 Fitting Logistic Regression to the Training set
 Predicting the test result
 Test accuracy of the result
 Visualizing the test set result.

3.6.6 Advantages of Logistic Regression Algorithm


 Logistic regression performs better when the data is linearly
separable
 It does not require too many computational resources
 There is no problem scaling the input features
 It is easy to implement and train a model using logistic regression

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 28

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

4. Elaborate in detail about Probabilistic Generative model and Naïve


Bayes.

PROBABILISTIC GENERATIVE MODEL AND NAÏVE BAYES


4.1 Probabilistic Generative model
4.2 Simple example
4.3 Generative models
4.4 Discriminative models

4.1 Probabilistic Generative model


 Given a model of one conditional probability, and estimated probability
distributions for the variables X and Y, denoted P(X) and P(Y), can
estimate the conditional probability using Bayes' rule:

 A generative model isa statistical model of the joint probability


distribution on given observable variable X and target variable Y.
Given a generative model for P(X|Y), can estimate:

 A discriminative model is a model of the conditional probability of the


target Y, given an observation xgiven a discriminative model forP(Y|X),
can estimate:

 Classifier based on a generative model is a generative classifier, while a


classifier based on a discriminative model is a discriminative classifier

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 29

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

4.2 Simple example

4.3 Generative models


Types of generative models are:
 Naive Bayes classifier or Bayesian network
 Linear discriminant analysis

4.4 Discriminative models


 Logistic regression
 Support Vector Machines
 Decision Tree Learning
 Random Forest

5. Elaborate in detail about Support Vector Machine (SVM).

SUPPORT VECTOR MACHINE


5.1 Support Vector Machine (SVM)
5.2 Cost Function and Gradient Updates
5.2.1 Hinge loss function
5.3 SVM Kernel
5.4 Types of SVMs
5.5 Advantages of SVM
5.6 Disadvantages
5.7 Applications

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 30

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

5.1 Support Vector Machine (SVM)


 Support Vector Machine(SVM) is a supervised machine learning
algorithm used for both classification and regression.
 The objective of SVM algorithm is to find a hyperplane in an N-
dimensional space that distinctly classifies the data points.
 Hyperplanes are decision boundaries that help classify the data points.
 The dimension of the hyperplane depends upon the number of features.
 If the number of input features is 2, then the hyperplane is just a line.
 If the number of input features is 3, then the hyperplane becomes a two-
dimensional plane.
 It becomes difficult to imagine when the number of features exceeds 3.
 The objective is to find a plane that has the maximum margin, i.e the
maximum distance between data points of both classes.
 Support vectors are data points that are closer to the hyperplane and
influence the position and orientation of the hyperplane.
 Using these support vectors, can maximize the margin of the classifier.
 Deleting the support vectors will change the position of the hyperplane.
 Example Refer Figure 2.10

Figure 2.10 – Example for Support Vectors

 Let’s consider two independent variables x1, x2 and one dependent


variable which is either a blue circle or a red box as shown in figure
2.11.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 31

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

Figure 2.11 - Linearly Separable Data points

5.2 Cost Function and Gradient Updates


 In the SVM algorithm, to maximize the margin between the data points
and the hyperplane, the loss function helps to maximize the margin is
called hinge loss.

5.2.1 Hinge loss function

 The cost is 0 if the predicted value and the actual value are of the same
sign. If they are not, then calculate the loss value.
 The objective of the regularization parameter is to balance the margin
maximization and loss.
 After adding the regularization parameter, the cost functions looks as
below.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 32

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

5.3 SVM Kernel:


 The SVM kernel is a function that takes low dimensional input space
and transforms it into higher-dimensional space, ie it converts non
separable problem to separable problem. It is mostly useful in non-
linear separation problems.

5.4 Types of SVMs


 There are two different types of SVMs, each used for different things:
o Simple SVM: Typically used for linear regression and classification
problems.
o Kernel SVM: Has more flexibility for non-linear data.

5.5 Advantages of SVM:


 Effective on datasets with multiple features, like financial or medical
data.
 Effective in cases where number of features is greater than the number
of data points.
 Its memory efficient as it uses a subset of training points in the decision
function called support vectors
 Different kernel functions can be specified for the decision functions and
its possible to specify custom kernels

5.6 Disadvantages
 If the number of features is a lot bigger than the number of data points,
choosing kernel functions and regularization term is crucial.
 SVMs don't directly provide probability estimates. Those are calculated
using an expensive five-fold cross-validation.
 Works best on small sample sets because of its high training time.

5.7 Applications
SVMs can be used to solve various real-world problems:
 SVMs are helpful in text and hypertext categorization.
 Classification of images can also be performed using SVMs.
 Classification of satellite data like SAR data using supervised SVM.
 Hand-written characters can be recognized using SVM.
 The SVM algorithm has been widely applied in the biological and other
sciences. They have been used to classify proteins with up to 90% of the
compounds classified correctly.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 33

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

6. Elaborate in detail about Decision Tree in Supervised Learning.

DECISION TREE IN SUPERVISED LEARNING


6.1 Decision Tree
6.2 Types of Decision Trees
6.3 Reason for using Decision Trees
6.4 Decision Tree Terminologies
6.5 Working of Decision Tree algorithm
6.6 Algorithms used to construct Decision Trees
6.7 Attribute Selection Measures
6.7.1 Entropy
6.7.2.Information Gain
6.7.3.Gini Index
6.7.4 Gain Ratio
6.7.5 Reduction in variance
6.7.6 Chi-Square
6.8. Avoid/counter Over fitting in Decision Trees
6.8.1 Pruning Decision Trees

6.8.2 Random Forest

6.9 Advantages of the Decision Tree


6.10 Disadvantages of the Decision Tree

6.1 Decision Tree


 Decision Tree is a supervised learning technique that can be used for
both classification and Regression problems.
 It is a tree-structured classifier, where internal nodes represent the
features of a dataset, branches represent the decision rules and each leaf
node represents the outcome.
 In a Decision tree, there are two nodes, the Decision Node and Leaf
Node.
 As shown in figure 2.12, Decision nodes are used to make any decision
and have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
 Example for Decision Tree Refer Figure 2.13
 The goal of using a Decision Tree is to create a training model that can
use to predict the class or value of the target variable by learning simple
decision rules inferred from prior data(training data).
 In order to build a tree, use the CART algorithm, which stands
for Classification and Regression Tree algorithm.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 34

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

Figure 2.12 – Decision Tree Structure

Figure 2.13 – Decision Tree Example

6.2 Types of Decision Trees


1. Categorical Variable Decision Tree: Decision Tree which has a
categorical target variable then it called a Categorical variable decision
tree.
2. Continuous Variable Decision Tree: Decision Tree has a continuous
target variable then it is called Continuous Variable Decision Tree.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 35

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

6.3 Reason for using Decision Trees


 Decision Trees usually mimic human thinking ability while making a
decision, so it is easy to understand.
 The logic behind the decision tree can be easily understood because it
shows a tree-like structure.
6.4 Decision Tree Terminologies
 Root Node: Root node is from where the decision tree starts. It represents
the entire dataset, which further gets divided into two or more homogeneous
sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node
into sub-nodes according to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from
the tree.
 Parent/Child node: The root node of the tree is called the parent node,
and other nodes are called the child nodes.
6.5 Working of Decision Tree algorithm
 In a decision tree, for predicting the class of the given dataset, the
algorithm starts from the root node of the tree.
 This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and
jumps to the next node.
 For the next node, the algorithm again compares the attribute value with
the other sub-nodes and move further.
 It continues the process until it reaches the leaf node of the tree.
 The complete process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
Step-2: Find the best attribute in the dataset using Attribute
Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the
best attributes.
Step-4: Generate the decision tree node, which contains the best
attribute.
Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage
is reached where cannot further classify the nodes and called
the final node as a leaf node.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 36

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

Example:
 Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the
decision tree starts with the root node (Salary attribute by ASM). The root
node splits further into the next decision node (distance from the office) and
one leaf node based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer). Consider the below diagram: Refer fig 2.14
Decision Tree

Figure 2.14 – Decision Tree Algorithm Example

6.6 Algorithms used to construct Decision Trees:


 ID3 → (extension of D3)
 C4.5 → (successor of ID3)
 CART → (Classification And Regression Tree)
 CHAID → (Chi-square automatic interaction detection Performs
multi-level splits when computing classification trees)
 MARS → (multivariate adaptive regression splines)

6.7Attribute Selection Measures


 While implementing a Decision tree, Attribute selection measure
orASM is used to select the best attribute for the nodes of the tree.
1. Entropy,

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 37

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

2. Information gain,
3. Gini index,
4. Gain Ratio,
5. Reduction in Variance
6. Chi-Square

6.7.1 Entropy:
 Entropy is a metric to measure the impurity in a given
attribute.
 Entropy is a measure of the randomness in the information
being processed.
 The higher the entropy, the harder it is to draw any conclusions
from that information.
 Flipping a coin is an example of an action that provides
information that is random.
 Entropy can be calculated as:

Entropy(S) = -P(yes)log2 P(yes)- P(no) log2 P(no)


Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no

6.7.2. Information Gain:


 Information gain or IG is a statistical property that measures
how well a given attribute separates the training examples
according to their target classification. Example Refer Fig 3.15.

Figure 2.15 – Information Gain Example

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 38

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

 Constructing a decision tree is all about finding an attribute


that returns the highest information gain and the smallest
entropy.
 Information gain is a decrease in entropy.
 It computes the difference between entropy before split and
average entropy after split of the dataset based on given
attribute values.
 It can be calculated using the below formula:

Information Gain= Entropy(S)-


[(Weighted Avg) *Entropy(each feature)]

6.7.3. Gini Index:


 Gini index as a cost function used to evaluate splits in the dataset.
 It is calculated by subtracting the sum of the squared probabilities
of each class from one.
 It favors larger partitions and easy to implement whereas
information gain favors smaller partitions with distinct values.
 Gini index can be calculated using the below formula:

6.7.4 Gain Ratio


 Information gain is biased towards choosing attributes with a large
number of values as root nodes.
 Gain ratio overcomes the problem with information gain by taking
the intrinsic information of a split into account.

6.7.5 Reduction in variance


 Reduction in variance is an algorithm that uses the standard
formula of variance to choose the best split.
 The split with lower variance is selected as the criteria to split the
population:

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 39

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

6.7.6 Chi-Square
 The acronym CHAID stands for Chi-squared Automatic Interaction
Detector.
 It finds out the statistical significance between the differences
between sub-nodes and parent node.
 Higher the value of Chi-Square higher the statistical significance of
differences between sub-node and Parent node.
 It generates a tree called CHAID (Chi-square Automatic Interaction
Detector).
 Mathematically, Chi-squared is represented as:

6.8. Avoid/counter Overfitting in Decision Trees


o Two ways to remove overfitting:
o Pruning Decision Trees.
o Random Forest

6.8.1 Pruning Decision Trees

 Pruning is a process of deleting the unnecessary nodes from a tree


in order to get the optimal decision tree.
 A too-large tree increases the risk of over fitting, and a small tree
may not capture all the important features of the dataset.
 Therefore, a technique that decreases the size of the learning tree
without reducing accuracy is known as Pruning.
 There are mainly two types of tree pruning technology used:
o Cost Complexity Pruning
o Reduced Error Pruning.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 40

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

6.8.2 Random Forest


 Random Forest is an example of ensemble learning, in which we
combine multiple machine learning algorithms to obtain better
predictive performance.
 The name random means
 A random sampling of training data set when building trees.
 Random subsets of features considered when splitting nodes.

6.9 Advantages of the Decision Tree


 It is simple to understand as it follows the same process which a
human follow while making any decision in real-life.
 It can be very useful for solving decision-related problems.
 It helps to think about all the possible outcomes for a problem.
 There is less requirement of data cleaning compared to other
algorithms.

6.10 Disadvantages of the Decision Tree


 The decision tree contains lots of layers, which makes it complex.
 It may have an over fitting issue, which can be resolved using
the Random Forest algorithm.
 For more class labels, the computational complexity of the decision
tree may increase.

7. Elaborate in detail about Random Forest in Supervised Learning.

RANDOM FOREST
7.1 Random Forest
7.2 Steps in the working process of Random Forest
7.3 Need for Random Forest
7.4 Example:
7.5 Important Features of Random Forest
7.6 Applications of Random Forest
7.8 Advantages of Random Forest
7.8 Disadvantages of Random Forest
7.9 Difference between Decision Tree & Random Forest

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 41

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

7.1 Random Forest


 Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset.
 Random Forest is a popular machine learning algorithm that belongs
to the supervised learning technique.
 It can be used for both Classification and Regression problems in ML.
 It is based on the concept of ensemble learning, which is a process
of combining multiple classifiers to solve a complex problem and to
improve the performance of the model.
 The greater number of trees in the forest leads to higher accuracy and
prevents the problem of over fitting.

7.2 Steps in the working process of Random Forest


 The Working process can be explained in the below steps and
diagram:
Step 1: In Random forest n number of random records are taken from
the data set having k number of records.
Step 2: Individual decision trees are constructed for each sample.
Step 3: Each decision tree will generate an output.
Step 4: Final output is considered based on Majority Voting or
Averaging for Classification and regression respectively.

7.3 Need for Random Forest


 It takes less training time as compared to other algorithms.
 It predicts output with high accuracy, even for the large dataset it
runs efficiently.
 It can also maintain accuracy when a large proportion of data is
missing.

7.4 Example:
 Suppose there is a dataset that contains multiple fruit images. So,
this dataset is given to the Random forest classifier. The dataset is
divided into subsets and given to each decision tree. During the
training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results,
the Random Forest classifier predicts the final decision.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 42

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

Figure 2.16 – Example for Random Forest

 In the above figure 2.16, the majority decision tree gives output as an
apple when compared to a banana, so the final output is taken as an
apple.

7.5 Important Features of Random Forest


1. Diversity-
Not all attributes/variables/features are considered while
making an individual tree, each tree is different.
2. Immune to the curse of dimensionality-
Since each tree does not consider all the features, the feature
space is reduced.
3. Parallelization-
Each tree is created independently out of different data and
attributes. This means that we can make full use of the CPU to
build random forests.
4. Train-Test split-
In a random forest we don’t have to segregate the data for train
and test as there will always be 30% of the data which is not
seen by the decision tree.
5. Stability-
Stability arises because the result is based on majority voting/
averaging.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 43

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 2

7.6 Applications of Random Forest


1. Banking: Banking sector mostly uses this algorithm for the
identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks
of the disease can be identified.
3. Land Use: We can identify the areas of similar land use by this
algorithm.
4. Marketing: Marketing trends can be identified using this
algorithm.

7.7 Advantages of Random Forest


 Random Forest is capable of performing both Classification and
Regression tasks.
 It is capable of handling large datasets with high dimensionality.
 It enhances the accuracy of the model and prevents the over fitting
issue.
7.8 Disadvantages of Random Forest
 Although random forest can be used for both classification and
regression tasks, it is not more suitable for Regression tasks.

7.9 Difference between Decision Tree & Random Forest


Decision trees Random Forest

Random forests are created from subsets


Decision trees normally suffer from the
of data and the final output is based on
problem of overfitting if it’s allowed to
average or majority ranking and hence
grow without any control.
the problem of overfitting is taken care of.

A single decision tree is faster in


It is comparatively slower.
computation.

When a data set with features is taken Random forest randomly selects
as input by a decision tree it will observations, builds a decision tree and
formulate some set of rules to do the average result is taken. It doesn’t use
prediction. any set of formulas.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 44

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

Approved by AICTE, New Delhi, Affiliated to Anna University, Chennai,


Accredited by National Board of Accreditation (NBA), Accredited by NAAC with “A” Grade &
Accredited by TATA Consultancy Services (TCS), Chennai)

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE


II YEAR / IV SEM
AL3451- MACHINE LEARNING
UNIT III ENSEMBLE TECHNIQUES AND UNSUPERVISED LEARNING
SYLLABUS:
Combining multiple learners: Model combination schemes, Voting,
Ensemble Learning - bagging, boosting, stacking, Unsupervised learning:
K-means, Instance Based Learning: KNN, Gaussian mixture models and
Expectation maximization

PART A
1. What is unsupervised learning?
Unsupervised learning is a type of machine learning algorithm used to
draw inferences from datasets consisting of input data without labeled
responses.

2. What is semi supervised learning?


 Semi-supervised machine learning is a combination of
supervised and unsupervised learning.
 It uses a small amount of labeled data and a large amount of
unlabeled data, which provides the benefits of both unsupervised
and supervised learning while avoiding the challenges of finding a
large amount of labeled data.

3. What is Ensemble method?


Ensemble methods are techniques that aim at improving the
accuracy of results in models by combining multiple models
instead of using a single model. The combined models increase the

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 1

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

accuracy of the results significantly. This has boosted the popularity of


ensemble methods in machine learning.

4. Explain Clustering.
Clustering is the act of organizing similar objects into groups
within a machine learning algorithm. Assigning related objects into
clusters is beneficial for AI models. Clustering has many uses in data
science, like Image processing, knowledge discovery in data,
unsupervised learning, and various other applications.

5. What is a Cluster?
Cluster is a group of objects that belongs to the same class. In
other words the similar objects are grouped in one cluster and dissimilar
are grouped in other cluster.

6. What is bagging?
Bagging is also known as Bootstrap aggregation, ensemble method
works by training multiple models independently and combining later to
result in strong model.

7. Define Boosting.
Boosting refers to a group of algorithms that utilize weighted
averages to make weak learning algorithms to stronger learning
algorithms.

8. What is K-Nearest neighbor Methods?


 K-NN is a non-parametric algorithm, which means it does not
make any assumption on underlying data.
 It is also called a lazy learner algorithm because it does not learn
from the training set immediately instead it stores the dataset and
at the time of classification, it performs an action on the dataset.
 KNN algorithm at the training phase just stores the dataset and
when it gets new data, then it classifies that data into a category
that is much similar to the new data.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 2

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

9. Which are the performance factors that influence KNN algorithm?


1. The distance function or distance metric used to determine the
nearest neighbors.
2. The Decision rule used to derive a classification from the K-
Nearest neighbors.
3. The number of neighbors used to classify the new example.

10. What is K Means Clustering?


 K-Means Clustering is an Unsupervised Learning algorithm, which
groups the unlabeled dataset into different clusters.
 Here K defines the number of pre-defined clusters that need to be
created in the process, as if K=2, there will be two clusters, and for
K=3, there will be three clusters, and so on.
 It is an iterative algorithm that divides the unlabeled dataset into
k different clusters in such a way that each dataset belongs only
one group that has similar properties.

11.List the properties of K-Means algorithm.


1. There are always K clusters
2. There is always at least one item in each cluster.
3. The clusters are non-hierarchical and they do not overlap

12. What is stacking?


Stacking, sometimes called stacked generalization, is an ensemble
machine learning method that combines heterogeneous base or
component models via a meta model.

13. How do GMMs differentiate from K-means clustering?


GMMs and K-means, both are clustering algorithms used for
unsupervised learning tasks. However, the basic difference between
them is that K-means is a distance-based clustering method while
GMMs is a distribution based clustering method.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 3

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

14. What is ‘Over fitting’ in Machine learning?


 In machine learning, when a statistical model describes random
error or noise instead of underlying relationship ‘over fitting’
occurs.
 When a model is excessively complex, over fitting is normally
observed, because of having too many parameters with respect to
the number of training data types.
 The model exhibits poor performance which has been over fit.

15. What is ensemble learning?


To solve a particular computational program, multiple models
such as classifiers or experts are strategically generated and combined.
This process is known as ensemble learning.

16. What are the two paradigms of ensemble methods?


The two paradigms of ensemble methods are
 Sequential ensemble methods
 Parallel ensemble methods

17. What is voting?


A voting classifier is a machine learning estimator that trains
various base models or estimators and predicts on the basis of
aggregating the findings of each base estimator. The aggregating criteria
can be combined decision of voting for each estimator output.

18. What is Error-Correcting Output Codes?


 The main classification task is defined in terms of a number of
subtasks that are implemented by the base learners.
 The idea is that the original task of separating one class from all
other classes may be a difficult problem.
 We want to define a set of simpler classification problems, each
specializing in one aspect of the task, and combining these
simpler classifiers, we get the final classifier.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 4

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

19. What is Gaussian Mixture model?


This model is a soft probabilistic clustering model that allows us
to describe the membership of points to a set of clusters using a mixture
of Gaussian densities.

20. Differentiate between Bagging and Boosting.

Bagging Boosting

 The simplest way of combining


 A way of combining predictions
predictions that belong to the
that belong to the different types.
same type.

 Aim to decrease variance, not  Aim to decrease bias, not


bias. variance.

 Models are weighted according to


 Each model receives equal weight.
their performance.

 New models are influenced


 Each model is built
by the performance of previously
independently.
built models.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 5

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

PART - B

1. Give Short notes on combining multiple learners.


Combining multiple learners is a machine learning technique that
combines multiple models into a single model. The goal is to improve the
model's performance by combining the weak learners in the right way.
Rationale
 The No Free Lunch Theorem states that there is no single learning
algorithm that in any domain always induces the most accurate
learner. The usual approach is to try many and choose the one that
performs the best on a separate validation set.

Different Learners use Different


 Algorithms: making different assumptions
 Hyper parameters: e.g. number of hidden nodes in NN, k in k-NN
 Representations: different features, multiple sources of information
 Training sets: small variations in the sets or different subproblems

Different Algorithms
o Different algorithms make different assumptions about the data and
lead to different classifiers.

Different Hyper parameters


o Use the same learning algorithm but use it with different hyper
parameters.

Different Input Representations


o Separate base-learners may be using different representations of the
same input object or event, making it possible to integrate different
types of Sensors/measurements/modalities.
o Different representations make different characteristics explicit
allowing better identification.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 6

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

Different Training Sets


Another possibility is to train different base-learners by different
subsets of the training set. This can be done randomly by drawing
random training sets from the given sample this is called bagging.

Model Combination Schemes


 There are also different ways the multiple base-learners are combined
to generate the final output:
 Multiexpert combination
o Multiexpert combination methods have base-learners that work in
parallel. methods can in turn be divided into two:
A) The global approach, also called learner fusion, given an
input, all base-learners generate an output and all these
outputs are used. Examples are voting and stacking.
B) The local approach, or learner selection, for example, in
mixture of
experts, there is a gating model, which looks at the input and
chooses one (or very few) of the learners as responsible for
generating the output.

 Multistage combination methods use a serial approach where the


next multistage combination base-learner is trained with or tested on
only the instances where the previous base-learners are not accurate
enough.

 Let L be the base-learners, dj(x) the prediction of base-learner, Mj


given the arbitrary dimensional input x
 Each Mj uses a different input representation xj.
 The final prediction is calculated from the predictions of the
baselearners:
Y= f(d1,d2,….,dL|Φ) Eq no 1
 where f(·) is the combining function with Φ denoting its
parameters.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 7

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

 When there are K outputs, for each learner there are dji(x), i= 1,
… , K, j = 1, … , L, and, combining them, also generate K values,
yi, i = 1, … , K and then for example in classification, choose the
class with the maximum yi value:
Voting
 A Voting Classifier is a machine learning model that trains on an
ensemble of numerous models and predicts an output (class) based
on their highest probability of chosen class as the output.
 It simply aggregates the findings of each classifier passed into Voting
Classifier and predicts the output class based on the highest majority
of voting.
 The idea is instead of creating separate dedicated models and finding
the accuracy for each them, create a single model which trains by
these models and predicts output based on their combined majority
of voting for each output class.
 The simplest way to combine multiple classifiers is by voting, which
corresponds to taking a linear combination of the learners (see figure
4.1)
 This is also known as ensembles. In the simplest case, all learners are
given equal weight and simple voting corresponds to take an average.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 8

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

Figure 3.1 Base-learners are dj and their outputs are combined using
f(·).
 Voting Classifier supports two types of voting’s.
1. Hard Voting: In hard voting, the predicted output class is a class with
the highest majority of votes i.e. the class which had the highest
probability of being predicted by each of the classifiers. Suppose three
classifiers predicted the output class(A, A, B), so here the majority
predicted A as output. Hence A will be the final prediction.
2. Soft Voting: In soft voting, the output class is the prediction based on
the average of probability given to that class. Suppose given some input
to three models, the prediction probability for class A = (0.30, 0.47,
0.53) and B = (0.20, 0.32, 0.40). So the average for class A is
0.4333 and B is 0.3067, the winner is clearly class A because it had the
highest probability averaged by each classifier.

Table 4.1 Classifier combination rules


 Voting schemes can be seen as approximations under a Bayesian
framework with weights approximating prior model probabilities, and
model decisions approximating model conditional likelihoods. This is
Bayesian model combination

Eq no 3

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 9

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

2. Explain Ensemble learning Technique in detail.


Ensemble learning Technique
"An ensembled model is a machine learning model that combines the
predictions from two or more models.”
 Ensemble learning is one of the most powerful machine learning
techniques that use the combined output of two or more
models/weak learners and solve a particular computational
intelligence problem.
 E.g., a Random Forest algorithm is an ensemble of various decision
trees combined.
 Ensemble learning is primarily used to improve the model
performance, such as classification, prediction, function
approximation, etc.
Simple Ensemble Training Methods
Simple ensemble training methods typically just involve the application
of statistical summary techniques, such as determining the mode, mean, or
weighted average of a set of predictions.
Advanced Ensemble Training Methods
There are 3 most common ensemble learning methods in machine
learning. These are as follows:

 Bagging: Bagging is a method of ensemble modeling, which is


primarily used to solve supervised machine learning problems.
 Boosting: Boosting is an ensemble method that enables each member
to learn from the preceding member's mistakes and make better
predictions for the future.
 Stacking: Stacking is one of the popular ensemble modeling
techniques in machine learning. Various weak learners are
ensembled in a parallel manner in such a way that by combining
them with Meta learners, we can predict better predictions for the
future.
Ensemble learning methods can be divided into one of two different groups:
1. Sequential methods
 In Sequential ensemble methods the base learners/models are
generated sequentially.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 10

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

 In the case of sequential methods, the essential idea is that the


dependence between the base learners is exploited in order to get
more accurate predictions.
 Examples of sequential ensemble methods
include AdaBoost, XGBoost, and Gradient tree boosting.

2. Parallel ensemble methods


 Parallel ensemble methods generate the base learners in parallel.
 When carrying out parallel ensemble learning, the idea is to
exploit the fact that the base learners have independence, as the
general error rate can be reduced by averaging the predictions of
the individual learners.

Methods for Independently Constructing Ensembles –


 Majority Vote
 Bagging and Random Forest
 Randomness Injection
 Feature-Selection Ensembles
 Error-Correcting Output Coding
Methods for Coordinated Construction of Ensembles –
 Boosting
 Stacking

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 11

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

3. Explain in Detail about Bagging Technique in Ensemble Learning.


Bagging
Bagging is a method of ensemble modeling, which is primarily used to
solve supervised machine learning problems. It is generally completed in two
steps as follows:
 Bootstrapping: It is a random sampling method that is used to derive
samples from the data using the replacement procedure. In this method,
first, random data samples are fed to the primary model, and then a base
learning algorithm is run on the samples to complete the learning
process.
 Aggregation: This is a step that involves the process of combining the
output of all base models and, based on their output, predicting an
aggregate result with greater accuracy and reduced variance.
 Example: In the Random Forest method, predictions from multiple
decision trees are ensembled parallelly. Further, in regression problems,
we use an average of these predictions to get the final output, whereas,
in classification problems, the model is selected as the predicted class.

Figure 3.2 Bagging Technique.


Implementation Steps of Bagging
 Step 1: Multiple subsets are created from the original data set with equal
tuples, selecting observations with replacement.
 Step 2: A base model is created on each of these subsets.
 Step 3: Each model is learned in parallel with each training set and
independent of each other.
 Step 4: The final predictions are determined by combining the
predictions from all the models.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 12

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

Figure 3.3 Bagging Technique.


Advantages:
 Reduce overfitting of the model.
 Handles higher dimensionality data very well.
 Maintains accuracy for missing data
Disadvantage:
 Since final prediction is based on mean prediction from the subset trees,
it won’t give precise values for the classification and regression model

4. Explain boosting Technique in Ensemble learning.


 Boosting is an ensemble method that enables each member to learn from
the preceding member's mistakes and make better predictions for the
future.
 It is done by building a model by using weak models in series.
 Firstly, a model is built from the training data.
 Then the second model is built which tries to correct the errors present
in the first model.
 This procedure is continued and models are added until either the
complete training data set is predicted correctly or the maximum number
of models is added.
 AdaBoost is short for Adaptive Boosting and is a very popular boosting
technique that combines multiple “weak classifiers” into a single “strong
classifier”.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 13

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

 Refer figure 3.4.

Algorithm
1. Initialize the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified
data points.
3. Increase the weight of the wrongly classified data points and decrease
the weights of correctly classified data points. And then normalize the
weights of all data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End

Figure 3.4 An illustration presenting the intuition behind the boosting algorithm,
consisting of the parallel learners and weighted dataset.

Similarities Between Bagging and Boosting


1. Both are ensemble methods to get N learners from 1 learner.
2. Both generate several training data sets by random sampling.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 14

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

3. Both make the final decision by averaging the N learners (or


taking the majority of them i.e Majority Voting).
4. Both are good at reducing variance and provide higher stability.

Differences between Bagging and Boosting

Bagging Boosting

 The simplest way of


combining predictions  A way of combining predictions
that belong to the same that belong to the different types.
type.

 Aim to decrease variance,  Aim to decrease bias, not


not bias. variance.

 Each model receives equal  Models are weighted according to


weight. their performance.

 New models are influenced


 Each model is built
by the performance of previously
independently.
built models.

Example: The Random forest Example: The AdaBoost uses Boosting


model uses Bagging. techniques

In this base classifiers are trained In this base classifiers are trained
parallelly. sequentially.

5. Explain in Detail about Stacking.


Stacking
 Stacking is one of the popular ensemble modeling techniques in machine
learning. Various weak learners are ensembled in a parallel manner in such

a way that by combining them with Meta learners, we can predict better
predictions for the future.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 15

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

 This ensemble technique works by applying input of combined multiple


weak learners' predictions and Meta learners so that a better output
prediction model can be achieved.
 In stacking, an algorithm takes the outputs of sub-models as input and
attempts to learn how to best combine the input predictions to make a
better output prediction.
 Stacking is also known as a stacked generalization and is an extended
form of the Model Averaging Ensemble technique in which all sub-models
equally participate as per their performance weights and build a new
model with better predictions. This new model is stacked up on top of the
others; this is the reason why it is named stacking.
Architecture of Stacking
 The architecture of the stacking model consists of two or more
base/learner's models and a meta-model that combines the
predictions of the base models.
 These base models are called level 0 models, and the meta-model
is known as the level 1 model.
 So, the Stacking ensemble method includes original (training)
data, primary level models, primary level prediction,
secondary level model, and final prediction.
 The basic architecture of stacking can be represented as shown in
figure 3.5

Figure 3.5 basic architecture of stacking


o Original data: This data is divided into n-folds and is also
considered test data or training data.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 16

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

o Base models: These models are also referred to as level-0 models.


These models use training data and provide compiled predictions
(level-0) as an output.
o Level-0 Predictions: Each base model is triggered on some
training data and provides different predictions, which are known
as level-0 predictions.
o Meta Model: The architecture of the stacking model consists of
one meta-model, which helps to best combine the predictions of
the base models. The meta-model is also known as the level-1
model.
o Level-1 Prediction: The meta-model learns how to best combine
the predictions of the base models and is trained on different
predictions made by individual base models, i.e., data not used to
train the base models are fed to the meta-model, predictions are
made, and these predictions, along with the expected outputs,
provide the input and output pairs of the training dataset used to
fit the meta-model.
o Refer figure 3.6.

Figure 3.6 Stacking Architecture

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 17

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

Steps to implement Stacking models:


o Split training data sets into n-folds using the Repeated Stratified
KFold as this is the most common approach to preparing training
datasets for meta-models.
o Now the base model is fitted with the first fold, which is n-1, and it
will make predictions for the nth folds.
o The prediction made in the above step is added to the x1_train list.
o Repeat steps 2 & 3 for remaining n-1folds, so it will give x1_train
array of size n.
o Now, the model is trained on all the n parts, which will make
predictions for the sample data.
o Add this prediction to the y1_test list.
o In the same way, we can find x2_train, y2_test, x3_train, and y3_test
by using Model 2 and 3 for training, respectively, to get Level 2
predictions.
o Now train the Meta model on level 1 prediction, where these
predictions will be used as features for the model (refer Figure 4.3).
o Finally, Meta learners can now be used to make a prediction on test
data in the stacking model.

6. Explain in detail about Unsupervised Learning.


 Unsupervised learning is a type of machine learning in which models
are trained using unlabeled dataset and are allowed to act on that
data without any supervision.
Clustering
 Clustering or cluster analysis is a machine learning technique, which
groups the unlabelled dataset.
 It is "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible
similarities remain in a group that has less or no similarities with
another group."
 Clustering involves dividing data points into multiple clusters of
similar values.
 Cluster is a group of objects that belongs to the same class

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 18

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

 Clustering is a process of partitioning a set of data in a meaningful


subclass as shown in figure 3.7.

Figure 3.7 Clustering


Common uses of this technique are:
 Market Segmentation
 Statistical data analysis
 Social network analysis
 Image segmentation
 Anomaly detection, etc.

Types of Clustering Methods


 Hard clustering (datapoint belongs to only one group)
 Soft Clustering (data points can belong to another group also).

Main clustering methods used in Machine learning:


 Partitioning Clustering
 Density-Based Clustering
 Distribution Model-Based Clustering
 Hierarchical Clustering
 Fuzzy Clustering

Partitioning Clustering
 It is a type of clustering that divides the data into non-hierarchical
groups.
 It is also known as the centroid-based method.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 19

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

 The most common example of partitioning clustering is the K-Means


Clustering algorithm as shown in figure 3.8.

Figure 3.8 K-Means Clustering algorithm.

Density-Based Clustering
 The density-based clustering method connects the highly-dense areas
into clusters, and the arbitrarily shaped distributions are formed as
long as the dense region can be connected.
 This algorithm does it by identifying different clusters in the dataset
and connects the areas of high densities into clusters.
 The dense areas in data space are divided from each other by sparser
areas as shown in 3.9.

Figure 3.9 density-based clustering

Distribution Model-Based Clustering


 In the distribution model-based clustering method, the data is divided
based on the probability of how a dataset belongs to a particular
distribution.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 20

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

 The grouping is done by assuming some distributions


commonly Gaussian Distribution.
 The example of this type is the Expectation-Maximization
Clustering algorithm that uses Gaussian Mixture Models (GMM) as
shown in figure 3.9.

Figure 3.9 Expectation-Maximization Clustering algorithm that uses Gaussian


Mixture Models (GMM)
Hierarchical Clustering
 In this technique, the dataset is divided into clusters to create a tree-
like structure, which is also called a dendrogram.
 The observations or any number of clusters can be selected by cutting
the tree at the correct level.
 The most common example of this method is the Agglomerative
Hierarchical algorithm.

 Figure 3.10 Agglomerative Hierarchical algorithm.


Fuzzy Clustering
 Fuzzy clustering is a type of soft method in which a data object may
belong to more than one group or cluster.
 Each dataset has a set of membership coefficients, which depend on
the degree of membership to be in a cluster.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 21

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

 Fuzzy C-means algorithm is the example of this type of clustering; it


is sometimes also known as the Fuzzy k-means algorithm.

Clustering Algorithms
 K-Means algorithm
 Mean-shift algorithm
 DBSCAN Algorithm-Density-Based Spatial Clustering of Applications
with Noise.
 Expectation-Maximization Clustering using GMM
 Agglomerative Hierarchical algorithm
 Affinity Propagation

Applications of Clustering
 In Identification of Cancer Cells: It divides the cancerous and
non-cancerous data sets into different groups.
 In Search Engines: It does it by grouping similar data objects in
one group that is far from the other dissimilar objects.
 Customer Segmentation: It is used in market research to
segment the customers based on their choice and preferences.
 In Biology: It is used in the biology stream to classify different
species of plants and animals using the image recognition
technique.
 In Land Use: The clustering technique is used in identifying the
area of similar lands use in the GIS database.

Unsupervised Learning: K means


 K-Means Clustering is an Unsupervised Learning algorithm, which
groups the unlabeled dataset into different clusters.
 Here K defines the number of pre-defined clusters that need to be
created in the process, as if K=2, there will be two clusters, and for
K=3, there will be three clusters, and so on.
 It is an iterative algorithm that divides the unlabeled dataset into k
different clusters in such a way that each dataset belongs only one
group that has similar properties.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 22

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

 It is a centroid-based algorithm, where each cluster is associated with


a centroid.
 The main aim of this algorithm is to minimize the sum of distances
between the data point and their corresponding clusters.
 The algorithm takes the unlabeled dataset as input, divides the
dataset into k-number of clusters, and repeats the process until it
does not find the best clusters.
 The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:


o Determines the best value for K center points or centroids by an iterative
process.
o Assigns each data point to its closest k-center. Those data points which
are near to the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away
from clusters as shown in figure 3.11.

Working of K-Means Algorithm


The working of the K-Means algorithm is explained in the below steps:
1. Choose the value of k and the k initial guesses for the centroids.
2. Compute the distance from each data point to each centroid.
In two dimensions, the distance, d, between any two points, (X1, Y1) and (X2,
Y2),
in the Cartesian plane is typically expressed by using the Euclidean distance
measure provided in

3. Compute the centroid, the center of mass, of each newly defined cluster
from Step2.
In two dimensions, the centroid (Xc, Yc) of the m points in a k-means
cluster is
calculated as follows in

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 23

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

4. Calculate distance d, between pi and q


For a given point, pi, at (pi1, pi2, . . . pin) and a centroid, q, located at
(q1, q2, . . . qn), the distance, d, between pi and q, is expressed as

5. Calculate centroid, q
The centroid, q, of a cluster of m points, (pi1, pi2, . . . pin) , is calculated
as

Figure 3.11 K-means Clustering Algorithm

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 24

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

7. Explain in detail about Instance based Learning and discuss KNN - k-


Nearest
Neighbor Learning?
Instance based learning:
 Instance-based learning are the systems that learn the training
examples by heart and then generalizes to new instances based on
some similarity measure.
 It is called instance-based because it builds the hypotheses from
the training instances.
 It is also known as memory-based learning or lazy-learning.
 When a new instance is encountered, its relationship to the stored
examples is examined in order to assign a target function value for
the new instance.
 Instance-based methods are sometimes referred to as lazy learning
methods because they delay processing until a new instance must
be classified.
 A key advantage of lazy learning is that instead of estimating the
target function once for the entire instance space, these methods
can estimate it locally and differently for each new instance to be
classified.
 Some of the instance-based learning algorithms are:
o K Nearest Neighbor (KNN)
o Self-Organizing Map (SOM)
o Learning Vector Quantization (LVQ)
o Locally Weighted Learning (LWL)
o Case-Based Reasoning

k-Nearest Neighbor Learning - KNN


o K-Nearest Neighbour is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most
similar to the available categories.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 25

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

o K-NN algorithm can be used for Regression as well as for Classification


but mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make
any assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from
the training set immediately instead it stores the dataset and at the time
of classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.

Need of a K-NN Algorithm


 Suppose there are two categories, i.e., Category A and Category B, and
we have a new data point x1, so this data point will lie in which of these
categories. To solve this type of problem, we need a K-NN algorithm. With
the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:

Figure 3.12 Explains the working of the K-NN Algorithm

Working of K-NN
The K-NN working can be explained on the basis of the below
algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of
neighbors

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 26

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

o Step-3: Take the K nearest neighbors as per the calculated


Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data
points in each category.
o Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
o Step-6: Our model is ready.

Example
Suppose we have a new data point and we need to put it in the required
category as shown in figure 3.13.

Figure 3.13 new data point and we need to put it in the required category
 Firstly, we will choose the number of neighbors, so we will choose
the k=5.
 Next, we will calculate the Euclidean distance between the data
points. The Euclidean distance is the distance between two points
It can be calculated as:

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 27

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

Figure 3.14
o By calculating the Euclidean distance we got the nearest neighbors, as
three nearest neighbors in category A and two nearest neighbors in
category B. Consider the below image:

Figure 3.15 nearest neighbors

o Since 3 nearest neighbors are from category A in Figure 4.9 , hence the
new data point must belong to category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN
algorithm:
o There is no particular way to determine the best value for "K", so we need
to try some values to find the best out of them. The most preferred value
for K is 5.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 28

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

o A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
o Always needs to determine the value of K which may be complex some
time.
o The computation cost is high because of calculating the distance between
the data points for all the training samples.
8. Explain in detail about Gaussian Mixture models and Expectation
Maximization.
Gaussian Mixture Model
 Gaussian mixture models (GMMs) are a type of machine learning
algorithm. They are used to classify data into different categories
based on the probability distribution.
 Gaussian mixture models can be used in many different areas,
including finance, marketing and so much more.
 The Gaussian mixture model is a probabilistic model that assumes all
the data points are generated from a mix of Gaussian distributions
with unknown parameters.
 A Gaussian mixture model can be used for clustering, which is the
task of grouping a set of data points into clusters.
 In general, K-means will be faster and more accurate when the data
set is large and the clusters are well-separated. Gaussian mixture
models will be more accurate when the data set is small or the
clusters are not well-separated.
 GMM consists of two parts – mean vectors (μ) & covariance matrices
(Σ).
 A Gaussian distribution is defined as a continuous probability
distribution that takes on a bell-shaped curve. Another name for

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 29

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

Gaussian distribution is the normal distribution as shown in figure


3.16.

Figure 3.16 Gaussian distribution


 Gaussian mixture models can handle missing data, whereas K-means
cannot.
 The probability density function for one dimensional space of a
Gaussian distribution is given by:

where μ is the mean and σ2 is the variance.

Figure 3.17 2D bell-shaped curve


In the case of two variables, instead of a 2D bell-shaped curve, we will
have a 3D bell curve as shown in figure 3.18.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 30

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

Figure 3.18 3D bell curve


The probability density function would be given by:

where x is the input vector, μ is the 2D mean vector, and Σ is the 2×2
covariance matrix.
Thus, this multivariate Gaussian model would have x and μ as vectors of
length d, and Σ would be a d x d covariance matrix.

Expectation-Maximization Algorithm
 The mean and variance value for each Gaussian distribution are
determined using a technique called Expectation-Maximization (EM).
 Expectation-Maximization (EM) is a statistical algorithm for finding
the right model parameters. Use EM when the data has missing
values, or in other words, when the data is incomplete.
 These missing variables are called latent variables.
 Expectation-Maximization algorithm has two steps:
o E-step: In this step, the available data is used to estimate
(guess) the values of the missing variables
o M-step: Based on the estimated values generated in the E-step,
the complete data is used to update the parameters
 Expectation-Maximization is the base of many algorithms, including
Gaussian Mixture Models.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 31

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT 3

E-step:
 For each point xi, calculate the probability that it belongs to
cluster/distribution c1, c2, … ck. This is done using the below
formula:

 This value will be high when the point is assigned to the right cluster
and lower otherwise.
M-step:
1. The new density is defined by the ratio of the number of points in the
cluster and the total number of points:

2. The mean and the covariance matrix are updated based on the values
assigned to the distribution, in proportion with the probability values
for the data point. Hence, a data point that has a higher probability of
being a part of that distribution will contribute a larger portion:

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 32

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

UNIT IV
NEURAL NETWORKS
Perceptron - Multilayer perceptron, activation functions, network training –
gradient descent optimization – stochastic gradient descent, error
backpropagation, from shallow networks to deep networks –Unit saturation (aka
the vanishing gradient problem) – ReLU, hyperparameter tuning, batch
normalization, regularization, dropout.

PART A
1. Define neuron.
A neuron is a cell in the brain whose principal function is the collection,
processing, and dissemination of electrical signals.

2. Define neural networks.


The brain's information-processing capacity is thought to emerge primarily
from networks of such neurons. For this reason, some of the earliest A1 work
aimed to create artificial neural networks.

3. What are the two main categories of neural network structures?


 Acyclic or feed-forward net-works
 Cyclic or recurrent networks

4. Define Perceptron.
A network with all the inputs connected directly to the outputs is called a
single layer neural network, or a perceptron network. Since each output unit
is independent of the others-each weight affects only one of the outputs.

5. Define Multi-Layer Perceptron.


 Multi-layer perception is also known as MLP. It is fully connected
dense layers, which transform any input dimension to the desired
dimension.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 1

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

 A multi-layer perception is a neural network that has multiple


layers. To create a neural network we combine neurons together so
that the outputs of some neurons are inputs of other neurons.
6. Define Activation function.
In an artificial neural network, the function which takes the incoming
signals as input and produces the output signal is known as the activation
function.

7. List down the advantages of multilayer perceptron.


 It can be used to solve complex nonlinear problems.
 It handles large amounts of input data well.
 Makes quick predictions after training.
 The same accuracy ratio can be achieved even with smaller
samples.

8. How do you train a neural model?


1. First an ANN will require a random weight initialization
2. Split the dataset in batches (batch size)
3. Send the batches 1 by 1 to the GPU
4. Calculate the forward pass (what would be the output with the
current weights)
5. Compare the calculated output to the expected output (loss)
6. Adjust the weights (using the learning rate increment or decrement)
according to the backward pass (backward gradient propagation).
7. Go back to step 2.
9. List out the activation functions.
1. ReLU Function
2. Sigmoid Function
3. Linear Function
4. Tanh Function
5. Softmax Function

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 2

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

10. Define Gradient Descent Optimization.


 Gradient Descent is a generic optimization algorithm capable of
finding optimal solutions to a wide range of problems.
 The general idea is to tweak parameters iteratively in order to
minimize the cost function.
 An important parameter of Gradient Descent (GD) is the size of the
steps, determined by the learning rate hyperparameters. If the
learning rate is too small, then the algorithm will have to go through
many iterations to converge, which will take a long time, and if it is
too high we may jump the optimal value.

11. List the types of Gradient Descent.


1.Batch Gradient Descent
2.Stochastic Gradient Descent
3.Mini-batch Gradient Descent

12. Define Stochastic Gradient Descent (SGD).


 In Stochastic Gradient Descent, a few samples are selected randomly
instead of the whole data set for each iteration.
 In Gradient Descent, there is a term called “batch” which denotes the
total number of samples from a dataset that is used for calculating the
gradient for each iteration.
 In typical Gradient Descent optimization, like Batch Gradient Descent,
the batch is taken to be the whole dataset.

13. What are the advantages and disadvantages of SGD?


Advantages:
 Speed: SGD is faster than other variants of Gradient Descent.
 Memory Efficiency: it is memory-efficient and can handle large
datasets that cannot fit into memory.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 3

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

 Avoidance of Local Minima: Due to the noisy updates in SGD, it


has the ability to escape from local minima and converge to a
global minimum.
Disadvantages:
 Noisy updates: The updates in SGD are noisy and have a high
variance, which can make the optimization process less stable and
lead to oscillations around the minimum.
 Slow Convergence: SGD may require more iterations to converge
to the minimum since it updates the parameters for each training
example one at a time.
 Sensitivity to Learning Rate: The choice of learning rate can be
critical in SGD since using a high learning rate can cause the
algorithm to overshoot the minimum, while a low learning rate can
make the algorithm converge slowly.
 Less Accurate: Due to the noisy updates, SGD may not converge
to the exact global minimum and can result in a suboptimal
solution. This can be mitigated by using techniques such as
learning rate scheduling and momentum-based updates.

14. Define Backpropagation.


The neural network has neurons that work in correspondence
with weight, bias, and their respective activation function. In a neural
network, we would update the weights and biases of the neurons on the basis
of the error at the output. This process is known as backpropagation.

15. List out the types of Backpropagation.


 Two Types of Backpropagation Networks are:
1. Static Back-propagation
2. Recurrent Backpropagation

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 4

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

 Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a
static input for static output. It is useful to solve static classification issues like
optical character recognition.
 Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed
value is achieved. After that, the error is computed and propagated backward.

16. What are the advantages and disadvantages of error backropagation?


Advantages:
 It does not have any parameters to tune except for the number of inputs.
 It is highly adaptable and efficient and does not require any prior
knowledge about the network.
 It is a standard process that usually works well.
Disadvantages:
 The performance of backpropagation relies very heavily on the training
data.
 Backpropagation needs a very large amount of time for training.
 Backpropagation requires a matrix-based method instead of mini-batch.

17. Define Unit saturation (aka the vanishing gradient problem).


The vanishing gradient problem is a challenge that occurs during the
training of artificial neural networks. It happens when the gradients used to
update the network become very small or "vanish" as they are backpropagated
from the output layers to the earlier layers.

18. Define Rectified linear unit (ReLU).


 It Stands for Rectified linear unit. It is the most widely used activation
function. Chiefly implemented in hidden layers of Neural network. (see
the below figure).
 Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0
otherwise.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 5

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

 Value Range :- [0, inf)


 Nature :- non-linear, which means we can easily backpropagate the
errors and have multiple layers of neurons being activated by the ReLU
function.

ReLU Function

 Uses :- ReLu is less computationally expensive than tanh and sigmoid


because it involves simpler mathematical operations. At a time only a few
neurons are activated making the network sparse making it efficient and
easy for computation.
 In simple words, RELU learns much faster than sigmoid and Tanh
function.

19. Define Normalization.


 Normalization is a data pre-processing tool used to bring the numerical
data to a common scale without distorting its shape.

20. Define Batch Normalization.


 Batch normalization is a process to make neural networks faster and
more stable through adding extra layers in a deep neural network. The
new layer performs the standardizing and normalizing operations on the
input of a layer coming from a previous layer. A typical neural network is
trained using a collected set of input data called batch.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 6

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

21. Define Hyperparameter tuning.


 A Machine Learning model is defined as a mathematical model with a
number of parameters that need to be learned from the data. By
training a model with existing data, we are able to fit the model
parameters.
 However, there is another kind of parameter, known
as Hyperparameters, that cannot be directly learned from the regular
training process. They are usually fixed before the actual training
process begins. These parameters express important properties of the
model such as its complexity or how fast it should learn.
22. What are the two strategies of Hyperparameter tuning?
1. GridSearchCV
2. RandomizedSearchCV

23. Define GridSearchCV.


In GridSearchCV approach, the machine learning model is evaluated for
a range of hyperparameter values. This approach is called GridSearchCV,
because it searches for the best set of hyperparameters from a grid of
hyperparameters values.

24. Define RandomizedSearchCV.


RandomizedSearchCV solves the drawbacks of GridSearchCV, as it
goes through only a fixed number of hyperparameter settings. It moves within
the grid in a random fashion to find the best set of hyperparameters. This
approach reduces unnecessary computation.

25. Define Overfitting.


When a model performs well on the training data and does not perform well
on the testing data, then the model is said to have high generalization error. In
other words, in such a scenario, the model has low bias and high variance and is
too complex. This is called overfitting.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 7

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

26. What is Regularization?


Regularization is one of the most important concepts of machine learning.
It is a technique to prevent the model from overfitting by adding extra information
to it. Regularization helps choose a simple model rather than a complex one.

27. List out the Regularization Technique.


1. Hints
2. Weight decay:
3. Ridge Regression (or) L2 Regularization.
4. Lasso Regression (or) L1 Regularization.
5. Dropout

28. Define L1 Regularization and L2 Regularization.


Lasso Regression (or) L1 Regularization.
 Least Absolute Shrinkage and Selection Operator (or LASSO)
Regression penalizes the coefficients to the extent that it becomes zero.
 It eliminates the insignificant independent variables. This
regularization technique uses the L1 norm for regularization.

Ridge Regression or L2 Regularization


 The Ridge regression technique is used to analyze the model where the
variables may be having multicollinearity.
 It reduces the insignificant independent variables though it does not
remove them completely. This type of regularization uses the L₂ norm for
regularization.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 8

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

29. Define Dropout.


 "Dropout" in machine learning refers to the process of randomly
ignoring certain nodes in a layer during training.

30. Difference between Shallow and Deep neural network.


Shallow Network Deep Network
 A shallow neural network has  A deep neural network has
only one hidden layer between multiple hidden layers
the input and output layers
 A shallow network might be used  A deep network might be
for simple tasks like image used for more complex tasks
classification. like image segmentation or
natural language processing.
 A shallow network is that it is  It is more computationally
computationally less expensive to expensive to train and may
train, and can be sufficient for require more data to avoid
simple tasks overfitting.
 It may not be powerful enough to  It capture more complex
capture complex patterns in the patterns in the data and
data. potentially achieve higher
accuracy.

31. Difference between Stochastic Gradient Descent and Gradient Descent.


Stochastic Gradient Descent Gradient Descent
 Computes gradient using the  Computes gradient using a
whole Training sample single Training sample
 Not suggested for huge training  Can be used for large
samples. training samples.
 Gives optimal solution given  Gives good solution but not
sufficient time to converge. optimal.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 9

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

 Deterministic in nature.  Stochastic in nature.


 Convergence is slow.  Reaches the convergence
much faster.

32. What is meant by Training set?


Training set is a set of pairs of input patterns with corresponding desired
output patterns. Each pair represents how the network is supposed to respond to
a particular input. The network is trained to respond correctly to each input
pattern from the training set.

33. What is meant by Test set?


The test set is the dataset that the model is trained on.

34. Difference between Data Mining and Machine learning.


Data Mining Machine Learning
 Extracting useful information  Introduce algorithm from data
from large amount of data as well as from past experience
 Teaches the computer to learn
 Used to understand the data
and understand from the data
flow
flow
 Human interference is more in  No human effort required after
it. design
 Huge databases with  Existing data as well as
unstructured data algorithms
 Historical data  Historical and real-time data
 Data mining is more of research
 Self learned and trains system
using methods like machine
to do the intelligent task
learning

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 10

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

35. Define Forward Pass.


Forward Propagation is the way to move from the Input layer (left) to the
Output layer (right) in the neural network. A neural network can be understood
by a collection of connected input/output nodes.

36. Define Backward Pass.


 In the backward pass, the flow is reversed so that we start by propagating
the error to the output layer until reaching the input layer passing through
the hidden layer(s).
 The process of propagating the network error from the output layer to the
input layer is called backward propagation, or simple backpropagation.

37. Define Tanh Function.


The activation that works almost always better than sigmoid function is
Tanh function also known as Tangent Hyperbolic function. It’s actually
mathematically shifted version of the sigmoid function. Both are similar and
can be derived from each other.
Equation :-

Value Range:- -1 to +1
Nature:- non-linear
Uses:- Usually used in hidden layers of a neural network as it’s values
lies between -1 to 1 hence the mean for the hidden layer comes out be 0
or very close to it, hence helps in centering the data by bringing mean
close to 0. This makes learning for the next layer much easier.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 11

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

Tanh Function
38. Define Sigmoid Function.
 It is a function which is plotted as ‘S’ shaped graph.
 Equation : A = 1/(1 + e -x )
 Nature : Non-linear. Notice that X values lies between -2 to 2, Y
values are very steep. This means, small changes in x would also
bring about large changes in the value of Y.
 Value Range : 0 to 1
 Uses : Usually used in output layer of a binary classification,
where result is either 0 or 1, as value for sigmoid function lies
between 0 and 1 only so, result can be predicted easily to be 1 if
value is greater than 0.5 and 0 otherwise.

39. What is Leaky ReLU?

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 12

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

In the leaky ReLU (the output is also linear on the negative side but with a
smaller slope, just enough to make sure that there will be updates for negative
activations, albeit small:

40. What is Artificial Neuron?


An artificial neuron is a connection point in an artificial neural network.
Artificial neural networks, like the human body's biological neural network, have
a layered architecture and each network node (connection point) has the
capability to process input and forward output to other nodes in the network.

41. What is meant by Feed forward neural network?


 "The process of receiving an input to produce some kind of output to make
some kind of prediction is known as Feed Forward."
 Feed Forward neural network is the core of many other important neural
networks such as convolution neural network.

42. Define Bias.


Neural network bias can be defined as the constant which is added to the
product of features and weights. It is used to offset the result. It helps the models
to shift the activation function towards the positive or negative side.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 13

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

PART B

1. Explain in detail about Perceptron.


 The perceptron is the basic processing element.
 It has inputs that may come from the environment or may be the
outputs of other perceptrons.
 Associated with each input, xj € ℜ, j =1, … , d, is a connection weight,
or synaptic weight wj € ℜ, and the output, y, in the simplest case is
a weighted sum of the inputs (see figure 5.1)

 w0 is the intercept value to make the model more general

Figure 4.1 Simple Perceptron


 We can write the output of the perceptron as a dot product
Y=wTx
 where w = [w0, w1, … , wd]T and x = [1, x1, … , xd]T are
augmented vectors to include also the bias weight and input.
 During testing, with given weights, w, for input x, we compute the
output y. To implement a given task, we need to learn the weights
w, the parameters of the system, such that correct outputs are
generated given the inputs.
 When d = 1 and x is fed from the environment through an input
unit, we have
PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 14

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

y = w x + w0
where w as the slope and w0 as the intercept
 Thus this perceptron with one input and one output can be used to
implement a linear fit. With more than one input, the line becomes
a (hyper)plane, and the perceptron with more than one input can be
used to implement multivariate linear fit.
 The perceptron as defined in equation 1 defines a hyperplane and
as such can be used to divide the input space into two:
o the half-space where it is positive and the half-space
where it is negative
 By using it to implement a linear discriminant function, the
perceptron can separate two classes by checking the sign of the
output. If we define s(·) as the threshold function

then we can,

Figure 4.2 K parallel perceptrons


From the Figure 4.2,
 xj, j = 0, … , d are the inputs and yi, i = 1, … , K are the outputs. wij is the
weight of the connection from input xj to output yi. Each output is a

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 15

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

weighted sum of the inputs. When used for a K-class classification


problem, there is a postprocessing to choose the maximum, or softmax if
we need the posterior probabilities.
 it is assumed that a hyperplane wTx = 0 can be found that separates xt ∈
C1 and xt ∈ C2. If at a later stage we need the posterior probability.
 Example, to calculate risk—we need to use the sigmoid function at the
output as

 When there are K > 2 outputs, there are K perceptrons, each of which has
a weight vector wi (see figure 5.2)

 where wij is the weight from input xj to output yi. W is the K ×(d + 1) weight
matrix of wij whose rows are the weight vectors of the K perceptrons. When
used for classification, during testing, we

 Each perceptron is a local function of its inputs and synaptic weights. In


classification, if we need the posterior probabilities and use the softmax,
we need the values of all outputs.
 Implementing this as a neural network results in a two-stage process,
where the first calculates the weighted sums, and the second calculates
the softmax values; but we denote this as a single layer:

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 16

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

 a linear transformation from a d dimensional space to a K-dimensional


space and can also be used for dimensionality reduction if K < d. we have
a two-layer network where the first layer of perceptrons implements the
linear transformation and the second layer implements the linear
regression or classification in the new space.

2. Explain in detail about Multi Layer Perceptron.


 Multi-layer perception is also known as MLP. It is fully connected
dense layers, which transform any input dimension to the desired
dimension.
 A multi-layer perception is a neural network that has multiple
layers. To create a neural network, we combine neurons together so
that the outputs of some neurons are inputs of other neurons.
 A multi-layer perceptron has one input layer and for each input,
there is one neuron (or node), it has one output layer with a single
node for each output and it can have any number of hidden layers
and each hidden layer can have any number of nodes. A schematic
diagram of a Multi-Layer Perceptron (MLP) is shown in figure 4.3.

Figure 4.3 A Multi-Layer Perceptron (MLP)

 In the multi-layer perceptron diagram, we can see that there are


three inputs and thus three input nodes and the hidden layer has
three nodes.
PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 17

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

 The output layer gives two outputs, therefore there are two
output nodes. The nodes in the input layer take input and
forward it for further process, in the diagram above the nodes in
the input layer forwards their output to each of the three nodes in
the hidden layer, and in the same way, the hidden layer
processes the information and passes it to the output layer.
 Every node in the multi-layer perception uses a sigmoid
activation function. The sigmoid activation function takes real
values as input and converts them to numbers between 0 and 1
using the sigmoid formula.
(x)=1/(1+(exp(-x))
 A perceptron that has a single layer of weights can only approximate
linear functions of the input and cannot solve problems like the
XOR, where the discrimininant to be estimated is nonlinear.
 Similarly, a perceptron cannot be used for nonlinear regression.
This limitation does not apply to feedforward networks with an
intermediate or hidden layer between the input and the output
layers.
 If used for classification, such multilayer perceptrons (MLP) can
implement nonlinear discriminants and, if used for regression, can
approximate nonlinear functions of the input.

Figure 4.4 XOR problem is not linearly separable.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 18

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

From the figure 4.4, We cannot draw a line where the empty circles are on one
side and the filled circles on the other side.
 Input x is fed to the input layer (including the bias),
the “activation” propagates in the forward direction,
and the values of the hidden units zh are calculated
(see Figure 5.5). Each hidden unit is a perceptron
by itself and applies the nonlinear sigmoid function
to its weighted sum:

 The output yi are perceptrons in the second layer


taking the hidden units as their inputs

 Where there is also a bias unit in the hidden layer,


which we denote by z0, and vi0 are the bias weights.
The input layer of xj is not counted since no
computation is done there and when there is a
hidden layer, this is a two-layer network.
 In a two-class discrimination task, there is one
sigmoid output unit and when there are K > 2
classes, there are K outputs with softmax as the
output nonlinearity.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 19

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

Figure 4.5 The structure of a multilayer perceptron.


 From the figure 4.5 xj, j = 0,… , d are the inputs and zh, h = 1, … , H
are the hidden units where H is the dimensionality of this hidden
space. z0 is the bias of the hidden layer. yi, i = 1, … , K are the
output units. whj are weights in the first layer, and vih are the
weights in the second layer.
 The MLP that implements XOR with two hidden units that
implement the two ANDs and the output that takes an OR of them.

 One is not limited to having one hidden layer, and more hidden layers
with their own incoming weights can be placed after the first hidden
layer with sigmoid hidden units, thus calculating nonlinear functions of

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 20

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

the first layer of hidden units and implementing more complex


functions of the inputs.

3. Explain in Detail about Activation function.


 Artificial neurons are elementary units in an artificial neural network.
The artificial neuron receives one or more inputs and sums them to
produce an output. Each input is separately weighted, and the sum is
passed through a function known as an activation function or transfer
function.
 In an artificial neural network, the function which takes the incoming
signals as input and produces the output signal is known as the
activation function.
 Neuron should be activated or not by calculating the weighted sum
and further adding bias to it. The purpose of the activation function
is to introduce non-linearity into the output of a neuron.
 The neural network has neurons that work in correspondence
with weight, bias, and their respective activation function. In a
neural network, we would update the weights and biases of the
neurons on the basis of the error at the output. This process is
known as Backpropagation.
 Activation functions make the back-propagation possible since the
gradients are supplied along with the error to update the weights
and biases.
Need for Non-linear activation function
 A neural network without an activation function is essentially
just a linear regression model. The activation function does the
non-linear transformation to the input making it capable to learn
and perform more complex tasks.
 The two main categories of activation functions are:
1. Linear Activation Function
2. Non-linear Activation Functions

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 21

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

Linear Activation Function

Figure 4.6 Linear Activation Function


Non-linear Activation Functions
1. Linear Function
 Linear function has the equation similar to as of a straight line
i.e. y = x (see in Figure 4.6).
 No matter how many layers we have, if all are linear in nature,
the final activation function of last layer is nothing but just a
linear function of the input of first layer.
 Range : -inf to +inf
 Uses : Linear activation function is used at just one place i.e.
output layer.
2. Sigmoid Function
 It is a function which is plotted as ‘S’ shaped graph (Refer Figure
4.7).
 Equation : A = 1/(1 + e -x )
 Nature : Non-linear. Notice that X values lies between -2 to 2, Y
values are very steep. This means, small changes in x would also
bring about large changes in the value of Y.
 Value Range : 0 to 1
 Uses : Usually used in output layer of a binary classification,
where result is either 0 or 1, as value for sigmoid function lies

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 22

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

between 0 and 1 only so, result can be predicted easily to be 1 if


value is greater than 0.5 and 0 otherwise.

Figure 4.7 Sigmoid Function


3. Tanh Function
 The activation that works almost always better than sigmoid
function is Tanh function also known as Tangent Hyperbolic
function. It’s actually mathematically shifted version of the
sigmoid function. Both are similar and can be derived from
each other (see in Figure 5.8).

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 23

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

Figure 4.8 Tanh Function


 Equation :-

 Value Range :- -1 to +1
 Nature :- non-linear
 Uses :- Usually used in hidden layers of a neural network as it’s values
lies between -1 to 1 hence the mean for the hidden layer comes out be 0
or very close to it, hence helps in centering the data by bringing mean
close to 0. This makes learning for the next layer much easier.
4. ReLU Function
 It Stands for Rectified linear unit. It is the most widely used
activation function. Chiefly implemented in hidden layers of
Neural network.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 24

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

Figure 4.9 ReLU FUNCTION

 Equation :- A(x) = max(0,x). It gives an output x if x is positive and


0 otherwise (see in Figure 5.9).
 Value Range :- [0, inf]
 Nature :- non-linear, which means we can easily backpropagate the
errors and have multiple layers of neurons being activated by the
ReLU function.
 Uses :- ReLu is less computationally expensive than tanh and
sigmoid because it involves simpler mathematical operations. At a
time only a few neurons are activated making the network sparse
making it efficient and easy for computation.
 In simple words, RELU learns much faster than sigmoid and Tanh
function.
5. Softmax Function
 It is a subclass of the sigmoid function, the softmax function comes in
handy when dealing with multiclass classification issues.
 Used frequently when managing several classes. In the output nodes of
image classification issues, the softmax was typically present. The
softmax function would split by the sum of the outputs and squeeze all
outputs for each category between 0 and 1.
 The output unit of the classifier, where we are actually attempting to
obtain the probabilities to determine the class of each input, is where
the softmax function is best applied.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 25

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

4. Discuss in detail about how the network is training.


 A biological neuron is composed of multiple dendrites, a nucleus and
a axon. When a stimuli is sent to the brain, it is received through
the synapse located at the extremity of the dendrite.
 When a stimuli arrives at the brain it is transmitted to the neuron via
the synaptic receptors which adjust the strength of the signal sent
to the nucleus. This message is transported by the dendrites to
the nucleus to then be processed in combination with other signals
emanating from other receptors on the other dendrites. Thus the
combination of all these signals takes place in the nucleus.
 After processing all these signals, the nucleus will emit an output
signal through its single axon. The axon will then stream this signal
to several other downstream neurons via its axon terminations. Thus
a neuron analysis is pushed in the subsequent layers of neurons.
 On the other hand, artificial neural networks are built on the
principle of bio-mimicry. External stimuli (the data), whose signal
strength is adjusted by the neuronal weights, circulates to the
neuron via the dendrites. The result of the calculation – called
the output – is then re-transmitted (via the axon) to several other
neurons and then subsequent layers are combined, and so on.(see in
Figure 4.10 & 4.11).

Figure 4.10 Biological Neuron

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 26

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

Figure 4.11 Artificial Neuron


Example:
1. Decide on the number of output classes (meaning the number of image
classes – for example two for cat vs dog)
2. Draw as many computation units as the number of output
classes (congrats you just create the Output Layer of the ANN)
3. Add as many Hidden Layers as needed within the defined architecture.
4. Stack those Hidden Layers to the Output Layer using Neural
Connections
5. It is important to understand that the Input Layer is basically a layer of
data ingestion
6. Add an Input Layer that is adapted to ingest your data
7. Assemble many Artificial Neurons together in a way where
the output (axon) an Neuron on a given Layer is (one) of the input of
another Neuron on a subsequent Layer. As a consequence, the Input
Layer is linked to the Hidden Layers which are then linked to the Output
Layer using Neural Connections (also shown in Figure 4.12).

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 27

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

Figure 4.12 A Neural Network Architecture


Train an Artificial Neural Network
 All Neurons of a given Layer are generating an Output, but they don’t
have the same Weight for the next Neurons Layer. This means that if a
Neuron on a layer observes a given pattern it might mean less for the
overall picture and will be partially or completely muted. This is
called Weighting.
 A big weight means that the Input is important and of course a small
weight means that we should ignore it. Every Neural
Connection between Neurons will have an associated Weight.
 Weights will be adjusted over the training to fit the objectives we have set
(recognize that a dog is a dog and that a cat is a cat).
 In simple terms: Training a Neural Network means finding the
appropriate Weights of the Neural Connections thanks to a feedback loop
called Gradient Backward propagation .
Steps to Training an Artificial Neural Network
1. First an ANN will require a random weight initialization
2. Split the dataset in batches (batch size)
3. Send the batches 1 by 1 to the GPU

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 28

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

4. Calculate the forward pass (what would be the output with the current
weights)
5. Compare the calculated output to the expected output (loss)
6. Adjust the weights (using the learning rate increment or decrement)
according to the backward pass (backward gradient propagation).
7. Go back to step 2
5. Discuss in detail about Gradient descent optimization Algorithm.
 Gradient Descent is a generic optimization algorithm capable of finding
optimal solutions to a wide range of problems.
 The general idea is to tweak parameters iteratively in order to minimize the
cost function.
 An important parameter of Gradient Descent (GD) is the size of the steps,
determined by the learning rate hyperparameters. If the learning rate is too
small, then the algorithm will have to go through many iterations to
converge, which will take a long time, and if it is too high we may jump the
optimal value.
Types of Gradient Descent:
 Typically, there are three types of Gradient Descent:
1. Batch Gradient Descent
Batch Gradient Descent involves calculations over the full training
set at each step as a result of which it is very slow on very large
training data. Thus, it becomes very computationally expensive to do
Batch GD.
2. Stochastic Gradient Descent
In SGD, only one training example is used to compute the gradient
and update the parameters at each iteration. This can be faster than
batch gradient descent but may lead to more noise in the updates.
3. Mini-batch Gradient Descent
In mini-batch gradient descent, a small batch of training examples
is used to compute the gradient and update the parameters at each
iteration. This can be a good compromise between batch gradient

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 29

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

descent and SGD, as it can be faster than batch gradient descent and
less noisy than SGD.
6. Explain in detail about Stochastic gradient descent .
Stochastic Gradient Descent:
 In Stochastic Gradient Descent, a few samples are selected randomly
instead of the whole data set for each iteration.
 In Gradient Descent, there is a term called “batch” which denotes the total
number of samples from a dataset that is used for calculating the gradient
for each iteration.
 In typical Gradient Descent optimization, like Batch Gradient Descent, the
batch is taken to be the whole dataset. Although using the whole dataset is
really useful for getting to the minima in a less noisy and less random
manner, the problem arises when our dataset gets big.
 Suppose, you have a million samples in your dataset, so if you use a
typical Gradient Descent optimization technique, you will have to use all of
the one million samples for completing one iteration while performing the
Gradient Descent, and it has to be done for every iteration until the
minima are reached. Hence, it becomes computationally very expensive to
perform.
 This problem is solved by Stochastic Gradient Descent. In SGD, it uses
only a single sample, i.e., a batch size of one, to perform each iteration.
 The sample is randomly shuffled and selected for performing the iteration.
SCD Algorithm

 In SGD, we find out the gradient of the cost function of a single example
at each iteration instead of the sum of the gradient of the cost function of
all the examples.
 In SGD, since only one sample from the dataset is chosen at random for
each iteration, the path taken by the algorithm to reach the minima is
usually noisier than your typical Gradient Descent algorithm. But that

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 30

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

doesn’t matter all that much because the path taken by the algorithm
does not matter, as long as we reach the minima and with a significantly
shorter training time (see in Figure 4.13 & 4.14).

Figure 4.13 The path taken by Batch Gradient Descent

Figure 4.14 The path taken by Stochastic Gradient Descent


 SGD is generally noisier than typical Gradient Descent, it usually took a
higher number of iterations to reach the minima, because of its
randomness in its descent.
 Even though it requires a higher number of iterations to reach the
minima than typical Gradient Descent, it is still computationally much
less expensive than typical Gradient Descent. Hence, in most scenarios,
SGD is preferred over Batch Gradient Descent for optimizing a learning
algorithm.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 31

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

Advantages:
 Speed: SGD is faster than other variants of Gradient Descent.
 Memory Efficiency:it is memory-efficient and can handle large datasets
that cannot fit into memory.
 Avoidance of Local Minima: Due to the noisy updates in SGD, it has the
ability to escape from local minima and converge to a global minimum.
Disadvantages:
 Noisy updates: The updates in SGD are noisy and have a high variance,
which can make the optimization process less stable and lead to
oscillations around the minimum.
 Slow Convergence: SGD may require more iterations to converge to the
minimum since it updates the parameters for each training example one
at a time.
 Sensitivity to Learning Rate: The choice of learning rate can be critical
in SGD since using a high learning rate can cause the algorithm to
overshoot the minimum, while a low learning rate can make the
algorithm converge slowly.
 Less Accurate: Due to the noisy updates, SGD may not converge to the
exact global minimum and can result in a suboptimal solution. This can
be mitigated by using techniques such as learning rate scheduling and
momentum-based updates.
7. Explain in detail about error backpropagation.
Backpropagation
 Backpropagation is one of the important concepts of a neural network.
or a single training example, Backpropagation algorithm calculates the
gradient of the error function.
 Backpropagation algorithms are a set of methods used to efficiently
train artificial neural networks following a gradient descent approach
which exploits the chain rule.
 The main features of Backpropagation are the iterative, recursive
and efficient method through which it calculates the updated weight to

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 32

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

improve the network until it is not able to perform the task for which it
is being trained.
 Derivatives of the activation function to be known at network design
time is required to Backpropagation.
How Backpropagation Algorithm Works?
 The Back propagation algorithm in neural network computes the gradient
of the loss function for a single weight by the chain rule. It efficiently
computes one layer at a time, unlike a native direct computation. It
computes the gradient, but it does not define how the gradient is used. It
generalizes the computation in the delta rule.(see in Figure 5.16)

Figure 4.16 Back propagation neural network


1. Inputs X, arrive through the preconnected path
2. Input is modeled using real weights W. The weights are usually randomly
selected.
3. Calculate the output for every neuron from the input layer, to the hidden
layers, to the output layer.
4. Calculate the error in the outputs
ErrorB= Actual Output – Desired Output
5. Travel back from the output layer to the hidden layer to adjust the weights
such that the error is decreased.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 33

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

6. Keep repeating the process until the desired output is achieved


Types of Backpropagation Networks
 Two Types of Backpropagation Networks are:
o Static Back-propagation
o Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a
static input for static output. It is useful to solve static classification issues like
optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed
value is achieved. After that, the error is computed and propagated backward.
Advantages:
 It does not have any parameters to tune except for the number of inputs.
 It is highly adaptable and efficient and does not require any prior
knowledge about the network.
 It is a standard process that usually works well.
Disadvantages:
 The performance of backpropagation relies very heavily on the training
data.
 Backpropagation needs a very large amount of time for training.
 Backpropagation requires a matrix-based method instead of mini-batch.

8. Explain in detail about Unit saturation (aka the vanishing gradient problem).
 The vanishing gradient problem is an issue that sometimes arises when
training machine learning algorithms through gradient descent. This most
often occurs in neural networks that have several neuronal layers such as
in a deep learning system, but also occurs in recurrent neural networks.
 The key point is that the calculated partial derivatives used to compute the
gradient as one goes deeper into the network. Since the gradients control
how much the network learns during training, the gradients are very small

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 34

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

or zero, then little to no training can take place, leading to poor predictive
performance.
The problem:
 As more layers using certain activation functions are added to neural
networks, the gradients of the loss function approaches zero, making the
network hard to train.
Why:
 Certain activation functions, like the sigmoid function, squishes a large
input space into a small input space between 0 and 1. Therefore, a large
change in the input of the sigmoid function will cause a small change in
the output. Hence, the derivative becomes small.
The sigmoid function and its derivative
 As an example, the below Figure 4.17 is the sigmoid function and its
derivative. Note how when the inputs of the sigmoid function become
larger or smaller (when |𝑥| becomes bigger), the derivative becomes close
to zero.

Figure 4.17 The sigmoid function and its derivative


Why it's significant:
 For shallow network with only a few layers that use these activations, this
isn't a big problem. However, when more layers are used, it can cause the
gradient to be too small for training to work effectively. Gradients of neural

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 35

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

networks are found using backpropagation. Simply put, backpropagation


finds the derivatives of the network by moving layer by layer from the final
layer to the initial one. By the chain rule, the derivatives of each layer are
multiplied down the network (from the final layer to the initial) to compute
the derivatives of the initial layers.
 However, when n hidden layers use activation like the sigmoid an function,
n small derivatives are multiplied together. Thus, the gradient decreases
exponentially as we propagate down to the initial layers. A small gradient
means that the weights and biases of the initial layers will not be updated
effectively with each training session. Since these initial layers are often
crucial to recognizing the core elements of the input data, it can lead to
overall inaccuracy of the whole network.
Solution:
 The simplest solution is to use other activation functions, such as ReLU,
which doesn't cause a small derivative. Residual networks are another
solution, as they provide residual connections straight to earlier layers.
 The residual connection directly adds the value at the beginning of the
block, x, to the end of the block (F(x) + x). This residual connection doesn't
go through activation functions that "squashes" the derivatives, resulting
in a higher overall derivative of the block.(see in Figure 5.18)

Figure 4.18 A residual block

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 36

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

 Finally, batch normalization layers can also resolve the issue. As stated
before, the problem arises when a large input space is mapped to a small
one, causing the derivatives to disappear.
9. Explain in detail about Rectified linear unit (ReLU).
 It Stands for Rectified linear unit. It is the most widely used activation
function. Chiefly implemented in hidden layers of Neural network. (see
the below figure 4.19)

Figure 4.19 RELU FUNCTION


 Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0
otherwise.
 Value Range :- [0, inf)
 Nature :- non-linear, which means we can easily backpropagate the
errors and have multiple layers of neurons being activated by the ReLU
function.
 Uses :- ReLu is less computationally expensive than tanh and sigmoid
because it involves simpler mathematical operations. At a time only a
few neurons are activated making the network sparse making it efficient
and easy for computation.
 In simple words, RELU learns much faster than sigmoid and Tanh
function.
 An activation function for hidden units that has become popular recently
with deep networks is the rectified linear unit (ReLU), which is defined as

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 37

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

Though ReLU is not differentiable at a = 0, we use it anyway; we use the left


derivative:

Leaky ReLU
In the leaky ReLU (the output is also linear on the negative side but with a
smaller slope, just enough to make sure that there will be updates for negative
activations, albeit small:

Advantage:
 it does not saturate (unlike sigmoid and tanh), updates can still be done
for large positive a for some inputs, some hidden unit activations will be
zero, meaning that we will have a sparse representation
 Sparse representations lead to faster Training.

Disadvantage:
 The derivative is zero for a ≤ 0, there is no further training if, for a hidden
unit, the weighted sum somehow becomes negative. This implies that one
should be careful in initializing the weights so that the initial activation for
all hidden units is positive.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 38

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

10. Explain in detail about Batch Normalization.


Normalization
 Normalization is a data pre-processing tool used to bring the numerical
data to a common scale without distorting its shape.

Batch normalization
 Batch normalization is a process to make neural networks faster and
more stable through adding extra layers in a deep neural network. The
new layer performs the standardizing and normalizing operations on the
input of a layer coming from a previous layer. A typical neural network is
trained using a collected set of input data called batch. Similarly, the
normalizing process in batch normalization takes place in batches, not as
a single input.
 A similar case can also be made for the hidden units, and this is the idea
behind batch normalization.
 We normalize hidden unit values before applying the activation function,
such as sigmoid or ReLU. Let us call that weighted sum aj. For each batch
or minibatch, for each hidden unit j we calculate the mean mj and
standard deviation sj of its values, and we first z-normalize:

 We can then map these to have arbitrary mean γj and scale βj and then we
apply the activation function.

 First, mj and sj are calculated anew for each batch, and we see
immediately that batch normalization is not meaningful with online
learning or very small minibatches.
 Second, γj and βj are parameters that are initialized and updated (after
each batch or minibatch) using gradient descent, just like the connection
weights. So they require extra memory and computation.
PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 39

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

 These new parameters allow each hidden unit to have its arbitrary (but
learned) mean and standard deviation for its activation.
 Once learning is done, we can go over the whole training data (or a large
enough subset) and calculate mj and sj to use during testing, when we see
one instance at a time.
Why batch normalization?
 An internal covariate shift occurs when there is a change in the input
distribution to our network. When the input distribution changes, hidden
layers try to learn to adapt to the new distribution. This slows down the
training process. If a process slows down, it takes a long time to converge
to a global minimum.
Example: Suppose we are training an image classification model, that classifies
the images into Dog or Not Dog. Let’s say we have the images of white dogs only,
these images will have certain distribution as well. Using these images model will
update its parameters.
Later, if we get a new set of images, consisting of non-white dogs. These
new images will have a slightly different distribution from the previous images.
Now the model will change its parameters according to these new images. Hence
the distribution of the hidden activation will also change. This change in hidden
activation is known as an internal covariate shift. Here batch normalization
works.
Advantages of Batch Normalization
 Speed Up the Training
 Handles internal covariate shift
 The model is less delicate to hyperparameter tuning.
 Batch normalization smoothens the loss function that in turn by
optimizing the model parameters improves the training speed of the model.

11. Explain in detail about hyperparameter tuning.

Hyperparameter tuning

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 40

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

Hyperparameter tuning is a machine learning process that involves finding


the best hyperparameters for a learning algorithm. Hyperparameters are settings
that control the learning process of a model. Some examples of model
hyperparameters include:
1. The penalty in Logistic Regression Classifier i.e. L1 or L2
regularization
2. The learning rate for training a neural network.
3. The C and sigma hyperparameters for support vector
machines.
4. The k in k-nearest neighbors.
 Models can have many hyperparameters and finding the best
combination of parameters can be treated as a search problem. The two
best strategies for Hyperparameter tuning are:
1. GridSearchCV
2. RandomizedSearchCV

 GridSearchCV
 In GridSearchCV approach, the machine learning model is evaluated
for a range of hyperparameter values. This approach is called
GridSearchCV, because it searches for the best set of
hyperparameters from a grid of hyperparameters values.
 For example, if we want to set two hyperparameters C and Alpha of
the Logistic Regression Classifier model, with different sets of values.
The grid search technique will construct many versions of the model
with all possible combinations of hyperparameters and will return the
best one.
 As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2,
0.3, 0.4]. For a combination of C=0.3 and Alpha=0.2, the
performance score comes out to be 0.726(Highest), therefore it is
selected. (see in Figure 5.20)

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 41

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

Figure 4.20 GridSearchCV

The following code illustrates how to use GridSearchCV


# Necessary imports
from sklearn.linear_model import Logistic Regression
from sklearn.model_selection import GridSearchCV
# Creating the hyperparameter grid
c_space = np.logspace(-5, 8, 15) param_grid = {'C': c_space}
#Instantiating logistic regression classifier
logreg = Logistic Regression()
# Instantiating the GridSearchCV object
logreg_cv= GridSearchCV(logreg, param_grid, cv = 5)
logreg_cv.fit(X,y)
# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters:{}".format(logreg_cv.best_params_))
print("Best score is {}".format(logrcg_cv.best_score_))
Output:
Tuned Logistic Regression Parameters: {'C': 3.7275937203149381) Best score is
0.7708333333333334
Drawback:
GridSearch CV will go through all the intermediate combinations of
hyperparameters which makes grid search computationally very expensive.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 42

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

 RandomizedSearchCV
RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes
through only a fixed number of hyperparameter settings. It moves within
the grid in a random fashion to find the best set of hyperparameters. This
approach reduces unnecessary computation.

12. Explain in detail about Regularization.


Overfitting:
 When a model performs well on the training data and does not perform
well on the testing data, then the model is said to have high generalization
error. In other words, in such a scenario, the model has low bias and high
variance and is too complex. This is called overfitting.
 Overfitting means that the model is a good fit on the train data compared
to the data. Overfitting is also a result of the model being too complex
Regularization:
 Regularization is one of the most important concepts of machine learning.
It is a technique to prevent the model from overfitting by adding extra
information to it. Regularization helps choose a simple model rather than
a complex one.
 Generalization error is "a measure of how accurately an algorithm can
predict outcome values for previously unseen data." Regularization refers
to the modifications that can be made to a leaming algorithm that helps to
reduce this generalization error and not the training error.
The commonly used regularization techniques are:
1. Hints
 Hints are properties of the target function that are known to us
independent of the training examples

Figure 4.21 HINTS

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 43

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

 The identity of the object does not change when it is translated,


rotated, or scaled. Note that this may not always be true, or may be
true up to a point: ‘b’ and ‘q’ are rotated versions of each other.
These are hints that can be incorporated into the learning process
to make learning easier.
 In image recognition, there are invariance hints: The identity of an
object does not change when it is rotated, translated, or scaled (see
Figure 4.21). Hints are auxiliary information that can be used to
guide the learning process and are especially useful when the
training set is limited.
 There are different ways in which hints can be used:
 Hints can be used to create virtual examples.
 The hint may be incorporated into the network structure.
2. Weight decay:
 Incentivize the network to use smaller weights by adding a penalty
to the loss function.
 Even if we start with a weight close to zero, because of some noisy
instances, it may move away from zero; the idea in weight decay is
to add some small constant background force that always pulls a
weight toward zero, unless it is absolutely necessary that it be large
(in magnitude) to decrease error. For any weight wi, the update rule
is

3. Ridge Regression
 The Ridge regression technique is used to analyze the model where
the variables may be having multicollinearity.
 It reduces the insignificant independent variables though it does not
remove them completely. This type of regularization uses the L₂ norm
for regularization.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 44

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

4. Lasso Regression
Least Absolute Shrinkage and Selection Operator (or LASSO)
Regression penalizes the coefficients to the extent that it becomes zero. It
eliminates the insignificant independent variables. This regularization
technique uses the L1 norm for regularization.

5. Dropout
"Dropout" in machine learning refers to the process of randomly
ignoring certain nodes in a layer during training.
 In the Figure 5.22, the neural network on the left represents a typical
neural network where all units are activated. On the right, the red
units have been dropped out of the model- the values of their weights
and biases are not considered during training.

Figure 4.22 Dropout

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 45

Downloaded by vidhya gopinath ([email protected])


lOMoARcPSD|49044958

AL 3451 –MACHINE LEARNING UNIT IV

 Dropout is used as a regularization technique - it prevents overfitting by


ensuring that no units are codependent.
 In dropout, we have a hyperparameter p, and we drop the input or hidden
unit with probability p, that is, set its output to zero, or keep it with
probability 1 – p.
 p is adjusted using cross-validation; with more inputs or hidden units in a
layer, we can afford a larger p (see Figure 4.23).

Figure 4.23 In dropout, the output of a random subset of the units are set to
zero, and backpropagation is done on the remaining smaller network.
 In each batch or minibatch, for each unit independently we decide
randomly to keep it or not. Let us say that p = 0.25. So, on average, we
remove a quarter of the units and we do backpropagation as usual on the
remaining network for that batch or minibatch. We need to make up for
the loss of units, though: In each layer, we divide the activation of the
remaining units by 1 − p to make sure that they provide a vector of similar
magnitude to the next layer. There is no dropout during testing.
 In each batch or minibatch, a smaller network (with smaller variance) is
trained. Thus dropout is effectively sampling from a pool of possible
networks of different depths and widths.
 There is a version called dropconnect that drops out or not connections
independently, which allows a larger set of possible networks to sample
from, and this may be preferable in smaller networks.

PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 46

Downloaded by vidhya gopinath ([email protected])

You might also like