ML Notes Question Bank Exstraction From Notes
ML Notes Question Bank Exstraction From Notes
AL 3451 ML
O MAILAM
Engineering College
Chennai,
Approved by AICTE, New Delhi, AMiliatcd to Anna University,
"A" Grade &
Accredited by National Board of Accreditation(NBA), Accredited by NAACwith
Accredited by TATA ConsultancyServices (TCS), Chennai)
Science
Department Of Artificial Intelligence and Data
&
LEARNING EXPERIMENTS
UNIT V DESIGN AND ANALYSIS OF MACHINE
resampling -
Guidelines for machinc lcarning cxperiments, Cross Validation (CV) and
assessing a single
K-fold CV, bootstrapping, measuring classificr performance, test, McNemar's
classificationalgorithm and comparingtwo classificationalgorithms - t
test, K-fold CV paired t test.
TEXTBOOKS:
Fourth Edition,
1. Ethem Alpaydin, "Introduction to Machine Learning", MIT Press,
2020.
2. Stephen Marsland,"MachineLearning: An AlgorithmicPerspective,"Second Edition",
CRC Press, 2014.
REFERENCES:
2006.
1. Christopher M. Bishop, "Pattern Recognitionand Machine Learning", Springer,
2. Tom Mitchell, "Machine Learning", McGraw Hill, 3rd Edition, 1997.
Machine
3. Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar, "Foundations of
Learning",Second Edition, MIT Press, 2012, 2018.
4. lan Goodfellow,Yoshua Bengio, Aaron Courville, "Deep Learning", MIT Press, 2016
5. Sebastain Raschka, Vahid Mirjalili , "Python Machine Learning", Packt publishing,
3rd Edition, 2019.
PREPARED BY
HOD PRINCÍPAL
Dr. Artheeswart,Prof &Head /AI&DS
PART A
1. Define linear Algebra.
Linear Algebra is an essential field of mathematics, which defines the study of
vectors, matrices, planes, mapping, and lines required for linear transformation.
2. What are the Benefits of learning Linear Algebra before Machine learning?
Better Graphic experience
Improved Statistics]
Creating better Machine Learning algorithms
Estimating the forecast of Machine Learning
Easy to Learn
3. List some of the supervised learning algorithms can be created using Linear
Algebra.
Logistic Regression
Linear Regression
Decision Trees
Support Vector Machines (SVM)
Supervised learning
Unsupervised learning
Reinforcement learning
Semi-supervised learning
PART-B
1. Explain in detail about linear algebra for machine learning and give some popular
examples of linear algebra and explain them.
2.1.3 Examples
2.1.3.1 Handwriting recognition learning problem
Task T: Recognizing and classifying handwritten words
within images
Performance P: Percent of words correctly classified
Training experience E: A dataset of handwritten words
with given classifications
2.1.3.2 A robot driving learning problem
Task T: Driving on highways using vision sensors
Performance P: Average distance traveled before an error
Training experience E: A sequence of images and steering
commands recorded while observing a human driver
2.3.2 Regression:
Regression is a supervised learning technique which helps in
finding the correlation between variables and enables us to predict
the continuous output variable based on the one or more predictor
variables.
It is mainly used for prediction, forecasting, time series modeling,
and determining the causal-effect relationship between variables.
2.3.3 Clustering:
Clustering or cluster analysis is a machine learning technique,
which groups the unlabeled dataset.
It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points.
The objects with the possible similarities remain in a group that
has less or no similarities with another group."
Thus, to specify a hypothesis h, we need only specify the set {x ∈ D ∣ h(x) = 1}.
Figure 3.1 shows all possible dichotomies of D if D has three elements. In the figure,
we have shown only one of the two sets in a dichotomy, namely the set {x ∈ D ∣ h(x) = 1}. The
circles and ellipses represent such sets.
that can be shattered; it is not necessary that we be able to shatter any four points in two
dimensions.
Figure 1.3 An axis-aligned rectangle can shattered four points. Only rectangle
covering two points are shown
VC dimension may seem pessimistic. It tells us that using a rectangle as our
hypothesis class, we can learn only datasets containing four points and not more.
In this case, because S is the tightest possible rectangle, the error region
between C and h = S is the sum of four rectangular strips (see figure 2.7).
We would like to make sure that the probability of a positive example falling
in here (and causing an error) is at most ε.
For any of these strips, if we can guarantee that the probability is upper
bounded by ε /4, the error is at most 4(ε /4) = ε.
Note that we count the overlaps in the corners twice, and the total actual error
in this case is less than 4(ε /4).
The probability that a randomly drawn example misses this strip is 1 − ε /4.
The probability that all N independent draws miss the strip is (1− ε /4)N , and
the probability that all N independent draws miss any of the four strips is at
most 4(1 − ε /4)N , which we would like to be at most δ. We have the inequality.
(1 − x) ≤ exp[−x]
So if we choose N and δ such that we have
4 exp[−ε N/4] ≤ δ
we can also write 4(1 − ε /4)N ≤ δ. Dividing both sides by 4, taking (natural) log and
rearranging terms, we have
N ≥ (4/ ε) log(4/δ
Figure 1.4 The difference between h and C is the sum of four rectangle
strips, one of which is shaded.
Therefore, provided that we take at least (4/ ε) log(4/δ) independent examples
from C and use the tightest rectangle as our hypothesis h, with confidence probability
at least 1 − δ, a given point will be misclassified with error probability at most ε. We
can have arbitrary large confidence by decreasing δ and arbitrary small error by
decreasing ε, and we see in equation 2.7 that the number of examples is a slowly
growing function of 1/ ε and 1/δ, linear and logarithmic, respectively.
Say suppose we have test data for which we have to determine the outputs or results. The
test data is as shown in figure 1.7.
The errors in the test dataset start increasing, so the point, just
before the raising of errors, is the good point, and we can stop here
for achieving a good model.
Figure 1.13
There are two other methods by which we can get a good point for our model, which
are the resampling method to estimate model accuracy and validation dataset.
Figure 1.14
o Irreducible errors: These errors will always be present in the model
regardless of which algorithm has been used. The cause of these
errors is unknown variables whose value can't be reduced.
What is Bias?
In general, a machine learning model analyses the data, find patterns
in it and make predictions. While training, the model learns these
patterns in the dataset and applies them to test data for prediction.
While making predictions, a difference occurs between prediction
values made by the model and actual values/expected values, and this
difference is known as bias errors or Errors due to bias.
Bias can be defined as an inability of machine learning algorithms such
as Linear Regression to capture the true relationship between the data
points. Each algorithm begins with some amount of bias because bias
occurs from assumptions in the model, which makes the target
function simple to learn. A model has either:
o Low Bias: A low bias model will make fewer assumptions about
the form of the target function.
o High Bias: A model with a high bias makes more assumptions,
and the model becomes unable to capture the important features
of our dataset. A high bias model also cannot perform well on
new data.
Generally, a linear algorithm has a high bias, as it makes them learn
fast. The simpler the algorithm, the higher the bias it has likely to be
introduced. Whereas a nonlinear algorithm often has low bias.
Some examples of machine learning algorithms with low bias are
Decision Trees, k-Nearest Neighbours and Support Vector
Machines. At the same time, an algorithm with high bias is Linear
Regression, Linear Discriminant Analysis and Logistic Regression.
Ways to reduce High Bias:
High bias mainly occurs due to a much simple model. Below are some
ways to reduce the high bias:
o Increase the input features as the model is underfitted.
o Decrease the regularization term.
o Use more complex models, such as including some polynomial
features.
7.3 What is a Variance Error?
The variance would specify the amount of variation in the prediction if the
different training data was used. In simple words, variance tells that
how much a random variable is different from its expected
value. Ideally, a model should not vary too much from one training dataset
to another, which means the algorithm should be good in understanding
the hidden mapping between inputs and output variables. Variance errors
are either of low variance or high variance.
Low variance means there is a small variation in the prediction of the target
function with changes in the training data set.
At the same time, High variance shows a large variation in the prediction
of the target function with changes in the training dataset.
A model that shows high variance learns a lot and perform well with the
training dataset, and does not generalize well with the unseen dataset.
As a result, such a model gives good results with the training dataset but
shows high error rates on the test dataset.
Since, with high variance, the model learns too much from the dataset, it
leads to overfitting of the model. A model with high variance has the below
problems:
o A high variance model leads to overfitting.
o Increase model complexities.
o Usually, nonlinear algorithms have a lot of flexibility to fit
the model, have high variance.
Figure 1.15
o Some examples of machine learning algorithms with low
variance are, Linear Regression, Logistic Regression,
and Linear discriminant analysis. At the same time,
algorithms with high variance are decision tree, Support
Vector Machine, and K-nearest neighbours.
o Ways to Reduce High Variance:
o Reduce the input features or number of parameters as a
model is overfitted.
o Do not use a much complex model.
o Increase the training data.
Figure 1.16
1. Low-Bias,Low-Variance:
The combination of low bias and low variance shows an ideal machine learning
model. However, it is not possible practically.
2. Low-Bias, High-Variance: With low bias and high variance, model predictions
are inconsistent and accurate on average. This case occurs when the model
learns with a large number of parameters and hence leads to an overfitting
3. High-Bias, Low-Variance: With High bias and low variance, predictions are
consistent but inaccurate on average. This case occurs when a model does not
learn well with the training dataset or uses few numbers of the parameter. It
leads to underfitting problems in the model.
4. High-Bias,High-Variance:
With high bias and high variance, predictions are inconsistent and also
inaccurate on average.
How to identify High variance or High Bias?
High variance can be identified if the model has:
For an accurate prediction of the model, algorithms need a low variance and low bias.
But this is not possible because bias and variance are related to each other:
o If we decrease the variance, it will increase the bias.
o If we decrease the bias, it will increase the variance.
SYLLABUS:
Linear Regression Models: Least squares, single & multiple variables,
Bayesian linear regression, gradient descent, Linear Classification
Models: Discriminant function- Percepton algorithm– Probabilistic
discriminative model - Logistic regression, Probabilistic generative
model – Naive Bayes, Maximum margin classifier – Support vector
machine, Decision Tree, Random forests
PART A
1. Define Regression.
2. Define Classification.
3. Define Clustering.
Clustering or cluster analysis is a machine learning technique, which
groups the unlabeled dataset.
It can be defined as "A way of grouping the data points into different
clusters, consisting of similar data points.
The objects with the possible similarities remain in a group that has
less or no similarities with another group."
Where,
m: Slope
c: y-intercept
where
Y´ represents the predicted value;
X represents the known value;
b and a represent numbers calculated from the original correlation
analysis
A Linear Regression model’s main aim is to find the best fit linear
line and the optimal values of intercept and coefficients such that the error is
minimized.
Error is the difference between the actual value and Predicted value and the
goal is to reduce this difference.
The vertical distance between the data point and the regression line is known
as error or residual.
Each data point has one residual and the sum of all the differences is known
as the Sum of Residuals/Errors.
where
o Posterior: It is the probability of an event to occur; say, H, given that
another event; say, E has already occurred. i.e., P(H | E).
o Prior: It is the probability of an event H has occurred prior to another
event. i.e., P(H)
o Likelihood: It is a likelihood function in which some parameter variable
is marginalized.
Two-class problems:
t = 1 represents class C1
t = 0 represents class C2
o Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM,
CAT or DOG, etc.
o Multi-class Problems:
o If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
o Example: Classifications of types of crops, Classification of types
of music.
The cost is 0 if the predicted value and the actual value are of the same
sign. If they are not, then calculate the loss value.
PART B
Where,
m: Slope
c: y-intercept
Linear regression algorithm shows a linear relationship between a
dependent (y) and one or more independent (x) variables, hence called
as linear regression.
Linear regression finds how the value of the dependent variable is
changing according to the value of the independent variable.
The linear regression model provides a sloped straight line
representing the relationship between the variables.
Consider the below Figure 2.1, which represents the relationship
between independent and dependent variables
where
Y´ represents the predicted value;
X represents the known value;
b and a represent numbers calculated from the original correlation
analysis
1.2.2 Least Squares Regression in Python
Scenario: A rocket motor is manufactured by combining an igniter
propellant and a sustainer propellant inside a strong metal housing. It
was noticed that the shear strength of the bond between two propellers
is strongly dependent on the age of the sustainer propellant.
Problem Statement: Implement a simple linear regression algorithm
using Python to build a machine learning model that studies the
relationship between the shear strength of the bond between two
propellers and the age of the sustainer propellant.
Step 1: Import the required Python libraries.
# Importing Libraries
importnumpyasnp
import pandas aspd
importmatplotlib.pyplotasplt
Step 6: Calculate the slope and the y-intercept using the formulas
# Calculating 'm' and 'c'
num = 0
denom = 0
for i in range(n):
num += (X[i] - mean_x) * (Y[i] - mean_y)
denom += (X[i] - mean_x) ** 2
m = num / denom
c = mean_y - (m * mean_x)
# Printing coefficients
print("Coefficients")
print(m, c)
Mathematical Approach:
Residual/Error = Actual values – Predicted Values
Sum of Residuals/Errors = Sum(Actual- Predicted Values)
Square of Sum of Residuals/Errors = (Sum(Actual- Predicted Values))2
where
o Posterior: It is the probability of an event to occur; say, H, given that
another event; say, E has already occurred. i.e., P(H | E).
o Prior: It is the probability of an event H has occurred prior to another
event. i.e., P(H)
o Likelihood: It is a likelihood function in which some parameter
variable is marginalized.
The Bayesian Ridge Regression formula is as follows:
p(y|λ) = N(w|0, λ^-1Ip)
where
'y' is the expected value,
lambda is the distribution's shape parameter before the lambda
parameter
the vector "w" is made up of the elements w0, w1,....
Program
fromsklearn.datasets import load_boston
fromsklearn.model_selection import train_test_split
fromsklearn.metrics import r2_score
fromsklearn.linear_model import BayesianRidge
Output
Test Set r2 score : 0.7943355984883815
The goal is to minimize the cost as much as possible in order to find the
best fit line.
Learning Rate
A learning rate is used for each pair of input and output values. It is a
scalar factor and coefficients are updated in direction towards minimizing
error.
The process is repeated until a minimum sum squared error is achieved or
no further improvement is possible.
derivative D.
Repeat this process until our Cost function is very small (ideally 0).
With linear models for classification, the decision surfaces are linear
functions, Classes that can be separated well by linear surfaces are
linearly separable.
In the figure 2.5, there are two classes, class A and Class B.
These classes have features that are similar to each other and
dissimilar to other classes.
2.3Discriminant function
A function of a set of variables that is evaluated for samples of events
or objects and used as an aid in discriminating between or classifying
them.
A discriminant function (DF) maps independent (discriminating)
variables into a latent variable D.
DF is usually postulated to be a linear function:
D = a0 + a1 x1 + a2 x2 ... aNxN
The goal of discriminant analysis is to find such values of the
coefficients {ai, i=0,...,N} that the distance between the mean values of
DF is maximal for the two groups.
Whenever there is a requirement to separate two or more classes
having multiple features efficiently, the Linear Discriminant Analysis
model is considered the most common technique to solve such
classification problems.
For example, if there are classes with multiple features and need to
separate them efficiently. Classify them using a single feature, then it
may show overlapping as shown in figure 3.6.
(2.1)
where w is the weight vector and w0 the bias or threshold weight.
(2.2)
or
(2.3)
(2.4)
where xp is the normal projection of x onto H, and r is the desired
algebraic distance which is positive if x is on the positive side and
negative if x is on the negative side. Then, because g(xp)=0,
Since then
(2.5)
or
(2.6)
(2.7)
or
(2.8)
(2.9)
The cost is 0 if the predicted value and the actual value are of the same
sign. If they are not, then calculate the loss value.
The objective of the regularization parameter is to balance the margin
maximization and loss.
After adding the regularization parameter, the cost functions looks as
below.
5.6 Disadvantages
If the number of features is a lot bigger than the number of data points,
choosing kernel functions and regularization term is crucial.
SVMs don't directly provide probability estimates. Those are calculated
using an expensive five-fold cross-validation.
Works best on small sample sets because of its high training time.
5.7 Applications
SVMs can be used to solve various real-world problems:
SVMs are helpful in text and hypertext categorization.
Classification of images can also be performed using SVMs.
Classification of satellite data like SAR data using supervised SVM.
Hand-written characters can be recognized using SVM.
The SVM algorithm has been widely applied in the biological and other
sciences. They have been used to classify proteins with up to 90% of the
compounds classified correctly.
Example:
Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the
decision tree starts with the root node (Salary attribute by ASM). The root
node splits further into the next decision node (distance from the office) and
one leaf node based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer). Consider the below diagram: Refer fig 2.14
Decision Tree
2. Information gain,
3. Gini index,
4. Gain Ratio,
5. Reduction in Variance
6. Chi-Square
6.7.1 Entropy:
Entropy is a metric to measure the impurity in a given
attribute.
Entropy is a measure of the randomness in the information
being processed.
The higher the entropy, the harder it is to draw any conclusions
from that information.
Flipping a coin is an example of an action that provides
information that is random.
Entropy can be calculated as:
6.7.6 Chi-Square
The acronym CHAID stands for Chi-squared Automatic Interaction
Detector.
It finds out the statistical significance between the differences
between sub-nodes and parent node.
Higher the value of Chi-Square higher the statistical significance of
differences between sub-node and Parent node.
It generates a tree called CHAID (Chi-square Automatic Interaction
Detector).
Mathematically, Chi-squared is represented as:
RANDOM FOREST
7.1 Random Forest
7.2 Steps in the working process of Random Forest
7.3 Need for Random Forest
7.4 Example:
7.5 Important Features of Random Forest
7.6 Applications of Random Forest
7.8 Advantages of Random Forest
7.8 Disadvantages of Random Forest
7.9 Difference between Decision Tree & Random Forest
7.4 Example:
Suppose there is a dataset that contains multiple fruit images. So,
this dataset is given to the Random forest classifier. The dataset is
divided into subsets and given to each decision tree. During the
training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results,
the Random Forest classifier predicts the final decision.
In the above figure 2.16, the majority decision tree gives output as an
apple when compared to a banana, so the final output is taken as an
apple.
When a data set with features is taken Random forest randomly selects
as input by a decision tree it will observations, builds a decision tree and
formulate some set of rules to do the average result is taken. It doesn’t use
prediction. any set of formulas.
PART A
1. What is unsupervised learning?
Unsupervised learning is a type of machine learning algorithm used to
draw inferences from datasets consisting of input data without labeled
responses.
4. Explain Clustering.
Clustering is the act of organizing similar objects into groups
within a machine learning algorithm. Assigning related objects into
clusters is beneficial for AI models. Clustering has many uses in data
science, like Image processing, knowledge discovery in data,
unsupervised learning, and various other applications.
5. What is a Cluster?
Cluster is a group of objects that belongs to the same class. In
other words the similar objects are grouped in one cluster and dissimilar
are grouped in other cluster.
6. What is bagging?
Bagging is also known as Bootstrap aggregation, ensemble method
works by training multiple models independently and combining later to
result in strong model.
7. Define Boosting.
Boosting refers to a group of algorithms that utilize weighted
averages to make weak learning algorithms to stronger learning
algorithms.
Bagging Boosting
PART - B
Different Algorithms
o Different algorithms make different assumptions about the data and
lead to different classifiers.
When there are K outputs, for each learner there are dji(x), i= 1,
… , K, j = 1, … , L, and, combining them, also generate K values,
yi, i = 1, … , K and then for example in classification, choose the
class with the maximum yi value:
Voting
A Voting Classifier is a machine learning model that trains on an
ensemble of numerous models and predicts an output (class) based
on their highest probability of chosen class as the output.
It simply aggregates the findings of each classifier passed into Voting
Classifier and predicts the output class based on the highest majority
of voting.
The idea is instead of creating separate dedicated models and finding
the accuracy for each them, create a single model which trains by
these models and predicts output based on their combined majority
of voting for each output class.
The simplest way to combine multiple classifiers is by voting, which
corresponds to taking a linear combination of the learners (see figure
4.1)
This is also known as ensembles. In the simplest case, all learners are
given equal weight and simple voting corresponds to take an average.
Figure 3.1 Base-learners are dj and their outputs are combined using
f(·).
Voting Classifier supports two types of voting’s.
1. Hard Voting: In hard voting, the predicted output class is a class with
the highest majority of votes i.e. the class which had the highest
probability of being predicted by each of the classifiers. Suppose three
classifiers predicted the output class(A, A, B), so here the majority
predicted A as output. Hence A will be the final prediction.
2. Soft Voting: In soft voting, the output class is the prediction based on
the average of probability given to that class. Suppose given some input
to three models, the prediction probability for class A = (0.30, 0.47,
0.53) and B = (0.20, 0.32, 0.40). So the average for class A is
0.4333 and B is 0.3067, the winner is clearly class A because it had the
highest probability averaged by each classifier.
Eq no 3
Algorithm
1. Initialize the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified
data points.
3. Increase the weight of the wrongly classified data points and decrease
the weights of correctly classified data points. And then normalize the
weights of all data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End
Figure 3.4 An illustration presenting the intuition behind the boosting algorithm,
consisting of the parallel learners and weighted dataset.
Bagging Boosting
In this base classifiers are trained In this base classifiers are trained
parallelly. sequentially.
a way that by combining them with Meta learners, we can predict better
predictions for the future.
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical
groups.
It is also known as the centroid-based method.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas
into clusters, and the arbitrarily shaped distributions are formed as
long as the dense region can be connected.
This algorithm does it by identifying different clusters in the dataset
and connects the areas of high densities into clusters.
The dense areas in data space are divided from each other by sparser
areas as shown in 3.9.
Clustering Algorithms
K-Means algorithm
Mean-shift algorithm
DBSCAN Algorithm-Density-Based Spatial Clustering of Applications
with Noise.
Expectation-Maximization Clustering using GMM
Agglomerative Hierarchical algorithm
Affinity Propagation
Applications of Clustering
In Identification of Cancer Cells: It divides the cancerous and
non-cancerous data sets into different groups.
In Search Engines: It does it by grouping similar data objects in
one group that is far from the other dissimilar objects.
Customer Segmentation: It is used in market research to
segment the customers based on their choice and preferences.
In Biology: It is used in the biology stream to classify different
species of plants and animals using the image recognition
technique.
In Land Use: The clustering technique is used in identifying the
area of similar lands use in the GIS database.
3. Compute the centroid, the center of mass, of each newly defined cluster
from Step2.
In two dimensions, the centroid (Xc, Yc) of the m points in a k-means
cluster is
calculated as follows in
5. Calculate centroid, q
The centroid, q, of a cluster of m points, (pi1, pi2, . . . pin) , is calculated
as
Working of K-NN
The K-NN working can be explained on the basis of the below
algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of
neighbors
Example
Suppose we have a new data point and we need to put it in the required
category as shown in figure 3.13.
Figure 3.13 new data point and we need to put it in the required category
Firstly, we will choose the number of neighbors, so we will choose
the k=5.
Next, we will calculate the Euclidean distance between the data
points. The Euclidean distance is the distance between two points
It can be calculated as:
Figure 3.14
o By calculating the Euclidean distance we got the nearest neighbors, as
three nearest neighbors in category A and two nearest neighbors in
category B. Consider the below image:
o Since 3 nearest neighbors are from category A in Figure 4.9 , hence the
new data point must belong to category A.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.
where x is the input vector, μ is the 2D mean vector, and Σ is the 2×2
covariance matrix.
Thus, this multivariate Gaussian model would have x and μ as vectors of
length d, and Σ would be a d x d covariance matrix.
Expectation-Maximization Algorithm
The mean and variance value for each Gaussian distribution are
determined using a technique called Expectation-Maximization (EM).
Expectation-Maximization (EM) is a statistical algorithm for finding
the right model parameters. Use EM when the data has missing
values, or in other words, when the data is incomplete.
These missing variables are called latent variables.
Expectation-Maximization algorithm has two steps:
o E-step: In this step, the available data is used to estimate
(guess) the values of the missing variables
o M-step: Based on the estimated values generated in the E-step,
the complete data is used to update the parameters
Expectation-Maximization is the base of many algorithms, including
Gaussian Mixture Models.
E-step:
For each point xi, calculate the probability that it belongs to
cluster/distribution c1, c2, … ck. This is done using the below
formula:
This value will be high when the point is assigned to the right cluster
and lower otherwise.
M-step:
1. The new density is defined by the ratio of the number of points in the
cluster and the total number of points:
2. The mean and the covariance matrix are updated based on the values
assigned to the distribution, in proportion with the probability values
for the data point. Hence, a data point that has a higher probability of
being a part of that distribution will contribute a larger portion:
UNIT IV
NEURAL NETWORKS
Perceptron - Multilayer perceptron, activation functions, network training –
gradient descent optimization – stochastic gradient descent, error
backpropagation, from shallow networks to deep networks –Unit saturation (aka
the vanishing gradient problem) – ReLU, hyperparameter tuning, batch
normalization, regularization, dropout.
PART A
1. Define neuron.
A neuron is a cell in the brain whose principal function is the collection,
processing, and dissemination of electrical signals.
4. Define Perceptron.
A network with all the inputs connected directly to the outputs is called a
single layer neural network, or a perceptron network. Since each output unit
is independent of the others-each weight affects only one of the outputs.
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a
static input for static output. It is useful to solve static classification issues like
optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed
value is achieved. After that, the error is computed and propagated backward.
ReLU Function
Value Range:- -1 to +1
Nature:- non-linear
Uses:- Usually used in hidden layers of a neural network as it’s values
lies between -1 to 1 hence the mean for the hidden layer comes out be 0
or very close to it, hence helps in centering the data by bringing mean
close to 0. This makes learning for the next layer much easier.
Tanh Function
38. Define Sigmoid Function.
It is a function which is plotted as ‘S’ shaped graph.
Equation : A = 1/(1 + e -x )
Nature : Non-linear. Notice that X values lies between -2 to 2, Y
values are very steep. This means, small changes in x would also
bring about large changes in the value of Y.
Value Range : 0 to 1
Uses : Usually used in output layer of a binary classification,
where result is either 0 or 1, as value for sigmoid function lies
between 0 and 1 only so, result can be predicted easily to be 1 if
value is greater than 0.5 and 0 otherwise.
In the leaky ReLU (the output is also linear on the negative side but with a
smaller slope, just enough to make sure that there will be updates for negative
activations, albeit small:
PART B
y = w x + w0
where w as the slope and w0 as the intercept
Thus this perceptron with one input and one output can be used to
implement a linear fit. With more than one input, the line becomes
a (hyper)plane, and the perceptron with more than one input can be
used to implement multivariate linear fit.
The perceptron as defined in equation 1 defines a hyperplane and
as such can be used to divide the input space into two:
o the half-space where it is positive and the half-space
where it is negative
By using it to implement a linear discriminant function, the
perceptron can separate two classes by checking the sign of the
output. If we define s(·) as the threshold function
then we can,
When there are K > 2 outputs, there are K perceptrons, each of which has
a weight vector wi (see figure 5.2)
where wij is the weight from input xj to output yi. W is the K ×(d + 1) weight
matrix of wij whose rows are the weight vectors of the K perceptrons. When
used for classification, during testing, we
The output layer gives two outputs, therefore there are two
output nodes. The nodes in the input layer take input and
forward it for further process, in the diagram above the nodes in
the input layer forwards their output to each of the three nodes in
the hidden layer, and in the same way, the hidden layer
processes the information and passes it to the output layer.
Every node in the multi-layer perception uses a sigmoid
activation function. The sigmoid activation function takes real
values as input and converts them to numbers between 0 and 1
using the sigmoid formula.
(x)=1/(1+(exp(-x))
A perceptron that has a single layer of weights can only approximate
linear functions of the input and cannot solve problems like the
XOR, where the discrimininant to be estimated is nonlinear.
Similarly, a perceptron cannot be used for nonlinear regression.
This limitation does not apply to feedforward networks with an
intermediate or hidden layer between the input and the output
layers.
If used for classification, such multilayer perceptrons (MLP) can
implement nonlinear discriminants and, if used for regression, can
approximate nonlinear functions of the input.
From the figure 4.4, We cannot draw a line where the empty circles are on one
side and the filled circles on the other side.
Input x is fed to the input layer (including the bias),
the “activation” propagates in the forward direction,
and the values of the hidden units zh are calculated
(see Figure 5.5). Each hidden unit is a perceptron
by itself and applies the nonlinear sigmoid function
to its weighted sum:
One is not limited to having one hidden layer, and more hidden layers
with their own incoming weights can be placed after the first hidden
layer with sigmoid hidden units, thus calculating nonlinear functions of
Value Range :- -1 to +1
Nature :- non-linear
Uses :- Usually used in hidden layers of a neural network as it’s values
lies between -1 to 1 hence the mean for the hidden layer comes out be 0
or very close to it, hence helps in centering the data by bringing mean
close to 0. This makes learning for the next layer much easier.
4. ReLU Function
It Stands for Rectified linear unit. It is the most widely used
activation function. Chiefly implemented in hidden layers of
Neural network.
4. Calculate the forward pass (what would be the output with the current
weights)
5. Compare the calculated output to the expected output (loss)
6. Adjust the weights (using the learning rate increment or decrement)
according to the backward pass (backward gradient propagation).
7. Go back to step 2
5. Discuss in detail about Gradient descent optimization Algorithm.
Gradient Descent is a generic optimization algorithm capable of finding
optimal solutions to a wide range of problems.
The general idea is to tweak parameters iteratively in order to minimize the
cost function.
An important parameter of Gradient Descent (GD) is the size of the steps,
determined by the learning rate hyperparameters. If the learning rate is too
small, then the algorithm will have to go through many iterations to
converge, which will take a long time, and if it is too high we may jump the
optimal value.
Types of Gradient Descent:
Typically, there are three types of Gradient Descent:
1. Batch Gradient Descent
Batch Gradient Descent involves calculations over the full training
set at each step as a result of which it is very slow on very large
training data. Thus, it becomes very computationally expensive to do
Batch GD.
2. Stochastic Gradient Descent
In SGD, only one training example is used to compute the gradient
and update the parameters at each iteration. This can be faster than
batch gradient descent but may lead to more noise in the updates.
3. Mini-batch Gradient Descent
In mini-batch gradient descent, a small batch of training examples
is used to compute the gradient and update the parameters at each
iteration. This can be a good compromise between batch gradient
descent and SGD, as it can be faster than batch gradient descent and
less noisy than SGD.
6. Explain in detail about Stochastic gradient descent .
Stochastic Gradient Descent:
In Stochastic Gradient Descent, a few samples are selected randomly
instead of the whole data set for each iteration.
In Gradient Descent, there is a term called “batch” which denotes the total
number of samples from a dataset that is used for calculating the gradient
for each iteration.
In typical Gradient Descent optimization, like Batch Gradient Descent, the
batch is taken to be the whole dataset. Although using the whole dataset is
really useful for getting to the minima in a less noisy and less random
manner, the problem arises when our dataset gets big.
Suppose, you have a million samples in your dataset, so if you use a
typical Gradient Descent optimization technique, you will have to use all of
the one million samples for completing one iteration while performing the
Gradient Descent, and it has to be done for every iteration until the
minima are reached. Hence, it becomes computationally very expensive to
perform.
This problem is solved by Stochastic Gradient Descent. In SGD, it uses
only a single sample, i.e., a batch size of one, to perform each iteration.
The sample is randomly shuffled and selected for performing the iteration.
SCD Algorithm
In SGD, we find out the gradient of the cost function of a single example
at each iteration instead of the sum of the gradient of the cost function of
all the examples.
In SGD, since only one sample from the dataset is chosen at random for
each iteration, the path taken by the algorithm to reach the minima is
usually noisier than your typical Gradient Descent algorithm. But that
doesn’t matter all that much because the path taken by the algorithm
does not matter, as long as we reach the minima and with a significantly
shorter training time (see in Figure 4.13 & 4.14).
Advantages:
Speed: SGD is faster than other variants of Gradient Descent.
Memory Efficiency:it is memory-efficient and can handle large datasets
that cannot fit into memory.
Avoidance of Local Minima: Due to the noisy updates in SGD, it has the
ability to escape from local minima and converge to a global minimum.
Disadvantages:
Noisy updates: The updates in SGD are noisy and have a high variance,
which can make the optimization process less stable and lead to
oscillations around the minimum.
Slow Convergence: SGD may require more iterations to converge to the
minimum since it updates the parameters for each training example one
at a time.
Sensitivity to Learning Rate: The choice of learning rate can be critical
in SGD since using a high learning rate can cause the algorithm to
overshoot the minimum, while a low learning rate can make the
algorithm converge slowly.
Less Accurate: Due to the noisy updates, SGD may not converge to the
exact global minimum and can result in a suboptimal solution. This can
be mitigated by using techniques such as learning rate scheduling and
momentum-based updates.
7. Explain in detail about error backpropagation.
Backpropagation
Backpropagation is one of the important concepts of a neural network.
or a single training example, Backpropagation algorithm calculates the
gradient of the error function.
Backpropagation algorithms are a set of methods used to efficiently
train artificial neural networks following a gradient descent approach
which exploits the chain rule.
The main features of Backpropagation are the iterative, recursive
and efficient method through which it calculates the updated weight to
improve the network until it is not able to perform the task for which it
is being trained.
Derivatives of the activation function to be known at network design
time is required to Backpropagation.
How Backpropagation Algorithm Works?
The Back propagation algorithm in neural network computes the gradient
of the loss function for a single weight by the chain rule. It efficiently
computes one layer at a time, unlike a native direct computation. It
computes the gradient, but it does not define how the gradient is used. It
generalizes the computation in the delta rule.(see in Figure 5.16)
8. Explain in detail about Unit saturation (aka the vanishing gradient problem).
The vanishing gradient problem is an issue that sometimes arises when
training machine learning algorithms through gradient descent. This most
often occurs in neural networks that have several neuronal layers such as
in a deep learning system, but also occurs in recurrent neural networks.
The key point is that the calculated partial derivatives used to compute the
gradient as one goes deeper into the network. Since the gradients control
how much the network learns during training, the gradients are very small
or zero, then little to no training can take place, leading to poor predictive
performance.
The problem:
As more layers using certain activation functions are added to neural
networks, the gradients of the loss function approaches zero, making the
network hard to train.
Why:
Certain activation functions, like the sigmoid function, squishes a large
input space into a small input space between 0 and 1. Therefore, a large
change in the input of the sigmoid function will cause a small change in
the output. Hence, the derivative becomes small.
The sigmoid function and its derivative
As an example, the below Figure 4.17 is the sigmoid function and its
derivative. Note how when the inputs of the sigmoid function become
larger or smaller (when |𝑥| becomes bigger), the derivative becomes close
to zero.
Finally, batch normalization layers can also resolve the issue. As stated
before, the problem arises when a large input space is mapped to a small
one, causing the derivatives to disappear.
9. Explain in detail about Rectified linear unit (ReLU).
It Stands for Rectified linear unit. It is the most widely used activation
function. Chiefly implemented in hidden layers of Neural network. (see
the below figure 4.19)
Leaky ReLU
In the leaky ReLU (the output is also linear on the negative side but with a
smaller slope, just enough to make sure that there will be updates for negative
activations, albeit small:
Advantage:
it does not saturate (unlike sigmoid and tanh), updates can still be done
for large positive a for some inputs, some hidden unit activations will be
zero, meaning that we will have a sparse representation
Sparse representations lead to faster Training.
Disadvantage:
The derivative is zero for a ≤ 0, there is no further training if, for a hidden
unit, the weighted sum somehow becomes negative. This implies that one
should be careful in initializing the weights so that the initial activation for
all hidden units is positive.
Batch normalization
Batch normalization is a process to make neural networks faster and
more stable through adding extra layers in a deep neural network. The
new layer performs the standardizing and normalizing operations on the
input of a layer coming from a previous layer. A typical neural network is
trained using a collected set of input data called batch. Similarly, the
normalizing process in batch normalization takes place in batches, not as
a single input.
A similar case can also be made for the hidden units, and this is the idea
behind batch normalization.
We normalize hidden unit values before applying the activation function,
such as sigmoid or ReLU. Let us call that weighted sum aj. For each batch
or minibatch, for each hidden unit j we calculate the mean mj and
standard deviation sj of its values, and we first z-normalize:
We can then map these to have arbitrary mean γj and scale βj and then we
apply the activation function.
First, mj and sj are calculated anew for each batch, and we see
immediately that batch normalization is not meaningful with online
learning or very small minibatches.
Second, γj and βj are parameters that are initialized and updated (after
each batch or minibatch) using gradient descent, just like the connection
weights. So they require extra memory and computation.
PREPARED BY: Ms.M.NITHYA, AP/AI & DS, Mr.THIYANESHWARAN, AP,CSBS 39
These new parameters allow each hidden unit to have its arbitrary (but
learned) mean and standard deviation for its activation.
Once learning is done, we can go over the whole training data (or a large
enough subset) and calculate mj and sj to use during testing, when we see
one instance at a time.
Why batch normalization?
An internal covariate shift occurs when there is a change in the input
distribution to our network. When the input distribution changes, hidden
layers try to learn to adapt to the new distribution. This slows down the
training process. If a process slows down, it takes a long time to converge
to a global minimum.
Example: Suppose we are training an image classification model, that classifies
the images into Dog or Not Dog. Let’s say we have the images of white dogs only,
these images will have certain distribution as well. Using these images model will
update its parameters.
Later, if we get a new set of images, consisting of non-white dogs. These
new images will have a slightly different distribution from the previous images.
Now the model will change its parameters according to these new images. Hence
the distribution of the hidden activation will also change. This change in hidden
activation is known as an internal covariate shift. Here batch normalization
works.
Advantages of Batch Normalization
Speed Up the Training
Handles internal covariate shift
The model is less delicate to hyperparameter tuning.
Batch normalization smoothens the loss function that in turn by
optimizing the model parameters improves the training speed of the model.
Hyperparameter tuning
GridSearchCV
In GridSearchCV approach, the machine learning model is evaluated
for a range of hyperparameter values. This approach is called
GridSearchCV, because it searches for the best set of
hyperparameters from a grid of hyperparameters values.
For example, if we want to set two hyperparameters C and Alpha of
the Logistic Regression Classifier model, with different sets of values.
The grid search technique will construct many versions of the model
with all possible combinations of hyperparameters and will return the
best one.
As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2,
0.3, 0.4]. For a combination of C=0.3 and Alpha=0.2, the
performance score comes out to be 0.726(Highest), therefore it is
selected. (see in Figure 5.20)
RandomizedSearchCV
RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes
through only a fixed number of hyperparameter settings. It moves within
the grid in a random fashion to find the best set of hyperparameters. This
approach reduces unnecessary computation.
3. Ridge Regression
The Ridge regression technique is used to analyze the model where
the variables may be having multicollinearity.
It reduces the insignificant independent variables though it does not
remove them completely. This type of regularization uses the L₂ norm
for regularization.
4. Lasso Regression
Least Absolute Shrinkage and Selection Operator (or LASSO)
Regression penalizes the coefficients to the extent that it becomes zero. It
eliminates the insignificant independent variables. This regularization
technique uses the L1 norm for regularization.
5. Dropout
"Dropout" in machine learning refers to the process of randomly
ignoring certain nodes in a layer during training.
In the Figure 5.22, the neural network on the left represents a typical
neural network where all units are activated. On the right, the red
units have been dropped out of the model- the values of their weights
and biases are not considered during training.
Figure 4.23 In dropout, the output of a random subset of the units are set to
zero, and backpropagation is done on the remaining smaller network.
In each batch or minibatch, for each unit independently we decide
randomly to keep it or not. Let us say that p = 0.25. So, on average, we
remove a quarter of the units and we do backpropagation as usual on the
remaining network for that batch or minibatch. We need to make up for
the loss of units, though: In each layer, we divide the activation of the
remaining units by 1 − p to make sure that they provide a vector of similar
magnitude to the next layer. There is no dropout during testing.
In each batch or minibatch, a smaller network (with smaller variance) is
trained. Thus dropout is effectively sampling from a pool of possible
networks of different depths and widths.
There is a version called dropconnect that drops out or not connections
independently, which allows a larger set of possible networks to sample
from, and this may be preferable in smaller networks.