Math Behind Machine Learning
Math Behind Machine Learning
Richard Han
www.onlinemathtraining.com
[email protected]
Abstract
In this article, we look at the mathematics behind the machine learning techniques linear
regression, linear discriminant analysis, logistic regression, artificial neural networks, and
support vector machines.
1. Introduction
In this article, we will identify the math subjects and topics used by the machine
learning techniques linear regression, linear discriminant analysis, logistic
regression, artificial neural networks, and support vector machines.
2. Linear Regression
Let’s look at several techniques in machine learning and the math topics that are
used in the process.
In linear regression, we try to find the best fit line or hyperplane for a given set of
data points. We model the output of our linear function by a linear combination of
the input variables using a set of parameters as weights.
The parameters are found by minimizing the residual sum of squares. We find a
critical point by setting the vector of derivatives of the residual sum of squares to
the zero vector. By the second derivative test, if the Hessian of the residual sum
of squares at a critical point is positive definite, then the residual sum of squares
has a local minimum there.
In the above process, we used derivatives, the second derivative test, and the
Hessian, which are notions from multivariable calculus. We can also find the
solution to our minimization problem using linear algebra.
Let 𝑋𝑋 be the matrix where the rows are our data inputs, beginning with 1 in each
row, and 𝒚𝒚 be the vector of our data outputs. We want a vector 𝛽𝛽 such that 𝑋𝑋𝑋𝑋 is
close to 𝒚𝒚. In other words, we want a vector 𝛽𝛽 such that the distance ‖𝑋𝑋𝑋𝑋 − 𝒚𝒚‖
between 𝑋𝑋𝑋𝑋 and 𝒚𝒚 is minimized. The vector 𝛽𝛽 which minimizes the distance is
such that 𝑋𝑋𝑋𝑋 is the projection of 𝒚𝒚 onto the column space of 𝑋𝑋. This is so because
the projection of 𝒚𝒚 onto the column space of 𝑋𝑋 is the vector in the column space of
𝑋𝑋 that is closest to 𝒚𝒚. We then use the fact that Euclidean N-space can be broken
into two subspaces, the column space of 𝑋𝑋 and the orthogonal complement of the
column space of 𝑋𝑋, and the fact that any vector in Euclidean N-space can be
written uniquely as the sum of vectors in the column space of 𝑋𝑋 and in the
orthogonal complement of the column space of 𝑋𝑋, respectively, to deduce that 𝒚𝒚 −
𝑋𝑋𝑋𝑋 is orthogonal to the columns of 𝑋𝑋. From here, we can arrive at the matrix
equation 𝑋𝑋 𝑇𝑇 𝑋𝑋𝑋𝑋 = 𝑋𝑋 𝑇𝑇 𝒚𝒚. If 𝑋𝑋 𝑇𝑇 𝑋𝑋 is positive definite, then the eigenvalues of 𝑋𝑋 𝑇𝑇 𝑋𝑋
are all positive. So 0 is not an eigenvalue of 𝑋𝑋 𝑇𝑇 𝑋𝑋. It follows that 𝑋𝑋 𝑇𝑇 𝑋𝑋 is
invertible. Then, we can solve the matrix equation for 𝛽𝛽, and the result is the
same result we get from using multi-variable calculus.
In the solution we just discussed, the notions of norm, projection, column space,
subspace, orthogonal complement, orthogonality, positive definiteness,
eigenvalue, and invertibility are used. These are notions from linear algebra. We
also used the facts that Euclidean N-space can be broken into two subspaces, the
column space of 𝑋𝑋 and the orthogonal complement of the column space of 𝑋𝑋 and
that any vector in Euclidean N-space can be written uniquely as the sum of vectors
in the column space of 𝑋𝑋 and in the orthogonal complement of the column space
of 𝑋𝑋, respectively.
In linear discriminant analysis, we estimate Pr(𝑌𝑌 = 𝑘𝑘|𝑋𝑋 = 𝑥𝑥), the probability that
𝑌𝑌 is the class 𝑘𝑘 given that the input variable 𝑋𝑋 is 𝑥𝑥. This is called a posterior
probability function. Once we have all of these probabilities for a fixed 𝑥𝑥, we pick
the class 𝑘𝑘 for which the probability Pr(𝑌𝑌 = 𝑘𝑘|𝑋𝑋 = 𝑥𝑥) is largest. We then classify
𝑥𝑥 as that class 𝑘𝑘.
2
Using Bayes’ rule, we can rewrite the posterior probability function in terms of
𝜋𝜋𝑘𝑘 = Pr(𝑌𝑌 = 𝑘𝑘), the prior probability that 𝑌𝑌 = 𝑘𝑘, and 𝑓𝑓𝑘𝑘 (𝑥𝑥) = Pr(𝑋𝑋 = 𝑥𝑥|𝑌𝑌 = 𝑘𝑘),
the probability that 𝑋𝑋 = 𝑥𝑥, given that 𝑌𝑌 = 𝑘𝑘.
Now, we find estimates for 𝜋𝜋𝑘𝑘 , 𝜇𝜇𝑘𝑘 , and Σ, and hence for 𝑝𝑝𝑘𝑘 (𝑥𝑥). We classify 𝑥𝑥
according to the class 𝑘𝑘 for which the estimated 𝑝𝑝𝑘𝑘 (𝑥𝑥) is greatest.
4. Logistic Regression
Assuming there are only two classes 0 and 1, let 𝑝𝑝(𝑥𝑥) = Pr(𝑌𝑌 = 1|𝑋𝑋 = 𝑥𝑥). In
logistic regression, we assume that the log-odds is a linear function of the
components of 𝑥𝑥. Assuming that the log-odds is a linear function of the
components of 𝑥𝑥, with parameters 𝛽𝛽0 , 𝛽𝛽1 , … , 𝛽𝛽𝑝𝑝 , we can solve for 𝑝𝑝(𝑥𝑥) as a
function of the parameters and the components of 𝑥𝑥. We can get an estimate for
𝑝𝑝(𝑥𝑥) if we had estimates for the parameters 𝛽𝛽0 , 𝛽𝛽1 , … , 𝛽𝛽𝑝𝑝 .
The probability of our observed data is a function of the parameters 𝛽𝛽0 , 𝛽𝛽1 , … , 𝛽𝛽𝑝𝑝
and is called a likelihood function. We find estimates for the parameters by
maximizing the likelihood function. Maximizing the likelihood function is
equivalent to maximizing the log of the likelihood function. To maximize the log-
likelihood function, we use the Newton-Raphson method.
3
The log-likelihood function 𝐿𝐿(𝛽𝛽) is a real-valued function of 𝛽𝛽 = (𝛽𝛽0 , 𝛽𝛽1 , … , 𝛽𝛽𝑝𝑝 ).
So 𝐿𝐿 is a function from ℝ𝑝𝑝+1 to ℝ. Further, 𝐿𝐿 is twice continuously differentiable.
So we can apply the multivariate Newton-Raphson method.
Next, we’ll look at a method that can be used to solve both regression and
classification problems. In artificial neural networks, we use compositions of
linear and nonlinear functions to model our output functions.
The input units, including the constant 1, will form the input layer. We take a
linear combination of the input units, including the constant 1, and then apply an
activation function ℎ to it to get a new unit. ℎ is a differentiable (possibly
nonlinear) function. We do this, say, 𝑀𝑀 times; we now have 𝑀𝑀 hidden units, and
these make up a hidden layer. Looking at the diagram, the weights in the linear
combinations are represented by the line segments connecting two units. We can
continue this process of taking linear combinations of the units in the previous
layer and applying an activation function to each linear combination to create a
new hidden unit and thus creating the next hidden layer. At some point we have a
4
last layer, called the output layer, and we use activation functions 𝑔𝑔𝑘𝑘 for each
output unit 𝑌𝑌𝑘𝑘 . 𝑔𝑔𝑘𝑘 is a function of all the linear combinations of the units from the
previous layer.
So far, we’ve constructed output values 𝑌𝑌𝑘𝑘 that depend on an input 𝑥𝑥 and that
involve a bunch of unknown parameters. Our goal now is to use our training data
to find values for the unknown parameters that minimize error. For binary
classification, we find estimates for the parameters by maximizing the likelihood
function associated with the probability of our observed data; this corresponds to
minimizing what’s called the cross-entropy error function. Similarly, for
multiclass classification, we find estimates for the parameters by maximizing the
likelihood function associated with the probability of our observed data; this
corresponds to minimizing what’s called the multiclass cross-entropy error
function. For regression, we find estimates for the parameters by minimizing the
sum-of-squares error function.
To minimize the error function, we use gradient descent, which requires finding
the gradient of the error function. To find the gradient of the error function, we
use backpropagation.
Let’s look at the method of support vector machines for solving classification
problems. The idea is that we have a bunch of data points, say of two classes, and
we want to separate them with a decision boundary. For instance, the data points
might be easily separated by a line like this:
5
If the data points can be easily separated using a line or hyperplane, we find the
separating hyperplane that is as far as possible from the points so that there is a
large margin. This requires maximizing the margin, and it ends up being a convex
optimization problem. To solve this convex optimization problem, we use
Lagrange multipliers, a notion from multivariable calculus. Once we find the
maximal margin hyperplane, we can classify new points depending on which side
of the hyperplane the point lies on. This method of classifying points is called the
maximal margin classifier.
If the data points are not separable by a hyperplane, we can still try to find a
hyperplane that separates most of the points but that may have some points that lie
inside the margin or lie on the wrong side of the hyperplane. The situation may
look like this:
6
Just as in the case of the maximal margin classifier, we want our hyperplane to be
as far as possible from each point that’s on the correct side of the hyperplane. So
points on the margin or outside the margin but on the correct side of the
hyperplane will be as far as possible from the hyperplane. Points inside the
margin but on the correct side of the hyperplane will be as far as possible from the
hyperplane and as close as possible to the margin boundary. For those points on
the wrong side of the hyperplane, we want those points to be as close to the
hyperplane as possible.
Just as in the case of the maximal margin classifier, we want to maximize the
margin so that points on the correct side of the hyperplane are as far as possible
from the hyperplane.
Not only do we want to maximize the margin, we also want to minimize the
violations of the margin. This problem turns out to be a convex optimization
problem, and it is solved using Lagrange multipliers.
Once we find the separating hyperplane, called the soft margin hyperplane, we can
classify new points depending on which side of the hyperplane the point lies on.
This method of classifying points is called the soft margin classifier.
7
If the data points are not linearly separable and it appears that the decision
boundary separating the two classes is non-linear, we can use what’s called the
support vector machine, or support vector machine classifier. The idea is to
consider a larger feature space with data points in this larger space associated with
the original data points and to apply the support vector classifier to this new set of
data points in the larger feature space. This will give us a linear decision
boundary in the enlarged feature space but a non-linear decision boundary in the
original feature space. Any new point is classified by sending it into the larger
space and using the linear decision boundary. Here’s what a situation that requires
support vector machines might look like:
In the process of solving the convex optimization problem for the soft margin
classifier, a dot product occurs; in the method of support vector machines, we
replace the dot product by something called a kernel. A kernel is essentially a
function that can be represented as the inner product of the images of the input
values under some transformation ℎ. This replacement of the dot product with a
kernel is called the kernel trick.
8
The kernel 𝐾𝐾 should be a valid kernel; that is, there should be a feature space
mapping ℎ that corresponds to 𝐾𝐾. By Mercer’s theorem, it’s sufficient that 𝐾𝐾 be
symmetric positive semidefinite.
In the support vector machine method, the enlarged feature space could be very
high-dimensional, even infinite dimensional. By working directly with kernels,
we don’t have to deal with the feature mapping ℎ or the enlarged feature space.
In the method of support vector machines, we see that the notions of dot product
and symmetric positive semidefiniteness are used; these notions are from linear
algebra. To solve the convex optimization problem, Lagrange multipliers is used;
this notion is from multivariable calculus.
7. Conclusion
In this article, we have looked at the mathematics behind the machine learning
techniques linear regression, linear discriminant analysis, logistic regression,
artificial neural networks, and support vector machines.