0% found this document useful (0 votes)

214 views9 pages

Math Behind Machine Learning

The document discusses the math concepts behind several machine learning techniques: - Linear regression uses concepts from multivariable calculus like derivatives and Hessians, as well as linear algebra concepts like norms, projections, and subspaces. - Linear discriminant analysis uses probability concepts like posterior probabilities, priors, and multivariate Gaussian distributions. - Logistic regression uses likelihood functions and the Newton-Raphson method from calculus. - Artificial neural networks represent models as neural network diagrams and use concepts like gradient descent, backpropagation, and the chain rule from calculus to minimize error functions.

Uploaded by

Aftab Alam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

214 views9 pages

Math Behind Machine Learning

Uploaded by

Aftab Alam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

The Math Behind Machine Learning

Richard Han
www.onlinemathtraining.com
[email protected]

Abstract
In this article, we look at the mathematics behind the machine learning techniques linear
regression, linear discriminant analysis, logistic regression, artificial neural networks, and
support vector machines.

Keywords: machine learning, data science, artificial intelligence, mathematics

1. Introduction

In this article, we will identify the math subjects and topics used by the machine
learning techniques linear regression, linear discriminant analysis, logistic
regression, artificial neural networks, and support vector machines.

2. Linear Regression

Let’s look at several techniques in machine learning and the math topics that are
used in the process.

In linear regression, we try to find the best fit line or hyperplane for a given set of
data points. We model the output of our linear function by a linear combination of
the input variables using a set of parameters as weights.

The parameters are found by minimizing the residual sum of squares. We find a
critical point by setting the vector of derivatives of the residual sum of squares to
the zero vector. By the second derivative test, if the Hessian of the residual sum
of squares at a critical point is positive definite, then the residual sum of squares
has a local minimum there.

In the above process, we used derivatives, the second derivative test, and the
Hessian, which are notions from multivariable calculus. We can also find the
solution to our minimization problem using linear algebra.

Let 𝑋𝑋 be the matrix where the rows are our data inputs, beginning with 1 in each
row, and 𝒚𝒚 be the vector of our data outputs. We want a vector 𝛽𝛽 such that 𝑋𝑋𝑋𝑋 is
close to 𝒚𝒚. In other words, we want a vector 𝛽𝛽 such that the distance ‖𝑋𝑋𝑋𝑋 − 𝒚𝒚‖
between 𝑋𝑋𝑋𝑋 and 𝒚𝒚 is minimized. The vector 𝛽𝛽 which minimizes the distance is
such that 𝑋𝑋𝑋𝑋 is the projection of 𝒚𝒚 onto the column space of 𝑋𝑋. This is so because
the projection of 𝒚𝒚 onto the column space of 𝑋𝑋 is the vector in the column space of
𝑋𝑋 that is closest to 𝒚𝒚. We then use the fact that Euclidean N-space can be broken
into two subspaces, the column space of 𝑋𝑋 and the orthogonal complement of the
column space of 𝑋𝑋, and the fact that any vector in Euclidean N-space can be
written uniquely as the sum of vectors in the column space of 𝑋𝑋 and in the
orthogonal complement of the column space of 𝑋𝑋, respectively, to deduce that 𝒚𝒚 −
𝑋𝑋𝑋𝑋 is orthogonal to the columns of 𝑋𝑋. From here, we can arrive at the matrix
equation 𝑋𝑋 𝑇𝑇 𝑋𝑋𝑋𝑋 = 𝑋𝑋 𝑇𝑇 𝒚𝒚. If 𝑋𝑋 𝑇𝑇 𝑋𝑋 is positive definite, then the eigenvalues of 𝑋𝑋 𝑇𝑇 𝑋𝑋
are all positive. So 0 is not an eigenvalue of 𝑋𝑋 𝑇𝑇 𝑋𝑋. It follows that 𝑋𝑋 𝑇𝑇 𝑋𝑋 is
invertible. Then, we can solve the matrix equation for 𝛽𝛽, and the result is the
same result we get from using multi-variable calculus.

In the solution we just discussed, the notions of norm, projection, column space,
subspace, orthogonal complement, orthogonality, positive definiteness,
eigenvalue, and invertibility are used. These are notions from linear algebra. We
also used the facts that Euclidean N-space can be broken into two subspaces, the
column space of 𝑋𝑋 and the orthogonal complement of the column space of 𝑋𝑋 and
that any vector in Euclidean N-space can be written uniquely as the sum of vectors
in the column space of 𝑋𝑋 and in the orthogonal complement of the column space
of 𝑋𝑋, respectively.

3. Linear Discriminant Analysis

Let’s turn to classification problems. In classification problems, we want to

determine the class to which a data point belongs to. One of the methods used for
classification problems is linear discriminant analysis.

In linear discriminant analysis, we estimate Pr(𝑌𝑌 = 𝑘𝑘|𝑋𝑋 = 𝑥𝑥), the probability that
𝑌𝑌 is the class 𝑘𝑘 given that the input variable 𝑋𝑋 is 𝑥𝑥. This is called a posterior
probability function. Once we have all of these probabilities for a fixed 𝑥𝑥, we pick
the class 𝑘𝑘 for which the probability Pr(𝑌𝑌 = 𝑘𝑘|𝑋𝑋 = 𝑥𝑥) is largest. We then classify
𝑥𝑥 as that class 𝑘𝑘.

2
Using Bayes’ rule, we can rewrite the posterior probability function in terms of
𝜋𝜋𝑘𝑘 = Pr(𝑌𝑌 = 𝑘𝑘), the prior probability that 𝑌𝑌 = 𝑘𝑘, and 𝑓𝑓𝑘𝑘 (𝑥𝑥) = Pr(𝑋𝑋 = 𝑥𝑥|𝑌𝑌 = 𝑘𝑘),
the probability that 𝑋𝑋 = 𝑥𝑥, given that 𝑌𝑌 = 𝑘𝑘.

We assume that the conditional distribution of 𝑋𝑋 given 𝑌𝑌 = 𝑘𝑘 is the multivariate

Gaussian distribution 𝑁𝑁(𝜇𝜇𝑘𝑘 , Σ), where 𝜇𝜇𝑘𝑘 is a class-specific mean vector and Σ is
the covariance of 𝑋𝑋. So 𝑓𝑓𝑘𝑘 (𝑥𝑥) can be rewritten in terms of 𝜇𝜇𝑘𝑘 and Σ.

Now, we find estimates for 𝜋𝜋𝑘𝑘 , 𝜇𝜇𝑘𝑘 , and Σ, and hence for 𝑝𝑝𝑘𝑘 (𝑥𝑥). We classify 𝑥𝑥
according to the class 𝑘𝑘 for which the estimated 𝑝𝑝𝑘𝑘 (𝑥𝑥) is greatest.

In linear discriminant analysis, we use posterior probability functions, prior

probabilities, Bayes’ rule, multivariate Gaussian distribution, class-specific mean
vector, and covariance, which are notions from probability theory.

4. Logistic Regression

Another method for solving classification problems is logistic regression. Just as

in linear discriminant analysis, we want to estimate Pr(𝑌𝑌 = 𝑘𝑘|𝑋𝑋 = 𝑥𝑥) and pick the
class 𝑘𝑘 for which this probability is largest. Instead of estimating this probability
indirectly using Bayes’ rule, as in linear discriminant analysis, we estimate the
probability directly.

Assuming there are only two classes 0 and 1, let 𝑝𝑝(𝑥𝑥) = Pr(𝑌𝑌 = 1|𝑋𝑋 = 𝑥𝑥). In
logistic regression, we assume that the log-odds is a linear function of the
components of 𝑥𝑥. Assuming that the log-odds is a linear function of the
components of 𝑥𝑥, with parameters 𝛽𝛽0 , 𝛽𝛽1 , … , 𝛽𝛽𝑝𝑝 , we can solve for 𝑝𝑝(𝑥𝑥) as a
function of the parameters and the components of 𝑥𝑥. We can get an estimate for
𝑝𝑝(𝑥𝑥) if we had estimates for the parameters 𝛽𝛽0 , 𝛽𝛽1 , … , 𝛽𝛽𝑝𝑝 .

The probability of our observed data is a function of the parameters 𝛽𝛽0 , 𝛽𝛽1 , … , 𝛽𝛽𝑝𝑝
and is called a likelihood function. We find estimates for the parameters by
maximizing the likelihood function. Maximizing the likelihood function is
equivalent to maximizing the log of the likelihood function. To maximize the log-
likelihood function, we use the Newton-Raphson method.

3
The log-likelihood function 𝐿𝐿(𝛽𝛽) is a real-valued function of 𝛽𝛽 = (𝛽𝛽0 , 𝛽𝛽1 , … , 𝛽𝛽𝑝𝑝 ).
So 𝐿𝐿 is a function from ℝ𝑝𝑝+1 to ℝ. Further, 𝐿𝐿 is twice continuously differentiable.
So we can apply the multivariate Newton-Raphson method.

In logistic regression, we use likelihood functions, a notion from probability

theory, and the multivariate Newton-Raphson method, a notion from multivariable
calculus.

5. Artificial Neural Networks

Next, we’ll look at a method that can be used to solve both regression and
classification problems. In artificial neural networks, we use compositions of
linear and nonlinear functions to model our output functions.

The output functions can be represented by a neural network diagram.

The input units, including the constant 1, will form the input layer. We take a
linear combination of the input units, including the constant 1, and then apply an
activation function ℎ to it to get a new unit. ℎ is a differentiable (possibly
nonlinear) function. We do this, say, 𝑀𝑀 times; we now have 𝑀𝑀 hidden units, and
these make up a hidden layer. Looking at the diagram, the weights in the linear
combinations are represented by the line segments connecting two units. We can
continue this process of taking linear combinations of the units in the previous
layer and applying an activation function to each linear combination to create a
new hidden unit and thus creating the next hidden layer. At some point we have a

4
last layer, called the output layer, and we use activation functions 𝑔𝑔𝑘𝑘 for each
output unit 𝑌𝑌𝑘𝑘 . 𝑔𝑔𝑘𝑘 is a function of all the linear combinations of the units from the
previous layer.

Usually, the activation function ℎ is chosen to be the logistic sigmoid function or

the 𝑡𝑡𝑡𝑡𝑡𝑡ℎ function. The output activation functions 𝑔𝑔𝑘𝑘 will differ depending on the
type of problem, whether it’s a regression problem, a binary classification
problem, or a multiclass classification problem.

So far, we’ve constructed output values 𝑌𝑌𝑘𝑘 that depend on an input 𝑥𝑥 and that
involve a bunch of unknown parameters. Our goal now is to use our training data
to find values for the unknown parameters that minimize error. For binary
classification, we find estimates for the parameters by maximizing the likelihood
function associated with the probability of our observed data; this corresponds to
minimizing what’s called the cross-entropy error function. Similarly, for
multiclass classification, we find estimates for the parameters by maximizing the
likelihood function associated with the probability of our observed data; this
corresponds to minimizing what’s called the multiclass cross-entropy error
function. For regression, we find estimates for the parameters by minimizing the
sum-of-squares error function.

To minimize the error function, we use gradient descent, which requires finding
the gradient of the error function. To find the gradient of the error function, we
use backpropagation.

In artificial neural networks, the notion of likelihood function from probability

theory is used in the case of classification problems. Gradient descent, from
multivariable calculus, is used to minimize the error function. During
backpropagation, the multivariable chain rule is used.

6. Support Vector Machines

Let’s look at the method of support vector machines for solving classification
problems. The idea is that we have a bunch of data points, say of two classes, and
we want to separate them with a decision boundary. For instance, the data points
might be easily separated by a line like this:

5
If the data points can be easily separated using a line or hyperplane, we find the
separating hyperplane that is as far as possible from the points so that there is a
large margin. This requires maximizing the margin, and it ends up being a convex
optimization problem. To solve this convex optimization problem, we use
Lagrange multipliers, a notion from multivariable calculus. Once we find the
maximal margin hyperplane, we can classify new points depending on which side
of the hyperplane the point lies on. This method of classifying points is called the
maximal margin classifier.

If the data points are not separable by a hyperplane, we can still try to find a
hyperplane that separates most of the points but that may have some points that lie
inside the margin or lie on the wrong side of the hyperplane. The situation may
look like this:

6
Just as in the case of the maximal margin classifier, we want our hyperplane to be
as far as possible from each point that’s on the correct side of the hyperplane. So
points on the margin or outside the margin but on the correct side of the
hyperplane will be as far as possible from the hyperplane. Points inside the
margin but on the correct side of the hyperplane will be as far as possible from the
hyperplane and as close as possible to the margin boundary. For those points on
the wrong side of the hyperplane, we want those points to be as close to the
hyperplane as possible.

Just as in the case of the maximal margin classifier, we want to maximize the
margin so that points on the correct side of the hyperplane are as far as possible
from the hyperplane.

Not only do we want to maximize the margin, we also want to minimize the
violations of the margin. This problem turns out to be a convex optimization
problem, and it is solved using Lagrange multipliers.

Once we find the separating hyperplane, called the soft margin hyperplane, we can
classify new points depending on which side of the hyperplane the point lies on.
This method of classifying points is called the soft margin classifier.

7
If the data points are not linearly separable and it appears that the decision
boundary separating the two classes is non-linear, we can use what’s called the
support vector machine, or support vector machine classifier. The idea is to
consider a larger feature space with data points in this larger space associated with
the original data points and to apply the support vector classifier to this new set of
data points in the larger feature space. This will give us a linear decision
boundary in the enlarged feature space but a non-linear decision boundary in the
original feature space. Any new point is classified by sending it into the larger
space and using the linear decision boundary. Here’s what a situation that requires
support vector machines might look like:

In the process of solving the convex optimization problem for the soft margin
classifier, a dot product occurs; in the method of support vector machines, we
replace the dot product by something called a kernel. A kernel is essentially a
function that can be represented as the inner product of the images of the input
values under some transformation ℎ. This replacement of the dot product with a
kernel is called the kernel trick.

8
The kernel 𝐾𝐾 should be a valid kernel; that is, there should be a feature space
mapping ℎ that corresponds to 𝐾𝐾. By Mercer’s theorem, it’s sufficient that 𝐾𝐾 be
symmetric positive semidefinite.

In the support vector machine method, the enlarged feature space could be very
high-dimensional, even infinite dimensional. By working directly with kernels,
we don’t have to deal with the feature mapping ℎ or the enlarged feature space.

In the method of support vector machines, we see that the notions of dot product
and symmetric positive semidefiniteness are used; these notions are from linear
algebra. To solve the convex optimization problem, Lagrange multipliers is used;
this notion is from multivariable calculus.

7. Conclusion

In this article, we have looked at the mathematics behind the machine learning
techniques linear regression, linear discriminant analysis, logistic regression,
artificial neural networks, and support vector machines.

Universal Logic 2010
100% (2)
Universal Logic 2010
141 pages
Programme Guide English BSC PDF
No ratings yet
Programme Guide English BSC PDF
179 pages
Chapter 1 - Introduction To Marine Navigation
No ratings yet
Chapter 1 - Introduction To Marine Navigation
14 pages
PHY 142 (FAVOUR) Questions
100% (1)
PHY 142 (FAVOUR) Questions
60 pages
Flexible Pavement - Aspects of Basic Design As Per IRC Method
100% (2)
Flexible Pavement - Aspects of Basic Design As Per IRC Method
87 pages
57 - Linear Predictive Coding PDF
No ratings yet
57 - Linear Predictive Coding PDF
7 pages
Curriculum - BITS BSC CS
No ratings yet
Curriculum - BITS BSC CS
10 pages
96-8000 English Mill
No ratings yet
96-8000 English Mill
296 pages
Full download International Scientific Conference Energy Management of Municipal Facilities and Sustainable Energy Technologies EMMFT 2019: Volume 1 Vera Murgul pdf docx
100% (4)
Full download International Scientific Conference Energy Management of Municipal Facilities and Sustainable Energy Technologies EMMFT 2019: Volume 1 Vera Murgul pdf docx
55 pages
UVa Problem List (Catagorized Algorithmic Problem)
100% (1)
UVa Problem List (Catagorized Algorithmic Problem)
8 pages
Computational Mathematics With Python
No ratings yet
Computational Mathematics With Python
33 pages
Practical No. 2 - Ultrasonic Pulse Velocity Test
No ratings yet
Practical No. 2 - Ultrasonic Pulse Velocity Test
8 pages
Hw3 Solution
No ratings yet
Hw3 Solution
17 pages
2004 - Gardner - Experiments On Stainless Steel Hollow Sections Part 2 Member Behaviour of Columns and Beams PDF
No ratings yet
2004 - Gardner - Experiments On Stainless Steel Hollow Sections Part 2 Member Behaviour of Columns and Beams PDF
14 pages
Machine Learning Is Fun
No ratings yet
Machine Learning Is Fun
142 pages
Ar and MR Corr
No ratings yet
Ar and MR Corr
10 pages
Assignment-Decision Tree
No ratings yet
Assignment-Decision Tree
12 pages
Optimization For Data Science
No ratings yet
Optimization For Data Science
18 pages
TALAT Lecture 3703: Stretch Forming
No ratings yet
TALAT Lecture 3703: Stretch Forming
13 pages
Khulna University of Engineering & Technology: List of Eligible Candidates
No ratings yet
Khulna University of Engineering & Technology: List of Eligible Candidates
146 pages
Structural Health Monitoring of Offshore Structures: January 2001
No ratings yet
Structural Health Monitoring of Offshore Structures: January 2001
22 pages
01 Rolling Contact Bearing 2012 2
No ratings yet
01 Rolling Contact Bearing 2012 2
40 pages
Electric Charge and Fields Assignment-2 (1)
No ratings yet
Electric Charge and Fields Assignment-2 (1)
7 pages
1.JAVA Practicals
No ratings yet
1.JAVA Practicals
33 pages
1983 Komar. Beach Processes and Erosion - An Introduction
No ratings yet
1983 Komar. Beach Processes and Erosion - An Introduction
48 pages
ACM ICPC Programming Contest Orientation
No ratings yet
ACM ICPC Programming Contest Orientation
40 pages
From The Sine-Gordon Field Theory To The Kardar-Parisi-Zhang Growth Equation
No ratings yet
From The Sine-Gordon Field Theory To The Kardar-Parisi-Zhang Growth Equation
6 pages
Modeling and Simulation of A Distillation Column Using Matlab
No ratings yet
Modeling and Simulation of A Distillation Column Using Matlab
15 pages
Mathematics in The Modern World L1
100% (1)
Mathematics in The Modern World L1
39 pages
AAAC Aerial Conductors (Bare) : Description
No ratings yet
AAAC Aerial Conductors (Bare) : Description
2 pages
Deep Dynamic Compaction
No ratings yet
Deep Dynamic Compaction
50 pages
The Rabin-Karp Algorithm: String Matching
No ratings yet
The Rabin-Karp Algorithm: String Matching
18 pages
37-39 Backtracking Algorithms
100% (1)
37-39 Backtracking Algorithms
33 pages
Instruction Manual LKH Pump
No ratings yet
Instruction Manual LKH Pump
56 pages
Sanford Kwinter and Umberto Boccioni Landscapes of Change Assemblage No 19 Dec 1992 PP 50 65
No ratings yet
Sanford Kwinter and Umberto Boccioni Landscapes of Change Assemblage No 19 Dec 1992 PP 50 65
17 pages
Quantum Physics Homework Set 1
No ratings yet
Quantum Physics Homework Set 1
1 page
Poster Modelling Exchange
No ratings yet
Poster Modelling Exchange
1 page
DAA - Backtracking Branch and Bound
No ratings yet
DAA - Backtracking Branch and Bound
39 pages
N Queen Problem
No ratings yet
N Queen Problem
12 pages
Ai-Unit Ii
No ratings yet
Ai-Unit Ii
61 pages
C Programing
No ratings yet
C Programing
165 pages
Tshri Vile Parle Kelvani Mandal'S D. J. Sanghvi College of Engineering Department of Information Technology Semester: Iii Subject: Java Programming Lab
No ratings yet
Tshri Vile Parle Kelvani Mandal'S D. J. Sanghvi College of Engineering Department of Information Technology Semester: Iii Subject: Java Programming Lab
11 pages
02B HRSG
100% (6)
02B HRSG
50 pages
01-Introduction Machine Learning
100% (1)
01-Introduction Machine Learning
48 pages
Binary Trees
No ratings yet
Binary Trees
548 pages
Class 11 All Programs
100% (1)
Class 11 All Programs
61 pages
How To Learn Mathematics For Machine Learning - Quora
100% (1)
How To Learn Mathematics For Machine Learning - Quora
14 pages
A2oj Ladders Abcde
No ratings yet
A2oj Ladders Abcde
24 pages
Practice Question and Answers
No ratings yet
Practice Question and Answers
11 pages
Cs8082 Machine Learning Techniques Ripped From Amazon Kindle e Books by Sai Seena
No ratings yet
Cs8082 Machine Learning Techniques Ripped From Amazon Kindle e Books by Sai Seena
148 pages
Data Structure Unit 5 (Searching and Sorting Notes)
No ratings yet
Data Structure Unit 5 (Searching and Sorting Notes)
26 pages
Logic and Computation Exercises
No ratings yet
Logic and Computation Exercises
111 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
9 pages
DAA Dynamic Programming
No ratings yet
DAA Dynamic Programming
35 pages
Decision Trees
No ratings yet
Decision Trees
5 pages
List of C - Programs
No ratings yet
List of C - Programs
10 pages
Course Plan Natural Language Processing
No ratings yet
Course Plan Natural Language Processing
5 pages
KMP Algorithm
No ratings yet
KMP Algorithm
26 pages
Ugc Net Exam Daa PDF
No ratings yet
Ugc Net Exam Daa PDF
94 pages
Common Functions Used in Analysis
No ratings yet
Common Functions Used in Analysis
7 pages
Machine Learning Cheat Sheet
No ratings yet
Machine Learning Cheat Sheet
1 page
Write A Python Program To Solve Quadratic Equation
No ratings yet
Write A Python Program To Solve Quadratic Equation
6 pages
Competitive Programming Resource
No ratings yet
Competitive Programming Resource
3 pages
Python Lists
No ratings yet
Python Lists
4 pages
Data Structures and Algorithms Problems
No ratings yet
Data Structures and Algorithms Problems
20 pages
Illustrative Problems 1. Find Minimum in A List
No ratings yet
Illustrative Problems 1. Find Minimum in A List
11 pages
Python - Programming
No ratings yet
Python - Programming
143 pages
UNIT2
No ratings yet
UNIT2
25 pages
Programs of Sir
No ratings yet
Programs of Sir
5 pages
Python Syllabus and List of Practicals
No ratings yet
Python Syllabus and List of Practicals
4 pages
Video Tutorial: Machine Learning 17CS73
100% (2)
Video Tutorial: Machine Learning 17CS73
27 pages
Week 5 Programming Assignment: (Https://swayam - Gov.in)
No ratings yet
Week 5 Programming Assignment: (Https://swayam - Gov.in)
12 pages
Algorithmic Aspects Machine Learning PDF
100% (2)
Algorithmic Aspects Machine Learning PDF
162 pages
ML Question Bank and Sol
No ratings yet
ML Question Bank and Sol
12 pages
Data Mining and Business Intelligence Lab Manual
No ratings yet
Data Mining and Business Intelligence Lab Manual
52 pages
Binary Tree - Interview Questions and Practice Problems
No ratings yet
Binary Tree - Interview Questions and Practice Problems
9 pages
Distance Based Models
No ratings yet
Distance Based Models
58 pages
Skill Set For Competitive Programming
No ratings yet
Skill Set For Competitive Programming
7 pages
Hassan Raza Test
No ratings yet
Hassan Raza Test
4 pages
TCP2101 Algorithm Design and Analysis Lab07 - Divide-and-Conquer
No ratings yet
TCP2101 Algorithm Design and Analysis Lab07 - Divide-and-Conquer
5 pages
Python & Leetcode - The Ultimate Interview Bootcamp: Strings
No ratings yet
Python & Leetcode - The Ultimate Interview Bootcamp: Strings
3 pages
Loss Functions
No ratings yet
Loss Functions
37 pages
C Programming Question Bank
No ratings yet
C Programming Question Bank
3 pages
Unit - 1 1.what Are The Features of Python. Explain The PVM. Ans: Following Are Some Important Features of Python
No ratings yet
Unit - 1 1.what Are The Features of Python. Explain The PVM. Ans: Following Are Some Important Features of Python
15 pages
Bresenham Line Drawing Algo
No ratings yet
Bresenham Line Drawing Algo
6 pages
31.5 - Python Syllabus
No ratings yet
31.5 - Python Syllabus
2 pages
Algorithm Analysis Design Lecture1 PowerPoint Presentation
No ratings yet
Algorithm Analysis Design Lecture1 PowerPoint Presentation
9 pages
Touchpad Prime Ver. 1.2 Class 6
From Everand
Touchpad Prime Ver. 1.2 Class 6
Nisha Batra
No ratings yet
Introduction to Numerical Analysis
From Everand
Introduction to Numerical Analysis
Simone Malacrida
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet

Uploaded by

Uploaded by

The Math Behind Machine Learning

Keywords: machine learning, data science, artificial intelligence, mathematics

3. Linear Discriminant Analysis

Let’s turn to classification problems. In classification problems, we want to

We assume that the conditional distribution of 𝑋𝑋 given 𝑌𝑌 = 𝑘𝑘 is the multivariate

In linear discriminant analysis, we use posterior probability functions, prior

Another method for solving classification problems is logistic regression. Just as

In logistic regression, we use likelihood functions, a notion from probability

5. Artificial Neural Networks

The output functions can be represented by a neural network diagram.

Usually, the activation function ℎ is chosen to be the logistic sigmoid function or

In artificial neural networks, the notion of likelihood function from probability

6. Support Vector Machines

You might also like