0% found this document useful (0 votes)

297 views

Mathematics for Machine Learning

A Comprehensive Guide to Building Mathematical Foundations for AI and Data Science

Uploaded by

Woody Woodpecker

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

297 views

Mathematics for Machine Learning

A Comprehensive Guide to Building Mathematical Foundations for AI and Data Science

Uploaded by

Woody Woodpecker

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 249

MATHEMATICS FOR MACHINE

LEARNING

A Comprehensive Guide to Building

Mathematical Foundations for AI
and Data Science

PART 1 : Beginner level

Mohamed Aazi
MATHEMATICS FOR
MACHINE LEARNING

A Comprehensive Guide to Building Mathematical

Foundations for AI and Data Science

Part 1 : Beginner level

Par : Mohamed AAZI

1
SECTION 1 : LINEAR ALGEBRA

Vector Addition

   
 
u1 v1 u1 + v1
     
     
 u2   v2   u2 + v2 
u+v = . + . = . 
    
 ..   ..   .. 

     
un vn un + vn

Explanation: Vector addition combines two vectors component-wise.

It is commonly used in machine learning for gradient updates or geometric
vector operations.

     
1 3 4
Example: If u =   and v =  , then u + v =  .
2 4 6

Implementation:

import numpy as np
u = np.array([1, 2])
v = np.array([3, 4])
result = u + v

2
Scalar Multiplication of a Vector

   
v1 αv1
   
   
 v2   αv2 
αv = α  .  =  . 
  
 ..   .. 

   
vn αvn

Explanation: Scalar multiplication scales each component of a vector

by the same scalar. It is used in scaling gradients or controlling vector
magnitudes.

   
2 6
Example: If α = 3 and v =  , then αv =  .
−1 −3

Implementation:

import numpy as np
alpha = 3
v = np.array([2, -1])
result = alpha * v

3
Dot Product

n
X
u·v = ui vi = u1 v1 + u2 v2 + · · · + un vn
i=1

Explanation: The dot product calculates a scalar representing the

magnitude of projection of one vector onto another. It is widely used in
ML for similarity measures or linear operations.

   
1 3
Example: If u =   and v =  , then u · v = 1 · 3 + 2 · 4 = 11.
2 4

Implementation:

import numpy as np
u = np.array([1, 2])
v = np.array([3, 4])
result = np.dot(u, v)

4
Cross Product (3D)

i j k
u × v = u1 u2 u3
v1 v2 v3

Explanation: The cross product generates a vector perpendicular to

two input vectors in 3D space. It is commonly used in physics and computer
graphics.

     
1 0 0
     
Example: If u = 0 and v = 1, then u × v = 0.
     
     
0 0 1

Implementation:

import numpy as np
u = np.array([1, 0, 0])
v = np.array([0, 1, 0])
result = np.cross(u, v)

5
Norm of a Vector (Euclidean)

v
u n q
uX
∥v∥ = t vi = v12 + v22 + · · · + vn2
2

i=1

Explanation: The Euclidean norm measures the magnitude (length)

of a vector. It is useful in optimization and distance computations in ML.

 
3 √
Example: If v =  , then ∥v∥ = 32 + 42 = 5.
4

Implementation:

import numpy as np
v = np.array([3, 4])
result = np.linalg.norm(v)

6
Orthogonality Condition

u·v =0

Explanation: Two vectors are orthogonal if their dot product is zero.

This condition is critical in linear algebra and ML for understanding inde-
pendence and basis construction.

   
1 −2
Example: If u =   and v =  , then u · v = 1 · −2 + 2 · 1 = 0,
2 1
confirming orthogonality.

Implementation:

import numpy as np
u = np.array([1, 2])
v = np.array([-2, 1])
result = np.dot(u, v)
is_orthogonal = result == 0

7
Matrix Addition

     
a11 a12 b b a + b11 a12 + b12
A+B=  +  11 12  =  11 
a21 a22 b21 b22 a21 + b21 a22 + b22

Explanation: Matrix addition combines two matrices element-wise. It

is used in ML for updating weights and biases or aggregating data.

     
1 2 5 6 6 8
Example: If A =   and B =  , then A + B =  .
3 4 7 8 10 12

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
result = A + B

8
Matrix Scalar Multiplication

   
a11 a12 αa11 αa12
αA = α  = 
a21 a22 αa21 αa22

Explanation: Scaling a matrix by a scalar is useful in ML for adjusting

learning rates or normalization.

   
1 2 2 4
Example: If α = 2 and A =  , then αA =  .
3 4 6 8

Implementation:

import numpy as np
alpha = 2
A = np.array([[1, 2], [3, 4]])
result = alpha * A

9
Matrix-Vector Multiplication

    
a11 a12 x1 a11 x1 + a12 x2
Ax =    =  
a21 a22 x2 a21 x1 + a22 x2

Explanation: Matrix-vector multiplication transforms a vector using

a linear transformation defined by the matrix. It is fundamental in ML for
applying weights to inputs.

     
1 2 5 17
Example: If A =   and x =  , then Ax =  .
3 4 6 39

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
x = np.array([5, 6])
result = np.dot(A, x)

10
Matrix Multiplication

n
X
C = AB, cij = aik bkj
k=1

Explanation: Matrix multiplication combines two matrices, producing

a matrix that represents the composition of linear transformations. It is
used in ML for layer operations in neural networks.

     
1 2 5 6 19 22
Example: If A =   and B =  , then AB =  .
3 4 7 8 43 50

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
result = np.dot(A, B)

11
Transpose of a Matrix

 T  
a11 a12 a a
AT =   =  11 21 
a21 a22 a12 a22

Explanation: The transpose of a matrix flips it over its diagonal,

exchanging rows with columns. It is used in ML for switching between
data representations.

   
1 2 1 3
Example: If A =  , then AT =  .
3 4 2 4

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
result = A.T

12
Determinant of a 2×2 Matrix

a11 a12
det(A) = = a11 a22 − a12 a21
a21 a22

Explanation: The determinant measures the scaling factor of the

transformation represented by a matrix. It is used to determine matrix
invertibility.

 
3 8
Example: If A =  , then det(A) = 3 · 6 − 8 · 4 = −14.
4 6

Implementation:

import numpy as np
A = np.array([[3, 8], [4, 6]])
result = np.linalg.det(A)

13
Inverse of a 2×2 Matrix

 
1 a −a12
A−1 =  22 , det(A) ̸= 0
det(A) −a21 a11

Explanation: The inverse of a 2×2 matrix reverses the linear trans-

formation it represents. It is used in solving systems of linear equations.

   
3 8 6 −8
Example: If A =  , then det(A) = −14 and A−1 = 1  .
−14
4 6 −4 3

Implementation:

import numpy as np
A = np.array([[3, 8], [4, 6]])
result = np.linalg.inv(A)

14
Cramer’s Rule

det(Ai )
xi = , det(A) ̸= 0
det(A)

Explanation: Cramer’s Rule solves a system of linear equations Ax =

b by replacing each column of A with b and computing determinants. It
is a theoretical method often used for small systems.

   
2 1 5
Example: For A =   and b =  ,
1 3 7

   
5 1 2 5
A1 =  , A2 =  
7 3 1 7

det(A1 ) det(A2 )
and det(A) = 5, so x1 = det(A)
, x2 = det(A)
.

Implementation:

import numpy as np
A = np.array([[2, 1], [1, 3]])
b = np.array([5, 7])
det_A = np.linalg.det(A)
x = [np.linalg.det(np.column_stack((b if i == j else A[:, j]
for j in range(A.shape[1])))) / det_A
for i in range(A.shape[1])]

15
Inverse of a Square Matrix

1
A−1 = adj(A), det(A) ̸= 0
det(A)

Explanation: The inverse of a square matrix generalizes the process

for higher dimensions using the adjugate and determinant. It is crucial in
linear algebra and ML for solving systems of equations.

 
4 7
Example: If A =  , the inverse is computed using cofactor
2 6
expansion and scaling.

Implementation:

import numpy as np
A = np.array([[4, 7], [2, 6]])
result = np.linalg.inv(A)

16
Determinant of a Triangular Matrix

n
Y
det(A) = aii
i=1

Explanation: The determinant of a triangular matrix (upper or lower)

is the product of its diagonal elements. This simplifies determinant calcu-
lations and is useful in decompositions.

 
2 1 0
 
Example: If A = 0 3 4, then det(A) = 2 · 3 · 5 = 30.
 
 
0 0 5

Implementation:

import numpy as np
A = np.array([[2, 1, 0], [0, 3, 4], [0, 0, 5]])
result = np.prod(np.diag(A))

17
Rank-Nullity Theorem

rank(A) + nullity(A) = n

Explanation: The Rank-Nullity Theorem states that the sum of the

rank (dimension of column space) and nullity (dimension of null space) of
a matrix equals the number of columns. It is fundamental in linear algebra
for understanding solutions to systems of linear equations.

Example: If A has 3 columns and its rank is 2, then the nullity is 1

since 2 + 1 = 3.

Implementation:

import numpy as np
from numpy.linalg import matrix_rank
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
rank = matrix_rank(A)
nullity = A.shape[1] - rank

18
Hadamard (Elementwise) Product

 
a11 b11 a12 b12
C=A◦B= 
a21 b21 a22 b22

Explanation: The Hadamard product performs elementwise multipli-

cation between two matrices. It is used in ML for feature-wise scaling or
gating.

     
1 2 5 6 5 12
Example: If A =   and B =  , then C =  .
3 4 7 8 21 32

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
result = np.multiply(A, B)

19
Outer Product

 
uv u1 v2 · · · u1 vn
 1 1 
 u2 v1 u2 v2 · · · u2 vn 
 
C=u⊗v =
 .. .. ... .. 

 . . . 
 
um v1 um v2 · · · um vn

Explanation: The outer product generates a matrix by multiplying

every element of one vector by every element of another. It is used in
tensor operations and constructing rank-1 matrices.

 
  3  
1   3 4 5
Example: If u =   and v = 4, then u ⊗ v =  .
 
2   6 8 10
5

Implementation:

import numpy as np
u = np.array([1, 2])
v = np.array([3, 4, 5])
result = np.outer(u, v)

20
Frobenius Norm

v
u m X
n
uX
∥A∥F = t |aij |2
i=1 j=1

Explanation: The Frobenius norm measures the magnitude of a ma-

trix by summing the squares of all its elements. It is widely used in opti-
mization and matrix analysis.

 
1 2 √ √
Example: If A =  , then ∥A∥F = 12 + 22 + 32 + 42 = 30.
3 4

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
result = np.linalg.norm(A, ’fro’)

21
Matrix Norm Inequality

∥Ax∥ ≤ ∥A∥∥x∥

Explanation: The matrix norm inequality states that the norm of a

matrix-vector product is bounded by the product of the matrix norm and
the vector norm. It is a key property in numerical linear algebra and ML
for error analysis.

   
1 2 1
Example: For A =   and x =  , compute ∥Ax∥ ≤ ∥A∥∥x∥.
3 4 1

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
x = np.array([1, 1])
left = np.linalg.norm(np.dot(A, x))
right = np.linalg.norm(A) * np.linalg.norm(x)
inequality_holds = left <= right

22
Matrix Trace

n
X
Tr(A) = aii
i=1

Explanation: The trace of a matrix is the sum of its diagonal elements.

It is used in ML for loss functions and characterizing matrix properties.

 
1 2
Example: If A =  , then Tr(A) = 1 + 4 = 5.
3 4

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
result = np.trace(A)

23
Trace of a Product

Tr(AB) = Tr(BA)

Explanation: The trace of a product of two matrices is invariant under

cyclic permutations. This property is useful in ML for simplifying gradients
in matrix calculus.

   
1 2 5 6
Example: For A =   and B =  , compute Tr(AB) =
3 4 7 8
Tr(BA).

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
trace1 = np.trace(np.dot(A, B))
trace2 = np.trace(np.dot(B, A))
equality_holds = trace1 == trace2

24
Block Matrix Multiplication

    
A B E F AE + BG AF + BH
C=  = 
C D G H CE + DG CF + DH

Explanation: Block matrix multiplication follows the same rules as

scalar matrix multiplication, but each element is a submatrix. It is used in
ML for large-scale computations and decompositions.

Example: Compute the block product for two partitioned 4 × 4 ma-

trices.

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
C = np.array([[9, 10], [11, 12]])
D = np.array([[13, 14], [15, 16]])
E = np.array([[17, 18], [19, 20]])
F = np.array([[21, 22], [23, 24]])
G = np.array([[25, 26], [27, 28]])
H = np.array([[29, 30], [31, 32]])
top_left = np.dot(A, E) + np.dot(B, G)
top_right = np.dot(A, F) + np.dot(B, H)
bottom_left = np.dot(C, E) + np.dot(D, G)
bottom_right = np.dot(C, F) + np.dot(D, H)

25
result = np.block([[top_left, top_right], [bottom_left, bottom_right]])

26
Kronecker Product

 
a11 B a12 B
C=A⊗B= 
a21 B a22 B

Explanation: The Kronecker product produces a block matrix by mul-

tiplying each element of one matrix by the entirety of another. It is used
in ML for tensor operations and signal processing.

   
1 2 0 5
Example: If A =   and B =  , compute A ⊗ B.
3 4 6 7

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[0, 5], [6, 7]])
result = np.kron(A, B)

27
SECTION 2 : PROBABILITY AND
STATISTICS

Conditional Probability

P (A ∩ B)
P (A | B) = , P (B) > 0
P (B)

Explanation: Conditional probability quantifies the likelihood of event

A occurring given that event B has occurred. It is fundamental in proba-
bilistic reasoning and Bayesian inference.

0.2
Example: If P (A ∩ B) = 0.2 and P (B) = 0.5, then P (A | B) = 0.5
=
0.4.

Implementation:

P_A_and_B = 0.2
P_B = 0.5
P_A_given_B = P_A_and_B / P_B

28
Law of Total Probability

X
P (A) = P (A | Bi )P (Bi )
i

Explanation: The law of total probability relates the probability of

an event A to the probabilities of A given a partition of events {Bi }. It is
used in scenarios with conditional dependencies.

Example: If P (A | B1 ) = 0.3, P (A | B2 ) = 0.7, P (B1 ) = 0.4, and

P (B2 ) = 0.6, then P (A) = 0.3 · 0.4 + 0.7 · 0.6 = 0.54.

Implementation:

P_A_given_B1 = 0.3
P_A_given_B2 = 0.7
P_B1 = 0.4
P_B2 = 0.6
P_A = P_A_given_B1 * P_B1 + P_A_given_B2 * P_B2

29
Bayes’ Theorem

P (B | A)P (A)
P (A | B) = , P (B) > 0
P (B)

Explanation: Bayes’ Theorem allows the reversal of conditional prob-

abilities, often used in updating beliefs with new evidence in ML and statis-
tics.

Example: If P (B | A) = 0.8, P (A) = 0.3, and P (B) = 0.5, then

0.8·0.3
P (A | B) = 0.5
= 0.48.

Implementation:

P_B_given_A = 0.8
P_A = 0.3
P_B = 0.5
P_A_given_B = (P_B_given_A * P_A) / P_B

30
Expectation

X
E[X] = xi P (X = xi )
i

Explanation: The expectation (mean) of a random variable is the

weighted average of all possible values, weighted by their probabilities. It
is central in probability and statistics.

Example: If X = {1, 2, 3} with P (X = 1) = 0.2, P (X = 2) = 0.5, and

P (X = 3) = 0.3, then E[X] = 1 · 0.2 + 2 · 0.5 + 3 · 0.3 = 2.1.

Implementation:

X = [1, 2, 3]
P_X = [0.2, 0.5, 0.3]
expectation = sum(x * p for x, p in zip(X, P_X))

31
Variance

Var(X) = E[(X − E[X])2 ] = E[X 2 ] − (E[X])2

Explanation: Variance measures the spread of a random variable

around its mean. It is widely used in ML for assessing uncertainty and
model performance.

Example: For X = {1, 2, 3} with P (X = 1) = 0.2, P (X = 2) = 0.5,

and P (X = 3) = 0.3, compute E[X] = 2.1 and E[X 2 ] = 4.7, so Var(X) =
4.7 − (2.1)2 = 0.29.

Implementation:

X = [1, 2, 3]
P_X = [0.2, 0.5, 0.3]
expectation = sum(x * p for x, p in zip(X, P_X))
expectation_X2 = sum(x**2 * p for x, p in zip(X, P_X))
variance = expectation_X2 - expectation**2

32
Standard Deviation

p
σ(X) = Var(X)

Explanation: The standard deviation is the square root of the variance

and provides a measure of dispersion in the same units as the random vari-
able. It is widely used in data analysis and ML for variability assessment.

√
Example: If Var(X) = 0.29, then σ(X) = 0.29 ≈ 0.54.

Implementation:

variance = 0.29
std_dev = variance**0.5

33
Covariance

Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]

Explanation: Covariance measures the joint variability of two random

variables. A positive value indicates that they increase together, while a
negative value indicates an inverse relationship.

Example: If X = {1, 2}, Y = {3, 4}, P (X, Y ) = {0.5, 0.5}, and

E[X] = 1.5, E[Y ] = 3.5, compute Cov(X, Y ) = 0.25.

Implementation:

X = [1, 2]
Y = [3, 4]
P_XY = [0.5, 0.5]
E_X = sum(x * p for x, p in zip(X, P_XY))
E_Y = sum(y * p for y, p in zip(Y, P_XY))
covariance = sum((x - E_X) * (y - E_Y) * p for x, y, p in zip(X, Y, P_XY))

34
Correlation

Cov(X, Y )
ρ(X, Y ) =
σ(X)σ(Y )

Explanation: Correlation normalizes covariance to a scale of [−1, 1],

quantifying the strength and direction of a linear relationship between two
variables.

Example: If Cov(X, Y ) = 0.25, σ(X) = 0.5, and σ(Y ) = 1.0, then

0.25
ρ(X, Y ) = 0.5·1.0
= 0.5.

Implementation:

covariance = 0.25
std_X = 0.5
std_Y = 1.0
correlation = covariance / (std_X * std_Y)

35
Probability Mass Function (PMF)


pi , if x = xi

P (X = x) =
0,

otherwise

Explanation: The PMF defines the probabilities of discrete outcomes

of a random variable. It is a foundational concept in probability theory.

Example: If X = {1, 2, 3} with P (X = 1) = 0.2, P (X = 2) = 0.5, and

P (X = 3) = 0.3, the PMF is defined for these values.

Implementation:

X = [1, 2, 3]
P_X = [0.2, 0.5, 0.3]
def pmf(x):
return P_X[X.index(x)] if x in X else 0

36
Probability Density Function (PDF)

Z ∞
fX (x) ≥ 0, fX (x)dx = 1
−∞

Explanation: The PDF defines the relative likelihood of a continuous

random variable at a specific value. It is used in probability and statistics
for modeling continuous distributions.

Example: For a standard normal distribution, the PDF is fX (x) =

x 2
√1 e− 2 .
2π

Implementation:

import numpy as np
from scipy.stats import norm
x = 0 # example point
pdf_value = norm.pdf(x)

37
Joint Probability

P (A ∩ B) = P (A | B)P (B)

Explanation: Joint probability quantifies the likelihood of two events

occurring together. It is essential in probabilistic modeling and understand-
ing relationships between variables.

Example: If P (A | B) = 0.4 and P (B) = 0.5, then P (A ∩ B) =

0.4 · 0.5 = 0.2.

Implementation:

P_A_given_B = 0.4
P_B = 0.5
P_A_and_B = P_A_given_B * P_B

38
CDF (Cumulative Distribution Function)

FX (x) = P (X ≤ x)

Explanation: The CDF of a random variable gives the probability that

the variable takes a value less than or equal to x. It is used to describe the
distribution function for both discrete and continuous variables.

Example: For a uniform distribution X ∼ U (0, 1), FX (0.5) = 0.5.

Implementation:

from scipy.stats import uniform

x = 0.5
cdf_value = uniform.cdf(x, loc=0, scale=1)

39
Entropy (discrete)

X
H(X) = − P (X = xi ) log2 P (X = xi )
i

Explanation: Entropy measures the uncertainty of a discrete random

variable. It is a fundamental concept in information theory and ML, par-
ticularly in decision trees and loss functions.

Example: If P (X) = {0.5, 0.5}, then H(X) = −0.5 log2 (0.5)−0.5 log2 (0.5) =
1.

Implementation:

import numpy as np
P_X = [0.5, 0.5]
entropy = -sum(p * np.log2(p) for p in P_X if p > 0)

40
Conditional Expectation

X
E[X | Y ] = xP (X = x | Y )
x

Explanation: Conditional expectation is the expected value of a ran-

dom variable X given that another variable Y is known. It is critical in
Bayesian inference and probabilistic modeling.

Example: If X = {1, 2} with P (X = 1 | Y ) = 0.7 and P (X = 2 |

Y ) = 0.3, then E[X | Y ] = 1 · 0.7 + 2 · 0.3 = 1.3.

Implementation:

X = [1, 2]
P_X_given_Y = [0.7, 0.3]
conditional_expectation = sum(x * p for x, p in zip(X, P_X_given_Y))

41
Law of Iterated Expectations

E[X] = E[E[X | Y ]]

Explanation: The law of iterated expectations states that the expec-

tation of X is the weighted average of its conditional expectations over Y .
It is foundational in probability theory and statistics.

Example: Suppose X depends on Y = {1, 2}, with E[X | Y = 1] = 3,

E[X | Y = 2] = 5, and P (Y = 1) = 0.6, P (Y = 2) = 0.4. Then E[X] =
3 · 0.6 + 5 · 0.4 = 3.8.

Implementation:

E_X_given_Y = [3, 5]
P_Y = [0.6, 0.4]
E_X = sum(e * p for e, p in zip(E_X_given_Y, P_Y))

42
Marginal Probability

X
P (A) = P (A ∩ B)
B

Explanation: Marginal probability calculates the probability of an

event A by summing (or integrating, for continuous cases) over all possible
outcomes of another variable B. It is used in probabilistic modeling to
reduce joint distributions.

Example: If P (A ∩ B1 ) = 0.3 and P (A ∩ B2 ) = 0.4, then P (A) =

0.3 + 0.4 = 0.7.

Implementation:

P_A_and_B = [0.3, 0.4]

P_A = sum(P_A_and_B)

43
Skewness

E[(X − µ)3 ]
Skewness(X) =
σ3

Explanation: Skewness measures the asymmetry of the probability

distribution of a random variable about its mean. Positive skew indicates
a longer right tail, and negative skew indicates a longer left tail.

Example: For X = {1, 2, 3} with mean µ = 2 and standard deviation

σ = 0.816, compute Skewness(X) using the third central moment.

Implementation:

import numpy as np
X = [1, 2, 3]
mu = np.mean(X)
sigma = np.std(X)
skewness = np.mean(((X - mu) / sigma)**3)

44
Kurtosis

E[(X − µ)4 ]
Kurtosis(X) =
σ4

Explanation: Kurtosis measures the ”tailedness” of the probability

distribution. A high kurtosis indicates heavy tails, while a low kurtosis
indicates light tails.

Example: For X = {1, 2, 3} with mean µ = 2 and standard deviation

σ = 0.816, compute Kurtosis(X) using the fourth central moment.

Implementation:

import numpy as np
X = [1, 2, 3]
mu = np.mean(X)
sigma = np.std(X)
kurtosis = np.mean(((X - mu) / sigma)**4)

45
Binary Cross-Entropy (special case)

n
1X
BCE(y, ŷ) = − (yi log(ŷi ) + (1 − yi ) log(1 − ŷi ))
n i=1

Explanation: Binary cross-entropy is a loss function used for binary

classification tasks. It measures the dissimilarity between predicted prob-
abilities (ŷ) and true labels (y).

Example: For y = [1, 0] and ŷ = [0.8, 0.2], compute BCE = − 21 (log(0.8) + log(0.8)).

Implementation:

import numpy as np
y = np.array([1, 0])
y_hat = np.array([0.8, 0.2])
bce = -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))

46
Variance (Alternative)

Var(X) = E[X 2 ] − (E[X])2

Explanation: An alternative formula for variance uses the difference

between the expected value of the square of X and the square of the ex-
pected value of X. This method is computationally efficient.

12 +22 +32
Example: For X = {1, 2, 3}, compute E[X 2 ] = 3
= 4.67 and
(E[X])2 = 22 = 4, so Var(X) = 0.67.

Implementation:

import numpy as np
X = np.array([1, 2, 3])
E_X2 = np.mean(X**2)
E_X = np.mean(X)
variance = E_X2 - E_X**2

47
SECTION 3 : CALCULUS

Limit Definition of Derivative

f (x + h) − f (x)
f ′ (x) = lim
h→0 h

Explanation: The derivative of a function is defined as the limit of the

difference quotient as h approaches zero. It represents the instantaneous
rate of change of the function.

(x+h)2 −x2
Example: For f (x) = x2 , compute f ′ (x) = limh→0 h
= 2x.

Implementation:

def derivative(f, x, h=1e-5):

return (f(x + h) - f(x)) / h

48
Power Rule

d n
x = nxn−1
dx

Explanation: The power rule simplifies differentiation of monomials.

It is foundational for calculus and widely used in gradient computations in
ML.

Example: For f (x) = x3 , f ′ (x) = 3x2 .

Implementation:

def power_rule(n, x):

return n * x**(n - 1)

49
Product Rule

d
[u(x)v(x)] = u′ (x)v(x) + u(x)v ′ (x)
dx

Explanation: The product rule computes the derivative of the product

of two functions. It is crucial for handling multiplicative relationships in
ML.

Example: For f (x) = (x2 )(ex ), f ′ (x) = 2xex + x2 ex .

Implementation:

def product_rule(u, v, u_prime, v_prime, x):

return u_prime(x) * v(x) + u(x) * v_prime(x)

50
Quotient Rule

u′ (x)v(x) − u(x)v ′ (x)

d u(x)
=
dx v(x) [v(x)]2

Explanation: The quotient rule computes the derivative of the ratio

of two functions. It is essential for operations involving divisions in ML
models.

x2 2xex −x2 ex
Example: For f (x) = ex
, f ′ (x) = e2x
.

Implementation:

def quotient_rule(u, v, u_prime, v_prime, x):

return (u_prime(x) * v(x) - u(x) * v_prime(x)) / (v(x)**2)

51
Chain Rule

d
f (g(x)) = f ′ (g(x))g ′ (x)
dx

Explanation: The chain rule computes the derivative of a compos-

ite function. It is extensively used in backpropagation for training neural
networks.

Example: For f (x) = sin(x2 ), f ′ (x) = cos(x2 ) · 2x.

Implementation:

def chain_rule(f_prime, g, g_prime, x):

return f_prime(g(x)) * g_prime(x)

52
Logarithmic Derivative

d 1
ln(x) = , x>0
dx x

Explanation: The derivative of the natural logarithm function is the

reciprocal of its argument. It is frequently used in ML for optimization and
logarithmic transformations.

Example: For f (x) = ln(x), f ′ (2) = 12 .

Implementation:

import numpy as np
def log_derivative(x):
return 1 / x

53
Exponential Derivative

d x
e = ex
dx

Explanation: The exponential function is unique as its derivative is

equal to itself. This property is key in gradient computations and expo-
nential growth models in ML.

Example: For f (x) = ex , f ′ (2) = e2 .

Implementation:

import numpy as np
def exp_derivative(x):
return np.exp(x)

54
Integral of a Power Function

xn+1
Z
xn dx = + C, n ̸= −1
n+1

Explanation: The integral of a power function generalizes the an-

tiderivative for monomials. This rule is fundamental in integral calculus
and applied in ML for cost function analysis.

x3
R
Example: For f (x) = x2 , x2 dx = 3
+ C.

Implementation:

def power_integral(n, x):

return x**(n + 1) / (n + 1)

55
Fundamental Theorem of Calculus

Z b
f (x)dx = F (b) − F (a), where F ′ (x) = f (x)
a

Explanation: The Fundamental Theorem of Calculus links differenti-

ation and integration, stating that integration over an interval is the dif-
ference of the antiderivative evaluated at the endpoints.

R3 h 3 i3
2 2 x 27 1 26
Example: For f (x) = x over [1, 3], 1
x dx = 3
= 3
− 3
= 3
.
1

Implementation:

def definite_integral(f, a, b):

from scipy.integrate import quad
result, _ = quad(f, a, b)
return result

56
Partial Derivatives

∂f f (x + h, y) − f (x, y) ∂f f (x, y + h) − f (x, y)

= lim , = lim
∂x h→0 h ∂y h→0 h

Explanation: Partial derivatives measure the rate of change of a multi-

variable function with respect to one variable while keeping others constant.
They are essential in optimization and gradient-based ML methods.

∂f ∂f
Example: For f (x, y) = x2 + y 2 , ∂x
= 2x, ∂y
= 2y.

Implementation:

def partial_derivative(f, var, point, h=1e-5):

args = list(point)
args[var] += h
return (f(*args) - f(*point)) / h

57
Gradient

 
∂f
 ∂x1 
 ∂f 
 ∂x2 
∇f (x) = 
 .. 

 . 
 
∂f
∂xn

Explanation: The gradient is a vector containing all partial derivatives

of a scalar-valued function. It points in the direction of the steepest ascent
and is widely used in ML optimization algorithms like gradient descent.

 
2x
Example: For f (x, y) = x2 + y 2 , ∇f (x, y) =  .
2y

Implementation:

import numpy as np
def gradient(f, point, h=1e-5):
grad = np.zeros(len(point))
for i in range(len(point)):
args = point.copy()
args[i] += h
grad[i] = (f(*args) - f(*point)) / h
return grad

58
Second Derivative (Hessian)

 
∂2f ∂2f
∂x21 ∂x1 ∂x2
···
 
 2f ∂2f
H(f ) =  ∂x∂ ∂x · · ·

 2 1 ∂x22
.. ..

..
. . .

Explanation: The Hessian is a square matrix of second-order partial

derivatives. It is used in optimization to assess curvature and convergence
properties of a function.

 
2 0
Example: For f (x, y) = x2 + y 2 , the Hessian is H(f ) =  .
0 2

Implementation:

def hessian(f, point, h=1e-5):

n = len(point)
hess = np.zeros((n, n))
for i in range(n):
for j in range(n):
args = point.copy()
args[i] += h
args[j] += h
f_ij = f(*args)
args[j] -= h
f_i = f(*args)
args[i] -= h
args[j] += h

59
f_j = f(*args)
f_orig = f(*point)
hess[i, j] = (f_ij - f_i - f_j + f_orig) / (h ** 2)
return hess

60
Directional Derivative

Dv f (x) = ∇f (x) · v

Explanation: The directional derivative measures the rate of change

of a function in the direction of a given vector. It is critical in optimization
and ML for evaluating function behavior in a specific direction.

 
2x
Example: For f (x, y) = x2 + y 2 , ∇f (x, y) =  . In the direction
2y
 
1
v =  , Dv f (x, y) = 2x.
0

Implementation:

def directional_derivative(f, grad_f, point, direction):

grad = grad_f(point)
return np.dot(grad, direction)

61
Higher-Order Partial Derivatives

∂kf
∂xp11 ∂xp22 · · · ∂xpnn

Explanation: Higher-order partial derivatives extend partial deriva-

tives to greater orders. Mixed derivatives often satisfy equality (fxy = fyx )
under smoothness conditions.

∂2f
Example: For f (x, y) = x2 y, ∂x∂y
= 2x.

Implementation:

def higher_order_partial(f, point, var_indices, h=1e-5):

args = list(point)
for var in var_indices:
args[var] += h
f_plus = f(*args)
for var in var_indices:
args[var] -= h * len(var_indices)
f_minus = f(*args)
return (f_plus - f_minus) / (h ** len(var_indices))

62
Total Derivative

n
df X ∂f dxi
=
dt i=1
∂xi dt

Explanation: The total derivative accounts for changes in all indepen-

dent variables as functions of an external variable t. It is used in dynamical
systems and optimization.

df
Example: If f (x, y) = x2 + y 2 , x = t, and y = t2 , then dt
= 2x · 1 +
2y · 2t = 2t + 4t3 .

Implementation:

def total_derivative(f, partials, dx_dt, point):

return sum(partials[i] * dx_dt[i] for i in range(len(point)))

63
Implicit Differentiation

∂F
dy ∂x
= − ∂F
dx ∂y

Explanation: Implicit differentiation computes the derivative of a de-

pendent variable in an equation where the variable cannot be explicitly
solved. It is used in ML and calculus for handling complex equations.

dy
Example: For F (x, y) = x2 + y 2 − 1 = 0, dx
= − xy .

Implementation:

def implicit_differentiation(F, x, y, partial_F_x, partial_F_y):

return -partial_F_x(x, y) / partial_F_y(x, y)

64
Taylor Series Expansion

f ′′ (a)
f (x) ≈ f (a) + f ′ (a)(x − a) + (x − a)2 + · · ·
2!

Explanation: The Taylor series approximates a function near a point

a using its derivatives. It is used in optimization and numerical analysis.

x2
Example: For f (x) = ex near a = 0, f (x) ≈ 1 + x + 2
+ · · ·.

Implementation:

def taylor_series(f, derivatives, a, x, terms=3):

result = 0
for n in range(terms):
result += derivatives[n](a) * (x - a)**n / np.math.factorial(n)
return result

65
Jacobian Matrix

 
∂f1 ∂f1
···
 ∂x1 ∂xn 
 . .. .. 
J(f ) =  .. . . 
 
∂fm ∂fm
∂x1
··· ∂xn

Explanation: The Jacobian matrix contains all first-order partial deriva-

tives of a vector-valued function. It is essential in ML for gradient-based
optimization in multivariable spaces.

   
2
x +y 2x 1
Example: For f (x, y) =  , the Jacobian is  .
2
y +x 1 2y

Implementation:

def jacobian(f, point, h=1e-5):

m = len(f)
n = len(point)
J = np.zeros((m, n))
for i in range(m):
for j in range(n):
args = point.copy()
args[j] += h
J[i, j] = (f[i](*args) - f[i](*point)) / h
return J

66
Arc Length of a Curve

s 2
Z b
dy
L= 1+ dx
a dx

Explanation: The arc length measures the distance along a curve

between two points. It is used in geometry and physics for path analysis.

R1p
Example: For y = x2 over [0, 1], L = 0
1 + (2x)2 dx.

Implementation:

from scipy.integrate import quad

def arc_length(f_prime, a, b):
integrand = lambda x: np.sqrt(1 + f_prime(x)**2)
return quad(integrand, a, b)[0]

67
Curvature of a Function

|y ′′ (x)|
κ(x) =
(1 + [y ′ (x)]2 )3/2

Explanation: Curvature quantifies how sharply a curve bends at a

given point. It is used in geometry and trajectory analysis in robotics and
ML.

Example: For y = x2 , y ′ (x) = 2x, y ′′ (x) = 2, so κ(x) = 2

(1+4x2 )3/2
.

Implementation:

def curvature(f_prime, f_double_prime, x):

numerator = abs(f_double_prime(x))
denominator = (1 + f_prime(x)**2)**1.5
return numerator / denominator

68
Integral by Parts

Z Z
′
uv dx = uv − u′ vdx

Explanation: Integration by parts is a technique derived from the

product rule of differentiation. It is used to simplify integrals involving
products of functions.

Example: For xex dx, let u = x and v ′ = ex . Then xex dx =

R R
R
xex − ex dx = xex − ex + C.

Implementation:

from sympy import symbols, integrate, exp

x = symbols(’x’)
u = x
v_prime = exp(x)
v = integrate(v_prime, x)
integral = u * v - integrate(v * u.diff(x), x)

69
Volume of Revolution (Disk Method)

Z b
V =π [f (x)]2 dx
a

Explanation: The disk method computes the volume of a solid of

revolution by slicing it into disks perpendicular to the axis of rotation. It
is common in geometry and physics.

Example: For y = x2 revolved around the x-axis over [0, 1], V =

R1 R1
π 0 (x2 )2 dx = π 0 x4 dx = π5 .

Implementation:

from scipy.integrate import quad

import numpy as np
def volume_of_revolution(f, a, b):
integrand = lambda x: np.pi * f(x)**2
return quad(integrand, a, b)[0]

70
Surface Integral

s 2 2
ZZ ZZ
∂g ∂g
f (x, y, z)dS = f (x, y, g(x, y)) 1 + + dA
S R ∂x ∂y

Explanation: A surface integral extends the idea of a line integral to

a surface, summing a scalar field or vector flux over the surface.

Example: Compute the surface integral of f (x, y, z) = z over z =

x2 + y 2 for x2 + y 2 ≤ 1.

Implementation:

from scipy.integrate import dblquad

def surface_integral(f, g, bounds_x, bounds_y):
def integrand(x, y):
gx, gy = g(x, y)
return f(x, y, g(x, y)) * np.sqrt(1 + gx**2 + gy**2)
return dblquad(integrand, *bounds_x, *bounds_y)

71
Divergence of a Vector Field

∂F1 ∂F2 ∂F3

div F = ∇ · F = + +
∂x ∂y ∂z

Explanation: The divergence measures the magnitude of a vector

field’s source or sink at a given point. It is used in fluid dynamics and
electromagnetism.

 
x
 
Example: For F = y , div F = 1 + 1 + 1 = 3.
 
 
z

Implementation:

from sympy import symbols, diff

x, y, z = symbols(’x y z’)
F = [x, y, z]
divergence = sum(diff(F[i], var) for i, var in enumerate([x, y, z]))

72
Curl of a Vector Field

i j k
curl F = ∇ × F = ∂
∂x
∂
∂y
∂
∂z

F1 F2 F3

Explanation: The curl measures the rotation or circulation of a vector

field at a point. It is critical in fluid mechanics and electromagnetism.

   
0 −y
   
Example: For F =  0 , curl F =  x .
   
   
xy 0

Implementation:

from sympy import symbols, Matrix

x, y, z = symbols(’x y z’)
F = Matrix([0, 0, x*y])
curl = F.jacobian([x, y, z]).transpose() - F.jacobian([x, y, z])

73
SECTION 4 : OPTIMIZATION

Gradient Descent

θ(t+1) = θ(t) − η∇J(θ(t) )

Explanation: Gradient descent is an optimization algorithm that it-

eratively updates parameters in the direction of the negative gradient to
minimize the cost function J(θ).

Example: For J(θ) = θ2 and η = 0.1, the update is θ(t+1) = θ(t) −

0.2θ(t) .

Implementation:

def gradient_descent(gradient, theta, eta, steps):

for _ in range(steps):
theta -= eta * gradient(theta)
return theta

74
Stochastic Gradient Descent (SGD)

θ(t+1) = θ(t) − η∇Ji (θ(t) )

Explanation: SGD computes gradients on individual data points, up-

dating parameters more frequently. It is widely used in ML due to its
efficiency with large datasets.

Example: For Ji (θ) = (θ − yi )2 , the update is based on one data point

at each iteration.

Implementation:

def stochastic_gradient_descent(gradient, theta, eta, data, steps):

for _ in range(steps):
i = np.random.randint(len(data))
theta -= eta * gradient(theta, data[i])
return theta

75
Momentum-based Gradient Descent

v (t+1) = βv (t) − η∇J(θ(t) ), θ(t+1) = θ(t) + v (t+1)

Explanation: Momentum adds an exponentially weighted moving av-

erage of past gradients to the current update, improving convergence speed
and stability.

Example: For β = 0.9, η = 0.1, the velocity update smooths oscilla-

tions in gradient descent.

Implementation:

def momentum_gradient_descent(gradient, theta, eta, beta, steps):

v = 0
for _ in range(steps):
v = beta * v - eta * gradient(theta)
theta += v
return theta

76
Nesterov Accelerated Gradient (NAG)

v (t+1) = βv (t) − η∇J(θ(t) + βv (t) ), θ(t+1) = θ(t) + v (t+1)

Explanation: NAG improves upon momentum by calculating gradi-

ents at a lookahead position, resulting in more precise updates.

Example: For β = 0.9, NAG anticipates the future direction, reducing

overshooting in oscillatory scenarios.

Implementation:

def nesterov_gradient_descent(gradient, theta, eta, beta, steps):

v = 0
for _ in range(steps):
lookahead = theta + beta * v
v = beta * v - eta * gradient(lookahead)
theta += v
return theta

77
RMSProp

η
s(t+1) = βs(t) + (1 − β)[∇J(θ(t) )]2 , θ(t+1) = θ(t) − √ ∇J(θ(t) )
s(t+1) + ϵ

Explanation: RMSProp scales the learning rate by a moving average

of squared gradients, improving convergence for non-convex problems.

Example: For β = 0.9, RMSProp adapts the step size for each param-
eter, stabilizing updates.

Implementation:

def rmsprop(gradient, theta, eta, beta, epsilon, steps):

s = 0
for _ in range(steps):
grad = gradient(theta)
s = beta * s + (1 - beta) * grad**2
theta -= eta / (np.sqrt(s) + epsilon) * grad
return theta

78
Adam Optimization

m(t+1) = β1 m(t) + (1 − β1 )∇J(θ(t) ), s(t+1) = β2 s(t) + (1 − β2 )[∇J(θ(t) )]2

m(t+1) s(t+1) η
m̂ = , ŝ = , θ(t+1) = θ(t) − √ m̂
1 − β1t 1 − β2t ŝ + ϵ

Explanation: Adam combines momentum and RMSProp, adapting

step sizes and smoothing updates. It is one of the most popular optimiza-
tion algorithms in ML.

Example: For β1 = 0.9, β2 = 0.999, Adam balances momentum and

per-parameter scaling.

Implementation:

def adam(gradient, theta, eta, beta1, beta2, epsilon, steps):

m, s = 0, 0
for t in range(1, steps + 1):
grad = gradient(theta)
m = beta1 * m + (1 - beta1) * grad
s = beta2 * s + (1 - beta2) * grad**2
m_hat = m / (1 - beta1**t)
s_hat = s / (1 - beta2**t)
theta -= eta / (np.sqrt(s_hat) + epsilon) * m_hat
return theta

79
Regularized Optimization Objective

Jreg (θ) = J(θ) + λR(θ)

Explanation: Regularization penalizes model complexity to prevent

overfitting. Common regularizers include L1 (lasso) and L2 (ridge) norms.

Example: For R(θ) = ∥θ∥22 , Jreg (θ) = J(θ) + λ∥θ∥22 .

Implementation:

def regularized_objective(loss, theta, reg, lam):

return loss(theta) + lam * reg(theta)

80
Learning Rate Decay

η0
ηt =
1 + γt

Explanation: Learning rate decay gradually reduces the learning rate

to improve convergence stability as training progresses.

Example: For η0 = 0.1, γ = 0.01, at step t = 10, ηt = 0.1/(1 + 0.01 ·

10) = 0.0909.

Implementation:

def learning_rate_decay(eta0, gamma, t):

return eta0 / (1 + gamma * t)

81
Gradient Clipping

g = clip(g, −τ, τ )

Explanation: Gradient clipping limits the gradient magnitude to pre-

vent exploding gradients in deep neural networks.

Example: For τ = 1.0, clip gradients to the range [−1, 1].

Implementation:

def gradient_clipping(grad, tau):

return np.clip(grad, -tau, tau)

82
Minibatch Gradient Descent

θ(t+1) = θ(t) − η∇JBt (θ(t) )

Explanation: Minibatch gradient descent computes updates using

small random subsets of data, balancing SGD’s noise and batch gradient
descent’s stability.

Example: Use minibatch size B = 32 to compute updates on smaller

subsets of data.

Implementation:

def minibatch_gradient_descent(gradient, theta, eta, data, batch_size, steps):

for _ in range(steps):
batch = np.random.choice(data, batch_size, replace=False)
theta -= eta * gradient(theta, batch)
return theta

83
Coordinate Descent

(t+1) (t) ∂J(θ)

θj = θj − η
∂θj

Explanation: Coordinate descent optimizes a single parameter at a

time, cycling through all parameters until convergence. It is effective for
high-dimensional problems.

Example: Minimize J(θ1 , θ2 ) = (θ1 − 1)2 + (θ2 − 2)2 by alternately

updating θ1 and θ2 .

Implementation:

def coordinate_descent(gradient, theta, eta, steps):

for _ in range(steps):
for j in range(len(theta)):
theta[j] -= eta * gradient(theta, j)
return theta

84
Elastic Net Regularization

Jreg (θ) = J(θ) + λ1 ∥θ∥1 + λ2 ∥θ∥22

Explanation: Elastic Net combines L1 and L2 regularization to handle

sparsity and multicollinearity. It is commonly used in regression tasks.

Example: For λ1 = 0.1, λ2 = 0.2, and J(θ) = ∥θ − y∥22 , compute the

regularized objective.

Implementation:

def elastic_net_objective(loss, theta, lam1, lam2):

return loss(theta) + lam1 * np.sum(np.abs(theta)) + lam2 * np.sum(theta**2)

85
Adagrad Optimization

η
θ(t+1) = θ(t) − √ ∇J(θ(t) )
G(t) + ϵ
t
X
(t)
G = [∇J(θ(i) )]2
i=1

Explanation: Adagrad adapts the learning rate for each parameter

based on the history of gradients, improving performance on sparse data.

Example: For η = 0.1, adaptively scale updates for different features.

Implementation:

def adagrad(gradient, theta, eta, epsilon, steps):

G = 0
for _ in range(steps):
grad = gradient(theta)
G += grad**2
theta -= eta / (np.sqrt(G) + epsilon) * grad
return theta

86
AdamW Optimization

η
θ(t+1) = θ(t) − √ m̂ − λθ(t)
ŝ + ϵ

Explanation: AdamW modifies Adam by decoupling weight decay

from the gradient updates, improving regularization and generalization in
ML models.

Example: For λ = 0.01, regularize weights alongside adaptive learning

rates.

Implementation:

def adamw(gradient, theta, eta, beta1, beta2, lam, epsilon, steps):

87
Momentum “Heavy Ball” Method

θ(t+1) = θ(t) + β(θ(t) − θ(t−1) ) − η∇J(θ(t) )

Explanation: This variant of momentum includes an inertial term to

improve convergence speed for strongly convex problems.

Example: For β = 0.9, the ”heavy ball” accelerates gradient descent.

Implementation:

def heavy_ball(gradient, theta, eta, beta, steps):

prev_theta = theta.copy()
v = 0
for _ in range(steps):
grad = gradient(theta)
v = beta * (theta - prev_theta) - eta * grad
prev_theta = theta.copy()
theta += v
return theta

88
Projection / Projected Gradient Descent

θ(t+1) = ProjC (θ(t) − η∇J(θ(t) ))

Explanation: Projected gradient descent ensures that updates remain

within a feasible set C, often used for constrained optimization.

Example: For C = ∥θ∥2 ≤ 1, project θ onto the unit ball after each
step.

Implementation:

def projected_gradient_descent(gradient, theta, eta, projection, steps):

for _ in range(steps):
theta -= eta * gradient(theta)
theta = projection(theta)
return theta

89
Newton’s Method

θ(t+1) = θ(t) − [H(θ(t) )]−1 ∇J(θ(t) )

Explanation: Newton’s method uses second-order information via the

Hessian to improve convergence, especially for quadratic cost functions.

Example: For J(θ) = θ2 , the update uses H = 2.

Implementation:

def newtons_method(gradient, hessian, theta, steps):

for _ in range(steps):
grad = gradient(theta)
hess = hessian(theta)
theta -= np.linalg.inv(hess).dot(grad)
return theta

90
Proximal Gradient Method

θ(t+1) = proxλR (θ(t) − η∇J(θ(t) ))

Explanation: The proximal gradient method generalizes gradient de-

scent to handle nonsmooth regularization terms such as L1 norm.

Example: For R(θ) = ∥θ∥1 , compute soft thresholding for each pa-
rameter.

Implementation:

def proximal_gradient(gradient, theta, eta, prox, steps):

for _ in range(steps):
theta -= eta * gradient(theta)
theta = prox(theta)
return theta

91
Proximal Gradient with L1 (ISTA)

θ(t+1) = soft(θ(t) − η∇J(θ(t) ), λη)

Explanation: Iterative Shrinkage-Thresholding Algorithm (ISTA) ap-

plies soft thresholding to update parameters for sparse optimization.

Example: For J(θ) = ∥θ − y∥22 + λ∥θ∥1 , apply shrinkage to each θi .

Implementation:

def ista(gradient, theta, eta, lam, steps):

def soft_threshold(x, lam):
return np.sign(x) * max(0, abs(x) - lam)
for _ in range(steps):
theta -= eta * gradient(theta)
theta = np.vectorize(soft_threshold)(theta, lam * eta)
return theta

92
Penalty Method

1
Jpenalty (θ) = J(θ) + h(θ)2
µ

Explanation: The penalty method solves constrained optimization

problems by penalizing constraint violations in the objective function.

Example: For h(θ) = ∥θ∥22 − 1, penalize deviations from the unit ball
constraint.

Implementation:

def penalty_method(loss, theta, penalty, mu):

return loss(theta) + penalty(theta)**2 / mu

93
Augmented Lagrangian Method

µ
L(θ, λ, µ) = J(θ) + λh(θ) + h(θ)2
2

Explanation: The augmented Lagrangian method combines Lagrangian

and penalty approaches to solve constrained optimization problems. It al-
ternates between updating parameters and Lagrange multipliers.

Example: For J(θ) = ∥θ∥22 and h(θ) = ∥θ∥1 − 1, compute updates for
θ, λ, and µ.

Implementation:

def augmented_lagrangian(loss, h, theta, lam, mu, steps):

for _ in range(steps):
lagrangian = loss(theta) + lam * h(theta) + (mu / 2) * h(theta)**2
theta -= np.gradient(lagrangian)
lam += mu * h(theta)
return theta

94
Dual Ascent Method

λ(t+1) = λ(t) + ηh(θ(t) )

Explanation: The dual ascent method optimizes the dual problem of

constrained optimization by updating the Lagrange multipliers iteratively.

Example: For h(θ) = ∥θ∥1 − 1, update λ based on the constraint

violation.

Implementation:

def dual_ascent(loss, h, theta, lam, eta, steps):

for _ in range(steps):
theta -= eta * np.gradient(loss(theta) + lam * h(theta))
lam += eta * h(theta)
return theta, lam

95
Trust Region Method

1
θ(t+1) = arg min{J(θ) + ∇J(θ)T ∆ + ∆T H∆ | ∥∆∥ ≤ ∆max }
∆ 2

Explanation: The trust region method restricts the step size to a

region where the quadratic approximation of the cost function is valid,
ensuring stability.

Example: For J(θ) = ∥θ−y∥22 , compute steps ∆ constrained by ∥∆∥ ≤

∆max .

Implementation:

def trust_region(loss, gradient, hessian, theta, delta_max, steps):

for _ in range(steps):
grad = gradient(theta)
hess = hessian(theta)
delta = np.linalg.solve(hess, -grad)
if np.linalg.norm(delta) > delta_max:
delta *= delta_max / np.linalg.norm(delta)
theta += delta
return theta

96
Barrier Method

m
1X
Jbarrier (θ) = J(θ) − ln(−hi (θ))
µ i=1

Explanation: The barrier method solves constrained optimization by

penalizing constraint violations with a logarithmic barrier, keeping updates
within the feasible region.

Example: For h(θ) = ∥θ∥1 − 1, use − ln(1 − ∥θ∥1 ) as the barrier term.

Implementation:

def barrier_method(loss, h, theta, mu, steps):

for _ in range(steps):
barrier = -np.sum(np.log(-h(theta)))
theta -= np.gradient(loss(theta) + (1 / mu) * barrier)
mu *= 0.9
return theta

97
Simulated Annealing

∆E
P (∆E) = exp −
T

Explanation: Simulated annealing is a probabilistic optimization al-

gorithm inspired by annealing in metallurgy. It explores the solution space
by accepting worse solutions probabilistically to escape local minima.

Example: Minimize J(θ) = θ2 with an initial temperature T = 1,

gradually cooling down.

Implementation:

import numpy as np
def simulated_annealing(loss, theta, T, cooling_rate, steps):
for _ in range(steps):
new_theta = theta + np.random.uniform(-1, 1, size=theta.shape)
delta_E = loss(new_theta) - loss(theta)
if delta_E < 0 or np.exp(-delta_E / T) > np.random.rand():
theta = new_theta
T *= cooling_rate
return theta

98
SECTION 5 : REGRESSION

Linear Regression Hypothesis

ŷ = Xβ + ϵ

Explanation: The hypothesis for linear regression assumes that the

target variable y is a linear combination of features X, coefficients β, and
an error term ϵ.

Example: For y = 2x1 + 3x2 + ϵ, predict y as a linear function of x1

and x2 .

Implementation:

import numpy as np
X = np.array([[1, 2], [3, 4]])
beta = np.array([2, 3])
y_pred = X @ beta

99
Ordinary Least Squares (OLS)

β = (XT X)−1 XT y

Explanation: OLS finds the coefficient vector β that minimizes the

sum of squared residuals between predicted and actual values.

   
1 2 5
Example: For X =   and y =  , compute β.
3 4 11

Implementation:

beta = np.linalg.inv(X.T @ X) @ X.T @ y

100
Mean Squared Error (MSE)

n
1X
MSE = (yi − ŷi )2
n i=1

Explanation: MSE quantifies the average squared difference between

actual and predicted values. It is a standard loss function in regression.

Example: For y = [1, 2, 3] and ŷ = [1.1, 1.9, 3.2], compute the MSE.

Implementation:

mse = np.mean((y - y_pred)**2)

101
Gradient of the MSE Loss

∂ 2
MSE = − XT (y − Xβ)
∂β n

Explanation: The gradient of MSE with respect to β is used in

gradient-based optimization algorithms like gradient descent.

 
1 2
Example: Compute the gradient for X =  , y = [5, 11], and
3 4
β = [1, 1].

Implementation:

grad = -2 / len(y) * X.T @ (y - X @ beta)

102
Coefficient of Determination (R²)

Pn
2 (yi − ŷi )2
R = 1 − Pi=1
n 2
i=1 (yi − ȳ)

Explanation: R² measures the proportion of variance in the target

variable explained by the model. A value close to 1 indicates a good fit.

Example: For y = [1, 2, 3] and ŷ = [1.1, 1.9, 3.2], compute R2 .

Implementation:

r2 = 1 - np.sum((y - y_pred)2) / np.sum((y - np.mean(y))2)

103
Adjusted R²

(1 − R2 )(n − 1)
R̄2 = 1 −
n−p−1

Explanation: Adjusted R² accounts for the number of predictors p in

the model, penalizing overfitting.

Example: For R2 = 0.9, n = 100, and p = 5, compute R̄2 .

Implementation:

adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)

104
Mean Absolute Error (MAE)

n
1X
MAE = |yi − ŷi |
n i=1

Explanation: MAE measures the average magnitude of prediction er-

rors. It is less sensitive to outliers compared to MSE.

Example: For y = [1, 2, 3] and ŷ = [1.1, 1.9, 3.2], compute the MAE.

Implementation:

mae = np.mean(np.abs(y - y_pred))

105
Weighted Least Squares (WLS)

β = (XT WX)−1 XT Wy

Explanation: WLS minimizes the sum of weighted residuals, allowing

for heteroscedasticity in the data.

Example: For W = diag([1, 2]), compute β.

Implementation:

W = np.diag([1, 2])
beta = np.linalg.inv(X.T @ W @ X) @ X.T @ W @ y

106
Polynomial Regression Hypothesis

ŷ = β 0 + β 1 x + β 2 x2 + · · · + β n xn

Explanation: Polynomial regression models the relationship between

x and y as a polynomial. It generalizes linear regression to non-linear
patterns.

Example: Fit y = 2x + x2 .

Implementation:

from numpy.polynomial.polynomial import Polynomial

poly = Polynomial.fit(X, y, deg=2)
y_pred = poly(X)

107
Non-Linear Regression

ŷ = f (X, β) + ϵ

Explanation: Non-linear regression models relationships where the

target variable is a non-linear function of the parameters.

Example: Fit y = aebx using optimization.

Implementation:

from scipy.optimize import curve_fit

def model(X, a, b):
return a * np.exp(b * X)
params, _ = curve_fit(model, X, y)

108
Maximum Likelihood Estimation for Regression

n
Y
β̂ = arg max p(yi | Xi , β)
β
i=1

Explanation: MLE estimates the parameters that maximize the like-

lihood of observing the data under a probabilistic model.

Example: Estimate β assuming Gaussian noise.

Implementation:

from scipy.optimize import minimize

def neg_log_likelihood(beta, X, y):
residuals = y - X @ beta
return np.sum(residuals**2)
beta = minimize(neg_log_likelihood, np.zeros(X.shape[1]), args=(X, y)).x

109
Empirical Risk Minimization

n
1X
θ̂ = arg min ℓ(yi , f (Xi , θ))
θ n i=1

Explanation: ERM minimizes the average loss over the training data
to estimate the model parameters.

Example: Minimize MSE loss for linear regression.

Implementation:

def empirical_risk(theta, X, y, loss):

return np.mean([loss(y[i], np.dot(X[i], theta)) for i in range(len(y))])

110
Logistic Regression Hypothesis

1
ŷ = σ(Xβ), σ(z) =
1 + e−z

Explanation: Logistic regression predicts probabilities for binary clas-

sification using the sigmoid function applied to a linear combination of
inputs.

   
1 2 1
Example: For X =   and β =  , compute ŷ.
3 4 −1

Implementation:

def sigmoid(z):
return 1 / (1 + np.exp(-z))
y_pred = sigmoid(X @ beta)

111
Binary Cross-Entropy Loss

n
1X
L=− [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )]
n i=1

Explanation: Binary cross-entropy measures the dissimilarity between

predicted probabilities and true labels in binary classification.

Example: For y = [1, 0] and ŷ = [0.9, 0.1], compute the loss.

Implementation:

loss = -np.mean(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))

112
Cross-Entropy Loss (Multi-Class)

n k
1 XX
L=− yij log(ŷij )
n i=1 j=1

Explanation: Cross-entropy loss generalizes to multi-class classifica-

tion, comparing one-hot-encoded true labels with predicted probabilities.

Example: For y = [1, 0, 0] and ŷ = [0.8, 0.1, 0.1], compute the loss.

Implementation:

loss = -np.mean(np.sum(y * np.log(y_pred), axis=1))

113
Hinge Loss for SVM

n
1X
L= max(0, 1 − yi ŷi )
n i=1

Explanation: Hinge loss penalizes predictions that are not at least

1 margin away from the correct classification in support vector machines
(SVMs).

Example: For y = [1, −1] and ŷ = [0.8, −0.5], compute the loss.

Implementation:

loss = np.mean(np.maximum(0, 1 - y * y_pred))

114
Lasso Regression Objective

1
L= ∥y − Xβ∥22 + λ∥β∥1
2n

Explanation: Lasso regression adds an L1 regularization term to the

least squares loss, promoting sparsity in the coefficients.

Example: For λ = 0.1, add ∥β∥1 as a penalty.

Implementation:

loss = 0.5 * np.mean((y - X @ beta)**2) + lam * np.sum(np.abs(beta))

115
Ridge Regression Objective

1
L= ∥y − Xβ∥22 + λ∥β∥22
2n

Explanation: Ridge regression adds an L2 regularization term to re-

duce overfitting by shrinking coefficients.

Example: For λ = 0.1, compute the loss with L2 regularization.

Implementation:

loss = 0.5 * np.mean((y - X @ beta)**2) + lam * np.sum(beta**2)

116
Negative Binomial Regression

α y
Γ(y + α) α µ̂
ŷ =
Γ(y + 1)Γ(α) α + µ̂ α + µ̂

Explanation: Negative binomial regression models count data with

overdispersion using a generalized linear model.

Example: Fit a model for overdispersed count data.

Implementation:

from statsmodels.api import GLM, families

model = GLM(y, X, family=families.NegativeBinomial())
results = model.fit()

117
Poisson Regression Model

µ̂ = eXβ

Explanation: Poisson regression models count data using a log link

function, assuming the target variable follows a Poisson distribution.

Example: Predict event counts given feature data.

Implementation:

from statsmodels.api import GLM, families

model = GLM(y, X, family=families.Poisson())
results = model.fit()

118
Gamma Regression Objective

n
1X yi
L= − log(µ̂i ) +
ϕ i=1 µ̂i

Explanation: Gamma regression models positive continuous data with

a Gamma distribution, often for skewed datasets.

Example: Predict insurance claims amounts.

Implementation:

from statsmodels.api import GLM, families

model = GLM(y, X, family=families.Gamma())
results = model.fit()

119
Probit Regression Model

P (y = 1) = Φ(Xβ)

Explanation: Probit regression models binary classification using the

cumulative normal distribution function Φ.

Example: Predict binary outcomes using a probit link.

Implementation:

from statsmodels.api import GLM, families

model = GLM(y, X, family=families.Binomial(link=families.links.probit()))
results = model.fit()

120
Multinomial Logistic Regression

eXβk
P (y = k) = PK Xβ
j=1 e
j

Explanation: Multinomial logistic regression generalizes logistic re-

gression for multi-class classification tasks.

Example: Classify samples into one of K = 3 classes.

Implementation:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(multi_class=’multinomial’)
model.fit(X, y)

121
Quantile Regression Loss

n
X
L= ρτ (yi − ŷi ), ρτ (e) = max(τ e, (1 − τ )e)
i=1

Explanation: Quantile regression minimizes the weighted sum of resid-

uals, modeling conditional quantiles of the target variable.

Example: Estimate the 90th percentile of target values.

Implementation:

from statsmodels.api import QuantReg

model = QuantReg(y, X)
results = model.fit(q=0.9)

122
Huber Loss


 1 (yi
n 
− ŷi )2 , if |yi − ŷi | ≤ δ
2
X
L=
i=1 δ|yi − ŷi | − 1 δ 2 , otherwise

2

Explanation: Huber loss combines MSE and MAE, being quadratic

for small errors and linear for large errors, robust to outliers.

Example: Fit a regression model robust to outliers with δ = 1.

Implementation:

def huber_loss(y, y_pred, delta):

diff = np.abs(y - y_pred)
return np.where(diff <= delta, 0.5 * diff**2, delta * diff - 0.5 * delta**2)

123
SECTION 6 : NEURAL
NETWORKS

Perceptron Update Rule

w(t+1) = w(t) + η(y − ŷ)x

Explanation: The perceptron update rule adjusts weights based on

prediction errors. It is used for binary classification in linearly separable
data.

Example: For x = [1, 2], y = 1, ŷ = 0, and η = 0.1, update w.

Implementation:

w += eta * (y - y_pred) * x

124
Forward Propagation (Single Layer)

ŷ = σ(Xw + b)

Explanation: Forward propagation computes predictions by applying

a weight matrix and activation function to input features.

Example: For X = [1, 2], w = [0.5, 0.5], and b = 0, compute ŷ.

125
Sigmoid Activation

1
σ(z) =
1 + e−z

Explanation: The sigmoid activation maps inputs to [0, 1], commonly

used for binary classification.

Example: For z = 0.5, compute σ(0.5).

Implementation:

def sigmoid(z):
return 1 / (1 + np.exp(-z))

126
Tanh Activation

ez − e−z
tanh(z) =
ez + e−z

Explanation: Tanh activation maps inputs to [−1, 1] and is useful for

symmetric data.

Example: For z = 0.5, compute tanh(0.5).

Implementation:

def tanh(z):
return np.tanh(z)

127
ReLU Activation

ReLU(z) = max(0, z)

Explanation: ReLU introduces non-linearity by zeroing negative val-

ues, often used in deep networks.

Example: For z = −1, compute ReLU(−1).

Implementation:

def relu(z):
return np.maximum(0, z)

128
Heaviside Step Activation


1, z ≥ 0

H(z) =
0, z < 0


Explanation: The Heaviside step function outputs binary values for

classification tasks.

Example: For z = −1, compute H(−1).

Implementation:

def heaviside(z):
return np.where(z >= 0, 1, 0)

129
Leaky ReLU Activation


z, z≥0

Leaky ReLU(z) =
αz, z < 0


Explanation: Leaky ReLU allows small gradients for negative inputs,

mitigating dead neurons.

Example: For z = −1 and α = 0.01, compute Leaky ReLU(−1).

Implementation:

def leaky_relu(z, alpha=0.01):

return np.where(z >= 0, z, alpha * z)

130
ELU Activation (Exponential Linear Unit)


z, z≥0

ELU(z) =
α(ez − 1), z < 0


Explanation: ELU smooths ReLU by providing exponential outputs

for negative inputs, improving gradient flow.

Example: For z = −1 and α = 1, compute ELU(−1).

Implementation:

def elu(z, alpha=1):

return np.where(z >= 0, z, alpha * (np.exp(z) - 1))

131
Softmax Function

ezi
Softmax(z)i = Pn
j=1 ezj

Explanation: Softmax normalizes a vector into a probability distribu-

tion over n classes.

Example: For z = [1, 2, 3], compute Softmax(z).

Implementation:

def softmax(z):
exp_z = np.exp(z - np.max(z)) # Numerical stability
return exp_z / exp_z.sum(axis=0)

132
Loss Function for Multi-Class (Cross-Entropy)

n k
1 XX
L=− yij log(ŷij )
n i=1 j=1

Explanation: Cross-entropy loss measures the dissimilarity between

predicted probabilities and true labels in multi-class classification.

Example: For y = [1, 0, 0] and ŷ = [0.8, 0.1, 0.1], compute the loss.

Implementation:

loss = -np.mean(np.sum(y * np.log(y_pred), axis=1))

133
Gradient Descent for Neural Networks

∂L
θ(t+1) = θ(t) − η
∂θ

Explanation: Gradient descent updates the network’s weights by min-

imizing the loss function using gradients.

Example: Update θ for L = (y − ŷ)2 .

134
Backpropagation (Gradient for Weights)

∂L ∂L ′
= δj ai , δj = σ (zj )
∂wij ∂zj

Explanation: Backpropagation computes the gradient of the loss func-

tion with respect to the weights in a neural network using the chain rule.

Example: Compute gradients for a single-layer neural network.

Implementation:

delta = (y_pred - y) * sigmoid_prime(z)

grad_w = np.outer(delta, a)

135
Mean Squared Error Loss

n
1X
L= (yi − ŷi )2
n i=1

Explanation: Mean squared error measures the average squared differ-

ence between predictions and actual values, commonly used in regression.

Example: For y = [1, 2] and ŷ = [1.1, 1.8], compute the loss.

Implementation:

loss = np.mean((y - y_pred)**2)

136
Binary Cross-Entropy Loss

n
1X
L=− [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )]
n i=1

Explanation: Binary cross-entropy measures the difference between

predicted probabilities and true binary labels.

Example: For y = [1, 0] and ŷ = [0.9, 0.1], compute the loss.

Implementation:

loss = -np.mean(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))

137
Batch Normalization

x−µ
x̂ = √ , y = γ x̂ + β
σ2 + ϵ

Explanation: Batch normalization normalizes inputs to a layer, re-

ducing internal covariate shift and accelerating training.

Example: Normalize x = [1, 2, 3] with γ = 1, β = 0.

Implementation:

mean = np.mean(x)
var = np.var(x)
x_norm = (x - mean) / np.sqrt(var + epsilon)
y = gamma * x_norm + beta

138
Dropout Regularization


0,

with probability p
âi =
ai
, otherwise


1−p

Explanation: Dropout randomly sets a fraction p of activations to

zero during training to prevent overfitting.

Example: Apply dropout to activations a = [1, 2, 3] with p = 0.5.

Implementation:

mask = np.random.rand(len(a)) > p

a_dropout = a * mask / (1 - p)

139
Gradient of Sigmoid

σ ′ (z) = σ(z)(1 − σ(z))

Explanation: The derivative of the sigmoid function is used in back-

propagation to compute gradients efficiently.

Example: For z = 0.5, compute σ ′ (0.5).

Implementation:

def sigmoid_prime(z):
s = sigmoid(z)
return s * (1 - s)

140
RMSProp for Weight Updates

η
s(t+1) = βs(t) + (1 − β)g 2 , w(t+1) = w(t) − √ g
s(t+1) + ϵ

Explanation: RMSProp adapts the learning rate for each weight based
on the moving average of squared gradients.

Implementation:

s = beta * s + (1 - beta) * grad**2

w -= eta / (np.sqrt(s) + epsilon) * grad

141
Xavier (Glorot) Initialization

r r
6 6
w ∼ U(− , )
nin + nout nin + nout

Explanation: Xavier initialization sets weights to maintain variance

across layers, improving convergence in deep networks.

Implementation:

limit = np.sqrt(6 / (n_in + n_out))

w = np.random.uniform(-limit, limit, size=(n_in, n_out))

142
L2 Regularization (Weight Decay)

λ
L = L0 + ∥w∥22
2

Explanation: L2 regularization adds a penalty proportional to the

square of weights to prevent overfitting.

143
Heaviside vs. Hard Sigmoid

Hard Sigmoid(z) = max(0, min(1, 0.2z + 0.5))

Explanation: Heaviside is a binary activation function, while Hard

Sigmoid approximates sigmoid for efficiency.

Implementation:

def hard_sigmoid(z):
return np.clip(0.2 * z + 0.5, 0, 1)

144
Swish Activation

Swish(z) = z · σ(z)

Explanation: Swish is a smooth, non-monotonic activation function

that often outperforms ReLU in deep networks.

Implementation:

def swish(z):
return z * sigmoid(z)

145
Maxout Activation

Maxout(z) = max zi
i∈[1,k]

Explanation: Maxout selects the maximum value from k linear func-

tions, enabling learnable piecewise linear activations.

Implementation:

def maxout(z):
return np.max(z, axis=0)

146
Sparse Categorical Cross-Entropy

n
1X
L=− log(ŷi,yi )
n i=1

Explanation: Sparse categorical cross-entropy simplifies the loss cal-

culation by directly indexing the true class probabilities.

Implementation:

loss = -np.mean(np.log(y_pred[range(len(y)), y]))

147
Cosine Similarity / Cosine Loss

u·v
Cosine Similarity =
∥u∥∥v∥

Explanation: Cosine similarity measures the angle between vectors,

commonly used in text and embedding similarity.

Implementation:

cos_sim = np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

148
SECTION 7 : CLUSTERING

Distance Metric (Euclidean)

v
u n
uX
d(u, v) = t (ui − vi )2
i=1

Explanation: Euclidean distance measures the straight-line distance

between two points in n-dimensional space. It is widely used in clustering
and nearest-neighbor methods.

p
Example: For u = [1, 2] and v = [3, 4], d(u, v) = (3 − 1)2 + (4 − 2)2 =
√
8.

Implementation:

def euclidean_distance(u, v):

return np.sqrt(np.sum((u - v)**2))

149
Manhattan Distance

n
X
d(u, v) = |ui − vi |
i=1

Explanation: Manhattan distance measures the sum of absolute differ-

ences between corresponding components, resembling city block distances.

Example: For u = [1, 2] and v = [3, 4], d(u, v) = |3 − 1| + |4 − 2| = 4.

Implementation:

def manhattan_distance(u, v):

return np.sum(np.abs(u - v))

150
Cosine Similarity

u·v
Cosine Similarity =
∥u∥∥v∥

Explanation: Cosine similarity measures the cosine of the angle be-

tween two vectors, capturing orientation rather than magnitude.

Example: For u = [1, 0] and v = [0, 1], similarity is 0.

Implementation:

def cosine_similarity(u, v):

return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

151
Jaccard Similarity (Binary Data)

|u ∩ v|
Jaccard Similarity =
|u ∪ v|

Explanation: Jaccard similarity compares the intersection and union

of binary data, commonly used in text and set-based similarity.

Example: For u = [1, 1, 0] and v = [1, 0, 1], similarity is 31 .

Implementation:

def jaccard_similarity(u, v):

return np.sum(np.logical_and(u, v)) / np.sum(np.logical_or(u, v))

152
k-Means Objective

k X
X
J= ∥xj − µi ∥2
i=1 xj ∈Ci

Explanation: The k-means objective minimizes the sum of squared

distances between data points and their assigned cluster centroids.

Example: For points [1, 2], [3, 4] in cluster C1 with centroid [2, 3],
compute J.

Implementation:

def k_means_objective(X, centroids, labels):

return np.sum(np.linalg.norm(X - centroids[labels], axis=1)**2)

153
Centroid Update Rule (k-Means)

1 X
µi = xj
|Ci | x ∈C
j i

Explanation: The centroid of each cluster is updated as the mean of

points assigned to it.

Example: For cluster C1 = {[1, 2], [3, 4]}, compute µ1 = [2, 3].

Implementation:

def update_centroids(X, labels, k):

return np.array([X[labels == i].mean(axis=0) for i in range(k)])

154
Elbow Method for Optimal k

k X
X
J(k) = ∥xj − µi ∥2
i=1 xj ∈Ci

Explanation: The elbow method finds the optimal number of clusters

k by identifying the ”elbow” in the plot of J(k) versus k.

Implementation:

def elbow_method(X, max_k):

distortions = []
for k in range(1, max_k + 1):
kmeans = KMeans(n_clusters=k).fit(X)
distortions.append(kmeans.inertia_)
return distortions

155
k-Medoids Objective

k X
X
J= d(xj , mi )
i=1 xj ∈Ci

Explanation: k-Medoids minimizes the sum of distances between data

points and their cluster medoids, robust to outliers.

Example: Replace centroids with medoids for robust clustering.

Implementation:

def k_medoids_objective(X, medoids, labels):

return np.sum([np.sum(np.linalg.norm(X[labels == i]
- medoids[i], axis=1)) for i in range(len(medoids))])

156
Fuzzy c-Means Objective

c X
X n
J= um
ij ∥xj − ci ∥
2

i=1 j=1

Explanation: Fuzzy c-means assigns membership values uij to each

data point for each cluster, allowing soft clustering.

Implementation:

def fuzzy_c_means_objective(X, centroids, memberships, m):

return np.sum(memberships**m * np.linalg.norm(X[:, None]
- centroids, axis=2)**2)

157
Silhouette Score

b−a
S= , a = intra-cluster distance, b = nearest-cluster distance
max(a, b)

Explanation: Silhouette score evaluates the quality of clustering by

comparing intra-cluster and nearest-cluster distances.

Implementation:

from sklearn.metrics import silhouette_score

score = silhouette_score(X, labels)

158
Hierarchical Clustering Dendrogram

d(C1 , C2 ) = min ∥x − y∥
x∈C1 ,y∈C2

Explanation: A dendrogram visually represents the hierarchical clus-

tering process, showing cluster merges.

Implementation:

from scipy.cluster.hierarchy import dendrogram, linkage

Z = linkage(X, method=’ward’)
dendrogram(Z)

159
Ward’s Linkage

|C1 ||C2 |
d(C1 , C2 ) = ∥µ − µ2 ∥2
|C1 | + |C2 | 1

Explanation: Ward’s linkage minimizes the variance increase when

merging clusters, resulting in compact clusters.

Implementation:

from scipy.cluster.hierarchy import linkage

Z = linkage(X, method=’ward’)

160
Single vs. Complete Linkage

dsingle (C1 , C2 ) = min ∥x − y∥, dcomplete (C1 , C2 ) = max ∥x − y∥

x∈C1 ,y∈C2 x∈C1 ,y∈C2

Explanation: Single linkage merges clusters based on the smallest

distance between points, while complete linkage uses the largest distance.
They influence the shape of hierarchical clustering.

Implementation:

from scipy.cluster.hierarchy import linkage

Z_single = linkage(X, method=’single’)
Z_complete = linkage(X, method=’complete’)

161
Average Linkage

1 X X
daverage (C1 , C2 ) = ∥x − y∥
|C1 ||C2 | x∈C y∈C
1 2

Explanation: Average linkage computes the average distance between

all pairs of points in two clusters, balancing the extremes of single and
complete linkage.

Implementation:

Z_average = linkage(X, method=’average’)

162
Minimum Spanning Tree Criterion

X
MST weight = w(u, v), w(u, v) = ∥u − v∥
(u,v)∈E

Explanation: The minimum spanning tree (MST) connects all points

with the minimum total edge weight, often used in clustering to detect
dense regions.

Implementation:

from scipy.sparse.csgraph import minimum_spanning_tree

mst = minimum_spanning_tree(distance_matrix(X))

163
DBSCAN Core Point Condition

|Neighbors(x)| ≥ MinPts, where Neighbors(x) = {y : ∥x − y∥ ≤ ϵ}

Explanation: A core point in DBSCAN must have at least MinPts

neighbors within a distance ϵ.

Implementation:

core_condition = len(neighbors) >= MinPts

164
DBSCAN Density Condition

Density-connected: ∃ a chain of points x1 , x2 , . . . , xn such that ∥xi −xi+1 ∥ ≤ ϵ

Explanation: DBSCAN forms clusters by connecting points that are

density-reachable through chains of neighbors.

Implementation:

from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=epsilon, min_samples=MinPts).fit(X)

165
Cohesion Metric

k X
X
Cohesion = ∥xj − µi ∥2
i=1 xj ∈Ci

Explanation: Cohesion measures the compactness of clusters, where

smaller values indicate tighter clusters.

Implementation:

cohesion = sum(np.linalg.norm(X[labels == i]
- centroids[i], axis=1).sum() for i in range(k))

166
Separation Metric

k
X k
X
Separation = ∥µi − µj ∥2
i=1 j=i+1

Explanation: Separation measures the distance between cluster cen-

troids, where larger values indicate well-separated clusters.

Implementation:

separation = sum(np.linalg.norm(centroids[i]
- centroids[j])**2 for i in range(k) for j in range(i+1, k))

167
Soft Clustering Membership

∥xj − ci ∥−2/(m−1)
uij = Pc −2/(m−1)
k=1 ∥xj − ck ∥

Explanation: Soft clustering assigns membership values uij to each

point for each cluster, indicating the degree of belonging.

Implementation:

memberships = 1 / (distances**(2/(m-1)) / distances.sum(axis=1, keepdims=True))

168
Entropy for Clustering Evaluation

k X
X n
H=− Pij log Pij
i=1 j=1

Explanation: Entropy measures the uncertainty in clustering assign-

ments, where lower values indicate clearer clustering.

Implementation:

entropy = -np.sum(P * np.log(P))

169
Mutual Information for Clustering

|U | |V |
X X Pij
I(U, V ) = Pij log
i=1 j=1
Pi P j

Explanation: Mutual information measures the shared information

between true and predicted clusters.

Implementation:

from sklearn.metrics import mutual_info_score

mi = mutual_info_score(true_labels, predicted_labels)

170
F-Measure for Clustering

2 · Precision · Recall
F =
Precision + Recall

Explanation: The F-measure evaluates clustering performance by bal-

ancing precision and recall.

Implementation:

from sklearn.metrics import f1_score

f_measure = f1_score(true_labels, predicted_labels, average=’weighted’)

171
Adjusted Rand Index (ARI)

Index − Expected Index

ARI =
Max Index − Expected Index

Explanation: ARI adjusts the Rand Index for chance, measuring clus-
tering similarity.

Implementation:

from sklearn.metrics import adjusted_rand_score

ari = adjusted_rand_score(true_labels, predicted_labels)

172
Normalized Mutual Information (NMI)

2I(U, V )
NMI =
H(U ) + H(V )

Explanation: NMI normalizes mutual information to compare clus-

tering solutions of different sizes.

Implementation:

from sklearn.metrics import normalized_mutual_info_score

nmi = normalized_mutual_info_score(true_labels, predicted_labels)

173
SECTION 8 : DIMENSIONALITY
REDUCTION

Principal Component Analysis (PCA) Objective

Maximize: Var(z) = wT Sw, subject to ∥w∥2 = 1

Explanation: PCA seeks directions (principal components) that max-

imize the variance of projected data while being orthogonal to each other.

Implementation:

from sklearn.decomposition import PCA

pca = PCA(n_components=k).fit(X)

174
Covariance Matrix for PCA

1
S= (X − X̄)T (X − X̄)
n−1

Explanation: The covariance matrix captures pairwise feature depen-

dencies and is central to PCA.

Implementation:

mean_X = np.mean(X, axis=0)

cov_matrix = np.cov(X - mean_X, rowvar=False)

175
Eigen Decomposition for PCA

Sw = λw

Explanation: PCA uses eigen decomposition of the covariance matrix

to find eigenvalues (variances) and eigenvectors (principal components).

Implementation:

eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

176
SVD (Singular Value Decomposition)

X = UΣVT

Explanation: SVD factorizes a matrix into orthogonal components,

enabling dimensionality reduction by truncating Σ.

Implementation:

U, S, Vt = np.linalg.svd(X, full_matrices=False)

177
Reconstruction Error for PCA

Error = ∥X − X̂∥2F , X̂ = ZWT + X̄

Explanation: Reconstruction error quantifies the information loss when

reducing dimensionality with PCA.

Implementation:

X_hat = Z @ W.T + mean_X

reconstruction_error = np.linalg.norm(X - X_hat, ’fro’)**2

178
Explained Variance Ratio

λi
Explained Variance Ratio = Pn
j=1 λj

Explanation: The explained variance ratio quantifies the proportion

of variance captured by each principal component.

Implementation:

explained_variance_ratio = eigenvalues / np.sum(eigenvalues)

179
Cumulative Explained Variance

k
X λ
Cumulative Explained Variance = Pn i
i=1 j=1 λj

Explanation: Cumulative explained variance evaluates the total vari-

ance captured by the first k principal components.

Implementation:

cumulative_explained_variance = np.cumsum(explained_variance_ratio)

180
Random Projection

Xproj = XR, R ∼ N (0, 1)

Explanation: Random projection reduces dimensionality by project-

ing data onto a lower-dimensional random matrix while approximately pre-
serving distances.

Implementation:

from sklearn.random_projection import GaussianRandomProjection

rp = GaussianRandomProjection(n_components=k).fit_transform(X)

181
Isomap Distance Matrix

dij = Shortest Path Distance on G, G = (X, ϵ-Neighborhoods)

Explanation: Isomap computes geodesic distances in a graph of near-

est neighbors to preserve non-linear structures in the data.

Implementation:

from sklearn.manifold import Isomap

isomap = Isomap(n_neighbors=k).fit_transform(X)

182
MDS Stress Function

X 2
Stress = dij − dˆij
i<j

Explanation: The stress function measures the discrepancy between

original and embedded distances in Multidimensional Scaling (MDS).

Implementation:

from sklearn.manifold import MDS

mds = MDS(n_components=2).fit_transform(X)

183
Multidimensional Scaling (MDS)

XMDS = arg min Stress(Y)

Explanation: MDS embeds data into a lower-dimensional space while

preserving pairwise distances as much as possible.

Implementation:

from sklearn.manifold import MDS

mds = MDS(n_components=k).fit_transform(X)

184
NMF (Non-Negative Matrix Factorization)

X ≈ WH, W ≥ 0, H ≥ 0

Explanation: NMF factorizes a non-negative matrix into two lower-

rank non-negative matrices, often used in topic modeling and image pro-
cessing.

Implementation:

from sklearn.decomposition import NMF

nmf = NMF(n_components=k).fit_transform(X)

185
ICA (Independent Component Analysis) Objective

n
X
Maximize: log p(si ), where s = WX
i=1

Explanation: ICA separates mixed signals into statistically indepen-

dent components by maximizing non-Gaussianity.

Implementation:

from sklearn.decomposition import FastICA

ica = FastICA(n_components=k).fit_transform(X)

186
Factor Analysis Model

X = ZΛ + ϵ, ϵ ∼ N (0, Ψ)

Explanation: Factor analysis models observed variables as linear com-

binations of latent factors plus noise.

Implementation:

from sklearn.decomposition import FactorAnalysis

fa = FactorAnalysis(n_components=k).fit_transform(X)

187
Kernel PCA Transformation

K = ϕ(X)ϕ(X)T , Eigen Decomposition: Kα = λα

Explanation: Kernel PCA applies PCA in a high-dimensional feature

space defined by a kernel function.

Implementation:

from sklearn.decomposition import KernelPCA

kpca = KernelPCA(kernel=’rbf’, n_components=k).fit_transform(X)

188
LDA (Fisher’s Criterion)

wT SB w
J(w) =
wT SW w

Explanation: LDA finds a projection that maximizes class separation

by optimizing the ratio of between-class to within-class variance.

Implementation:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis(n_components=k).fit_transform(X, y)

189
Robust PCA (RPCA)

X = L + S, ∥L∥∗ + λ∥S∥1

Explanation: RPCA decomposes a matrix into a low-rank component

(L) and a sparse component (S).

Implementation:

from r_pca import R_pca

rpca = R_pca(X)
L, S = rpca.fit()

190
Hessian LLE

Minimize: ∥WX − X∥22 , subject to local Hessian alignment

Explanation: Hessian LLE preserves local geometric structures while

optimizing a low-dimensional embedding.

Implementation:

from sklearn.manifold import LocallyLinearEmbedding

hessian_lle = LocallyLinearEmbedding(n_neighbors=k,
method=’hessian’).fit_transform(X)

191
Laplacian Eigenmaps Objective

X
Minimize: wij ∥yi − yj ∥2 , W = Graph Weights
i,j

Explanation: Laplacian Eigenmaps embeds data while preserving lo-

cal neighborhood information based on a graph structure.

Implementation:

from sklearn.manifold import SpectralEmbedding

laplacian = SpectralEmbedding(n_components=k).fit_transform(X)

192
Autoencoder Reconstruction

X̂ = Decoder(Encoder(X))

Explanation: Autoencoders minimize reconstruction error by com-

pressing data into a latent representation and reconstructing it.

Implementation:

from keras.models import Model

encoded = encoder(X)
decoded = decoder(encoded)

193
Autoencoder Latent Representation

Z = Encoder(X)

Explanation: The latent representation (Z) compresses input data

into a lower-dimensional space for downstream tasks.

Implementation:

latent_representation = encoder.predict(X)

194
Sparse PCA Objective

Maximize: ∥XW∥22 , subject to sparsity constraints on W

Explanation: Sparse PCA introduces sparsity in the principal compo-

nents to improve interpretability.

Implementation:

from sklearn.decomposition import SparsePCA

spca = SparsePCA(n_components=k).fit_transform(X)

195
t-SNE Objective

X Pij
Minimize: KL(P ||Q) = Pij log
i̸=j
Qij

Explanation: t-SNE minimizes the Kullback-Leibler divergence be-

tween high-dimensional and low-dimensional distributions.

Implementation:

from sklearn.manifold import TSNE

tsne = TSNE(n_components=k).fit_transform(X)

196
Gradient of t-SNE

∂KL X
=4 (Pij − Qij )(yi − yj )Qij
∂yi j

Explanation: The gradient of the t-SNE objective updates low-dimensional

embeddings to align distributions.

197
UMAP (Uniform Manifold Approximation and Pro-
jection)

X X
Optimize: wij ∥yi − yj ∥2 − λ wkl log(∥yk − yl ∥)
i,j k,l

Explanation: UMAP preserves local and global structures by opti-

mizing a balance between distances and densities.

Implementation:

import umap
umap_embedding = umap.UMAP(n_components=k).fit_transform(X)

198
SECTION 9 : PROBABILITY
DISTRIBUTIONS

Bernoulli Distribution

P (X = x) = px (1 − p)1−x , x ∈ {0, 1}, 0 ≤ p ≤ 1

Explanation: The Bernoulli distribution models a single binary event,

with success probability p.

Example: For p = 0.7, P (X = 1) = 0.7, P (X = 0) = 0.3.

Implementation:

from scipy.stats import bernoulli

prob = bernoulli.pmf(k=1, p=0.7)

199
Binomial Distribution

n k
P (X = k) = p (1 − p)n−k , k ∈ {0, 1, . . . , n}
k

Explanation: The Binomial distribution models the number of suc-

cesses in n independent Bernoulli trials.

5

Example: For n = 5 and p = 0.5, P (X = 3) = 3
(0.5)3 (0.5)2 =
0.3125.

Implementation:

from scipy.stats import binom

prob = binom.pmf(k=3, n=5, p=0.5)

200
Poisson Distribution

λk e−λ
P (X = k) = , k ∈ {0, 1, 2, . . .}
k!

Explanation: The Poisson distribution models the number of events

in a fixed interval, with a mean rate λ.

32 e−3
Example: For λ = 3, P (X = 2) = 2!
= 0.224.

Implementation:

from scipy.stats import poisson

prob = poisson.pmf(k=2, mu=3)

201
Uniform Distribution (Continuous)

1
f (x) = , x ∈ [a, b]
b−a

Explanation: The continuous uniform distribution assigns equal prob-

ability density to all points in [a, b].

Example: For a = 0, b = 2, f (1) = 12 .

Implementation:

from scipy.stats import uniform

prob = uniform.pdf(x=1, loc=0, scale=2)

202
Discrete Uniform Distribution

1
P (X = x) = , x ∈ {1, 2, . . . , n}
n

Explanation: The discrete uniform distribution assigns equal proba-

bility to n discrete outcomes.

Example: For n = 6, P (X = 3) = 16 .

Implementation:

from scipy.stats import randint

prob = randint.pmf(k=3, low=1, high=7)

203
Normal (Gaussian) Distribution

1 (x−µ)2
f (x) = √ e− 2σ 2
2πσ 2

Explanation: The normal distribution models data with a symmetric

bell shape, defined by mean µ and standard deviation σ.

Example: For µ = 0, σ = 1, f (0) = √1 ≈ 0.398.

2π

Implementation:

from scipy.stats import norm

prob = norm.pdf(x=0, loc=0, scale=1)

204
Exponential Distribution

f (x) = λe−λx , x≥0

Explanation: The exponential distribution models the time between

events in a Poisson process.

Example: For λ = 2, f (1) = 2e−2 ≈ 0.271.

Implementation:

from scipy.stats import expon

prob = expon.pdf(x=1, scale=1/2)

205
Geometric Distribution

P (X = k) = (1 − p)k−1 p, k ∈ {1, 2, . . .}

Explanation: The geometric distribution models the number of trials

until the first success in repeated Bernoulli trials.

Example: For p = 0.5, P (X = 3) = (0.5)2 (0.5) = 0.125.

Implementation:

from scipy.stats import geom

prob = geom.pmf(k=3, p=0.5)

206
Hypergeometric Distribution

K N −K

k n−k
P (X = k) = N

n

Explanation: The hypergeometric distribution models successes in n

draws without replacement from a population of N with K successes.

Example: For N = 20, K = 7, n = 5, P (X = 3).

Implementation:

from scipy.stats import hypergeom

prob = hypergeom.pmf(k=3, M=20, n=5, N=7)

207
Beta Distribution

xα−1 (1 − x)β−1
f (x) = , x ∈ [0, 1]
B(α, β)

Explanation: The Beta distribution models probabilities as a function

of parameters α and β.

Example: For α = 2, β = 3, compute f (0.5).

Implementation:

from scipy.stats import beta

prob = beta.pdf(x=0.5, a=2, b=3)

208
Gamma Distribution

β α xα−1 e−βx
f (x) = , x>0
Γ(α)

Explanation: The Gamma distribution generalizes the exponential

distribution, often used for waiting times.

Example: For α = 2, β = 1, compute f (1).

Implementation:

from scipy.stats import gamma

prob = gamma.pdf(x=1, a=2, scale=1/1)

209
Multinomial Distribution

n!
P (X1 = k1 , . . . , Xk = kk ) = pk1 · · · pkkk
k1 ! · · · kk ! 1

Explanation: The multinomial distribution generalizes the binomial

distribution for multiple categories.

Example: For n = 3, p = [0.2, 0.5, 0.3], and k = [1, 1, 1].

Implementation:

from scipy.stats import multinomial

prob = multinomial.pmf(x=[1, 1, 1], n=3, p=[0.2, 0.5, 0.3])

210
Chi-Square Distribution

xk/2−1 e−x/2
f (x) = , x>0
2k/2 Γ(k/2)

Explanation: The chi-square distribution models the sum of squares

of k independent standard normal variables, commonly used in hypothesis
testing.

Example: For k = 3, compute f (2).

Implementation:

from scipy.stats import chi2

prob = chi2.pdf(x=2, df=3)

211
Student’s t-Distribution

−(ν+1)/2
x2

Γ((ν + 1)/2)
f (x) = √ 1+
νπΓ(ν/2) ν

Explanation: The Student’s t-distribution is used for estimating pop-

ulation parameters when the sample size is small.

Example: For ν = 5, compute f (1).

Implementation:

from scipy.stats import t

prob = t.pdf(x=1, df=5)

212
F-Distribution

r d1 −(d1 +d2 )/2

d1 x d1 x
d2
1+ d2
f (x) = , x>0
xB(d1 /2, d2 /2)

Explanation: The F-distribution models the ratio of variances and is

commonly used in ANOVA tests.

Implementation:

from scipy.stats import f

prob = f.pdf(x=2, dfn=5, dfd=10)

213
Laplace Distribution

1 − |x−µ|
f (x) = e b
2b

Explanation: The Laplace distribution, also known as the double ex-

ponential distribution, is used for modeling differences in data.

Implementation:

from scipy.stats import laplace

prob = laplace.pdf(x=0, loc=0, scale=1)

214
Rayleigh Distribution

x −x2 /(2σ2 )
f (x) = e , x≥0
σ2

Explanation: The Rayleigh distribution models the magnitude of a

two-dimensional vector with independent normal components.

Implementation:

from scipy.stats import rayleigh

prob = rayleigh.pdf(x=2, scale=1)

215
Triangular Distribution


2(x−a)
, a≤x<c


(b−a)(c−a)
f (x) =
2(b−x)
, c≤x≤b


(b−a)(b−c)

Explanation: The triangular distribution models data with a known

minimum, maximum, and mode.

Implementation:

from scipy.stats import triang

prob = triang.pdf(x=0.5, c=0.5, loc=0, scale=1)

216
Log-Normal Distribution

1 (ln x−µ)2
f (x) = √ e− 2σ2 , x>0
xσ 2π

Explanation: The log-normal distribution models data whose loga-

rithm follows a normal distribution.

Implementation:

from scipy.stats import lognorm

prob = lognorm.pdf(x=2, s=1, scale=np.exp(0))

217
Arcsine Distribution

1
f (x) = p , x ∈ (0, 1)
π x(1 − x)

Explanation: The arcsine distribution models probabilities with end-

points more likely than the middle.

Implementation:

from scipy.stats import arcsine

prob = arcsine.pdf(x=0.5)

218
Beta-Binomial Distribution

n B(k + α, n − k + β)
P (X = k) =
k B(α, β)

Explanation: The beta-binomial distribution models overdispersed bi-

nomial outcomes using a Beta prior.

Implementation:

from scipy.stats import betabinom

prob = betabinom.pmf(k=2, n=5, a=2, b=3)

219
Cauchy Distribution

1
f (x) = 2
x−x0
πγ 1 + γ

Explanation: The Cauchy distribution models data with heavy tails,

often used in robust statistics.

Implementation:

from scipy.stats import cauchy

prob = cauchy.pdf(x=0, loc=0, scale=1)

220
Weibull Distribution

k x k−1 −(x/λ)k
f (x) = e , x≥0
λ λ

Explanation: The Weibull distribution is used for reliability analysis

and modeling lifetimes.

Implementation:

from scipy.stats import weibull_min

prob = weibull_min.pdf(x=2, c=1.5, scale=1)

221
Pareto Distribution

αxαm
f (x) = , x ≥ xm
xα+1

Explanation: The Pareto distribution models wealth distribution and

heavy-tailed phenomena.

Implementation:

from scipy.stats import pareto

prob = pareto.pdf(x=2, b=1)

222
Log-Cauchy Distribution

1
f (x) = 2 , x>0
ln x−x0
xπγ 1 + γ

Explanation: The log-Cauchy distribution is the logarithmic trans-

form of the Cauchy distribution, with heavy tails.

223
SECTION 10 : REINFORCEMENT
LEARNING

Reward Function

R(s, a) = E[Reward | s, a]

Explanation: The reward function provides the immediate reward

received after taking action a in state s, guiding the agent’s behavior.

Implementation:

def reward_function(state, action):

# Example reward calculation
return rewards[state, action]

224
Discounted Return

∞
X
Gt = γ k Rt+k+1 , 0≤γ<1
k=0

Explanation: The discounted return accumulates rewards over time,

weighting future rewards by the discount factor γ.

Implementation:

def discounted_return(rewards, gamma):

G = 0
for t, r in enumerate(rewards):
G += (gamma**t) * r
return G

225
Bellman Equation (State-Value Function)

V (s) = Eπ [R(s, a) + γV (s′ )]

Explanation: The Bellman equation relates the value of a state to the

expected return from it under a policy π.

Implementation:

def bellman_state_value(s, rewards, transition_prob, gamma, V):

return np.sum(transition_prob[s] * (rewards[s] + gamma * V))

226
Bellman Equation (Action-Value Function)

Q(s, a) = E[R(s, a) + γV (s′ )]

Explanation: The Bellman equation for the action-value function ex-

presses the value of taking action a in state s and following the policy
afterward.

Implementation:

def bellman_action_value(s, a, rewards, transition_prob, gamma, V):

return rewards[s, a] + gamma * np.sum(transition_prob[s, a] * V)

227
Temporal Difference (TD) Update

V (st ) ← V (st ) + α [Rt+1 + γV (st+1 ) − V (st )]

Explanation: The TD update improves the value estimate of a state

by using the difference between predicted and actual returns.

Implementation:

def td_update(V, state, reward, next_state, alpha, gamma):

V[state] += alpha * (reward + gamma * V[next_state] - V[state])

228
Monte Carlo Policy Evaluation

V (s) ← E[Gt | st = s]

Explanation: Monte Carlo evaluation updates the value of a state by

averaging returns from multiple episodes starting from that state.

Implementation:

def monte_carlo_evaluation(V, state_returns, state_counts):

for state, returns in state_returns.items():
V[state] = np.mean(returns)

229
Policy Improvement

π ′ (s) = arg max Q(s, a)

Explanation: Policy improvement updates the policy by choosing the

action that maximizes the action-value function.

Implementation:

def policy_improvement(Q):
return np.argmax(Q, axis=1)

230
Q-Learning Update

h i
Q(st , at ) ← Q(st , at ) + α Rt+1 + γ max Q(st+1 , a) − Q(st , at )
a

Explanation: Q-learning is an off-policy algorithm that updates action-

value estimates using the maximum future Q-value.

Implementation:

def q_learning_update(Q, state, action, reward, next_state, alpha, gamma):

Q[state, action] += alpha * (reward + gamma * np.max(Q[next_state])
- Q[state, action])

231
SARSA Update

Q(st , at ) ← Q(st , at ) + α [Rt+1 + γQ(st+1 , at+1 ) − Q(st , at )]

Explanation: SARSA is an on-policy algorithm that updates Q-values

based on the action actually taken under the current policy.

Implementation:

def sarsa_update(Q, state, action, reward,

next_state, next_action, alpha, gamma):
Q[state, action] += alpha * (reward + gamma * Q[next_state, next_action]
- Q[state, action])

232
Value Iteration Update

" #
X
′ ′
V (s) ← max R(s, a) + γ P (s | s, a)V (s )
a
s′

Explanation: Value iteration iteratively updates state values by find-

ing the optimal action at each step.

Implementation:

def value_iteration(V, rewards, transition_prob, gamma):

for s in range(len(V)):
V[s] = max(np.sum(transition_prob[s, a] * (rewards[s, a]
+ gamma * V)) for a in range(num_actions))

233
Actor–Critic Policy Update

θ ← θ + α∇θ log πθ (at | st )δt , δt = Rt+1 + γV (st+1 ) − V (st )

Explanation: The actor updates the policy using the advantage, while
the critic updates the value function to estimate the advantage.

Implementation:

def actor_critic_update(actor, critic, state, action, reward, next_state,

alpha, gamma):
delta = reward + gamma * critic[next_state] - critic[state]
actor.update(state, action, alpha * delta)
critic[state] += alpha * delta

234
Deterministic Policy Gradient

∇J(θ) = Es∼ρπ [∇a Q(s, a)∇θ πθ (s)]

Explanation: Deterministic policy gradients update the policy directly

in a continuous action space using gradients of the Q-function.

Implementation:

def deterministic_policy_gradient(policy, q_function, state, alpha):

action = policy(state)
grad_q = q_function.gradient(state, action)
grad_pi = policy.gradient(state)
policy.update(state, alpha * np.dot(grad_q, grad_pi))

235
Discount Factor (γ)

∞
X
Gt = γ k Rt+k+1 , 0≤γ<1
k=0

Explanation: The discount factor determines the weight given to fu-

ture rewards. A smaller γ prioritizes immediate rewards, while a larger γ
considers longer-term rewards.

Implementation:

def discounted_return(rewards, gamma):

G = 0
for t, r in enumerate(rewards):
G += (gamma**t) * r
return G

236
Expected SARSA

Q(st , at ) ← Q(st , at ) + α [Rt+1 + γEa′ [Q(st+1 , a′ )] − Q(st , at )]

Explanation: Expected SARSA updates Q-values using the expected

value of the next action, improving stability over standard SARSA.

Implementation:

def expected_sarsa(Q, state, action, reward, next_state, policy, alpha, gamma):

expected_value = np.sum(policy[next_state] * Q[next_state])
Q[state, action] += alpha * (reward + gamma * expected_value
- Q[state, action])

237
Eligibility Traces Update (TD(λ))

et = γλet−1 + ∇θ V (st ), θ ← θ + αδt et

Explanation: TD(λ) combines TD and Monte Carlo methods using

eligibility traces, balancing bias and variance in value updates.

Implementation:

def td_lambda_update(V, eligibility, state, reward, next_state, alpha,

gamma, lambda_):
delta = reward + gamma * V[next_state] - V[state]
eligibility[state] += 1
V += alpha * delta * eligibility
eligibility *= gamma * lambda_

238
TD Error

δt = Rt+1 + γV (st+1 ) − V (st )

Explanation: The TD error measures the difference between predicted

and observed rewards, guiding updates in temporal difference learning.

Implementation:

def td_error(V, state, reward, next_state, gamma):

return reward + gamma * V[next_state] - V[state]

239
Stochastic Gradient Descent in RL

θ ← θ − α∇θ L(θ)

Explanation: Stochastic gradient descent updates model parameters

by minimizing a loss function, often used in function approximation for RL.

Implementation:

def sgd_update(theta, grad, alpha):

return theta - alpha * grad

240
Double Q-Learning

h i
Q1 (st , at ) ← Q1 (st , at )+α Rt+1 + γQ2 (st+1 , arg max Q1 (st+1 , a)) − Q1 (st , at )
a

Explanation: Double Q-learning reduces overestimation bias by alter-

nating updates between two Q-functions.

Implementation:

def double_q_learning_update(Q1, Q2, state, action, reward, next_state,

alpha, gamma):
max_action = np.argmax(Q1[next_state])
target = reward + gamma * Q2[next_state, max_action]
Q1[state, action] += alpha * (target - Q1[state, action])

241
Advantage Actor–Critic (A2C)

δt = Rt+1 + γV (st+1 ) − V (st ), θ ← θ + α∇θ log πθ (at | st )δt

Explanation: A2C uses the advantage function to reduce variance in

policy updates while learning the value function as a baseline.

Implementation:

def a2c_update(actor, critic, state, action, reward, next_state, alpha, gamma):

delta = reward + gamma * critic[next_state] - critic[state]
actor.update(state, action, alpha * delta)
critic[state] += alpha * delta

242
Off-Policy Evaluation (Importance Sampling)

"T −1 #
Y π(at | st )
E[Ĝ] = E Gt
t=0
µ(at | st )

Explanation: Importance sampling corrects for discrepancies between

the behavior policy µ and the target policy π when estimating returns.

Implementation:

def importance_sampling(weights, returns):

return np.sum(weights * returns)

243
Policy Gradient Update Rule

θ ← θ + α∇θ Eπθ [Gt log πθ (at | st )]

Explanation: The policy gradient algorithm updates parameters in

the direction of performance improvement, directly optimizing the policy.

Implementation:

def policy_gradient_update(policy, rewards, states, actions, alpha):

for state, action, reward in zip(states, actions, rewards):
grad = policy.gradient(state, action)
policy.update(state, action, alpha * reward * grad)

244
Soft Q-Learning Objective

L = Es,a [Q(s, a) − α log π(a | s)]

Explanation: Soft Q-learning optimizes a policy by balancing reward

maximization and entropy regularization.

Implementation:

def soft_q_update(Q, policy, state, action, reward, next_state, alpha, gamma):

entropy = -policy.log_prob(action, state)
target = reward + gamma * (Q[next_state].max() + alpha * entropy)
Q[state, action] += alpha * (target - Q[state, action])

245
Entropy-Regularized RL

π ∗ = arg max E[Gt ] + αH(π)

Explanation: Entropy regularization encourages exploration by max-

imizing the entropy of the policy.

Implementation:

def entropy_regularized_update(policy, rewards, states,

actions, alpha, entropy_coeff):
for state, action, reward in zip(states, actions, rewards):
entropy = -policy.log_prob(action, state)
grad = policy.gradient(state, action)
policy.update(state, action, alpha *
(reward + entropy_coeff * entropy) * grad)

246
Soft Actor–Critic (SAC)

L = Es,a [Q(s, a) − α log π(a | s)] , Q(s, a) = R + γV (s′ )

Explanation: SAC combines entropy regularization with actor–critic

methods to improve stability and exploration in continuous control.

Implementation:

def sac_update(Q, policy, state, action, reward, next_state, alpha, gamma):

entropy = -policy.log_prob(action, state)
target = reward + gamma * (Q[next_state].max() + alpha * entropy)
Q[state, action] += alpha * (target - Q[state, action])

247
Trust Region Policy Optimization (TRPO)

πθ (a | s)
max Eπθ A(s, a) , subject to DKL (πθ ||πθold ) ≤ δ
θ πθold (a | s)

248

Probability and Statistics for Machine Learning_ A Textbook
No ratings yet
Probability and Statistics for Machine Learning_ A Textbook
530 pages
Sample For Solution Manual For A First Course in Machine Learning by Rogers & Girolami
No ratings yet
Sample For Solution Manual For A First Course in Machine Learning by Rogers & Girolami
6 pages
Course Outline-FS13-EE 801 Analysis of Stochastic Systems-MUI
No ratings yet
Course Outline-FS13-EE 801 Analysis of Stochastic Systems-MUI
3 pages
Chapter Bit Manipulation
No ratings yet
Chapter Bit Manipulation
14 pages
Distribution Tables
No ratings yet
Distribution Tables
18 pages
Infosys Pragathi Report
No ratings yet
Infosys Pragathi Report
68 pages
G K Catalogue
No ratings yet
G K Catalogue
32 pages
Book - Roger D Peng-Exploratory Data Analysis With R-Leanpub (2015) PDF
No ratings yet
Book - Roger D Peng-Exploratory Data Analysis With R-Leanpub (2015) PDF
125 pages
Instant Download Linear Algebra and Vector Calculus (2110015) Gujarat Technological University 2017 Ravish R Singh PDF All Chapters
100% (4)
Instant Download Linear Algebra and Vector Calculus (2110015) Gujarat Technological University 2017 Ravish R Singh PDF All Chapters
76 pages
44th International Mathematical Olympiad.. Short-Listed Problems and Solutions (Tokyo, 2003) (71s) - MSCH
No ratings yet
44th International Mathematical Olympiad.. Short-Listed Problems and Solutions (Tokyo, 2003) (71s) - MSCH
71 pages
IIT Ropar CV Template 1
No ratings yet
IIT Ropar CV Template 1
1 page
WBUT College List-2012-13-23-07-12
No ratings yet
WBUT College List-2012-13-23-07-12
26 pages
ECC Presentation
No ratings yet
ECC Presentation
24 pages
Probability Theory and Stochastic Processes
0% (1)
Probability Theory and Stochastic Processes
155 pages
ML Unit 3
No ratings yet
ML Unit 3
40 pages
JEE Main 2021 Question Paper Maths Feb 24 Shift 2
No ratings yet
JEE Main 2021 Question Paper Maths Feb 24 Shift 2
18 pages
Calculus Question Bank
No ratings yet
Calculus Question Bank
83 pages
Random Variables
No ratings yet
Random Variables
12 pages
Syllabus - CO Wise - Programming For Problem Solving (KCS-101/KCS-201) - 2020-21
No ratings yet
Syllabus - CO Wise - Programming For Problem Solving (KCS-101/KCS-201) - 2020-21
192 pages
Adnan
No ratings yet
Adnan
445 pages
Coding Statements Useful For Tcs Ninja - With Solution
No ratings yet
Coding Statements Useful For Tcs Ninja - With Solution
49 pages
Ai 10
No ratings yet
Ai 10
244 pages
Data Science Full Stack Roadmap
No ratings yet
Data Science Full Stack Roadmap
25 pages
CALCUTTA UNIVERSITY Mathematics Honours Textbooks
100% (2)
CALCUTTA UNIVERSITY Mathematics Honours Textbooks
4 pages
387-Summary Gate Academy
No ratings yet
387-Summary Gate Academy
424 pages
Proba For DS
No ratings yet
Proba For DS
297 pages
Digital Logic Circuit Analysis and Design ISM by Nelson
No ratings yet
Digital Logic Circuit Analysis and Design ISM by Nelson
31 pages
Continue
No ratings yet
Continue
2 pages
Mathematics For Machine Learning
No ratings yet
Mathematics For Machine Learning
52 pages
2016 Math Kcet
No ratings yet
2016 Math Kcet
71 pages
GATE - CS - Engineering Mathematics
No ratings yet
GATE - CS - Engineering Mathematics
33 pages
Analysis and Linear Algebra For Finance Part I
No ratings yet
Analysis and Linear Algebra For Finance Part I
127 pages
Fundamentals of Data Structure in C - S. Sahni, S. Anderson-Freed and E. Horowitz
100% (1)
Fundamentals of Data Structure in C - S. Sahni, S. Anderson-Freed and E. Horowitz
82 pages
Linear Algebra Challenging Problems For Students, 2 Edition (Fuzhen Zhang)
100% (1)
Linear Algebra Challenging Problems For Students, 2 Edition (Fuzhen Zhang)
266 pages
Schaum's Outline - Vector Calculus - Chp-3,4
No ratings yet
Schaum's Outline - Vector Calculus - Chp-3,4
47 pages
Prelims - PMD - PHI
No ratings yet
Prelims - PMD - PHI
504 pages
Question Bank For NUMO 2019 PDF
100% (1)
Question Bank For NUMO 2019 PDF
28 pages
Engineering Mathematics 3
No ratings yet
Engineering Mathematics 3
3 pages
Engineering Mathematics IV Sample Question Paper
No ratings yet
Engineering Mathematics IV Sample Question Paper
4 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
CS8501 - Theory of Computation (Ripped From Amazon Kindle Ebooks by Sai Seena)
100% (1)
CS8501 - Theory of Computation (Ripped From Amazon Kindle Ebooks by Sai Seena)
520 pages
Data Science - Assignment 2
No ratings yet
Data Science - Assignment 2
4 pages
Stanford CS Courses
No ratings yet
Stanford CS Courses
1 page
CHAP01 Data Structures: A Pseudo Code Approach
No ratings yet
CHAP01 Data Structures: A Pseudo Code Approach
18 pages
B - S Grewal - Higher Engineering Mathematics PDF - Download B
0% (1)
B - S Grewal - Higher Engineering Mathematics PDF - Download B
3 pages
BACK PROPAGATION Cluster 4
No ratings yet
BACK PROPAGATION Cluster 4
45 pages
COmputing and Combinatoria
No ratings yet
COmputing and Combinatoria
490 pages
Work Book TOC and Compiler Design Made Easy PDF
No ratings yet
Work Book TOC and Compiler Design Made Easy PDF
58 pages
Advances in Computers - Web Technology Vol. 67 PDF
No ratings yet
Advances in Computers - Web Technology Vol. 67 PDF
347 pages
Gate Question Bank Cs
No ratings yet
Gate Question Bank Cs
426 pages
C++ by E Balagurusamy
No ratings yet
C++ by E Balagurusamy
656 pages
Vector Algebra L6
No ratings yet
Vector Algebra L6
53 pages
Machine Learning
No ratings yet
Machine Learning
216 pages
NIMCET-2024-Question-Paper-with-Solutions
No ratings yet
NIMCET-2024-Question-Paper-with-Solutions
32 pages
Copy of Arihant-AIEEE-Mathematics
100% (1)
Copy of Arihant-AIEEE-Mathematics
301 pages
Jee Main 2022 - Coordinate Geometry & Conics Sprint
No ratings yet
Jee Main 2022 - Coordinate Geometry & Conics Sprint
178 pages
Fundamentals of Statistical Signal Processing Volume I Estimation Theory PDFDrive
No ratings yet
Fundamentals of Statistical Signal Processing Volume I Estimation Theory PDFDrive
603 pages
Matrix Algebra For Engineers
No ratings yet
Matrix Algebra For Engineers
190 pages
ML Class Presentation Notes
No ratings yet
ML Class Presentation Notes
51 pages
XHG 22xhgxfd
No ratings yet
XHG 22xhgxfd
15 pages
Linear Algebra Operations For Machine Learning - GeeksforGeeks
No ratings yet
Linear Algebra Operations For Machine Learning - GeeksforGeeks
18 pages
Power BI Interview Questions Part-1
No ratings yet
Power BI Interview Questions Part-1
53 pages
Crime Analysis in India (2001-2013)
No ratings yet
Crime Analysis in India (2001-2013)
23 pages
The Complete SQL HandBook
No ratings yet
The Complete SQL HandBook
89 pages
Celebrate 50 Years of Microsoft
No ratings yet
Celebrate 50 Years of Microsoft
28 pages
Excel Mastery With These Guided Projects
100% (1)
Excel Mastery With These Guided Projects
66 pages
SQL Case When Statement
100% (1)
SQL Case When Statement
10 pages
Limpieza de Datos Con Pandas
No ratings yet
Limpieza de Datos Con Pandas
19 pages
Data Structure and Algorithms
No ratings yet
Data Structure and Algorithms
110 pages
Trade Tariffs in 3 Levels of Difficulty
No ratings yet
Trade Tariffs in 3 Levels of Difficulty
10 pages
ETL Best Practices
No ratings yet
ETL Best Practices
21 pages
The Big Six - SQL
No ratings yet
The Big Six - SQL
23 pages
R Cookbook
No ratings yet
R Cookbook
79 pages
Data KPIs Cheat sheet
No ratings yet
Data KPIs Cheat sheet
12 pages
Inventory Abbreviations
No ratings yet
Inventory Abbreviations
13 pages
8 Machine Learning Algorithms
No ratings yet
8 Machine Learning Algorithms
13 pages
100 SQL Commands Notes
No ratings yet
100 SQL Commands Notes
8 pages
Assignment 1 Year
No ratings yet
Assignment 1 Year
27 pages
MA8451-Probability and Rand Processes
No ratings yet
MA8451-Probability and Rand Processes
18 pages
Sensitivity Analysis Toolbox For DYNARE
No ratings yet
Sensitivity Analysis Toolbox For DYNARE
12 pages
Ptspunit2 VRC
No ratings yet
Ptspunit2 VRC
164 pages
Assignment # 03 (PME)
No ratings yet
Assignment # 03 (PME)
1 page
Safety Science: J. Maiti, Vivek V. Khanzode, P.K. Ray
No ratings yet
Safety Science: J. Maiti, Vivek V. Khanzode, P.K. Ray
10 pages
PSM 2 Variabel Acak
No ratings yet
PSM 2 Variabel Acak
27 pages
Statistics and Probability Test Items
No ratings yet
Statistics and Probability Test Items
10 pages
Premium Calculation On Health Insurance Implementing Deductible
No ratings yet
Premium Calculation On Health Insurance Implementing Deductible
8 pages
MCF2D
No ratings yet
MCF2D
43 pages
Binyam Wesenseged (M.SC)
No ratings yet
Binyam Wesenseged (M.SC)
103 pages
CDF-Channel
No ratings yet
CDF-Channel
9 pages
Comparison of Probability Distribution Functions For Fitting Distillation Curves of Petroleum
No ratings yet
Comparison of Probability Distribution Functions For Fitting Distillation Curves of Petroleum
9 pages
Power System Protection, 2nd Edition Paul M. Anderson download pdf
100% (3)
Power System Protection, 2nd Edition Paul M. Anderson download pdf
55 pages
CFA Level 1 LOS Changes PDF
No ratings yet
CFA Level 1 LOS Changes PDF
51 pages
Communication Lab Matlab-Zeytinoglu
No ratings yet
Communication Lab Matlab-Zeytinoglu
114 pages
MG221: Applied Probability & Statistics: Syllabus 2018
No ratings yet
MG221: Applied Probability & Statistics: Syllabus 2018
2 pages
RNG Revised
No ratings yet
RNG Revised
132 pages
Compiled by Birhan Fetene: Stat 276: Introductory Probability Lecture Notes
No ratings yet
Compiled by Birhan Fetene: Stat 276: Introductory Probability Lecture Notes
77 pages
Degradation Model Analysis of Laser Diodes: &) U. Zeimer B. Sumpf G. Erbert G. Tra Nkle
No ratings yet
Degradation Model Analysis of Laser Diodes: &) U. Zeimer B. Sumpf G. Erbert G. Tra Nkle
5 pages
Chapter 4: Multiple Random Variables
No ratings yet
Chapter 4: Multiple Random Variables
34 pages
Sections 9 Probabitliy Online Quewston Random
No ratings yet
Sections 9 Probabitliy Online Quewston Random
3 pages
Probability Density Function
No ratings yet
Probability Density Function
15 pages
Note - 1 PDF
No ratings yet
Note - 1 PDF
75 pages
Exponential Distribution
0% (1)
Exponential Distribution
17 pages
Introduction to Statistics and Data Analysis With Exercises Solutions and Applications in R 1st Edition Christian Heumann pdf download
No ratings yet
Introduction to Statistics and Data Analysis With Exercises Solutions and Applications in R 1st Edition Christian Heumann pdf download
64 pages
SIC - AI - Chapter 4. Probability and Statistics - Rev2.0
No ratings yet
SIC - AI - Chapter 4. Probability and Statistics - Rev2.0
219 pages
Statistical and Mathematical Methods For Data Analysis
No ratings yet
Statistical and Mathematical Methods For Data Analysis
39 pages