0% found this document useful (0 votes)
297 views

Mathematics for Machine Learning

A Comprehensive Guide to Building Mathematical Foundations for AI and Data Science

Uploaded by

Woody Woodpecker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
297 views

Mathematics for Machine Learning

A Comprehensive Guide to Building Mathematical Foundations for AI and Data Science

Uploaded by

Woody Woodpecker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 249

MATHEMATICS FOR MACHINE

LEARNING

A Comprehensive Guide to Building


Mathematical Foundations for AI
and Data Science

PART 1 : Beginner level

Mohamed Aazi
MATHEMATICS FOR
MACHINE LEARNING

A Comprehensive Guide to Building Mathematical

Foundations for AI and Data Science

Part 1 : Beginner level

Par : Mohamed AAZI

1
SECTION 1 : LINEAR ALGEBRA

Vector Addition

   
 
u1 v1 u1 + v1
     
     
 u2   v2   u2 + v2 
u+v = . + . = . 
    
 ..   ..   .. 

     
un vn un + vn

Explanation: Vector addition combines two vectors component-wise.


It is commonly used in machine learning for gradient updates or geometric
vector operations.

     
1 3 4
Example: If u =   and v =  , then u + v =  .
2 4 6

Implementation:

import numpy as np
u = np.array([1, 2])
v = np.array([3, 4])
result = u + v

2
Scalar Multiplication of a Vector

   
v1 αv1
   
   
 v2   αv2 
αv = α  .  =  . 
  
 ..   .. 

   
vn αvn

Explanation: Scalar multiplication scales each component of a vector


by the same scalar. It is used in scaling gradients or controlling vector
magnitudes.

   
2 6
Example: If α = 3 and v =  , then αv =  .
−1 −3

Implementation:

import numpy as np
alpha = 3
v = np.array([2, -1])
result = alpha * v

3
Dot Product

n
X
u·v = ui vi = u1 v1 + u2 v2 + · · · + un vn
i=1

Explanation: The dot product calculates a scalar representing the


magnitude of projection of one vector onto another. It is widely used in
ML for similarity measures or linear operations.

   
1 3
Example: If u =   and v =  , then u · v = 1 · 3 + 2 · 4 = 11.
2 4

Implementation:

import numpy as np
u = np.array([1, 2])
v = np.array([3, 4])
result = np.dot(u, v)

4
Cross Product (3D)

i j k
u × v = u1 u2 u3
v1 v2 v3

Explanation: The cross product generates a vector perpendicular to


two input vectors in 3D space. It is commonly used in physics and computer
graphics.

     
1 0 0
     
Example: If u = 0 and v = 1, then u × v = 0.
     
     
0 0 1

Implementation:

import numpy as np
u = np.array([1, 0, 0])
v = np.array([0, 1, 0])
result = np.cross(u, v)

5
Norm of a Vector (Euclidean)

v
u n q
uX
∥v∥ = t vi = v12 + v22 + · · · + vn2
2

i=1

Explanation: The Euclidean norm measures the magnitude (length)


of a vector. It is useful in optimization and distance computations in ML.

 
3 √
Example: If v =  , then ∥v∥ = 32 + 42 = 5.
4

Implementation:

import numpy as np
v = np.array([3, 4])
result = np.linalg.norm(v)

6
Orthogonality Condition

u·v =0

Explanation: Two vectors are orthogonal if their dot product is zero.


This condition is critical in linear algebra and ML for understanding inde-
pendence and basis construction.

   
1 −2
Example: If u =   and v =  , then u · v = 1 · −2 + 2 · 1 = 0,
2 1
confirming orthogonality.

Implementation:

import numpy as np
u = np.array([1, 2])
v = np.array([-2, 1])
result = np.dot(u, v)
is_orthogonal = result == 0

7
Matrix Addition

     
a11 a12 b b a + b11 a12 + b12
A+B=  +  11 12  =  11 
a21 a22 b21 b22 a21 + b21 a22 + b22

Explanation: Matrix addition combines two matrices element-wise. It


is used in ML for updating weights and biases or aggregating data.

     
1 2 5 6 6 8
Example: If A =   and B =  , then A + B =  .
3 4 7 8 10 12

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
result = A + B

8
Matrix Scalar Multiplication

   
a11 a12 αa11 αa12
αA = α  = 
a21 a22 αa21 αa22

Explanation: Scaling a matrix by a scalar is useful in ML for adjusting


learning rates or normalization.

   
1 2 2 4
Example: If α = 2 and A =  , then αA =  .
3 4 6 8

Implementation:

import numpy as np
alpha = 2
A = np.array([[1, 2], [3, 4]])
result = alpha * A

9
Matrix-Vector Multiplication

    
a11 a12 x1 a11 x1 + a12 x2
Ax =    =  
a21 a22 x2 a21 x1 + a22 x2

Explanation: Matrix-vector multiplication transforms a vector using


a linear transformation defined by the matrix. It is fundamental in ML for
applying weights to inputs.

     
1 2 5 17
Example: If A =   and x =  , then Ax =  .
3 4 6 39

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
x = np.array([5, 6])
result = np.dot(A, x)

10
Matrix Multiplication

n
X
C = AB, cij = aik bkj
k=1

Explanation: Matrix multiplication combines two matrices, producing


a matrix that represents the composition of linear transformations. It is
used in ML for layer operations in neural networks.

     
1 2 5 6 19 22
Example: If A =   and B =  , then AB =  .
3 4 7 8 43 50

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
result = np.dot(A, B)

11
Transpose of a Matrix

 T  
a11 a12 a a
AT =   =  11 21 
a21 a22 a12 a22

Explanation: The transpose of a matrix flips it over its diagonal,


exchanging rows with columns. It is used in ML for switching between
data representations.

   
1 2 1 3
Example: If A =  , then AT =  .
3 4 2 4

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
result = A.T

12
Determinant of a 2×2 Matrix

a11 a12
det(A) = = a11 a22 − a12 a21
a21 a22

Explanation: The determinant measures the scaling factor of the


transformation represented by a matrix. It is used to determine matrix
invertibility.

 
3 8
Example: If A =  , then det(A) = 3 · 6 − 8 · 4 = −14.
4 6

Implementation:

import numpy as np
A = np.array([[3, 8], [4, 6]])
result = np.linalg.det(A)

13
Inverse of a 2×2 Matrix

 
1 a −a12
A−1 =  22 , det(A) ̸= 0
det(A) −a21 a11

Explanation: The inverse of a 2×2 matrix reverses the linear trans-


formation it represents. It is used in solving systems of linear equations.

   
3 8 6 −8
Example: If A =  , then det(A) = −14 and A−1 = 1  .
−14
4 6 −4 3

Implementation:

import numpy as np
A = np.array([[3, 8], [4, 6]])
result = np.linalg.inv(A)

14
Cramer’s Rule

det(Ai )
xi = , det(A) ̸= 0
det(A)

Explanation: Cramer’s Rule solves a system of linear equations Ax =


b by replacing each column of A with b and computing determinants. It
is a theoretical method often used for small systems.

   
2 1 5
Example: For A =   and b =  ,
1 3 7

   
5 1 2 5
A1 =  , A2 =  
7 3 1 7

det(A1 ) det(A2 )
and det(A) = 5, so x1 = det(A)
, x2 = det(A)
.

Implementation:

import numpy as np
A = np.array([[2, 1], [1, 3]])
b = np.array([5, 7])
det_A = np.linalg.det(A)
x = [np.linalg.det(np.column_stack((b if i == j else A[:, j]
for j in range(A.shape[1])))) / det_A
for i in range(A.shape[1])]

15
Inverse of a Square Matrix

1
A−1 = adj(A), det(A) ̸= 0
det(A)

Explanation: The inverse of a square matrix generalizes the process


for higher dimensions using the adjugate and determinant. It is crucial in
linear algebra and ML for solving systems of equations.

 
4 7
Example: If A =  , the inverse is computed using cofactor
2 6
expansion and scaling.

Implementation:

import numpy as np
A = np.array([[4, 7], [2, 6]])
result = np.linalg.inv(A)

16
Determinant of a Triangular Matrix

n
Y
det(A) = aii
i=1

Explanation: The determinant of a triangular matrix (upper or lower)


is the product of its diagonal elements. This simplifies determinant calcu-
lations and is useful in decompositions.

 
2 1 0
 
Example: If A = 0 3 4, then det(A) = 2 · 3 · 5 = 30.
 
 
0 0 5

Implementation:

import numpy as np
A = np.array([[2, 1, 0], [0, 3, 4], [0, 0, 5]])
result = np.prod(np.diag(A))

17
Rank-Nullity Theorem

rank(A) + nullity(A) = n

Explanation: The Rank-Nullity Theorem states that the sum of the


rank (dimension of column space) and nullity (dimension of null space) of
a matrix equals the number of columns. It is fundamental in linear algebra
for understanding solutions to systems of linear equations.

Example: If A has 3 columns and its rank is 2, then the nullity is 1


since 2 + 1 = 3.

Implementation:

import numpy as np
from numpy.linalg import matrix_rank
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
rank = matrix_rank(A)
nullity = A.shape[1] - rank

18
Hadamard (Elementwise) Product

 
a11 b11 a12 b12
C=A◦B= 
a21 b21 a22 b22

Explanation: The Hadamard product performs elementwise multipli-


cation between two matrices. It is used in ML for feature-wise scaling or
gating.

     
1 2 5 6 5 12
Example: If A =   and B =  , then C =  .
3 4 7 8 21 32

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
result = np.multiply(A, B)

19
Outer Product

 
uv u1 v2 · · · u1 vn
 1 1 
 u2 v1 u2 v2 · · · u2 vn 
 
C=u⊗v =
 .. .. ... .. 

 . . . 
 
um v1 um v2 · · · um vn

Explanation: The outer product generates a matrix by multiplying


every element of one vector by every element of another. It is used in
tensor operations and constructing rank-1 matrices.

 
  3  
1   3 4 5
Example: If u =   and v = 4, then u ⊗ v =  .
 
2   6 8 10
5

Implementation:

import numpy as np
u = np.array([1, 2])
v = np.array([3, 4, 5])
result = np.outer(u, v)

20
Frobenius Norm

v
u m X
n
uX
∥A∥F = t |aij |2
i=1 j=1

Explanation: The Frobenius norm measures the magnitude of a ma-


trix by summing the squares of all its elements. It is widely used in opti-
mization and matrix analysis.

 
1 2 √ √
Example: If A =  , then ∥A∥F = 12 + 22 + 32 + 42 = 30.
3 4

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
result = np.linalg.norm(A, ’fro’)

21
Matrix Norm Inequality

∥Ax∥ ≤ ∥A∥∥x∥

Explanation: The matrix norm inequality states that the norm of a


matrix-vector product is bounded by the product of the matrix norm and
the vector norm. It is a key property in numerical linear algebra and ML
for error analysis.

   
1 2 1
Example: For A =   and x =  , compute ∥Ax∥ ≤ ∥A∥∥x∥.
3 4 1

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
x = np.array([1, 1])
left = np.linalg.norm(np.dot(A, x))
right = np.linalg.norm(A) * np.linalg.norm(x)
inequality_holds = left <= right

22
Matrix Trace

n
X
Tr(A) = aii
i=1

Explanation: The trace of a matrix is the sum of its diagonal elements.


It is used in ML for loss functions and characterizing matrix properties.

 
1 2
Example: If A =  , then Tr(A) = 1 + 4 = 5.
3 4

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
result = np.trace(A)

23
Trace of a Product

Tr(AB) = Tr(BA)

Explanation: The trace of a product of two matrices is invariant under


cyclic permutations. This property is useful in ML for simplifying gradients
in matrix calculus.

   
1 2 5 6
Example: For A =   and B =  , compute Tr(AB) =
3 4 7 8
Tr(BA).

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
trace1 = np.trace(np.dot(A, B))
trace2 = np.trace(np.dot(B, A))
equality_holds = trace1 == trace2

24
Block Matrix Multiplication

    
A B E F AE + BG AF + BH
C=  = 
C D G H CE + DG CF + DH

Explanation: Block matrix multiplication follows the same rules as


scalar matrix multiplication, but each element is a submatrix. It is used in
ML for large-scale computations and decompositions.

Example: Compute the block product for two partitioned 4 × 4 ma-


trices.

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
C = np.array([[9, 10], [11, 12]])
D = np.array([[13, 14], [15, 16]])
E = np.array([[17, 18], [19, 20]])
F = np.array([[21, 22], [23, 24]])
G = np.array([[25, 26], [27, 28]])
H = np.array([[29, 30], [31, 32]])
top_left = np.dot(A, E) + np.dot(B, G)
top_right = np.dot(A, F) + np.dot(B, H)
bottom_left = np.dot(C, E) + np.dot(D, G)
bottom_right = np.dot(C, F) + np.dot(D, H)

25
result = np.block([[top_left, top_right], [bottom_left, bottom_right]])

26
Kronecker Product

 
a11 B a12 B
C=A⊗B= 
a21 B a22 B

Explanation: The Kronecker product produces a block matrix by mul-


tiplying each element of one matrix by the entirety of another. It is used
in ML for tensor operations and signal processing.

   
1 2 0 5
Example: If A =   and B =  , compute A ⊗ B.
3 4 6 7

Implementation:

import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[0, 5], [6, 7]])
result = np.kron(A, B)

27
SECTION 2 : PROBABILITY AND
STATISTICS

Conditional Probability

P (A ∩ B)
P (A | B) = , P (B) > 0
P (B)

Explanation: Conditional probability quantifies the likelihood of event


A occurring given that event B has occurred. It is fundamental in proba-
bilistic reasoning and Bayesian inference.

0.2
Example: If P (A ∩ B) = 0.2 and P (B) = 0.5, then P (A | B) = 0.5
=
0.4.

Implementation:

P_A_and_B = 0.2
P_B = 0.5
P_A_given_B = P_A_and_B / P_B

28
Law of Total Probability

X
P (A) = P (A | Bi )P (Bi )
i

Explanation: The law of total probability relates the probability of


an event A to the probabilities of A given a partition of events {Bi }. It is
used in scenarios with conditional dependencies.

Example: If P (A | B1 ) = 0.3, P (A | B2 ) = 0.7, P (B1 ) = 0.4, and


P (B2 ) = 0.6, then P (A) = 0.3 · 0.4 + 0.7 · 0.6 = 0.54.

Implementation:

P_A_given_B1 = 0.3
P_A_given_B2 = 0.7
P_B1 = 0.4
P_B2 = 0.6
P_A = P_A_given_B1 * P_B1 + P_A_given_B2 * P_B2

29
Bayes’ Theorem

P (B | A)P (A)
P (A | B) = , P (B) > 0
P (B)

Explanation: Bayes’ Theorem allows the reversal of conditional prob-


abilities, often used in updating beliefs with new evidence in ML and statis-
tics.

Example: If P (B | A) = 0.8, P (A) = 0.3, and P (B) = 0.5, then


0.8·0.3
P (A | B) = 0.5
= 0.48.

Implementation:

P_B_given_A = 0.8
P_A = 0.3
P_B = 0.5
P_A_given_B = (P_B_given_A * P_A) / P_B

30
Expectation

X
E[X] = xi P (X = xi )
i

Explanation: The expectation (mean) of a random variable is the


weighted average of all possible values, weighted by their probabilities. It
is central in probability and statistics.

Example: If X = {1, 2, 3} with P (X = 1) = 0.2, P (X = 2) = 0.5, and


P (X = 3) = 0.3, then E[X] = 1 · 0.2 + 2 · 0.5 + 3 · 0.3 = 2.1.

Implementation:

X = [1, 2, 3]
P_X = [0.2, 0.5, 0.3]
expectation = sum(x * p for x, p in zip(X, P_X))

31
Variance

Var(X) = E[(X − E[X])2 ] = E[X 2 ] − (E[X])2

Explanation: Variance measures the spread of a random variable


around its mean. It is widely used in ML for assessing uncertainty and
model performance.

Example: For X = {1, 2, 3} with P (X = 1) = 0.2, P (X = 2) = 0.5,


and P (X = 3) = 0.3, compute E[X] = 2.1 and E[X 2 ] = 4.7, so Var(X) =
4.7 − (2.1)2 = 0.29.

Implementation:

X = [1, 2, 3]
P_X = [0.2, 0.5, 0.3]
expectation = sum(x * p for x, p in zip(X, P_X))
expectation_X2 = sum(x**2 * p for x, p in zip(X, P_X))
variance = expectation_X2 - expectation**2

32
Standard Deviation

p
σ(X) = Var(X)

Explanation: The standard deviation is the square root of the variance


and provides a measure of dispersion in the same units as the random vari-
able. It is widely used in data analysis and ML for variability assessment.


Example: If Var(X) = 0.29, then σ(X) = 0.29 ≈ 0.54.

Implementation:

variance = 0.29
std_dev = variance**0.5

33
Covariance

Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]

Explanation: Covariance measures the joint variability of two random


variables. A positive value indicates that they increase together, while a
negative value indicates an inverse relationship.

Example: If X = {1, 2}, Y = {3, 4}, P (X, Y ) = {0.5, 0.5}, and


E[X] = 1.5, E[Y ] = 3.5, compute Cov(X, Y ) = 0.25.

Implementation:

X = [1, 2]
Y = [3, 4]
P_XY = [0.5, 0.5]
E_X = sum(x * p for x, p in zip(X, P_XY))
E_Y = sum(y * p for y, p in zip(Y, P_XY))
covariance = sum((x - E_X) * (y - E_Y) * p for x, y, p in zip(X, Y, P_XY))

34
Correlation

Cov(X, Y )
ρ(X, Y ) =
σ(X)σ(Y )

Explanation: Correlation normalizes covariance to a scale of [−1, 1],


quantifying the strength and direction of a linear relationship between two
variables.

Example: If Cov(X, Y ) = 0.25, σ(X) = 0.5, and σ(Y ) = 1.0, then


0.25
ρ(X, Y ) = 0.5·1.0
= 0.5.

Implementation:

covariance = 0.25
std_X = 0.5
std_Y = 1.0
correlation = covariance / (std_X * std_Y)

35
Probability Mass Function (PMF)


pi , if x = xi

P (X = x) =
0,

otherwise

Explanation: The PMF defines the probabilities of discrete outcomes


of a random variable. It is a foundational concept in probability theory.

Example: If X = {1, 2, 3} with P (X = 1) = 0.2, P (X = 2) = 0.5, and


P (X = 3) = 0.3, the PMF is defined for these values.

Implementation:

X = [1, 2, 3]
P_X = [0.2, 0.5, 0.3]
def pmf(x):
return P_X[X.index(x)] if x in X else 0

36
Probability Density Function (PDF)

Z ∞
fX (x) ≥ 0, fX (x)dx = 1
−∞

Explanation: The PDF defines the relative likelihood of a continuous


random variable at a specific value. It is used in probability and statistics
for modeling continuous distributions.

Example: For a standard normal distribution, the PDF is fX (x) =


x 2
√1 e− 2 .

Implementation:

import numpy as np
from scipy.stats import norm
x = 0 # example point
pdf_value = norm.pdf(x)

37
Joint Probability

P (A ∩ B) = P (A | B)P (B)

Explanation: Joint probability quantifies the likelihood of two events


occurring together. It is essential in probabilistic modeling and understand-
ing relationships between variables.

Example: If P (A | B) = 0.4 and P (B) = 0.5, then P (A ∩ B) =


0.4 · 0.5 = 0.2.

Implementation:

P_A_given_B = 0.4
P_B = 0.5
P_A_and_B = P_A_given_B * P_B

38
CDF (Cumulative Distribution Function)

FX (x) = P (X ≤ x)

Explanation: The CDF of a random variable gives the probability that


the variable takes a value less than or equal to x. It is used to describe the
distribution function for both discrete and continuous variables.

Example: For a uniform distribution X ∼ U (0, 1), FX (0.5) = 0.5.

Implementation:

from scipy.stats import uniform


x = 0.5
cdf_value = uniform.cdf(x, loc=0, scale=1)

39
Entropy (discrete)

X
H(X) = − P (X = xi ) log2 P (X = xi )
i

Explanation: Entropy measures the uncertainty of a discrete random


variable. It is a fundamental concept in information theory and ML, par-
ticularly in decision trees and loss functions.

Example: If P (X) = {0.5, 0.5}, then H(X) = −0.5 log2 (0.5)−0.5 log2 (0.5) =
1.

Implementation:

import numpy as np
P_X = [0.5, 0.5]
entropy = -sum(p * np.log2(p) for p in P_X if p > 0)

40
Conditional Expectation

X
E[X | Y ] = xP (X = x | Y )
x

Explanation: Conditional expectation is the expected value of a ran-


dom variable X given that another variable Y is known. It is critical in
Bayesian inference and probabilistic modeling.

Example: If X = {1, 2} with P (X = 1 | Y ) = 0.7 and P (X = 2 |


Y ) = 0.3, then E[X | Y ] = 1 · 0.7 + 2 · 0.3 = 1.3.

Implementation:

X = [1, 2]
P_X_given_Y = [0.7, 0.3]
conditional_expectation = sum(x * p for x, p in zip(X, P_X_given_Y))

41
Law of Iterated Expectations

E[X] = E[E[X | Y ]]

Explanation: The law of iterated expectations states that the expec-


tation of X is the weighted average of its conditional expectations over Y .
It is foundational in probability theory and statistics.

Example: Suppose X depends on Y = {1, 2}, with E[X | Y = 1] = 3,


E[X | Y = 2] = 5, and P (Y = 1) = 0.6, P (Y = 2) = 0.4. Then E[X] =
3 · 0.6 + 5 · 0.4 = 3.8.

Implementation:

E_X_given_Y = [3, 5]
P_Y = [0.6, 0.4]
E_X = sum(e * p for e, p in zip(E_X_given_Y, P_Y))

42
Marginal Probability

X
P (A) = P (A ∩ B)
B

Explanation: Marginal probability calculates the probability of an


event A by summing (or integrating, for continuous cases) over all possible
outcomes of another variable B. It is used in probabilistic modeling to
reduce joint distributions.

Example: If P (A ∩ B1 ) = 0.3 and P (A ∩ B2 ) = 0.4, then P (A) =


0.3 + 0.4 = 0.7.

Implementation:

P_A_and_B = [0.3, 0.4]


P_A = sum(P_A_and_B)

43
Skewness

E[(X − µ)3 ]
Skewness(X) =
σ3

Explanation: Skewness measures the asymmetry of the probability


distribution of a random variable about its mean. Positive skew indicates
a longer right tail, and negative skew indicates a longer left tail.

Example: For X = {1, 2, 3} with mean µ = 2 and standard deviation


σ = 0.816, compute Skewness(X) using the third central moment.

Implementation:

import numpy as np
X = [1, 2, 3]
mu = np.mean(X)
sigma = np.std(X)
skewness = np.mean(((X - mu) / sigma)**3)

44
Kurtosis

E[(X − µ)4 ]
Kurtosis(X) =
σ4

Explanation: Kurtosis measures the ”tailedness” of the probability


distribution. A high kurtosis indicates heavy tails, while a low kurtosis
indicates light tails.

Example: For X = {1, 2, 3} with mean µ = 2 and standard deviation


σ = 0.816, compute Kurtosis(X) using the fourth central moment.

Implementation:

import numpy as np
X = [1, 2, 3]
mu = np.mean(X)
sigma = np.std(X)
kurtosis = np.mean(((X - mu) / sigma)**4)

45
Binary Cross-Entropy (special case)

n
1X
BCE(y, ŷ) = − (yi log(ŷi ) + (1 − yi ) log(1 − ŷi ))
n i=1

Explanation: Binary cross-entropy is a loss function used for binary


classification tasks. It measures the dissimilarity between predicted prob-
abilities (ŷ) and true labels (y).

Example: For y = [1, 0] and ŷ = [0.8, 0.2], compute BCE = − 21 (log(0.8) + log(0.8)).

Implementation:

import numpy as np
y = np.array([1, 0])
y_hat = np.array([0.8, 0.2])
bce = -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))

46
Variance (Alternative)

Var(X) = E[X 2 ] − (E[X])2

Explanation: An alternative formula for variance uses the difference


between the expected value of the square of X and the square of the ex-
pected value of X. This method is computationally efficient.

12 +22 +32
Example: For X = {1, 2, 3}, compute E[X 2 ] = 3
= 4.67 and
(E[X])2 = 22 = 4, so Var(X) = 0.67.

Implementation:

import numpy as np
X = np.array([1, 2, 3])
E_X2 = np.mean(X**2)
E_X = np.mean(X)
variance = E_X2 - E_X**2

47
SECTION 3 : CALCULUS

Limit Definition of Derivative

f (x + h) − f (x)
f ′ (x) = lim
h→0 h

Explanation: The derivative of a function is defined as the limit of the


difference quotient as h approaches zero. It represents the instantaneous
rate of change of the function.

(x+h)2 −x2
Example: For f (x) = x2 , compute f ′ (x) = limh→0 h
= 2x.

Implementation:

def derivative(f, x, h=1e-5):


return (f(x + h) - f(x)) / h

48
Power Rule

d n
x = nxn−1
dx

Explanation: The power rule simplifies differentiation of monomials.


It is foundational for calculus and widely used in gradient computations in
ML.

Example: For f (x) = x3 , f ′ (x) = 3x2 .

Implementation:

def power_rule(n, x):


return n * x**(n - 1)

49
Product Rule

d
[u(x)v(x)] = u′ (x)v(x) + u(x)v ′ (x)
dx

Explanation: The product rule computes the derivative of the product


of two functions. It is crucial for handling multiplicative relationships in
ML.

Example: For f (x) = (x2 )(ex ), f ′ (x) = 2xex + x2 ex .

Implementation:

def product_rule(u, v, u_prime, v_prime, x):


return u_prime(x) * v(x) + u(x) * v_prime(x)

50
Quotient Rule

u′ (x)v(x) − u(x)v ′ (x)



d u(x)
=
dx v(x) [v(x)]2

Explanation: The quotient rule computes the derivative of the ratio


of two functions. It is essential for operations involving divisions in ML
models.

x2 2xex −x2 ex
Example: For f (x) = ex
, f ′ (x) = e2x
.

Implementation:

def quotient_rule(u, v, u_prime, v_prime, x):


return (u_prime(x) * v(x) - u(x) * v_prime(x)) / (v(x)**2)

51
Chain Rule

d
f (g(x)) = f ′ (g(x))g ′ (x)
dx

Explanation: The chain rule computes the derivative of a compos-


ite function. It is extensively used in backpropagation for training neural
networks.

Example: For f (x) = sin(x2 ), f ′ (x) = cos(x2 ) · 2x.

Implementation:

def chain_rule(f_prime, g, g_prime, x):


return f_prime(g(x)) * g_prime(x)

52
Logarithmic Derivative

d 1
ln(x) = , x>0
dx x

Explanation: The derivative of the natural logarithm function is the


reciprocal of its argument. It is frequently used in ML for optimization and
logarithmic transformations.

Example: For f (x) = ln(x), f ′ (2) = 12 .

Implementation:

import numpy as np
def log_derivative(x):
return 1 / x

53
Exponential Derivative

d x
e = ex
dx

Explanation: The exponential function is unique as its derivative is


equal to itself. This property is key in gradient computations and expo-
nential growth models in ML.

Example: For f (x) = ex , f ′ (2) = e2 .

Implementation:

import numpy as np
def exp_derivative(x):
return np.exp(x)

54
Integral of a Power Function

xn+1
Z
xn dx = + C, n ̸= −1
n+1

Explanation: The integral of a power function generalizes the an-


tiderivative for monomials. This rule is fundamental in integral calculus
and applied in ML for cost function analysis.

x3
R
Example: For f (x) = x2 , x2 dx = 3
+ C.

Implementation:

def power_integral(n, x):


return x**(n + 1) / (n + 1)

55
Fundamental Theorem of Calculus

Z b
f (x)dx = F (b) − F (a), where F ′ (x) = f (x)
a

Explanation: The Fundamental Theorem of Calculus links differenti-


ation and integration, stating that integration over an interval is the dif-
ference of the antiderivative evaluated at the endpoints.

R3 h 3 i3
2 2 x 27 1 26
Example: For f (x) = x over [1, 3], 1
x dx = 3
= 3
− 3
= 3
.
1

Implementation:

def definite_integral(f, a, b):


from scipy.integrate import quad
result, _ = quad(f, a, b)
return result

56
Partial Derivatives

∂f f (x + h, y) − f (x, y) ∂f f (x, y + h) − f (x, y)


= lim , = lim
∂x h→0 h ∂y h→0 h

Explanation: Partial derivatives measure the rate of change of a multi-


variable function with respect to one variable while keeping others constant.
They are essential in optimization and gradient-based ML methods.

∂f ∂f
Example: For f (x, y) = x2 + y 2 , ∂x
= 2x, ∂y
= 2y.

Implementation:

def partial_derivative(f, var, point, h=1e-5):


args = list(point)
args[var] += h
return (f(*args) - f(*point)) / h

57
Gradient

 
∂f
 ∂x1 
 ∂f 
 ∂x2 
∇f (x) = 
 .. 

 . 
 
∂f
∂xn

Explanation: The gradient is a vector containing all partial derivatives


of a scalar-valued function. It points in the direction of the steepest ascent
and is widely used in ML optimization algorithms like gradient descent.

 
2x
Example: For f (x, y) = x2 + y 2 , ∇f (x, y) =  .
2y

Implementation:

import numpy as np
def gradient(f, point, h=1e-5):
grad = np.zeros(len(point))
for i in range(len(point)):
args = point.copy()
args[i] += h
grad[i] = (f(*args) - f(*point)) / h
return grad

58
Second Derivative (Hessian)

 
∂2f ∂2f
∂x21 ∂x1 ∂x2
···
 
 2f ∂2f
H(f ) =  ∂x∂ ∂x · · ·

 2 1 ∂x22
.. ..

..
. . .

Explanation: The Hessian is a square matrix of second-order partial


derivatives. It is used in optimization to assess curvature and convergence
properties of a function.

 
2 0
Example: For f (x, y) = x2 + y 2 , the Hessian is H(f ) =  .
0 2

Implementation:

def hessian(f, point, h=1e-5):


n = len(point)
hess = np.zeros((n, n))
for i in range(n):
for j in range(n):
args = point.copy()
args[i] += h
args[j] += h
f_ij = f(*args)
args[j] -= h
f_i = f(*args)
args[i] -= h
args[j] += h

59
f_j = f(*args)
f_orig = f(*point)
hess[i, j] = (f_ij - f_i - f_j + f_orig) / (h ** 2)
return hess

60
Directional Derivative

Dv f (x) = ∇f (x) · v

Explanation: The directional derivative measures the rate of change


of a function in the direction of a given vector. It is critical in optimization
and ML for evaluating function behavior in a specific direction.

 
2x
Example: For f (x, y) = x2 + y 2 , ∇f (x, y) =  . In the direction
2y
 
1
v =  , Dv f (x, y) = 2x.
0

Implementation:

def directional_derivative(f, grad_f, point, direction):


grad = grad_f(point)
return np.dot(grad, direction)

61
Higher-Order Partial Derivatives

∂kf
∂xp11 ∂xp22 · · · ∂xpnn

Explanation: Higher-order partial derivatives extend partial deriva-


tives to greater orders. Mixed derivatives often satisfy equality (fxy = fyx )
under smoothness conditions.

∂2f
Example: For f (x, y) = x2 y, ∂x∂y
= 2x.

Implementation:

def higher_order_partial(f, point, var_indices, h=1e-5):


args = list(point)
for var in var_indices:
args[var] += h
f_plus = f(*args)
for var in var_indices:
args[var] -= h * len(var_indices)
f_minus = f(*args)
return (f_plus - f_minus) / (h ** len(var_indices))

62
Total Derivative

n
df X ∂f dxi
=
dt i=1
∂xi dt

Explanation: The total derivative accounts for changes in all indepen-


dent variables as functions of an external variable t. It is used in dynamical
systems and optimization.

df
Example: If f (x, y) = x2 + y 2 , x = t, and y = t2 , then dt
= 2x · 1 +
2y · 2t = 2t + 4t3 .

Implementation:

def total_derivative(f, partials, dx_dt, point):


return sum(partials[i] * dx_dt[i] for i in range(len(point)))

63
Implicit Differentiation

∂F
dy ∂x
= − ∂F
dx ∂y

Explanation: Implicit differentiation computes the derivative of a de-


pendent variable in an equation where the variable cannot be explicitly
solved. It is used in ML and calculus for handling complex equations.

dy
Example: For F (x, y) = x2 + y 2 − 1 = 0, dx
= − xy .

Implementation:

def implicit_differentiation(F, x, y, partial_F_x, partial_F_y):


return -partial_F_x(x, y) / partial_F_y(x, y)

64
Taylor Series Expansion

f ′′ (a)
f (x) ≈ f (a) + f ′ (a)(x − a) + (x − a)2 + · · ·
2!

Explanation: The Taylor series approximates a function near a point


a using its derivatives. It is used in optimization and numerical analysis.

x2
Example: For f (x) = ex near a = 0, f (x) ≈ 1 + x + 2
+ · · ·.

Implementation:

def taylor_series(f, derivatives, a, x, terms=3):


result = 0
for n in range(terms):
result += derivatives[n](a) * (x - a)**n / np.math.factorial(n)
return result

65
Jacobian Matrix

 
∂f1 ∂f1
···
 ∂x1 ∂xn 
 . .. .. 
J(f ) =  .. . . 
 
∂fm ∂fm
∂x1
··· ∂xn

Explanation: The Jacobian matrix contains all first-order partial deriva-


tives of a vector-valued function. It is essential in ML for gradient-based
optimization in multivariable spaces.

   
2
x +y 2x 1
Example: For f (x, y) =  , the Jacobian is  .
2
y +x 1 2y

Implementation:

def jacobian(f, point, h=1e-5):


m = len(f)
n = len(point)
J = np.zeros((m, n))
for i in range(m):
for j in range(n):
args = point.copy()
args[j] += h
J[i, j] = (f[i](*args) - f[i](*point)) / h
return J

66
Arc Length of a Curve

s 2
Z b
dy
L= 1+ dx
a dx

Explanation: The arc length measures the distance along a curve


between two points. It is used in geometry and physics for path analysis.

R1p
Example: For y = x2 over [0, 1], L = 0
1 + (2x)2 dx.

Implementation:

from scipy.integrate import quad


def arc_length(f_prime, a, b):
integrand = lambda x: np.sqrt(1 + f_prime(x)**2)
return quad(integrand, a, b)[0]

67
Curvature of a Function

|y ′′ (x)|
κ(x) =
(1 + [y ′ (x)]2 )3/2

Explanation: Curvature quantifies how sharply a curve bends at a


given point. It is used in geometry and trajectory analysis in robotics and
ML.

Example: For y = x2 , y ′ (x) = 2x, y ′′ (x) = 2, so κ(x) = 2


(1+4x2 )3/2
.

Implementation:

def curvature(f_prime, f_double_prime, x):


numerator = abs(f_double_prime(x))
denominator = (1 + f_prime(x)**2)**1.5
return numerator / denominator

68
Integral by Parts

Z Z

uv dx = uv − u′ vdx

Explanation: Integration by parts is a technique derived from the


product rule of differentiation. It is used to simplify integrals involving
products of functions.

Example: For xex dx, let u = x and v ′ = ex . Then xex dx =


R R
R
xex − ex dx = xex − ex + C.

Implementation:

from sympy import symbols, integrate, exp


x = symbols(’x’)
u = x
v_prime = exp(x)
v = integrate(v_prime, x)
integral = u * v - integrate(v * u.diff(x), x)

69
Volume of Revolution (Disk Method)

Z b
V =π [f (x)]2 dx
a

Explanation: The disk method computes the volume of a solid of


revolution by slicing it into disks perpendicular to the axis of rotation. It
is common in geometry and physics.

Example: For y = x2 revolved around the x-axis over [0, 1], V =


R1 R1
π 0 (x2 )2 dx = π 0 x4 dx = π5 .

Implementation:

from scipy.integrate import quad


import numpy as np
def volume_of_revolution(f, a, b):
integrand = lambda x: np.pi * f(x)**2
return quad(integrand, a, b)[0]

70
Surface Integral

s 2 2
ZZ ZZ
∂g ∂g
f (x, y, z)dS = f (x, y, g(x, y)) 1 + + dA
S R ∂x ∂y

Explanation: A surface integral extends the idea of a line integral to


a surface, summing a scalar field or vector flux over the surface.

Example: Compute the surface integral of f (x, y, z) = z over z =


x2 + y 2 for x2 + y 2 ≤ 1.

Implementation:

from scipy.integrate import dblquad


def surface_integral(f, g, bounds_x, bounds_y):
def integrand(x, y):
gx, gy = g(x, y)
return f(x, y, g(x, y)) * np.sqrt(1 + gx**2 + gy**2)
return dblquad(integrand, *bounds_x, *bounds_y)

71
Divergence of a Vector Field

∂F1 ∂F2 ∂F3


div F = ∇ · F = + +
∂x ∂y ∂z

Explanation: The divergence measures the magnitude of a vector


field’s source or sink at a given point. It is used in fluid dynamics and
electromagnetism.

 
x
 
Example: For F = y , div F = 1 + 1 + 1 = 3.
 
 
z

Implementation:

from sympy import symbols, diff


x, y, z = symbols(’x y z’)
F = [x, y, z]
divergence = sum(diff(F[i], var) for i, var in enumerate([x, y, z]))

72
Curl of a Vector Field

i j k
curl F = ∇ × F = ∂
∂x

∂y

∂z

F1 F2 F3

Explanation: The curl measures the rotation or circulation of a vector


field at a point. It is critical in fluid mechanics and electromagnetism.

   
0 −y
   
Example: For F =  0 , curl F =  x .
   
   
xy 0

Implementation:

from sympy import symbols, Matrix


x, y, z = symbols(’x y z’)
F = Matrix([0, 0, x*y])
curl = F.jacobian([x, y, z]).transpose() - F.jacobian([x, y, z])

73
SECTION 4 : OPTIMIZATION

Gradient Descent

θ(t+1) = θ(t) − η∇J(θ(t) )

Explanation: Gradient descent is an optimization algorithm that it-


eratively updates parameters in the direction of the negative gradient to
minimize the cost function J(θ).

Example: For J(θ) = θ2 and η = 0.1, the update is θ(t+1) = θ(t) −


0.2θ(t) .

Implementation:

def gradient_descent(gradient, theta, eta, steps):


for _ in range(steps):
theta -= eta * gradient(theta)
return theta

74
Stochastic Gradient Descent (SGD)

θ(t+1) = θ(t) − η∇Ji (θ(t) )

Explanation: SGD computes gradients on individual data points, up-


dating parameters more frequently. It is widely used in ML due to its
efficiency with large datasets.

Example: For Ji (θ) = (θ − yi )2 , the update is based on one data point


at each iteration.

Implementation:

def stochastic_gradient_descent(gradient, theta, eta, data, steps):


for _ in range(steps):
i = np.random.randint(len(data))
theta -= eta * gradient(theta, data[i])
return theta

75
Momentum-based Gradient Descent

v (t+1) = βv (t) − η∇J(θ(t) ), θ(t+1) = θ(t) + v (t+1)

Explanation: Momentum adds an exponentially weighted moving av-


erage of past gradients to the current update, improving convergence speed
and stability.

Example: For β = 0.9, η = 0.1, the velocity update smooths oscilla-


tions in gradient descent.

Implementation:

def momentum_gradient_descent(gradient, theta, eta, beta, steps):


v = 0
for _ in range(steps):
v = beta * v - eta * gradient(theta)
theta += v
return theta

76
Nesterov Accelerated Gradient (NAG)

v (t+1) = βv (t) − η∇J(θ(t) + βv (t) ), θ(t+1) = θ(t) + v (t+1)

Explanation: NAG improves upon momentum by calculating gradi-


ents at a lookahead position, resulting in more precise updates.

Example: For β = 0.9, NAG anticipates the future direction, reducing


overshooting in oscillatory scenarios.

Implementation:

def nesterov_gradient_descent(gradient, theta, eta, beta, steps):


v = 0
for _ in range(steps):
lookahead = theta + beta * v
v = beta * v - eta * gradient(lookahead)
theta += v
return theta

77
RMSProp

η
s(t+1) = βs(t) + (1 − β)[∇J(θ(t) )]2 , θ(t+1) = θ(t) − √ ∇J(θ(t) )
s(t+1) + ϵ

Explanation: RMSProp scales the learning rate by a moving average


of squared gradients, improving convergence for non-convex problems.

Example: For β = 0.9, RMSProp adapts the step size for each param-
eter, stabilizing updates.

Implementation:

def rmsprop(gradient, theta, eta, beta, epsilon, steps):


s = 0
for _ in range(steps):
grad = gradient(theta)
s = beta * s + (1 - beta) * grad**2
theta -= eta / (np.sqrt(s) + epsilon) * grad
return theta

78
Adam Optimization

m(t+1) = β1 m(t) + (1 − β1 )∇J(θ(t) ), s(t+1) = β2 s(t) + (1 − β2 )[∇J(θ(t) )]2


m(t+1) s(t+1) η
m̂ = , ŝ = , θ(t+1) = θ(t) − √ m̂
1 − β1t 1 − β2t ŝ + ϵ

Explanation: Adam combines momentum and RMSProp, adapting


step sizes and smoothing updates. It is one of the most popular optimiza-
tion algorithms in ML.

Example: For β1 = 0.9, β2 = 0.999, Adam balances momentum and


per-parameter scaling.

Implementation:

def adam(gradient, theta, eta, beta1, beta2, epsilon, steps):


m, s = 0, 0
for t in range(1, steps + 1):
grad = gradient(theta)
m = beta1 * m + (1 - beta1) * grad
s = beta2 * s + (1 - beta2) * grad**2
m_hat = m / (1 - beta1**t)
s_hat = s / (1 - beta2**t)
theta -= eta / (np.sqrt(s_hat) + epsilon) * m_hat
return theta

79
Regularized Optimization Objective

Jreg (θ) = J(θ) + λR(θ)

Explanation: Regularization penalizes model complexity to prevent


overfitting. Common regularizers include L1 (lasso) and L2 (ridge) norms.

Example: For R(θ) = ∥θ∥22 , Jreg (θ) = J(θ) + λ∥θ∥22 .

Implementation:

def regularized_objective(loss, theta, reg, lam):


return loss(theta) + lam * reg(theta)

80
Learning Rate Decay

η0
ηt =
1 + γt

Explanation: Learning rate decay gradually reduces the learning rate


to improve convergence stability as training progresses.

Example: For η0 = 0.1, γ = 0.01, at step t = 10, ηt = 0.1/(1 + 0.01 ·


10) = 0.0909.

Implementation:

def learning_rate_decay(eta0, gamma, t):


return eta0 / (1 + gamma * t)

81
Gradient Clipping

g = clip(g, −τ, τ )

Explanation: Gradient clipping limits the gradient magnitude to pre-


vent exploding gradients in deep neural networks.

Example: For τ = 1.0, clip gradients to the range [−1, 1].

Implementation:

def gradient_clipping(grad, tau):


return np.clip(grad, -tau, tau)

82
Minibatch Gradient Descent

θ(t+1) = θ(t) − η∇JBt (θ(t) )

Explanation: Minibatch gradient descent computes updates using


small random subsets of data, balancing SGD’s noise and batch gradient
descent’s stability.

Example: Use minibatch size B = 32 to compute updates on smaller


subsets of data.

Implementation:

def minibatch_gradient_descent(gradient, theta, eta, data, batch_size, steps):


for _ in range(steps):
batch = np.random.choice(data, batch_size, replace=False)
theta -= eta * gradient(theta, batch)
return theta

83
Coordinate Descent

(t+1) (t) ∂J(θ)


θj = θj − η
∂θj

Explanation: Coordinate descent optimizes a single parameter at a


time, cycling through all parameters until convergence. It is effective for
high-dimensional problems.

Example: Minimize J(θ1 , θ2 ) = (θ1 − 1)2 + (θ2 − 2)2 by alternately


updating θ1 and θ2 .

Implementation:

def coordinate_descent(gradient, theta, eta, steps):


for _ in range(steps):
for j in range(len(theta)):
theta[j] -= eta * gradient(theta, j)
return theta

84
Elastic Net Regularization

Jreg (θ) = J(θ) + λ1 ∥θ∥1 + λ2 ∥θ∥22

Explanation: Elastic Net combines L1 and L2 regularization to handle


sparsity and multicollinearity. It is commonly used in regression tasks.

Example: For λ1 = 0.1, λ2 = 0.2, and J(θ) = ∥θ − y∥22 , compute the


regularized objective.

Implementation:

def elastic_net_objective(loss, theta, lam1, lam2):


return loss(theta) + lam1 * np.sum(np.abs(theta)) + lam2 * np.sum(theta**2)

85
Adagrad Optimization

η
θ(t+1) = θ(t) − √ ∇J(θ(t) )
G(t) + ϵ
t
X
(t)
G = [∇J(θ(i) )]2
i=1

Explanation: Adagrad adapts the learning rate for each parameter


based on the history of gradients, improving performance on sparse data.

Example: For η = 0.1, adaptively scale updates for different features.

Implementation:

def adagrad(gradient, theta, eta, epsilon, steps):


G = 0
for _ in range(steps):
grad = gradient(theta)
G += grad**2
theta -= eta / (np.sqrt(G) + epsilon) * grad
return theta

86
AdamW Optimization

η
θ(t+1) = θ(t) − √ m̂ − λθ(t)
ŝ + ϵ

Explanation: AdamW modifies Adam by decoupling weight decay


from the gradient updates, improving regularization and generalization in
ML models.

Example: For λ = 0.01, regularize weights alongside adaptive learning


rates.

Implementation:

def adamw(gradient, theta, eta, beta1, beta2, lam, epsilon, steps):


m, s = 0, 0
for t in range(1, steps + 1):
grad = gradient(theta)
m = beta1 * m + (1 - beta1) * grad
s = beta2 * s + (1 - beta2) * grad**2
m_hat = m / (1 - beta1**t)
s_hat = s / (1 - beta2**t)
theta -= eta / (np.sqrt(s_hat) + epsilon) * m_hat + lam * theta
return theta

87
Momentum “Heavy Ball” Method

θ(t+1) = θ(t) + β(θ(t) − θ(t−1) ) − η∇J(θ(t) )

Explanation: This variant of momentum includes an inertial term to


improve convergence speed for strongly convex problems.

Example: For β = 0.9, the ”heavy ball” accelerates gradient descent.

Implementation:

def heavy_ball(gradient, theta, eta, beta, steps):


prev_theta = theta.copy()
v = 0
for _ in range(steps):
grad = gradient(theta)
v = beta * (theta - prev_theta) - eta * grad
prev_theta = theta.copy()
theta += v
return theta

88
Projection / Projected Gradient Descent

θ(t+1) = ProjC (θ(t) − η∇J(θ(t) ))

Explanation: Projected gradient descent ensures that updates remain


within a feasible set C, often used for constrained optimization.

Example: For C = ∥θ∥2 ≤ 1, project θ onto the unit ball after each
step.

Implementation:

def projected_gradient_descent(gradient, theta, eta, projection, steps):


for _ in range(steps):
theta -= eta * gradient(theta)
theta = projection(theta)
return theta

89
Newton’s Method

θ(t+1) = θ(t) − [H(θ(t) )]−1 ∇J(θ(t) )

Explanation: Newton’s method uses second-order information via the


Hessian to improve convergence, especially for quadratic cost functions.

Example: For J(θ) = θ2 , the update uses H = 2.

Implementation:

def newtons_method(gradient, hessian, theta, steps):


for _ in range(steps):
grad = gradient(theta)
hess = hessian(theta)
theta -= np.linalg.inv(hess).dot(grad)
return theta

90
Proximal Gradient Method

θ(t+1) = proxλR (θ(t) − η∇J(θ(t) ))

Explanation: The proximal gradient method generalizes gradient de-


scent to handle nonsmooth regularization terms such as L1 norm.

Example: For R(θ) = ∥θ∥1 , compute soft thresholding for each pa-
rameter.

Implementation:

def proximal_gradient(gradient, theta, eta, prox, steps):


for _ in range(steps):
theta -= eta * gradient(theta)
theta = prox(theta)
return theta

91
Proximal Gradient with L1 (ISTA)

θ(t+1) = soft(θ(t) − η∇J(θ(t) ), λη)

Explanation: Iterative Shrinkage-Thresholding Algorithm (ISTA) ap-


plies soft thresholding to update parameters for sparse optimization.

Example: For J(θ) = ∥θ − y∥22 + λ∥θ∥1 , apply shrinkage to each θi .

Implementation:

def ista(gradient, theta, eta, lam, steps):


def soft_threshold(x, lam):
return np.sign(x) * max(0, abs(x) - lam)
for _ in range(steps):
theta -= eta * gradient(theta)
theta = np.vectorize(soft_threshold)(theta, lam * eta)
return theta

92
Penalty Method

1
Jpenalty (θ) = J(θ) + h(θ)2
µ

Explanation: The penalty method solves constrained optimization


problems by penalizing constraint violations in the objective function.

Example: For h(θ) = ∥θ∥22 − 1, penalize deviations from the unit ball
constraint.

Implementation:

def penalty_method(loss, theta, penalty, mu):


return loss(theta) + penalty(theta)**2 / mu

93
Augmented Lagrangian Method

µ
L(θ, λ, µ) = J(θ) + λh(θ) + h(θ)2
2

Explanation: The augmented Lagrangian method combines Lagrangian


and penalty approaches to solve constrained optimization problems. It al-
ternates between updating parameters and Lagrange multipliers.

Example: For J(θ) = ∥θ∥22 and h(θ) = ∥θ∥1 − 1, compute updates for
θ, λ, and µ.

Implementation:

def augmented_lagrangian(loss, h, theta, lam, mu, steps):


for _ in range(steps):
lagrangian = loss(theta) + lam * h(theta) + (mu / 2) * h(theta)**2
theta -= np.gradient(lagrangian)
lam += mu * h(theta)
return theta

94
Dual Ascent Method

λ(t+1) = λ(t) + ηh(θ(t) )

Explanation: The dual ascent method optimizes the dual problem of


constrained optimization by updating the Lagrange multipliers iteratively.

Example: For h(θ) = ∥θ∥1 − 1, update λ based on the constraint


violation.

Implementation:

def dual_ascent(loss, h, theta, lam, eta, steps):


for _ in range(steps):
theta -= eta * np.gradient(loss(theta) + lam * h(theta))
lam += eta * h(theta)
return theta, lam

95
Trust Region Method

1
θ(t+1) = arg min{J(θ) + ∇J(θ)T ∆ + ∆T H∆ | ∥∆∥ ≤ ∆max }
∆ 2

Explanation: The trust region method restricts the step size to a


region where the quadratic approximation of the cost function is valid,
ensuring stability.

Example: For J(θ) = ∥θ−y∥22 , compute steps ∆ constrained by ∥∆∥ ≤


∆max .

Implementation:

def trust_region(loss, gradient, hessian, theta, delta_max, steps):


for _ in range(steps):
grad = gradient(theta)
hess = hessian(theta)
delta = np.linalg.solve(hess, -grad)
if np.linalg.norm(delta) > delta_max:
delta *= delta_max / np.linalg.norm(delta)
theta += delta
return theta

96
Barrier Method

m
1X
Jbarrier (θ) = J(θ) − ln(−hi (θ))
µ i=1

Explanation: The barrier method solves constrained optimization by


penalizing constraint violations with a logarithmic barrier, keeping updates
within the feasible region.

Example: For h(θ) = ∥θ∥1 − 1, use − ln(1 − ∥θ∥1 ) as the barrier term.

Implementation:

def barrier_method(loss, h, theta, mu, steps):


for _ in range(steps):
barrier = -np.sum(np.log(-h(theta)))
theta -= np.gradient(loss(theta) + (1 / mu) * barrier)
mu *= 0.9
return theta

97
Simulated Annealing


∆E
P (∆E) = exp −
T

Explanation: Simulated annealing is a probabilistic optimization al-


gorithm inspired by annealing in metallurgy. It explores the solution space
by accepting worse solutions probabilistically to escape local minima.

Example: Minimize J(θ) = θ2 with an initial temperature T = 1,


gradually cooling down.

Implementation:

import numpy as np
def simulated_annealing(loss, theta, T, cooling_rate, steps):
for _ in range(steps):
new_theta = theta + np.random.uniform(-1, 1, size=theta.shape)
delta_E = loss(new_theta) - loss(theta)
if delta_E < 0 or np.exp(-delta_E / T) > np.random.rand():
theta = new_theta
T *= cooling_rate
return theta

98
SECTION 5 : REGRESSION

Linear Regression Hypothesis

ŷ = Xβ + ϵ

Explanation: The hypothesis for linear regression assumes that the


target variable y is a linear combination of features X, coefficients β, and
an error term ϵ.

Example: For y = 2x1 + 3x2 + ϵ, predict y as a linear function of x1


and x2 .

Implementation:

import numpy as np
X = np.array([[1, 2], [3, 4]])
beta = np.array([2, 3])
y_pred = X @ beta

99
Ordinary Least Squares (OLS)

β = (XT X)−1 XT y

Explanation: OLS finds the coefficient vector β that minimizes the


sum of squared residuals between predicted and actual values.

   
1 2 5
Example: For X =   and y =  , compute β.
3 4 11

Implementation:

beta = np.linalg.inv(X.T @ X) @ X.T @ y

100
Mean Squared Error (MSE)

n
1X
MSE = (yi − ŷi )2
n i=1

Explanation: MSE quantifies the average squared difference between


actual and predicted values. It is a standard loss function in regression.

Example: For y = [1, 2, 3] and ŷ = [1.1, 1.9, 3.2], compute the MSE.

Implementation:

mse = np.mean((y - y_pred)**2)

101
Gradient of the MSE Loss

∂ 2
MSE = − XT (y − Xβ)
∂β n

Explanation: The gradient of MSE with respect to β is used in


gradient-based optimization algorithms like gradient descent.

 
1 2
Example: Compute the gradient for X =  , y = [5, 11], and
3 4
β = [1, 1].

Implementation:

grad = -2 / len(y) * X.T @ (y - X @ beta)

102
Coefficient of Determination (R²)

Pn
2 (yi − ŷi )2
R = 1 − Pi=1
n 2
i=1 (yi − ȳ)

Explanation: R² measures the proportion of variance in the target


variable explained by the model. A value close to 1 indicates a good fit.

Example: For y = [1, 2, 3] and ŷ = [1.1, 1.9, 3.2], compute R2 .

Implementation:

r2 = 1 - np.sum((y - y_pred)**2) / np.sum((y - np.mean(y))**2)

103
Adjusted R²

(1 − R2 )(n − 1)
R̄2 = 1 −
n−p−1

Explanation: Adjusted R² accounts for the number of predictors p in


the model, penalizing overfitting.

Example: For R2 = 0.9, n = 100, and p = 5, compute R̄2 .

Implementation:

adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)

104
Mean Absolute Error (MAE)

n
1X
MAE = |yi − ŷi |
n i=1

Explanation: MAE measures the average magnitude of prediction er-


rors. It is less sensitive to outliers compared to MSE.

Example: For y = [1, 2, 3] and ŷ = [1.1, 1.9, 3.2], compute the MAE.

Implementation:

mae = np.mean(np.abs(y - y_pred))

105
Weighted Least Squares (WLS)

β = (XT WX)−1 XT Wy

Explanation: WLS minimizes the sum of weighted residuals, allowing


for heteroscedasticity in the data.

Example: For W = diag([1, 2]), compute β.

Implementation:

W = np.diag([1, 2])
beta = np.linalg.inv(X.T @ W @ X) @ X.T @ W @ y

106
Polynomial Regression Hypothesis

ŷ = β 0 + β 1 x + β 2 x2 + · · · + β n xn

Explanation: Polynomial regression models the relationship between


x and y as a polynomial. It generalizes linear regression to non-linear
patterns.

Example: Fit y = 2x + x2 .

Implementation:

from numpy.polynomial.polynomial import Polynomial


poly = Polynomial.fit(X, y, deg=2)
y_pred = poly(X)

107
Non-Linear Regression

ŷ = f (X, β) + ϵ

Explanation: Non-linear regression models relationships where the


target variable is a non-linear function of the parameters.

Example: Fit y = aebx using optimization.

Implementation:

from scipy.optimize import curve_fit


def model(X, a, b):
return a * np.exp(b * X)
params, _ = curve_fit(model, X, y)

108
Maximum Likelihood Estimation for Regression

n
Y
β̂ = arg max p(yi | Xi , β)
β
i=1

Explanation: MLE estimates the parameters that maximize the like-


lihood of observing the data under a probabilistic model.

Example: Estimate β assuming Gaussian noise.

Implementation:

from scipy.optimize import minimize


def neg_log_likelihood(beta, X, y):
residuals = y - X @ beta
return np.sum(residuals**2)
beta = minimize(neg_log_likelihood, np.zeros(X.shape[1]), args=(X, y)).x

109
Empirical Risk Minimization

n
1X
θ̂ = arg min ℓ(yi , f (Xi , θ))
θ n i=1

Explanation: ERM minimizes the average loss over the training data
to estimate the model parameters.

Example: Minimize MSE loss for linear regression.

Implementation:

def empirical_risk(theta, X, y, loss):


return np.mean([loss(y[i], np.dot(X[i], theta)) for i in range(len(y))])

110
Logistic Regression Hypothesis

1
ŷ = σ(Xβ), σ(z) =
1 + e−z

Explanation: Logistic regression predicts probabilities for binary clas-


sification using the sigmoid function applied to a linear combination of
inputs.

   
1 2 1
Example: For X =   and β =  , compute ŷ.
3 4 −1

Implementation:

def sigmoid(z):
return 1 / (1 + np.exp(-z))
y_pred = sigmoid(X @ beta)

111
Binary Cross-Entropy Loss

n
1X
L=− [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )]
n i=1

Explanation: Binary cross-entropy measures the dissimilarity between


predicted probabilities and true labels in binary classification.

Example: For y = [1, 0] and ŷ = [0.9, 0.1], compute the loss.

Implementation:

loss = -np.mean(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))

112
Cross-Entropy Loss (Multi-Class)

n k
1 XX
L=− yij log(ŷij )
n i=1 j=1

Explanation: Cross-entropy loss generalizes to multi-class classifica-


tion, comparing one-hot-encoded true labels with predicted probabilities.

Example: For y = [1, 0, 0] and ŷ = [0.8, 0.1, 0.1], compute the loss.

Implementation:

loss = -np.mean(np.sum(y * np.log(y_pred), axis=1))

113
Hinge Loss for SVM

n
1X
L= max(0, 1 − yi ŷi )
n i=1

Explanation: Hinge loss penalizes predictions that are not at least


1 margin away from the correct classification in support vector machines
(SVMs).

Example: For y = [1, −1] and ŷ = [0.8, −0.5], compute the loss.

Implementation:

loss = np.mean(np.maximum(0, 1 - y * y_pred))

114
Lasso Regression Objective

1
L= ∥y − Xβ∥22 + λ∥β∥1
2n

Explanation: Lasso regression adds an L1 regularization term to the


least squares loss, promoting sparsity in the coefficients.

Example: For λ = 0.1, add ∥β∥1 as a penalty.

Implementation:

loss = 0.5 * np.mean((y - X @ beta)**2) + lam * np.sum(np.abs(beta))

115
Ridge Regression Objective

1
L= ∥y − Xβ∥22 + λ∥β∥22
2n

Explanation: Ridge regression adds an L2 regularization term to re-


duce overfitting by shrinking coefficients.

Example: For λ = 0.1, compute the loss with L2 regularization.

Implementation:

loss = 0.5 * np.mean((y - X @ beta)**2) + lam * np.sum(beta**2)

116
Negative Binomial Regression

α y
Γ(y + α) α µ̂
ŷ =
Γ(y + 1)Γ(α) α + µ̂ α + µ̂

Explanation: Negative binomial regression models count data with


overdispersion using a generalized linear model.

Example: Fit a model for overdispersed count data.

Implementation:

from statsmodels.api import GLM, families


model = GLM(y, X, family=families.NegativeBinomial())
results = model.fit()

117
Poisson Regression Model

µ̂ = eXβ

Explanation: Poisson regression models count data using a log link


function, assuming the target variable follows a Poisson distribution.

Example: Predict event counts given feature data.

Implementation:

from statsmodels.api import GLM, families


model = GLM(y, X, family=families.Poisson())
results = model.fit()

118
Gamma Regression Objective

n
1X yi
L= − log(µ̂i ) +
ϕ i=1 µ̂i

Explanation: Gamma regression models positive continuous data with


a Gamma distribution, often for skewed datasets.

Example: Predict insurance claims amounts.

Implementation:

from statsmodels.api import GLM, families


model = GLM(y, X, family=families.Gamma())
results = model.fit()

119
Probit Regression Model

P (y = 1) = Φ(Xβ)

Explanation: Probit regression models binary classification using the


cumulative normal distribution function Φ.

Example: Predict binary outcomes using a probit link.

Implementation:

from statsmodels.api import GLM, families


model = GLM(y, X, family=families.Binomial(link=families.links.probit()))
results = model.fit()

120
Multinomial Logistic Regression

eXβk
P (y = k) = PK Xβ
j=1 e
j

Explanation: Multinomial logistic regression generalizes logistic re-


gression for multi-class classification tasks.

Example: Classify samples into one of K = 3 classes.

Implementation:

from sklearn.linear_model import LogisticRegression


model = LogisticRegression(multi_class=’multinomial’)
model.fit(X, y)

121
Quantile Regression Loss

n
X
L= ρτ (yi − ŷi ), ρτ (e) = max(τ e, (1 − τ )e)
i=1

Explanation: Quantile regression minimizes the weighted sum of resid-


uals, modeling conditional quantiles of the target variable.

Example: Estimate the 90th percentile of target values.

Implementation:

from statsmodels.api import QuantReg


model = QuantReg(y, X)
results = model.fit(q=0.9)

122
Huber Loss


 1 (yi
n 
− ŷi )2 , if |yi − ŷi | ≤ δ
2
X
L=
i=1 δ|yi − ŷi | − 1 δ 2 , otherwise

2

Explanation: Huber loss combines MSE and MAE, being quadratic


for small errors and linear for large errors, robust to outliers.

Example: Fit a regression model robust to outliers with δ = 1.

Implementation:

def huber_loss(y, y_pred, delta):


diff = np.abs(y - y_pred)
return np.where(diff <= delta, 0.5 * diff**2, delta * diff - 0.5 * delta**2)

123
SECTION 6 : NEURAL
NETWORKS

Perceptron Update Rule

w(t+1) = w(t) + η(y − ŷ)x

Explanation: The perceptron update rule adjusts weights based on


prediction errors. It is used for binary classification in linearly separable
data.

Example: For x = [1, 2], y = 1, ŷ = 0, and η = 0.1, update w.

Implementation:

w += eta * (y - y_pred) * x

124
Forward Propagation (Single Layer)

ŷ = σ(Xw + b)

Explanation: Forward propagation computes predictions by applying


a weight matrix and activation function to input features.

Example: For X = [1, 2], w = [0.5, 0.5], and b = 0, compute ŷ.

125
Sigmoid Activation

1
σ(z) =
1 + e−z

Explanation: The sigmoid activation maps inputs to [0, 1], commonly


used for binary classification.

Example: For z = 0.5, compute σ(0.5).

Implementation:

def sigmoid(z):
return 1 / (1 + np.exp(-z))

126
Tanh Activation

ez − e−z
tanh(z) =
ez + e−z

Explanation: Tanh activation maps inputs to [−1, 1] and is useful for


symmetric data.

Example: For z = 0.5, compute tanh(0.5).

Implementation:

def tanh(z):
return np.tanh(z)

127
ReLU Activation

ReLU(z) = max(0, z)

Explanation: ReLU introduces non-linearity by zeroing negative val-


ues, often used in deep networks.

Example: For z = −1, compute ReLU(−1).

Implementation:

def relu(z):
return np.maximum(0, z)

128
Heaviside Step Activation


1, z ≥ 0

H(z) =
0, z < 0

Explanation: The Heaviside step function outputs binary values for


classification tasks.

Example: For z = −1, compute H(−1).

Implementation:

def heaviside(z):
return np.where(z >= 0, 1, 0)

129
Leaky ReLU Activation


z, z≥0

Leaky ReLU(z) =
αz, z < 0

Explanation: Leaky ReLU allows small gradients for negative inputs,


mitigating dead neurons.

Example: For z = −1 and α = 0.01, compute Leaky ReLU(−1).

Implementation:

def leaky_relu(z, alpha=0.01):


return np.where(z >= 0, z, alpha * z)

130
ELU Activation (Exponential Linear Unit)


z, z≥0

ELU(z) =
α(ez − 1), z < 0

Explanation: ELU smooths ReLU by providing exponential outputs


for negative inputs, improving gradient flow.

Example: For z = −1 and α = 1, compute ELU(−1).

Implementation:

def elu(z, alpha=1):


return np.where(z >= 0, z, alpha * (np.exp(z) - 1))

131
Softmax Function

ezi
Softmax(z)i = Pn
j=1 ezj

Explanation: Softmax normalizes a vector into a probability distribu-


tion over n classes.

Example: For z = [1, 2, 3], compute Softmax(z).

Implementation:

def softmax(z):
exp_z = np.exp(z - np.max(z)) # Numerical stability
return exp_z / exp_z.sum(axis=0)

132
Loss Function for Multi-Class (Cross-Entropy)

n k
1 XX
L=− yij log(ŷij )
n i=1 j=1

Explanation: Cross-entropy loss measures the dissimilarity between


predicted probabilities and true labels in multi-class classification.

Example: For y = [1, 0, 0] and ŷ = [0.8, 0.1, 0.1], compute the loss.

Implementation:

loss = -np.mean(np.sum(y * np.log(y_pred), axis=1))

133
Gradient Descent for Neural Networks

∂L
θ(t+1) = θ(t) − η
∂θ

Explanation: Gradient descent updates the network’s weights by min-


imizing the loss function using gradients.

Example: Update θ for L = (y − ŷ)2 .

134
Backpropagation (Gradient for Weights)

∂L ∂L ′
= δj ai , δj = σ (zj )
∂wij ∂zj

Explanation: Backpropagation computes the gradient of the loss func-


tion with respect to the weights in a neural network using the chain rule.

Example: Compute gradients for a single-layer neural network.

Implementation:

delta = (y_pred - y) * sigmoid_prime(z)


grad_w = np.outer(delta, a)

135
Mean Squared Error Loss

n
1X
L= (yi − ŷi )2
n i=1

Explanation: Mean squared error measures the average squared differ-


ence between predictions and actual values, commonly used in regression.

Example: For y = [1, 2] and ŷ = [1.1, 1.8], compute the loss.

Implementation:

loss = np.mean((y - y_pred)**2)

136
Binary Cross-Entropy Loss

n
1X
L=− [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )]
n i=1

Explanation: Binary cross-entropy measures the difference between


predicted probabilities and true binary labels.

Example: For y = [1, 0] and ŷ = [0.9, 0.1], compute the loss.

Implementation:

loss = -np.mean(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))

137
Batch Normalization

x−µ
x̂ = √ , y = γ x̂ + β
σ2 + ϵ

Explanation: Batch normalization normalizes inputs to a layer, re-


ducing internal covariate shift and accelerating training.

Example: Normalize x = [1, 2, 3] with γ = 1, β = 0.

Implementation:

mean = np.mean(x)
var = np.var(x)
x_norm = (x - mean) / np.sqrt(var + epsilon)
y = gamma * x_norm + beta

138
Dropout Regularization


0,

with probability p
âi =
ai
, otherwise


1−p

Explanation: Dropout randomly sets a fraction p of activations to


zero during training to prevent overfitting.

Example: Apply dropout to activations a = [1, 2, 3] with p = 0.5.

Implementation:

mask = np.random.rand(len(a)) > p


a_dropout = a * mask / (1 - p)

139
Gradient of Sigmoid

σ ′ (z) = σ(z)(1 − σ(z))

Explanation: The derivative of the sigmoid function is used in back-


propagation to compute gradients efficiently.

Example: For z = 0.5, compute σ ′ (0.5).

Implementation:

def sigmoid_prime(z):
s = sigmoid(z)
return s * (1 - s)

140
RMSProp for Weight Updates

η
s(t+1) = βs(t) + (1 − β)g 2 , w(t+1) = w(t) − √ g
s(t+1) + ϵ

Explanation: RMSProp adapts the learning rate for each weight based
on the moving average of squared gradients.

Implementation:

s = beta * s + (1 - beta) * grad**2


w -= eta / (np.sqrt(s) + epsilon) * grad

141
Xavier (Glorot) Initialization

r r
6 6
w ∼ U(− , )
nin + nout nin + nout

Explanation: Xavier initialization sets weights to maintain variance


across layers, improving convergence in deep networks.

Implementation:

limit = np.sqrt(6 / (n_in + n_out))


w = np.random.uniform(-limit, limit, size=(n_in, n_out))

142
L2 Regularization (Weight Decay)

λ
L = L0 + ∥w∥22
2

Explanation: L2 regularization adds a penalty proportional to the


square of weights to prevent overfitting.

143
Heaviside vs. Hard Sigmoid

Hard Sigmoid(z) = max(0, min(1, 0.2z + 0.5))

Explanation: Heaviside is a binary activation function, while Hard


Sigmoid approximates sigmoid for efficiency.

Implementation:

def hard_sigmoid(z):
return np.clip(0.2 * z + 0.5, 0, 1)

144
Swish Activation

Swish(z) = z · σ(z)

Explanation: Swish is a smooth, non-monotonic activation function


that often outperforms ReLU in deep networks.

Implementation:

def swish(z):
return z * sigmoid(z)

145
Maxout Activation

Maxout(z) = max zi
i∈[1,k]

Explanation: Maxout selects the maximum value from k linear func-


tions, enabling learnable piecewise linear activations.

Implementation:

def maxout(z):
return np.max(z, axis=0)

146
Sparse Categorical Cross-Entropy

n
1X
L=− log(ŷi,yi )
n i=1

Explanation: Sparse categorical cross-entropy simplifies the loss cal-


culation by directly indexing the true class probabilities.

Implementation:

loss = -np.mean(np.log(y_pred[range(len(y)), y]))

147
Cosine Similarity / Cosine Loss

u·v
Cosine Similarity =
∥u∥∥v∥

Explanation: Cosine similarity measures the angle between vectors,


commonly used in text and embedding similarity.

Implementation:

cos_sim = np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

148
SECTION 7 : CLUSTERING

Distance Metric (Euclidean)

v
u n
uX
d(u, v) = t (ui − vi )2
i=1

Explanation: Euclidean distance measures the straight-line distance


between two points in n-dimensional space. It is widely used in clustering
and nearest-neighbor methods.

p
Example: For u = [1, 2] and v = [3, 4], d(u, v) = (3 − 1)2 + (4 − 2)2 =

8.

Implementation:

def euclidean_distance(u, v):


return np.sqrt(np.sum((u - v)**2))

149
Manhattan Distance

n
X
d(u, v) = |ui − vi |
i=1

Explanation: Manhattan distance measures the sum of absolute differ-


ences between corresponding components, resembling city block distances.

Example: For u = [1, 2] and v = [3, 4], d(u, v) = |3 − 1| + |4 − 2| = 4.

Implementation:

def manhattan_distance(u, v):


return np.sum(np.abs(u - v))

150
Cosine Similarity

u·v
Cosine Similarity =
∥u∥∥v∥

Explanation: Cosine similarity measures the cosine of the angle be-


tween two vectors, capturing orientation rather than magnitude.

Example: For u = [1, 0] and v = [0, 1], similarity is 0.

Implementation:

def cosine_similarity(u, v):


return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

151
Jaccard Similarity (Binary Data)

|u ∩ v|
Jaccard Similarity =
|u ∪ v|

Explanation: Jaccard similarity compares the intersection and union


of binary data, commonly used in text and set-based similarity.

Example: For u = [1, 1, 0] and v = [1, 0, 1], similarity is 31 .

Implementation:

def jaccard_similarity(u, v):


return np.sum(np.logical_and(u, v)) / np.sum(np.logical_or(u, v))

152
k-Means Objective

k X
X
J= ∥xj − µi ∥2
i=1 xj ∈Ci

Explanation: The k-means objective minimizes the sum of squared


distances between data points and their assigned cluster centroids.

Example: For points [1, 2], [3, 4] in cluster C1 with centroid [2, 3],
compute J.

Implementation:

def k_means_objective(X, centroids, labels):


return np.sum(np.linalg.norm(X - centroids[labels], axis=1)**2)

153
Centroid Update Rule (k-Means)

1 X
µi = xj
|Ci | x ∈C
j i

Explanation: The centroid of each cluster is updated as the mean of


points assigned to it.

Example: For cluster C1 = {[1, 2], [3, 4]}, compute µ1 = [2, 3].

Implementation:

def update_centroids(X, labels, k):


return np.array([X[labels == i].mean(axis=0) for i in range(k)])

154
Elbow Method for Optimal k

k X
X
J(k) = ∥xj − µi ∥2
i=1 xj ∈Ci

Explanation: The elbow method finds the optimal number of clusters


k by identifying the ”elbow” in the plot of J(k) versus k.

Implementation:

def elbow_method(X, max_k):


distortions = []
for k in range(1, max_k + 1):
kmeans = KMeans(n_clusters=k).fit(X)
distortions.append(kmeans.inertia_)
return distortions

155
k-Medoids Objective

k X
X
J= d(xj , mi )
i=1 xj ∈Ci

Explanation: k-Medoids minimizes the sum of distances between data


points and their cluster medoids, robust to outliers.

Example: Replace centroids with medoids for robust clustering.

Implementation:

def k_medoids_objective(X, medoids, labels):


return np.sum([np.sum(np.linalg.norm(X[labels == i]
- medoids[i], axis=1)) for i in range(len(medoids))])

156
Fuzzy c-Means Objective

c X
X n
J= um
ij ∥xj − ci ∥
2

i=1 j=1

Explanation: Fuzzy c-means assigns membership values uij to each


data point for each cluster, allowing soft clustering.

Implementation:

def fuzzy_c_means_objective(X, centroids, memberships, m):


return np.sum(memberships**m * np.linalg.norm(X[:, None]
- centroids, axis=2)**2)

157
Silhouette Score

b−a
S= , a = intra-cluster distance, b = nearest-cluster distance
max(a, b)

Explanation: Silhouette score evaluates the quality of clustering by


comparing intra-cluster and nearest-cluster distances.

Implementation:

from sklearn.metrics import silhouette_score


score = silhouette_score(X, labels)

158
Hierarchical Clustering Dendrogram

d(C1 , C2 ) = min ∥x − y∥
x∈C1 ,y∈C2

Explanation: A dendrogram visually represents the hierarchical clus-


tering process, showing cluster merges.

Implementation:

from scipy.cluster.hierarchy import dendrogram, linkage


Z = linkage(X, method=’ward’)
dendrogram(Z)

159
Ward’s Linkage

|C1 ||C2 |
d(C1 , C2 ) = ∥µ − µ2 ∥2
|C1 | + |C2 | 1

Explanation: Ward’s linkage minimizes the variance increase when


merging clusters, resulting in compact clusters.

Implementation:

from scipy.cluster.hierarchy import linkage


Z = linkage(X, method=’ward’)

160
Single vs. Complete Linkage

dsingle (C1 , C2 ) = min ∥x − y∥, dcomplete (C1 , C2 ) = max ∥x − y∥


x∈C1 ,y∈C2 x∈C1 ,y∈C2

Explanation: Single linkage merges clusters based on the smallest


distance between points, while complete linkage uses the largest distance.
They influence the shape of hierarchical clustering.

Implementation:

from scipy.cluster.hierarchy import linkage


Z_single = linkage(X, method=’single’)
Z_complete = linkage(X, method=’complete’)

161
Average Linkage

1 X X
daverage (C1 , C2 ) = ∥x − y∥
|C1 ||C2 | x∈C y∈C
1 2

Explanation: Average linkage computes the average distance between


all pairs of points in two clusters, balancing the extremes of single and
complete linkage.

Implementation:

Z_average = linkage(X, method=’average’)

162
Minimum Spanning Tree Criterion

X
MST weight = w(u, v), w(u, v) = ∥u − v∥
(u,v)∈E

Explanation: The minimum spanning tree (MST) connects all points


with the minimum total edge weight, often used in clustering to detect
dense regions.

Implementation:

from scipy.sparse.csgraph import minimum_spanning_tree


mst = minimum_spanning_tree(distance_matrix(X))

163
DBSCAN Core Point Condition

|Neighbors(x)| ≥ MinPts, where Neighbors(x) = {y : ∥x − y∥ ≤ ϵ}

Explanation: A core point in DBSCAN must have at least MinPts


neighbors within a distance ϵ.

Implementation:

core_condition = len(neighbors) >= MinPts

164
DBSCAN Density Condition

Density-connected: ∃ a chain of points x1 , x2 , . . . , xn such that ∥xi −xi+1 ∥ ≤ ϵ

Explanation: DBSCAN forms clusters by connecting points that are


density-reachable through chains of neighbors.

Implementation:

from sklearn.cluster import DBSCAN


dbscan = DBSCAN(eps=epsilon, min_samples=MinPts).fit(X)

165
Cohesion Metric

k X
X
Cohesion = ∥xj − µi ∥2
i=1 xj ∈Ci

Explanation: Cohesion measures the compactness of clusters, where


smaller values indicate tighter clusters.

Implementation:

cohesion = sum(np.linalg.norm(X[labels == i]
- centroids[i], axis=1).sum() for i in range(k))

166
Separation Metric

k
X k
X
Separation = ∥µi − µj ∥2
i=1 j=i+1

Explanation: Separation measures the distance between cluster cen-


troids, where larger values indicate well-separated clusters.

Implementation:

separation = sum(np.linalg.norm(centroids[i]
- centroids[j])**2 for i in range(k) for j in range(i+1, k))

167
Soft Clustering Membership

∥xj − ci ∥−2/(m−1)
uij = Pc −2/(m−1)
k=1 ∥xj − ck ∥

Explanation: Soft clustering assigns membership values uij to each


point for each cluster, indicating the degree of belonging.

Implementation:

memberships = 1 / (distances**(2/(m-1)) / distances.sum(axis=1, keepdims=True))

168
Entropy for Clustering Evaluation

k X
X n
H=− Pij log Pij
i=1 j=1

Explanation: Entropy measures the uncertainty in clustering assign-


ments, where lower values indicate clearer clustering.

Implementation:

entropy = -np.sum(P * np.log(P))

169
Mutual Information for Clustering

|U | |V |
X X Pij
I(U, V ) = Pij log
i=1 j=1
Pi P j

Explanation: Mutual information measures the shared information


between true and predicted clusters.

Implementation:

from sklearn.metrics import mutual_info_score


mi = mutual_info_score(true_labels, predicted_labels)

170
F-Measure for Clustering

2 · Precision · Recall
F =
Precision + Recall

Explanation: The F-measure evaluates clustering performance by bal-


ancing precision and recall.

Implementation:

from sklearn.metrics import f1_score


f_measure = f1_score(true_labels, predicted_labels, average=’weighted’)

171
Adjusted Rand Index (ARI)

Index − Expected Index


ARI =
Max Index − Expected Index

Explanation: ARI adjusts the Rand Index for chance, measuring clus-
tering similarity.

Implementation:

from sklearn.metrics import adjusted_rand_score


ari = adjusted_rand_score(true_labels, predicted_labels)

172
Normalized Mutual Information (NMI)

2I(U, V )
NMI =
H(U ) + H(V )

Explanation: NMI normalizes mutual information to compare clus-


tering solutions of different sizes.

Implementation:

from sklearn.metrics import normalized_mutual_info_score


nmi = normalized_mutual_info_score(true_labels, predicted_labels)

173
SECTION 8 : DIMENSIONALITY
REDUCTION

Principal Component Analysis (PCA) Objective

Maximize: Var(z) = wT Sw, subject to ∥w∥2 = 1

Explanation: PCA seeks directions (principal components) that max-


imize the variance of projected data while being orthogonal to each other.

Implementation:

from sklearn.decomposition import PCA


pca = PCA(n_components=k).fit(X)

174
Covariance Matrix for PCA

1
S= (X − X̄)T (X − X̄)
n−1

Explanation: The covariance matrix captures pairwise feature depen-


dencies and is central to PCA.

Implementation:

mean_X = np.mean(X, axis=0)


cov_matrix = np.cov(X - mean_X, rowvar=False)

175
Eigen Decomposition for PCA

Sw = λw

Explanation: PCA uses eigen decomposition of the covariance matrix


to find eigenvalues (variances) and eigenvectors (principal components).

Implementation:

eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

176
SVD (Singular Value Decomposition)

X = UΣVT

Explanation: SVD factorizes a matrix into orthogonal components,


enabling dimensionality reduction by truncating Σ.

Implementation:

U, S, Vt = np.linalg.svd(X, full_matrices=False)

177
Reconstruction Error for PCA

Error = ∥X − X̂∥2F , X̂ = ZWT + X̄

Explanation: Reconstruction error quantifies the information loss when


reducing dimensionality with PCA.

Implementation:

X_hat = Z @ W.T + mean_X


reconstruction_error = np.linalg.norm(X - X_hat, ’fro’)**2

178
Explained Variance Ratio

λi
Explained Variance Ratio = Pn
j=1 λj

Explanation: The explained variance ratio quantifies the proportion


of variance captured by each principal component.

Implementation:

explained_variance_ratio = eigenvalues / np.sum(eigenvalues)

179
Cumulative Explained Variance

k
X λ
Cumulative Explained Variance = Pn i
i=1 j=1 λj

Explanation: Cumulative explained variance evaluates the total vari-


ance captured by the first k principal components.

Implementation:

cumulative_explained_variance = np.cumsum(explained_variance_ratio)

180
Random Projection

Xproj = XR, R ∼ N (0, 1)

Explanation: Random projection reduces dimensionality by project-


ing data onto a lower-dimensional random matrix while approximately pre-
serving distances.

Implementation:

from sklearn.random_projection import GaussianRandomProjection


rp = GaussianRandomProjection(n_components=k).fit_transform(X)

181
Isomap Distance Matrix

dij = Shortest Path Distance on G, G = (X, ϵ-Neighborhoods)

Explanation: Isomap computes geodesic distances in a graph of near-


est neighbors to preserve non-linear structures in the data.

Implementation:

from sklearn.manifold import Isomap


isomap = Isomap(n_neighbors=k).fit_transform(X)

182
MDS Stress Function

X 2
Stress = dij − dˆij
i<j

Explanation: The stress function measures the discrepancy between


original and embedded distances in Multidimensional Scaling (MDS).

Implementation:

from sklearn.manifold import MDS


mds = MDS(n_components=2).fit_transform(X)

183
Multidimensional Scaling (MDS)

XMDS = arg min Stress(Y)


Y

Explanation: MDS embeds data into a lower-dimensional space while


preserving pairwise distances as much as possible.

Implementation:

from sklearn.manifold import MDS


mds = MDS(n_components=k).fit_transform(X)

184
NMF (Non-Negative Matrix Factorization)

X ≈ WH, W ≥ 0, H ≥ 0

Explanation: NMF factorizes a non-negative matrix into two lower-


rank non-negative matrices, often used in topic modeling and image pro-
cessing.

Implementation:

from sklearn.decomposition import NMF


nmf = NMF(n_components=k).fit_transform(X)

185
ICA (Independent Component Analysis) Objective

n
X
Maximize: log p(si ), where s = WX
i=1

Explanation: ICA separates mixed signals into statistically indepen-


dent components by maximizing non-Gaussianity.

Implementation:

from sklearn.decomposition import FastICA


ica = FastICA(n_components=k).fit_transform(X)

186
Factor Analysis Model

X = ZΛ + ϵ, ϵ ∼ N (0, Ψ)

Explanation: Factor analysis models observed variables as linear com-


binations of latent factors plus noise.

Implementation:

from sklearn.decomposition import FactorAnalysis


fa = FactorAnalysis(n_components=k).fit_transform(X)

187
Kernel PCA Transformation

K = ϕ(X)ϕ(X)T , Eigen Decomposition: Kα = λα

Explanation: Kernel PCA applies PCA in a high-dimensional feature


space defined by a kernel function.

Implementation:

from sklearn.decomposition import KernelPCA


kpca = KernelPCA(kernel=’rbf’, n_components=k).fit_transform(X)

188
LDA (Fisher’s Criterion)

wT SB w
J(w) =
wT SW w

Explanation: LDA finds a projection that maximizes class separation


by optimizing the ratio of between-class to within-class variance.

Implementation:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis


lda = LinearDiscriminantAnalysis(n_components=k).fit_transform(X, y)

189
Robust PCA (RPCA)

X = L + S, ∥L∥∗ + λ∥S∥1

Explanation: RPCA decomposes a matrix into a low-rank component


(L) and a sparse component (S).

Implementation:

from r_pca import R_pca


rpca = R_pca(X)
L, S = rpca.fit()

190
Hessian LLE

Minimize: ∥WX − X∥22 , subject to local Hessian alignment

Explanation: Hessian LLE preserves local geometric structures while


optimizing a low-dimensional embedding.

Implementation:

from sklearn.manifold import LocallyLinearEmbedding


hessian_lle = LocallyLinearEmbedding(n_neighbors=k,
method=’hessian’).fit_transform(X)

191
Laplacian Eigenmaps Objective

X
Minimize: wij ∥yi − yj ∥2 , W = Graph Weights
i,j

Explanation: Laplacian Eigenmaps embeds data while preserving lo-


cal neighborhood information based on a graph structure.

Implementation:

from sklearn.manifold import SpectralEmbedding


laplacian = SpectralEmbedding(n_components=k).fit_transform(X)

192
Autoencoder Reconstruction

X̂ = Decoder(Encoder(X))

Explanation: Autoencoders minimize reconstruction error by com-


pressing data into a latent representation and reconstructing it.

Implementation:

from keras.models import Model


encoded = encoder(X)
decoded = decoder(encoded)

193
Autoencoder Latent Representation

Z = Encoder(X)

Explanation: The latent representation (Z) compresses input data


into a lower-dimensional space for downstream tasks.

Implementation:

latent_representation = encoder.predict(X)

194
Sparse PCA Objective

Maximize: ∥XW∥22 , subject to sparsity constraints on W

Explanation: Sparse PCA introduces sparsity in the principal compo-


nents to improve interpretability.

Implementation:

from sklearn.decomposition import SparsePCA


spca = SparsePCA(n_components=k).fit_transform(X)

195
t-SNE Objective

X Pij
Minimize: KL(P ||Q) = Pij log
i̸=j
Qij

Explanation: t-SNE minimizes the Kullback-Leibler divergence be-


tween high-dimensional and low-dimensional distributions.

Implementation:

from sklearn.manifold import TSNE


tsne = TSNE(n_components=k).fit_transform(X)

196
Gradient of t-SNE

∂KL X
=4 (Pij − Qij )(yi − yj )Qij
∂yi j

Explanation: The gradient of the t-SNE objective updates low-dimensional


embeddings to align distributions.

197
UMAP (Uniform Manifold Approximation and Pro-
jection)

X X
Optimize: wij ∥yi − yj ∥2 − λ wkl log(∥yk − yl ∥)
i,j k,l

Explanation: UMAP preserves local and global structures by opti-


mizing a balance between distances and densities.

Implementation:

import umap
umap_embedding = umap.UMAP(n_components=k).fit_transform(X)

198
SECTION 9 : PROBABILITY
DISTRIBUTIONS

Bernoulli Distribution

P (X = x) = px (1 − p)1−x , x ∈ {0, 1}, 0 ≤ p ≤ 1

Explanation: The Bernoulli distribution models a single binary event,


with success probability p.

Example: For p = 0.7, P (X = 1) = 0.7, P (X = 0) = 0.3.

Implementation:

from scipy.stats import bernoulli


prob = bernoulli.pmf(k=1, p=0.7)

199
Binomial Distribution


n k
P (X = k) = p (1 − p)n−k , k ∈ {0, 1, . . . , n}
k

Explanation: The Binomial distribution models the number of suc-


cesses in n independent Bernoulli trials.

5

Example: For n = 5 and p = 0.5, P (X = 3) = 3
(0.5)3 (0.5)2 =
0.3125.

Implementation:

from scipy.stats import binom


prob = binom.pmf(k=3, n=5, p=0.5)

200
Poisson Distribution

λk e−λ
P (X = k) = , k ∈ {0, 1, 2, . . .}
k!

Explanation: The Poisson distribution models the number of events


in a fixed interval, with a mean rate λ.

32 e−3
Example: For λ = 3, P (X = 2) = 2!
= 0.224.

Implementation:

from scipy.stats import poisson


prob = poisson.pmf(k=2, mu=3)

201
Uniform Distribution (Continuous)

1
f (x) = , x ∈ [a, b]
b−a

Explanation: The continuous uniform distribution assigns equal prob-


ability density to all points in [a, b].

Example: For a = 0, b = 2, f (1) = 12 .

Implementation:

from scipy.stats import uniform


prob = uniform.pdf(x=1, loc=0, scale=2)

202
Discrete Uniform Distribution

1
P (X = x) = , x ∈ {1, 2, . . . , n}
n

Explanation: The discrete uniform distribution assigns equal proba-


bility to n discrete outcomes.

Example: For n = 6, P (X = 3) = 16 .

Implementation:

from scipy.stats import randint


prob = randint.pmf(k=3, low=1, high=7)

203
Normal (Gaussian) Distribution

1 (x−µ)2
f (x) = √ e− 2σ 2
2πσ 2

Explanation: The normal distribution models data with a symmetric


bell shape, defined by mean µ and standard deviation σ.

Example: For µ = 0, σ = 1, f (0) = √1 ≈ 0.398.


Implementation:

from scipy.stats import norm


prob = norm.pdf(x=0, loc=0, scale=1)

204
Exponential Distribution

f (x) = λe−λx , x≥0

Explanation: The exponential distribution models the time between


events in a Poisson process.

Example: For λ = 2, f (1) = 2e−2 ≈ 0.271.

Implementation:

from scipy.stats import expon


prob = expon.pdf(x=1, scale=1/2)

205
Geometric Distribution

P (X = k) = (1 − p)k−1 p, k ∈ {1, 2, . . .}

Explanation: The geometric distribution models the number of trials


until the first success in repeated Bernoulli trials.

Example: For p = 0.5, P (X = 3) = (0.5)2 (0.5) = 0.125.

Implementation:

from scipy.stats import geom


prob = geom.pmf(k=3, p=0.5)

206
Hypergeometric Distribution

K N −K

k n−k
P (X = k) = N

n

Explanation: The hypergeometric distribution models successes in n


draws without replacement from a population of N with K successes.

Example: For N = 20, K = 7, n = 5, P (X = 3).

Implementation:

from scipy.stats import hypergeom


prob = hypergeom.pmf(k=3, M=20, n=5, N=7)

207
Beta Distribution

xα−1 (1 − x)β−1
f (x) = , x ∈ [0, 1]
B(α, β)

Explanation: The Beta distribution models probabilities as a function


of parameters α and β.

Example: For α = 2, β = 3, compute f (0.5).

Implementation:

from scipy.stats import beta


prob = beta.pdf(x=0.5, a=2, b=3)

208
Gamma Distribution

β α xα−1 e−βx
f (x) = , x>0
Γ(α)

Explanation: The Gamma distribution generalizes the exponential


distribution, often used for waiting times.

Example: For α = 2, β = 1, compute f (1).

Implementation:

from scipy.stats import gamma


prob = gamma.pdf(x=1, a=2, scale=1/1)

209
Multinomial Distribution

n!
P (X1 = k1 , . . . , Xk = kk ) = pk1 · · · pkkk
k1 ! · · · kk ! 1

Explanation: The multinomial distribution generalizes the binomial


distribution for multiple categories.

Example: For n = 3, p = [0.2, 0.5, 0.3], and k = [1, 1, 1].

Implementation:

from scipy.stats import multinomial


prob = multinomial.pmf(x=[1, 1, 1], n=3, p=[0.2, 0.5, 0.3])

210
Chi-Square Distribution

xk/2−1 e−x/2
f (x) = , x>0
2k/2 Γ(k/2)

Explanation: The chi-square distribution models the sum of squares


of k independent standard normal variables, commonly used in hypothesis
testing.

Example: For k = 3, compute f (2).

Implementation:

from scipy.stats import chi2


prob = chi2.pdf(x=2, df=3)

211
Student’s t-Distribution

−(ν+1)/2
x2

Γ((ν + 1)/2)
f (x) = √ 1+
νπΓ(ν/2) ν

Explanation: The Student’s t-distribution is used for estimating pop-


ulation parameters when the sample size is small.

Example: For ν = 5, compute f (1).

Implementation:

from scipy.stats import t


prob = t.pdf(x=1, df=5)

212
F-Distribution

r d1 −(d1 +d2 )/2


d1 x d1 x
d2
1+ d2
f (x) = , x>0
xB(d1 /2, d2 /2)

Explanation: The F-distribution models the ratio of variances and is


commonly used in ANOVA tests.

Implementation:

from scipy.stats import f


prob = f.pdf(x=2, dfn=5, dfd=10)

213
Laplace Distribution

1 − |x−µ|
f (x) = e b
2b

Explanation: The Laplace distribution, also known as the double ex-


ponential distribution, is used for modeling differences in data.

Implementation:

from scipy.stats import laplace


prob = laplace.pdf(x=0, loc=0, scale=1)

214
Rayleigh Distribution

x −x2 /(2σ2 )
f (x) = e , x≥0
σ2

Explanation: The Rayleigh distribution models the magnitude of a


two-dimensional vector with independent normal components.

Implementation:

from scipy.stats import rayleigh


prob = rayleigh.pdf(x=2, scale=1)

215
Triangular Distribution


2(x−a)
, a≤x<c


(b−a)(c−a)
f (x) =
2(b−x)
, c≤x≤b


(b−a)(b−c)

Explanation: The triangular distribution models data with a known


minimum, maximum, and mode.

Implementation:

from scipy.stats import triang


prob = triang.pdf(x=0.5, c=0.5, loc=0, scale=1)

216
Log-Normal Distribution

1 (ln x−µ)2
f (x) = √ e− 2σ2 , x>0
xσ 2π

Explanation: The log-normal distribution models data whose loga-


rithm follows a normal distribution.

Implementation:

from scipy.stats import lognorm


prob = lognorm.pdf(x=2, s=1, scale=np.exp(0))

217
Arcsine Distribution

1
f (x) = p , x ∈ (0, 1)
π x(1 − x)

Explanation: The arcsine distribution models probabilities with end-


points more likely than the middle.

Implementation:

from scipy.stats import arcsine


prob = arcsine.pdf(x=0.5)

218
Beta-Binomial Distribution


n B(k + α, n − k + β)
P (X = k) =
k B(α, β)

Explanation: The beta-binomial distribution models overdispersed bi-


nomial outcomes using a Beta prior.

Implementation:

from scipy.stats import betabinom


prob = betabinom.pmf(k=2, n=5, a=2, b=3)

219
Cauchy Distribution

1
f (x) = 2
x−x0
πγ 1 + γ

Explanation: The Cauchy distribution models data with heavy tails,


often used in robust statistics.

Implementation:

from scipy.stats import cauchy


prob = cauchy.pdf(x=0, loc=0, scale=1)

220
Weibull Distribution

k x k−1 −(x/λ)k
f (x) = e , x≥0
λ λ

Explanation: The Weibull distribution is used for reliability analysis


and modeling lifetimes.

Implementation:

from scipy.stats import weibull_min


prob = weibull_min.pdf(x=2, c=1.5, scale=1)

221
Pareto Distribution

αxαm
f (x) = , x ≥ xm
xα+1

Explanation: The Pareto distribution models wealth distribution and


heavy-tailed phenomena.

Implementation:

from scipy.stats import pareto


prob = pareto.pdf(x=2, b=1)

222
Log-Cauchy Distribution

1
f (x) = 2 , x>0
ln x−x0
xπγ 1 + γ

Explanation: The log-Cauchy distribution is the logarithmic trans-


form of the Cauchy distribution, with heavy tails.

223
SECTION 10 : REINFORCEMENT
LEARNING

Reward Function

R(s, a) = E[Reward | s, a]

Explanation: The reward function provides the immediate reward


received after taking action a in state s, guiding the agent’s behavior.

Implementation:

def reward_function(state, action):


# Example reward calculation
return rewards[state, action]

224
Discounted Return


X
Gt = γ k Rt+k+1 , 0≤γ<1
k=0

Explanation: The discounted return accumulates rewards over time,


weighting future rewards by the discount factor γ.

Implementation:

def discounted_return(rewards, gamma):


G = 0
for t, r in enumerate(rewards):
G += (gamma**t) * r
return G

225
Bellman Equation (State-Value Function)

V (s) = Eπ [R(s, a) + γV (s′ )]

Explanation: The Bellman equation relates the value of a state to the


expected return from it under a policy π.

Implementation:

def bellman_state_value(s, rewards, transition_prob, gamma, V):


return np.sum(transition_prob[s] * (rewards[s] + gamma * V))

226
Bellman Equation (Action-Value Function)

Q(s, a) = E[R(s, a) + γV (s′ )]

Explanation: The Bellman equation for the action-value function ex-


presses the value of taking action a in state s and following the policy
afterward.

Implementation:

def bellman_action_value(s, a, rewards, transition_prob, gamma, V):


return rewards[s, a] + gamma * np.sum(transition_prob[s, a] * V)

227
Temporal Difference (TD) Update

V (st ) ← V (st ) + α [Rt+1 + γV (st+1 ) − V (st )]

Explanation: The TD update improves the value estimate of a state


by using the difference between predicted and actual returns.

Implementation:

def td_update(V, state, reward, next_state, alpha, gamma):


V[state] += alpha * (reward + gamma * V[next_state] - V[state])

228
Monte Carlo Policy Evaluation

V (s) ← E[Gt | st = s]

Explanation: Monte Carlo evaluation updates the value of a state by


averaging returns from multiple episodes starting from that state.

Implementation:

def monte_carlo_evaluation(V, state_returns, state_counts):


for state, returns in state_returns.items():
V[state] = np.mean(returns)

229
Policy Improvement

π ′ (s) = arg max Q(s, a)


a

Explanation: Policy improvement updates the policy by choosing the


action that maximizes the action-value function.

Implementation:

def policy_improvement(Q):
return np.argmax(Q, axis=1)

230
Q-Learning Update

h i
Q(st , at ) ← Q(st , at ) + α Rt+1 + γ max Q(st+1 , a) − Q(st , at )
a

Explanation: Q-learning is an off-policy algorithm that updates action-


value estimates using the maximum future Q-value.

Implementation:

def q_learning_update(Q, state, action, reward, next_state, alpha, gamma):


Q[state, action] += alpha * (reward + gamma * np.max(Q[next_state])
- Q[state, action])

231
SARSA Update

Q(st , at ) ← Q(st , at ) + α [Rt+1 + γQ(st+1 , at+1 ) − Q(st , at )]

Explanation: SARSA is an on-policy algorithm that updates Q-values


based on the action actually taken under the current policy.

Implementation:

def sarsa_update(Q, state, action, reward,


next_state, next_action, alpha, gamma):
Q[state, action] += alpha * (reward + gamma * Q[next_state, next_action]
- Q[state, action])

232
Value Iteration Update

" #
X
′ ′
V (s) ← max R(s, a) + γ P (s | s, a)V (s )
a
s′

Explanation: Value iteration iteratively updates state values by find-


ing the optimal action at each step.

Implementation:

def value_iteration(V, rewards, transition_prob, gamma):


for s in range(len(V)):
V[s] = max(np.sum(transition_prob[s, a] * (rewards[s, a]
+ gamma * V)) for a in range(num_actions))

233
Actor–Critic Policy Update

θ ← θ + α∇θ log πθ (at | st )δt , δt = Rt+1 + γV (st+1 ) − V (st )

Explanation: The actor updates the policy using the advantage, while
the critic updates the value function to estimate the advantage.

Implementation:

def actor_critic_update(actor, critic, state, action, reward, next_state,


alpha, gamma):
delta = reward + gamma * critic[next_state] - critic[state]
actor.update(state, action, alpha * delta)
critic[state] += alpha * delta

234
Deterministic Policy Gradient

∇J(θ) = Es∼ρπ [∇a Q(s, a)∇θ πθ (s)]

Explanation: Deterministic policy gradients update the policy directly


in a continuous action space using gradients of the Q-function.

Implementation:

def deterministic_policy_gradient(policy, q_function, state, alpha):


action = policy(state)
grad_q = q_function.gradient(state, action)
grad_pi = policy.gradient(state)
policy.update(state, alpha * np.dot(grad_q, grad_pi))

235
Discount Factor (γ)


X
Gt = γ k Rt+k+1 , 0≤γ<1
k=0

Explanation: The discount factor determines the weight given to fu-


ture rewards. A smaller γ prioritizes immediate rewards, while a larger γ
considers longer-term rewards.

Implementation:

def discounted_return(rewards, gamma):


G = 0
for t, r in enumerate(rewards):
G += (gamma**t) * r
return G

236
Expected SARSA

Q(st , at ) ← Q(st , at ) + α [Rt+1 + γEa′ [Q(st+1 , a′ )] − Q(st , at )]

Explanation: Expected SARSA updates Q-values using the expected


value of the next action, improving stability over standard SARSA.

Implementation:

def expected_sarsa(Q, state, action, reward, next_state, policy, alpha, gamma):


expected_value = np.sum(policy[next_state] * Q[next_state])
Q[state, action] += alpha * (reward + gamma * expected_value
- Q[state, action])

237
Eligibility Traces Update (TD(λ))

et = γλet−1 + ∇θ V (st ), θ ← θ + αδt et

Explanation: TD(λ) combines TD and Monte Carlo methods using


eligibility traces, balancing bias and variance in value updates.

Implementation:

def td_lambda_update(V, eligibility, state, reward, next_state, alpha,


gamma, lambda_):
delta = reward + gamma * V[next_state] - V[state]
eligibility[state] += 1
V += alpha * delta * eligibility
eligibility *= gamma * lambda_

238
TD Error

δt = Rt+1 + γV (st+1 ) − V (st )

Explanation: The TD error measures the difference between predicted


and observed rewards, guiding updates in temporal difference learning.

Implementation:

def td_error(V, state, reward, next_state, gamma):


return reward + gamma * V[next_state] - V[state]

239
Stochastic Gradient Descent in RL

θ ← θ − α∇θ L(θ)

Explanation: Stochastic gradient descent updates model parameters


by minimizing a loss function, often used in function approximation for RL.

Implementation:

def sgd_update(theta, grad, alpha):


return theta - alpha * grad

240
Double Q-Learning

h i
Q1 (st , at ) ← Q1 (st , at )+α Rt+1 + γQ2 (st+1 , arg max Q1 (st+1 , a)) − Q1 (st , at )
a

Explanation: Double Q-learning reduces overestimation bias by alter-


nating updates between two Q-functions.

Implementation:

def double_q_learning_update(Q1, Q2, state, action, reward, next_state,


alpha, gamma):
max_action = np.argmax(Q1[next_state])
target = reward + gamma * Q2[next_state, max_action]
Q1[state, action] += alpha * (target - Q1[state, action])

241
Advantage Actor–Critic (A2C)

δt = Rt+1 + γV (st+1 ) − V (st ), θ ← θ + α∇θ log πθ (at | st )δt

Explanation: A2C uses the advantage function to reduce variance in


policy updates while learning the value function as a baseline.

Implementation:

def a2c_update(actor, critic, state, action, reward, next_state, alpha, gamma):


delta = reward + gamma * critic[next_state] - critic[state]
actor.update(state, action, alpha * delta)
critic[state] += alpha * delta

242
Off-Policy Evaluation (Importance Sampling)

"T −1 #
Y π(at | st )
E[Ĝ] = E Gt
t=0
µ(at | st )

Explanation: Importance sampling corrects for discrepancies between


the behavior policy µ and the target policy π when estimating returns.

Implementation:

def importance_sampling(weights, returns):


return np.sum(weights * returns)

243
Policy Gradient Update Rule

θ ← θ + α∇θ Eπθ [Gt log πθ (at | st )]

Explanation: The policy gradient algorithm updates parameters in


the direction of performance improvement, directly optimizing the policy.

Implementation:

def policy_gradient_update(policy, rewards, states, actions, alpha):


for state, action, reward in zip(states, actions, rewards):
grad = policy.gradient(state, action)
policy.update(state, action, alpha * reward * grad)

244
Soft Q-Learning Objective

L = Es,a [Q(s, a) − α log π(a | s)]

Explanation: Soft Q-learning optimizes a policy by balancing reward


maximization and entropy regularization.

Implementation:

def soft_q_update(Q, policy, state, action, reward, next_state, alpha, gamma):


entropy = -policy.log_prob(action, state)
target = reward + gamma * (Q[next_state].max() + alpha * entropy)
Q[state, action] += alpha * (target - Q[state, action])

245
Entropy-Regularized RL

π ∗ = arg max E[Gt ] + αH(π)


π

Explanation: Entropy regularization encourages exploration by max-


imizing the entropy of the policy.

Implementation:

def entropy_regularized_update(policy, rewards, states,


actions, alpha, entropy_coeff):
for state, action, reward in zip(states, actions, rewards):
entropy = -policy.log_prob(action, state)
grad = policy.gradient(state, action)
policy.update(state, action, alpha *
(reward + entropy_coeff * entropy) * grad)

246
Soft Actor–Critic (SAC)

L = Es,a [Q(s, a) − α log π(a | s)] , Q(s, a) = R + γV (s′ )

Explanation: SAC combines entropy regularization with actor–critic


methods to improve stability and exploration in continuous control.

Implementation:

def sac_update(Q, policy, state, action, reward, next_state, alpha, gamma):


entropy = -policy.log_prob(action, state)
target = reward + gamma * (Q[next_state].max() + alpha * entropy)
Q[state, action] += alpha * (target - Q[state, action])

247
Trust Region Policy Optimization (TRPO)


πθ (a | s)
max Eπθ A(s, a) , subject to DKL (πθ ||πθold ) ≤ δ
θ πθold (a | s)

248

You might also like