Mathematics for Machine Learning
Mathematics for Machine Learning
LEARNING
Mohamed Aazi
MATHEMATICS FOR
MACHINE LEARNING
1
SECTION 1 : LINEAR ALGEBRA
Vector Addition
u1 v1 u1 + v1
u2 v2 u2 + v2
u+v = . + . = .
.. .. ..
un vn un + vn
1 3 4
Example: If u = and v = , then u + v = .
2 4 6
Implementation:
import numpy as np
u = np.array([1, 2])
v = np.array([3, 4])
result = u + v
2
Scalar Multiplication of a Vector
v1 αv1
v2 αv2
αv = α . = .
.. ..
vn αvn
2 6
Example: If α = 3 and v = , then αv = .
−1 −3
Implementation:
import numpy as np
alpha = 3
v = np.array([2, -1])
result = alpha * v
3
Dot Product
n
X
u·v = ui vi = u1 v1 + u2 v2 + · · · + un vn
i=1
1 3
Example: If u = and v = , then u · v = 1 · 3 + 2 · 4 = 11.
2 4
Implementation:
import numpy as np
u = np.array([1, 2])
v = np.array([3, 4])
result = np.dot(u, v)
4
Cross Product (3D)
i j k
u × v = u1 u2 u3
v1 v2 v3
1 0 0
Example: If u = 0 and v = 1, then u × v = 0.
0 0 1
Implementation:
import numpy as np
u = np.array([1, 0, 0])
v = np.array([0, 1, 0])
result = np.cross(u, v)
5
Norm of a Vector (Euclidean)
v
u n q
uX
∥v∥ = t vi = v12 + v22 + · · · + vn2
2
i=1
3 √
Example: If v = , then ∥v∥ = 32 + 42 = 5.
4
Implementation:
import numpy as np
v = np.array([3, 4])
result = np.linalg.norm(v)
6
Orthogonality Condition
u·v =0
1 −2
Example: If u = and v = , then u · v = 1 · −2 + 2 · 1 = 0,
2 1
confirming orthogonality.
Implementation:
import numpy as np
u = np.array([1, 2])
v = np.array([-2, 1])
result = np.dot(u, v)
is_orthogonal = result == 0
7
Matrix Addition
a11 a12 b b a + b11 a12 + b12
A+B= + 11 12 = 11
a21 a22 b21 b22 a21 + b21 a22 + b22
1 2 5 6 6 8
Example: If A = and B = , then A + B = .
3 4 7 8 10 12
Implementation:
import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
result = A + B
8
Matrix Scalar Multiplication
a11 a12 αa11 αa12
αA = α =
a21 a22 αa21 αa22
1 2 2 4
Example: If α = 2 and A = , then αA = .
3 4 6 8
Implementation:
import numpy as np
alpha = 2
A = np.array([[1, 2], [3, 4]])
result = alpha * A
9
Matrix-Vector Multiplication
a11 a12 x1 a11 x1 + a12 x2
Ax = =
a21 a22 x2 a21 x1 + a22 x2
1 2 5 17
Example: If A = and x = , then Ax = .
3 4 6 39
Implementation:
import numpy as np
A = np.array([[1, 2], [3, 4]])
x = np.array([5, 6])
result = np.dot(A, x)
10
Matrix Multiplication
n
X
C = AB, cij = aik bkj
k=1
1 2 5 6 19 22
Example: If A = and B = , then AB = .
3 4 7 8 43 50
Implementation:
import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
result = np.dot(A, B)
11
Transpose of a Matrix
T
a11 a12 a a
AT = = 11 21
a21 a22 a12 a22
1 2 1 3
Example: If A = , then AT = .
3 4 2 4
Implementation:
import numpy as np
A = np.array([[1, 2], [3, 4]])
result = A.T
12
Determinant of a 2×2 Matrix
a11 a12
det(A) = = a11 a22 − a12 a21
a21 a22
3 8
Example: If A = , then det(A) = 3 · 6 − 8 · 4 = −14.
4 6
Implementation:
import numpy as np
A = np.array([[3, 8], [4, 6]])
result = np.linalg.det(A)
13
Inverse of a 2×2 Matrix
1 a −a12
A−1 = 22 , det(A) ̸= 0
det(A) −a21 a11
3 8 6 −8
Example: If A = , then det(A) = −14 and A−1 = 1 .
−14
4 6 −4 3
Implementation:
import numpy as np
A = np.array([[3, 8], [4, 6]])
result = np.linalg.inv(A)
14
Cramer’s Rule
det(Ai )
xi = , det(A) ̸= 0
det(A)
2 1 5
Example: For A = and b = ,
1 3 7
5 1 2 5
A1 = , A2 =
7 3 1 7
det(A1 ) det(A2 )
and det(A) = 5, so x1 = det(A)
, x2 = det(A)
.
Implementation:
import numpy as np
A = np.array([[2, 1], [1, 3]])
b = np.array([5, 7])
det_A = np.linalg.det(A)
x = [np.linalg.det(np.column_stack((b if i == j else A[:, j]
for j in range(A.shape[1])))) / det_A
for i in range(A.shape[1])]
15
Inverse of a Square Matrix
1
A−1 = adj(A), det(A) ̸= 0
det(A)
4 7
Example: If A = , the inverse is computed using cofactor
2 6
expansion and scaling.
Implementation:
import numpy as np
A = np.array([[4, 7], [2, 6]])
result = np.linalg.inv(A)
16
Determinant of a Triangular Matrix
n
Y
det(A) = aii
i=1
2 1 0
Example: If A = 0 3 4, then det(A) = 2 · 3 · 5 = 30.
0 0 5
Implementation:
import numpy as np
A = np.array([[2, 1, 0], [0, 3, 4], [0, 0, 5]])
result = np.prod(np.diag(A))
17
Rank-Nullity Theorem
rank(A) + nullity(A) = n
Implementation:
import numpy as np
from numpy.linalg import matrix_rank
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
rank = matrix_rank(A)
nullity = A.shape[1] - rank
18
Hadamard (Elementwise) Product
a11 b11 a12 b12
C=A◦B=
a21 b21 a22 b22
1 2 5 6 5 12
Example: If A = and B = , then C = .
3 4 7 8 21 32
Implementation:
import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
result = np.multiply(A, B)
19
Outer Product
uv u1 v2 · · · u1 vn
1 1
u2 v1 u2 v2 · · · u2 vn
C=u⊗v =
.. .. ... ..
. . .
um v1 um v2 · · · um vn
3
1 3 4 5
Example: If u = and v = 4, then u ⊗ v = .
2 6 8 10
5
Implementation:
import numpy as np
u = np.array([1, 2])
v = np.array([3, 4, 5])
result = np.outer(u, v)
20
Frobenius Norm
v
u m X
n
uX
∥A∥F = t |aij |2
i=1 j=1
1 2 √ √
Example: If A = , then ∥A∥F = 12 + 22 + 32 + 42 = 30.
3 4
Implementation:
import numpy as np
A = np.array([[1, 2], [3, 4]])
result = np.linalg.norm(A, ’fro’)
21
Matrix Norm Inequality
∥Ax∥ ≤ ∥A∥∥x∥
1 2 1
Example: For A = and x = , compute ∥Ax∥ ≤ ∥A∥∥x∥.
3 4 1
Implementation:
import numpy as np
A = np.array([[1, 2], [3, 4]])
x = np.array([1, 1])
left = np.linalg.norm(np.dot(A, x))
right = np.linalg.norm(A) * np.linalg.norm(x)
inequality_holds = left <= right
22
Matrix Trace
n
X
Tr(A) = aii
i=1
1 2
Example: If A = , then Tr(A) = 1 + 4 = 5.
3 4
Implementation:
import numpy as np
A = np.array([[1, 2], [3, 4]])
result = np.trace(A)
23
Trace of a Product
Tr(AB) = Tr(BA)
1 2 5 6
Example: For A = and B = , compute Tr(AB) =
3 4 7 8
Tr(BA).
Implementation:
import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
trace1 = np.trace(np.dot(A, B))
trace2 = np.trace(np.dot(B, A))
equality_holds = trace1 == trace2
24
Block Matrix Multiplication
A B E F AE + BG AF + BH
C= =
C D G H CE + DG CF + DH
Implementation:
import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
C = np.array([[9, 10], [11, 12]])
D = np.array([[13, 14], [15, 16]])
E = np.array([[17, 18], [19, 20]])
F = np.array([[21, 22], [23, 24]])
G = np.array([[25, 26], [27, 28]])
H = np.array([[29, 30], [31, 32]])
top_left = np.dot(A, E) + np.dot(B, G)
top_right = np.dot(A, F) + np.dot(B, H)
bottom_left = np.dot(C, E) + np.dot(D, G)
bottom_right = np.dot(C, F) + np.dot(D, H)
25
result = np.block([[top_left, top_right], [bottom_left, bottom_right]])
26
Kronecker Product
a11 B a12 B
C=A⊗B=
a21 B a22 B
1 2 0 5
Example: If A = and B = , compute A ⊗ B.
3 4 6 7
Implementation:
import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[0, 5], [6, 7]])
result = np.kron(A, B)
27
SECTION 2 : PROBABILITY AND
STATISTICS
Conditional Probability
P (A ∩ B)
P (A | B) = , P (B) > 0
P (B)
0.2
Example: If P (A ∩ B) = 0.2 and P (B) = 0.5, then P (A | B) = 0.5
=
0.4.
Implementation:
P_A_and_B = 0.2
P_B = 0.5
P_A_given_B = P_A_and_B / P_B
28
Law of Total Probability
X
P (A) = P (A | Bi )P (Bi )
i
Implementation:
P_A_given_B1 = 0.3
P_A_given_B2 = 0.7
P_B1 = 0.4
P_B2 = 0.6
P_A = P_A_given_B1 * P_B1 + P_A_given_B2 * P_B2
29
Bayes’ Theorem
P (B | A)P (A)
P (A | B) = , P (B) > 0
P (B)
Implementation:
P_B_given_A = 0.8
P_A = 0.3
P_B = 0.5
P_A_given_B = (P_B_given_A * P_A) / P_B
30
Expectation
X
E[X] = xi P (X = xi )
i
Implementation:
X = [1, 2, 3]
P_X = [0.2, 0.5, 0.3]
expectation = sum(x * p for x, p in zip(X, P_X))
31
Variance
Implementation:
X = [1, 2, 3]
P_X = [0.2, 0.5, 0.3]
expectation = sum(x * p for x, p in zip(X, P_X))
expectation_X2 = sum(x**2 * p for x, p in zip(X, P_X))
variance = expectation_X2 - expectation**2
32
Standard Deviation
p
σ(X) = Var(X)
√
Example: If Var(X) = 0.29, then σ(X) = 0.29 ≈ 0.54.
Implementation:
variance = 0.29
std_dev = variance**0.5
33
Covariance
Implementation:
X = [1, 2]
Y = [3, 4]
P_XY = [0.5, 0.5]
E_X = sum(x * p for x, p in zip(X, P_XY))
E_Y = sum(y * p for y, p in zip(Y, P_XY))
covariance = sum((x - E_X) * (y - E_Y) * p for x, y, p in zip(X, Y, P_XY))
34
Correlation
Cov(X, Y )
ρ(X, Y ) =
σ(X)σ(Y )
Implementation:
covariance = 0.25
std_X = 0.5
std_Y = 1.0
correlation = covariance / (std_X * std_Y)
35
Probability Mass Function (PMF)
pi , if x = xi
P (X = x) =
0,
otherwise
Implementation:
X = [1, 2, 3]
P_X = [0.2, 0.5, 0.3]
def pmf(x):
return P_X[X.index(x)] if x in X else 0
36
Probability Density Function (PDF)
Z ∞
fX (x) ≥ 0, fX (x)dx = 1
−∞
Implementation:
import numpy as np
from scipy.stats import norm
x = 0 # example point
pdf_value = norm.pdf(x)
37
Joint Probability
P (A ∩ B) = P (A | B)P (B)
Implementation:
P_A_given_B = 0.4
P_B = 0.5
P_A_and_B = P_A_given_B * P_B
38
CDF (Cumulative Distribution Function)
FX (x) = P (X ≤ x)
Implementation:
39
Entropy (discrete)
X
H(X) = − P (X = xi ) log2 P (X = xi )
i
Example: If P (X) = {0.5, 0.5}, then H(X) = −0.5 log2 (0.5)−0.5 log2 (0.5) =
1.
Implementation:
import numpy as np
P_X = [0.5, 0.5]
entropy = -sum(p * np.log2(p) for p in P_X if p > 0)
40
Conditional Expectation
X
E[X | Y ] = xP (X = x | Y )
x
Implementation:
X = [1, 2]
P_X_given_Y = [0.7, 0.3]
conditional_expectation = sum(x * p for x, p in zip(X, P_X_given_Y))
41
Law of Iterated Expectations
E[X] = E[E[X | Y ]]
Implementation:
E_X_given_Y = [3, 5]
P_Y = [0.6, 0.4]
E_X = sum(e * p for e, p in zip(E_X_given_Y, P_Y))
42
Marginal Probability
X
P (A) = P (A ∩ B)
B
Implementation:
43
Skewness
E[(X − µ)3 ]
Skewness(X) =
σ3
Implementation:
import numpy as np
X = [1, 2, 3]
mu = np.mean(X)
sigma = np.std(X)
skewness = np.mean(((X - mu) / sigma)**3)
44
Kurtosis
E[(X − µ)4 ]
Kurtosis(X) =
σ4
Implementation:
import numpy as np
X = [1, 2, 3]
mu = np.mean(X)
sigma = np.std(X)
kurtosis = np.mean(((X - mu) / sigma)**4)
45
Binary Cross-Entropy (special case)
n
1X
BCE(y, ŷ) = − (yi log(ŷi ) + (1 − yi ) log(1 − ŷi ))
n i=1
Example: For y = [1, 0] and ŷ = [0.8, 0.2], compute BCE = − 21 (log(0.8) + log(0.8)).
Implementation:
import numpy as np
y = np.array([1, 0])
y_hat = np.array([0.8, 0.2])
bce = -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))
46
Variance (Alternative)
12 +22 +32
Example: For X = {1, 2, 3}, compute E[X 2 ] = 3
= 4.67 and
(E[X])2 = 22 = 4, so Var(X) = 0.67.
Implementation:
import numpy as np
X = np.array([1, 2, 3])
E_X2 = np.mean(X**2)
E_X = np.mean(X)
variance = E_X2 - E_X**2
47
SECTION 3 : CALCULUS
f (x + h) − f (x)
f ′ (x) = lim
h→0 h
(x+h)2 −x2
Example: For f (x) = x2 , compute f ′ (x) = limh→0 h
= 2x.
Implementation:
48
Power Rule
d n
x = nxn−1
dx
Implementation:
49
Product Rule
d
[u(x)v(x)] = u′ (x)v(x) + u(x)v ′ (x)
dx
Implementation:
50
Quotient Rule
x2 2xex −x2 ex
Example: For f (x) = ex
, f ′ (x) = e2x
.
Implementation:
51
Chain Rule
d
f (g(x)) = f ′ (g(x))g ′ (x)
dx
Implementation:
52
Logarithmic Derivative
d 1
ln(x) = , x>0
dx x
Implementation:
import numpy as np
def log_derivative(x):
return 1 / x
53
Exponential Derivative
d x
e = ex
dx
Implementation:
import numpy as np
def exp_derivative(x):
return np.exp(x)
54
Integral of a Power Function
xn+1
Z
xn dx = + C, n ̸= −1
n+1
x3
R
Example: For f (x) = x2 , x2 dx = 3
+ C.
Implementation:
55
Fundamental Theorem of Calculus
Z b
f (x)dx = F (b) − F (a), where F ′ (x) = f (x)
a
R3 h 3 i3
2 2 x 27 1 26
Example: For f (x) = x over [1, 3], 1
x dx = 3
= 3
− 3
= 3
.
1
Implementation:
56
Partial Derivatives
∂f ∂f
Example: For f (x, y) = x2 + y 2 , ∂x
= 2x, ∂y
= 2y.
Implementation:
57
Gradient
∂f
∂x1
∂f
∂x2
∇f (x) =
..
.
∂f
∂xn
2x
Example: For f (x, y) = x2 + y 2 , ∇f (x, y) = .
2y
Implementation:
import numpy as np
def gradient(f, point, h=1e-5):
grad = np.zeros(len(point))
for i in range(len(point)):
args = point.copy()
args[i] += h
grad[i] = (f(*args) - f(*point)) / h
return grad
58
Second Derivative (Hessian)
∂2f ∂2f
∂x21 ∂x1 ∂x2
···
2f ∂2f
H(f ) = ∂x∂ ∂x · · ·
2 1 ∂x22
.. ..
..
. . .
2 0
Example: For f (x, y) = x2 + y 2 , the Hessian is H(f ) = .
0 2
Implementation:
59
f_j = f(*args)
f_orig = f(*point)
hess[i, j] = (f_ij - f_i - f_j + f_orig) / (h ** 2)
return hess
60
Directional Derivative
Dv f (x) = ∇f (x) · v
2x
Example: For f (x, y) = x2 + y 2 , ∇f (x, y) = . In the direction
2y
1
v = , Dv f (x, y) = 2x.
0
Implementation:
61
Higher-Order Partial Derivatives
∂kf
∂xp11 ∂xp22 · · · ∂xpnn
∂2f
Example: For f (x, y) = x2 y, ∂x∂y
= 2x.
Implementation:
62
Total Derivative
n
df X ∂f dxi
=
dt i=1
∂xi dt
df
Example: If f (x, y) = x2 + y 2 , x = t, and y = t2 , then dt
= 2x · 1 +
2y · 2t = 2t + 4t3 .
Implementation:
63
Implicit Differentiation
∂F
dy ∂x
= − ∂F
dx ∂y
dy
Example: For F (x, y) = x2 + y 2 − 1 = 0, dx
= − xy .
Implementation:
64
Taylor Series Expansion
f ′′ (a)
f (x) ≈ f (a) + f ′ (a)(x − a) + (x − a)2 + · · ·
2!
x2
Example: For f (x) = ex near a = 0, f (x) ≈ 1 + x + 2
+ · · ·.
Implementation:
65
Jacobian Matrix
∂f1 ∂f1
···
∂x1 ∂xn
. .. ..
J(f ) = .. . .
∂fm ∂fm
∂x1
··· ∂xn
2
x +y 2x 1
Example: For f (x, y) = , the Jacobian is .
2
y +x 1 2y
Implementation:
66
Arc Length of a Curve
s 2
Z b
dy
L= 1+ dx
a dx
R1p
Example: For y = x2 over [0, 1], L = 0
1 + (2x)2 dx.
Implementation:
67
Curvature of a Function
|y ′′ (x)|
κ(x) =
(1 + [y ′ (x)]2 )3/2
Implementation:
68
Integral by Parts
Z Z
′
uv dx = uv − u′ vdx
Implementation:
69
Volume of Revolution (Disk Method)
Z b
V =π [f (x)]2 dx
a
Implementation:
70
Surface Integral
s 2 2
ZZ ZZ
∂g ∂g
f (x, y, z)dS = f (x, y, g(x, y)) 1 + + dA
S R ∂x ∂y
Implementation:
71
Divergence of a Vector Field
x
Example: For F = y , div F = 1 + 1 + 1 = 3.
z
Implementation:
72
Curl of a Vector Field
i j k
curl F = ∇ × F = ∂
∂x
∂
∂y
∂
∂z
F1 F2 F3
0 −y
Example: For F = 0 , curl F = x .
xy 0
Implementation:
73
SECTION 4 : OPTIMIZATION
Gradient Descent
Implementation:
74
Stochastic Gradient Descent (SGD)
Implementation:
75
Momentum-based Gradient Descent
Implementation:
76
Nesterov Accelerated Gradient (NAG)
Implementation:
77
RMSProp
η
s(t+1) = βs(t) + (1 − β)[∇J(θ(t) )]2 , θ(t+1) = θ(t) − √ ∇J(θ(t) )
s(t+1) + ϵ
Example: For β = 0.9, RMSProp adapts the step size for each param-
eter, stabilizing updates.
Implementation:
78
Adam Optimization
Implementation:
79
Regularized Optimization Objective
Implementation:
80
Learning Rate Decay
η0
ηt =
1 + γt
Implementation:
81
Gradient Clipping
g = clip(g, −τ, τ )
Implementation:
82
Minibatch Gradient Descent
Implementation:
83
Coordinate Descent
Implementation:
84
Elastic Net Regularization
Implementation:
85
Adagrad Optimization
η
θ(t+1) = θ(t) − √ ∇J(θ(t) )
G(t) + ϵ
t
X
(t)
G = [∇J(θ(i) )]2
i=1
Implementation:
86
AdamW Optimization
η
θ(t+1) = θ(t) − √ m̂ − λθ(t)
ŝ + ϵ
Implementation:
87
Momentum “Heavy Ball” Method
Implementation:
88
Projection / Projected Gradient Descent
Example: For C = ∥θ∥2 ≤ 1, project θ onto the unit ball after each
step.
Implementation:
89
Newton’s Method
Implementation:
90
Proximal Gradient Method
Example: For R(θ) = ∥θ∥1 , compute soft thresholding for each pa-
rameter.
Implementation:
91
Proximal Gradient with L1 (ISTA)
Implementation:
92
Penalty Method
1
Jpenalty (θ) = J(θ) + h(θ)2
µ
Example: For h(θ) = ∥θ∥22 − 1, penalize deviations from the unit ball
constraint.
Implementation:
93
Augmented Lagrangian Method
µ
L(θ, λ, µ) = J(θ) + λh(θ) + h(θ)2
2
Example: For J(θ) = ∥θ∥22 and h(θ) = ∥θ∥1 − 1, compute updates for
θ, λ, and µ.
Implementation:
94
Dual Ascent Method
Implementation:
95
Trust Region Method
1
θ(t+1) = arg min{J(θ) + ∇J(θ)T ∆ + ∆T H∆ | ∥∆∥ ≤ ∆max }
∆ 2
Implementation:
96
Barrier Method
m
1X
Jbarrier (θ) = J(θ) − ln(−hi (θ))
µ i=1
Example: For h(θ) = ∥θ∥1 − 1, use − ln(1 − ∥θ∥1 ) as the barrier term.
Implementation:
97
Simulated Annealing
∆E
P (∆E) = exp −
T
Implementation:
import numpy as np
def simulated_annealing(loss, theta, T, cooling_rate, steps):
for _ in range(steps):
new_theta = theta + np.random.uniform(-1, 1, size=theta.shape)
delta_E = loss(new_theta) - loss(theta)
if delta_E < 0 or np.exp(-delta_E / T) > np.random.rand():
theta = new_theta
T *= cooling_rate
return theta
98
SECTION 5 : REGRESSION
ŷ = Xβ + ϵ
Implementation:
import numpy as np
X = np.array([[1, 2], [3, 4]])
beta = np.array([2, 3])
y_pred = X @ beta
99
Ordinary Least Squares (OLS)
β = (XT X)−1 XT y
1 2 5
Example: For X = and y = , compute β.
3 4 11
Implementation:
100
Mean Squared Error (MSE)
n
1X
MSE = (yi − ŷi )2
n i=1
Example: For y = [1, 2, 3] and ŷ = [1.1, 1.9, 3.2], compute the MSE.
Implementation:
101
Gradient of the MSE Loss
∂ 2
MSE = − XT (y − Xβ)
∂β n
1 2
Example: Compute the gradient for X = , y = [5, 11], and
3 4
β = [1, 1].
Implementation:
102
Coefficient of Determination (R²)
Pn
2 (yi − ŷi )2
R = 1 − Pi=1
n 2
i=1 (yi − ȳ)
Implementation:
103
Adjusted R²
(1 − R2 )(n − 1)
R̄2 = 1 −
n−p−1
Implementation:
adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
104
Mean Absolute Error (MAE)
n
1X
MAE = |yi − ŷi |
n i=1
Example: For y = [1, 2, 3] and ŷ = [1.1, 1.9, 3.2], compute the MAE.
Implementation:
105
Weighted Least Squares (WLS)
β = (XT WX)−1 XT Wy
Implementation:
W = np.diag([1, 2])
beta = np.linalg.inv(X.T @ W @ X) @ X.T @ W @ y
106
Polynomial Regression Hypothesis
ŷ = β 0 + β 1 x + β 2 x2 + · · · + β n xn
Example: Fit y = 2x + x2 .
Implementation:
107
Non-Linear Regression
ŷ = f (X, β) + ϵ
Implementation:
108
Maximum Likelihood Estimation for Regression
n
Y
β̂ = arg max p(yi | Xi , β)
β
i=1
Implementation:
109
Empirical Risk Minimization
n
1X
θ̂ = arg min ℓ(yi , f (Xi , θ))
θ n i=1
Explanation: ERM minimizes the average loss over the training data
to estimate the model parameters.
Implementation:
110
Logistic Regression Hypothesis
1
ŷ = σ(Xβ), σ(z) =
1 + e−z
1 2 1
Example: For X = and β = , compute ŷ.
3 4 −1
Implementation:
def sigmoid(z):
return 1 / (1 + np.exp(-z))
y_pred = sigmoid(X @ beta)
111
Binary Cross-Entropy Loss
n
1X
L=− [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )]
n i=1
Implementation:
112
Cross-Entropy Loss (Multi-Class)
n k
1 XX
L=− yij log(ŷij )
n i=1 j=1
Example: For y = [1, 0, 0] and ŷ = [0.8, 0.1, 0.1], compute the loss.
Implementation:
113
Hinge Loss for SVM
n
1X
L= max(0, 1 − yi ŷi )
n i=1
Example: For y = [1, −1] and ŷ = [0.8, −0.5], compute the loss.
Implementation:
114
Lasso Regression Objective
1
L= ∥y − Xβ∥22 + λ∥β∥1
2n
Implementation:
115
Ridge Regression Objective
1
L= ∥y − Xβ∥22 + λ∥β∥22
2n
Implementation:
116
Negative Binomial Regression
α y
Γ(y + α) α µ̂
ŷ =
Γ(y + 1)Γ(α) α + µ̂ α + µ̂
Implementation:
117
Poisson Regression Model
µ̂ = eXβ
Implementation:
118
Gamma Regression Objective
n
1X yi
L= − log(µ̂i ) +
ϕ i=1 µ̂i
Implementation:
119
Probit Regression Model
P (y = 1) = Φ(Xβ)
Implementation:
120
Multinomial Logistic Regression
eXβk
P (y = k) = PK Xβ
j=1 e
j
Implementation:
121
Quantile Regression Loss
n
X
L= ρτ (yi − ŷi ), ρτ (e) = max(τ e, (1 − τ )e)
i=1
Implementation:
122
Huber Loss
1 (yi
n
− ŷi )2 , if |yi − ŷi | ≤ δ
2
X
L=
i=1 δ|yi − ŷi | − 1 δ 2 , otherwise
2
Implementation:
123
SECTION 6 : NEURAL
NETWORKS
Implementation:
w += eta * (y - y_pred) * x
124
Forward Propagation (Single Layer)
ŷ = σ(Xw + b)
125
Sigmoid Activation
1
σ(z) =
1 + e−z
Implementation:
def sigmoid(z):
return 1 / (1 + np.exp(-z))
126
Tanh Activation
ez − e−z
tanh(z) =
ez + e−z
Implementation:
def tanh(z):
return np.tanh(z)
127
ReLU Activation
ReLU(z) = max(0, z)
Implementation:
def relu(z):
return np.maximum(0, z)
128
Heaviside Step Activation
1, z ≥ 0
H(z) =
0, z < 0
Implementation:
def heaviside(z):
return np.where(z >= 0, 1, 0)
129
Leaky ReLU Activation
z, z≥0
Leaky ReLU(z) =
αz, z < 0
Implementation:
130
ELU Activation (Exponential Linear Unit)
z, z≥0
ELU(z) =
α(ez − 1), z < 0
Implementation:
131
Softmax Function
ezi
Softmax(z)i = Pn
j=1 ezj
Implementation:
def softmax(z):
exp_z = np.exp(z - np.max(z)) # Numerical stability
return exp_z / exp_z.sum(axis=0)
132
Loss Function for Multi-Class (Cross-Entropy)
n k
1 XX
L=− yij log(ŷij )
n i=1 j=1
Example: For y = [1, 0, 0] and ŷ = [0.8, 0.1, 0.1], compute the loss.
Implementation:
133
Gradient Descent for Neural Networks
∂L
θ(t+1) = θ(t) − η
∂θ
134
Backpropagation (Gradient for Weights)
∂L ∂L ′
= δj ai , δj = σ (zj )
∂wij ∂zj
Implementation:
135
Mean Squared Error Loss
n
1X
L= (yi − ŷi )2
n i=1
Implementation:
136
Binary Cross-Entropy Loss
n
1X
L=− [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )]
n i=1
Implementation:
137
Batch Normalization
x−µ
x̂ = √ , y = γ x̂ + β
σ2 + ϵ
Implementation:
mean = np.mean(x)
var = np.var(x)
x_norm = (x - mean) / np.sqrt(var + epsilon)
y = gamma * x_norm + beta
138
Dropout Regularization
0,
with probability p
âi =
ai
, otherwise
1−p
Implementation:
139
Gradient of Sigmoid
Implementation:
def sigmoid_prime(z):
s = sigmoid(z)
return s * (1 - s)
140
RMSProp for Weight Updates
η
s(t+1) = βs(t) + (1 − β)g 2 , w(t+1) = w(t) − √ g
s(t+1) + ϵ
Explanation: RMSProp adapts the learning rate for each weight based
on the moving average of squared gradients.
Implementation:
141
Xavier (Glorot) Initialization
r r
6 6
w ∼ U(− , )
nin + nout nin + nout
Implementation:
142
L2 Regularization (Weight Decay)
λ
L = L0 + ∥w∥22
2
143
Heaviside vs. Hard Sigmoid
Implementation:
def hard_sigmoid(z):
return np.clip(0.2 * z + 0.5, 0, 1)
144
Swish Activation
Swish(z) = z · σ(z)
Implementation:
def swish(z):
return z * sigmoid(z)
145
Maxout Activation
Maxout(z) = max zi
i∈[1,k]
Implementation:
def maxout(z):
return np.max(z, axis=0)
146
Sparse Categorical Cross-Entropy
n
1X
L=− log(ŷi,yi )
n i=1
Implementation:
147
Cosine Similarity / Cosine Loss
u·v
Cosine Similarity =
∥u∥∥v∥
Implementation:
148
SECTION 7 : CLUSTERING
v
u n
uX
d(u, v) = t (ui − vi )2
i=1
p
Example: For u = [1, 2] and v = [3, 4], d(u, v) = (3 − 1)2 + (4 − 2)2 =
√
8.
Implementation:
149
Manhattan Distance
n
X
d(u, v) = |ui − vi |
i=1
Implementation:
150
Cosine Similarity
u·v
Cosine Similarity =
∥u∥∥v∥
Implementation:
151
Jaccard Similarity (Binary Data)
|u ∩ v|
Jaccard Similarity =
|u ∪ v|
Implementation:
152
k-Means Objective
k X
X
J= ∥xj − µi ∥2
i=1 xj ∈Ci
Example: For points [1, 2], [3, 4] in cluster C1 with centroid [2, 3],
compute J.
Implementation:
153
Centroid Update Rule (k-Means)
1 X
µi = xj
|Ci | x ∈C
j i
Example: For cluster C1 = {[1, 2], [3, 4]}, compute µ1 = [2, 3].
Implementation:
154
Elbow Method for Optimal k
k X
X
J(k) = ∥xj − µi ∥2
i=1 xj ∈Ci
Implementation:
155
k-Medoids Objective
k X
X
J= d(xj , mi )
i=1 xj ∈Ci
Implementation:
156
Fuzzy c-Means Objective
c X
X n
J= um
ij ∥xj − ci ∥
2
i=1 j=1
Implementation:
157
Silhouette Score
b−a
S= , a = intra-cluster distance, b = nearest-cluster distance
max(a, b)
Implementation:
158
Hierarchical Clustering Dendrogram
d(C1 , C2 ) = min ∥x − y∥
x∈C1 ,y∈C2
Implementation:
159
Ward’s Linkage
|C1 ||C2 |
d(C1 , C2 ) = ∥µ − µ2 ∥2
|C1 | + |C2 | 1
Implementation:
160
Single vs. Complete Linkage
Implementation:
161
Average Linkage
1 X X
daverage (C1 , C2 ) = ∥x − y∥
|C1 ||C2 | x∈C y∈C
1 2
Implementation:
162
Minimum Spanning Tree Criterion
X
MST weight = w(u, v), w(u, v) = ∥u − v∥
(u,v)∈E
Implementation:
163
DBSCAN Core Point Condition
Implementation:
164
DBSCAN Density Condition
Implementation:
165
Cohesion Metric
k X
X
Cohesion = ∥xj − µi ∥2
i=1 xj ∈Ci
Implementation:
cohesion = sum(np.linalg.norm(X[labels == i]
- centroids[i], axis=1).sum() for i in range(k))
166
Separation Metric
k
X k
X
Separation = ∥µi − µj ∥2
i=1 j=i+1
Implementation:
separation = sum(np.linalg.norm(centroids[i]
- centroids[j])**2 for i in range(k) for j in range(i+1, k))
167
Soft Clustering Membership
∥xj − ci ∥−2/(m−1)
uij = Pc −2/(m−1)
k=1 ∥xj − ck ∥
Implementation:
168
Entropy for Clustering Evaluation
k X
X n
H=− Pij log Pij
i=1 j=1
Implementation:
169
Mutual Information for Clustering
|U | |V |
X X Pij
I(U, V ) = Pij log
i=1 j=1
Pi P j
Implementation:
170
F-Measure for Clustering
2 · Precision · Recall
F =
Precision + Recall
Implementation:
171
Adjusted Rand Index (ARI)
Explanation: ARI adjusts the Rand Index for chance, measuring clus-
tering similarity.
Implementation:
172
Normalized Mutual Information (NMI)
2I(U, V )
NMI =
H(U ) + H(V )
Implementation:
173
SECTION 8 : DIMENSIONALITY
REDUCTION
Implementation:
174
Covariance Matrix for PCA
1
S= (X − X̄)T (X − X̄)
n−1
Implementation:
175
Eigen Decomposition for PCA
Sw = λw
Implementation:
176
SVD (Singular Value Decomposition)
X = UΣVT
Implementation:
U, S, Vt = np.linalg.svd(X, full_matrices=False)
177
Reconstruction Error for PCA
Implementation:
178
Explained Variance Ratio
λi
Explained Variance Ratio = Pn
j=1 λj
Implementation:
179
Cumulative Explained Variance
k
X λ
Cumulative Explained Variance = Pn i
i=1 j=1 λj
Implementation:
cumulative_explained_variance = np.cumsum(explained_variance_ratio)
180
Random Projection
Implementation:
181
Isomap Distance Matrix
Implementation:
182
MDS Stress Function
X 2
Stress = dij − dˆij
i<j
Implementation:
183
Multidimensional Scaling (MDS)
Implementation:
184
NMF (Non-Negative Matrix Factorization)
X ≈ WH, W ≥ 0, H ≥ 0
Implementation:
185
ICA (Independent Component Analysis) Objective
n
X
Maximize: log p(si ), where s = WX
i=1
Implementation:
186
Factor Analysis Model
X = ZΛ + ϵ, ϵ ∼ N (0, Ψ)
Implementation:
187
Kernel PCA Transformation
Implementation:
188
LDA (Fisher’s Criterion)
wT SB w
J(w) =
wT SW w
Implementation:
189
Robust PCA (RPCA)
X = L + S, ∥L∥∗ + λ∥S∥1
Implementation:
190
Hessian LLE
Implementation:
191
Laplacian Eigenmaps Objective
X
Minimize: wij ∥yi − yj ∥2 , W = Graph Weights
i,j
Implementation:
192
Autoencoder Reconstruction
X̂ = Decoder(Encoder(X))
Implementation:
193
Autoencoder Latent Representation
Z = Encoder(X)
Implementation:
latent_representation = encoder.predict(X)
194
Sparse PCA Objective
Implementation:
195
t-SNE Objective
X Pij
Minimize: KL(P ||Q) = Pij log
i̸=j
Qij
Implementation:
196
Gradient of t-SNE
∂KL X
=4 (Pij − Qij )(yi − yj )Qij
∂yi j
197
UMAP (Uniform Manifold Approximation and Pro-
jection)
X X
Optimize: wij ∥yi − yj ∥2 − λ wkl log(∥yk − yl ∥)
i,j k,l
Implementation:
import umap
umap_embedding = umap.UMAP(n_components=k).fit_transform(X)
198
SECTION 9 : PROBABILITY
DISTRIBUTIONS
Bernoulli Distribution
Implementation:
199
Binomial Distribution
n k
P (X = k) = p (1 − p)n−k , k ∈ {0, 1, . . . , n}
k
5
Example: For n = 5 and p = 0.5, P (X = 3) = 3
(0.5)3 (0.5)2 =
0.3125.
Implementation:
200
Poisson Distribution
λk e−λ
P (X = k) = , k ∈ {0, 1, 2, . . .}
k!
32 e−3
Example: For λ = 3, P (X = 2) = 2!
= 0.224.
Implementation:
201
Uniform Distribution (Continuous)
1
f (x) = , x ∈ [a, b]
b−a
Implementation:
202
Discrete Uniform Distribution
1
P (X = x) = , x ∈ {1, 2, . . . , n}
n
Example: For n = 6, P (X = 3) = 16 .
Implementation:
203
Normal (Gaussian) Distribution
1 (x−µ)2
f (x) = √ e− 2σ 2
2πσ 2
Implementation:
204
Exponential Distribution
Implementation:
205
Geometric Distribution
P (X = k) = (1 − p)k−1 p, k ∈ {1, 2, . . .}
Implementation:
206
Hypergeometric Distribution
K N −K
k n−k
P (X = k) = N
n
Implementation:
207
Beta Distribution
xα−1 (1 − x)β−1
f (x) = , x ∈ [0, 1]
B(α, β)
Implementation:
208
Gamma Distribution
β α xα−1 e−βx
f (x) = , x>0
Γ(α)
Implementation:
209
Multinomial Distribution
n!
P (X1 = k1 , . . . , Xk = kk ) = pk1 · · · pkkk
k1 ! · · · kk ! 1
Implementation:
210
Chi-Square Distribution
xk/2−1 e−x/2
f (x) = , x>0
2k/2 Γ(k/2)
Implementation:
211
Student’s t-Distribution
−(ν+1)/2
x2
Γ((ν + 1)/2)
f (x) = √ 1+
νπΓ(ν/2) ν
Implementation:
212
F-Distribution
Implementation:
213
Laplace Distribution
1 − |x−µ|
f (x) = e b
2b
Implementation:
214
Rayleigh Distribution
x −x2 /(2σ2 )
f (x) = e , x≥0
σ2
Implementation:
215
Triangular Distribution
2(x−a)
, a≤x<c
(b−a)(c−a)
f (x) =
2(b−x)
, c≤x≤b
(b−a)(b−c)
Implementation:
216
Log-Normal Distribution
1 (ln x−µ)2
f (x) = √ e− 2σ2 , x>0
xσ 2π
Implementation:
217
Arcsine Distribution
1
f (x) = p , x ∈ (0, 1)
π x(1 − x)
Implementation:
218
Beta-Binomial Distribution
n B(k + α, n − k + β)
P (X = k) =
k B(α, β)
Implementation:
219
Cauchy Distribution
1
f (x) = 2
x−x0
πγ 1 + γ
Implementation:
220
Weibull Distribution
k x k−1 −(x/λ)k
f (x) = e , x≥0
λ λ
Implementation:
221
Pareto Distribution
αxαm
f (x) = , x ≥ xm
xα+1
Implementation:
222
Log-Cauchy Distribution
1
f (x) = 2 , x>0
ln x−x0
xπγ 1 + γ
223
SECTION 10 : REINFORCEMENT
LEARNING
Reward Function
R(s, a) = E[Reward | s, a]
Implementation:
224
Discounted Return
∞
X
Gt = γ k Rt+k+1 , 0≤γ<1
k=0
Implementation:
225
Bellman Equation (State-Value Function)
Implementation:
226
Bellman Equation (Action-Value Function)
Implementation:
227
Temporal Difference (TD) Update
Implementation:
228
Monte Carlo Policy Evaluation
V (s) ← E[Gt | st = s]
Implementation:
229
Policy Improvement
Implementation:
def policy_improvement(Q):
return np.argmax(Q, axis=1)
230
Q-Learning Update
h i
Q(st , at ) ← Q(st , at ) + α Rt+1 + γ max Q(st+1 , a) − Q(st , at )
a
Implementation:
231
SARSA Update
Implementation:
232
Value Iteration Update
" #
X
′ ′
V (s) ← max R(s, a) + γ P (s | s, a)V (s )
a
s′
Implementation:
233
Actor–Critic Policy Update
Explanation: The actor updates the policy using the advantage, while
the critic updates the value function to estimate the advantage.
Implementation:
234
Deterministic Policy Gradient
Implementation:
235
Discount Factor (γ)
∞
X
Gt = γ k Rt+k+1 , 0≤γ<1
k=0
Implementation:
236
Expected SARSA
Implementation:
237
Eligibility Traces Update (TD(λ))
Implementation:
238
TD Error
Implementation:
239
Stochastic Gradient Descent in RL
θ ← θ − α∇θ L(θ)
Implementation:
240
Double Q-Learning
h i
Q1 (st , at ) ← Q1 (st , at )+α Rt+1 + γQ2 (st+1 , arg max Q1 (st+1 , a)) − Q1 (st , at )
a
Implementation:
241
Advantage Actor–Critic (A2C)
Implementation:
242
Off-Policy Evaluation (Importance Sampling)
"T −1 #
Y π(at | st )
E[Ĝ] = E Gt
t=0
µ(at | st )
Implementation:
243
Policy Gradient Update Rule
Implementation:
244
Soft Q-Learning Objective
Implementation:
245
Entropy-Regularized RL
Implementation:
246
Soft Actor–Critic (SAC)
Implementation:
247
Trust Region Policy Optimization (TRPO)
πθ (a | s)
max Eπθ A(s, a) , subject to DKL (πθ ||πθold ) ≤ δ
θ πθold (a | s)
248