Linear Algebra - Intuition, Math, Code
Linear Algebra - Intuition, Math, Code
Mike X Cohen
2 February 2021
Contents
1 Introduction
1.1 What is linear algebra and why learn it?
1.2 About this book
1.3 Prerequisites
1.4 Exercises and code challenges
1.5 Online and other resources
2 Vectors
2.1 Scalars
2.2 Vectors: geometry and algebra
2.3 Transpose operation
2.4 Vector addition and subtraction
2.5 Vector-scalar multiplication
2.6 Exercises
2.7 Answers
2.8 Code challenges
2.9 Code solutions
3 Vector multiplication
3.1 Vector dot product: Algebra
3.2 Dot product properties
3.3 Vector dot product: Geometry
3.4 Algebra and geometry
3.5 Linear weighted combination
3.6 The outer product
3.7 Hadamard multiplication
3.8 Cross product
3.9 Unit vectors
3.10 Exercises
3.11 Answers
3.12 Code challenges
3.13 Code solutions
4 Vector spaces
4.1 Dimensions and fields
4.2 Vector spaces
4.3 Subspaces and ambient spaces
4.4 Subsets
4.5 Span
4.6 Linear independence
4.7 Basis
4.8 Exercises
4.9 Answers
5 Matrices
5.1 Interpretations and uses of matrices
5.2 Matrix terms and notation
5.3 Matrix dimensionalities
5.4 The transpose operation
5.5 Matrix zoology
5.6 Matrix addition and subtraction
5.7 Scalar-matrix mult.
5.8 "Shifting" a matrix
5.9 Diagonal and trace
5.10 Exercises
5.11 Answers
5.12 Code challenges
5.13 Code solutions
6 Matrix multiplication
6.1 "Standard" multiplication
6.2 Multiplication and eqns.
6.3 Multiplication with diagonals
6.4 LIVE EVIL
6.5 Matrix-vector multiplication
6.6 Creating symmetric matrices
6.7 Multiply symmetric matrices
6.8 Hadamard multiplication
6.9 Frobenius dot product
6.10 Matrix norms
6.11 What about matrix division?
6.12 Exercises
6.13 Answers
6.14 Code challenges
6.15 Code solutions
7 Rank
7.1 Six things about matrix rank
7.2 Interpretations of matrix rank
7.3 Computing matrix rank
7.4 Rank and scalar multiplication
7.5 Rank of added matrices
7.6 Rank of multiplied matrices
7.7 Rank of A, AT, ATA, and AAT
7.8 Rank of random matrices
7.9 Boosting rank by "shifting"
7.10 Rank difficulties
7.11 Rank and span
7.12 Exercises
7.13 Answers
7.14 Code challenges
7.15 Code solutions
8 Matrix spaces
8.1 Column space of a matrix
8.2 Column space: A and AAT
8.3 Determining whether v ∈ C(A)
8.4 Row space of a matrix
8.5 Row spaces of A and ATA
8.6 Null space of a matrix
8.7 Geometry of the null space
8.8 Orthogonal subspaces
8.9 Matrix space orthogonalities
8.10 Dimensionalities of matrix spaces
8.11 More on Ax = b and Ay = 0
8.12 Exercises
8.13 Answers
8.14 Code challenges
8.15 Code solutions
9 Complex numbers
9.1 Complex numbers and â„‚
9.2 What are complex numbers?
9.3 The complex conjugate
9.4 Complex arithmetic
9.5 Complex dot product
9.6 Special complex matrices
9.7 Exercises
9.8 Answers
9.9 Code challenges
9.10 Code solutions
10 Systems of equations
10.1 Algebra and geometry of eqns.
10.2 From systems to matrices
10.3 Row reduction
10.4 Gaussian elimination
10.5 Row-reduced echelon form
10.6 Gauss-Jordan elimination
10.7 Possibilities for solutions
10.8 Matrix spaces, row reduction
10.9 Exercises
10.10 Answers
10.11 Coding challenges
10.12 Code solutions
11 Determinant
11.1 Features of determinants
11.2 Determinant of a 2×2 matrix
11.3 The characteristic polynomial
11.4 3×3 matrix determinant
11.5 The full procedure
11.6 Δ of triangles
11.7 Determinant and row reduction
11.8 Δ and scalar multiplication
11.9 Theory vs practice
11.10 Exercises
11.11 Answers
11.12 Code challenges
11.13 Code solutions
12 Matrix inverse
12.1 Concepts and applications
12.2 Inverse of a diagonal matrix
12.3 Inverse of a 2×2 matrix
12.4 The MCA algorithm
12.5 Inverse via row reduction
12.6 Left inverse
12.7 Right inverse
12.8 The pseudoinverse, part 1
12.9 Exercises
12.10 Answers
12.11 Code challenges
12.12 Code solutions
13 Projections
13.1 Projections in â„2
13.2 Projections in â„N
13.3 Orth and par vect comps
13.4 Orthogonal matrices
13.5 Orthogonalization via GS
13.6 QR decomposition
13.7 Inverse via QR
13.8 Exercises
13.9 Answers
13.10 Code challenges
13.11 Code solutions
14 Least-squares
14.1 Introduction
14.2 5 steps of model-fitting
14.3 Terminology
14.4 Least-squares via left inverse
14.5 Least-squares via projection
14.6 Least-squares via row-reduction
14.7 Predictions and residuals
14.8 Least-squares example
14.9 Code challenges
14.10 Code solutions
15 Eigendecomposition
15.1 Eigenwhatnow?
15.2 Finding eigenvalues
15.3 Finding eigenvectors
15.4 Diagonalization
15.5 Conditions for diagonalization
15.6 Distinct, repeated eigenvalues
15.7 Complex solutions
15.8 Symmetric matrices
15.9 Eigenvalues singular matrices
15.10 Eigenlayers of a matrix
15.11 Matrix powers and inverse
15.12 Generalized eigendecomposition
15.13 Exercises
15.14 Answers
15.15 Code challenges
15.16 Code solutions
16 The SVD
16.1 Singular value decomposition
16.2 Computing the SVD
16.3 Singular values and eigenvalues
16.4 SVD of a symmetric matrix
16.5 SVD and the four subspaces
16.6 SVD and matrix rank
16.7 SVD spectral theory
16.8 Low-rank approximations
16.9 Normalizing singular values
16.10 Condition number of a matrix
16.11 SVD and the matrix inverse
16.12 MP Pseudoinverse, part 2
16.13 Code challenges
16.14 Code solutions
17 Quadratic form
17.1 Algebraic perspective
17.2 Geometric perspective
17.3 The normalized quadratic form
17.4 Evecs and the qf surface
17.5 Matrix definiteness
17.6 The definiteness of ATA
17.7 λ and definiteness
17.8 Code challenges
17.9 Code solutions
18 Covariance matrices
18.1 Correlation
18.2 Variance and standard deviation
18.3 Covariance
18.4 Correlation coefficient
18.5 Covariance matrices
18.6 Correlation to covariance
18.7 Code challenges
18.8 Code solutions
19 PCA
19.1 PCA: interps and apps
19.2 How to perform a PCA
19.3 The algebra of PCA
19.4 Regularization
19.5 Is PCA always the best?
19.6 Code challenges
19.7 Code solutions
20 The end.
20.1 The end... of the beginning!
20.2 Thanks!
Chapter 1
Introduction to this book
1.1 What is linear algebra and why learn it?
The purpose of this book is to teach you how to think about and work with
matrices, with an eye towards applications in machine learning, multivariate
statistics, time series, and image processing. If you are interested in data
science, quantitative biology, statistics, or machine learning and artificial
intelligence, then this book is for you. If you don’t have a strong
background in mathematics, then don’t be concerned: You need only
high-school math and a bit of dedication to learn linear algebra from this
book.
This book is written with the self-studying reader in mind. Many people do
not realize how important linear algebra is until after university, or they do
not meet the requirements of university-level linear algebra courses
(typically, calculus). Linear algebra textbooks are often used as a
compendium to a lecture-based course embedded in a traditional university
math program, and therefore can be a challenge to use as an independent
resource. I hope that this book is a self-contained resource that works well
inside or outside of a formal course.
Ebook version The ebook version is identical to the physical version of this
book, in terms of the text, formulas, visualizations, and code. However, the
formatting is necesarily quite different. The book was designed to be a
physical book; and thus, margins, fonts, text and figure placements, and code
blocks are optimized for pages, not for ereaders.
Therefore, I recommend getting the physical copy of the book if you have the
choice. If you get the ebook version, then please accept my apologies for any
ugly or annoying formatting issues. If you have difficulties reading the code,
please download it from github.com/mikexcohen/LinAlgBook.
So yes, there is a respectable number of equations here. There are three levels
of hierarchy in the equations throughout this book. Some equations are
simple or reminders of previously discussed equations; these are lowest on
the totem pole are are presented in-line with text like this: x(yz) = (xy)z.
More important equations are given on their own lines. The number in
parentheses to the right will allow me to refer back to that equation later in
the text (the number left of the decimal point is the chapter, and the number
to the right is the equation number).
(1.1)
And the most important equations—the ones you should really make sure to
understand and be comfortable using and reproducing—are presented in
their own box with a title:
__________________________________________________________________
Just keep in mind that not every concept in linear algebra has both a
geometric and an algebraic concept. The dualism is useful in many cases, but
it’s not a fundamental fact that necessarily applies to all linear algebra
concepts.
1.3 Prerequisites
The obvious. Dare I write it? You need to be motivated to learn linear
algebra. Linear algebra isn’t so difficult, but it’s also not so easy. An
intention to learn applied linear algebra—and a willingness to expend mental
energy towards that goal—is the single most important prerequisite.
Everything below is minor in comparison.
Calculus. Simply put: none. I strongly object to calculus being taught before
linear algebra. No offense to calculus, of course; it’s a rich, beautiful, and
incredibly important subject. But linear algebra can be learned without any
calculus, whereas many topics in calculus involve some linear algebra.
Furthermore, many modern applications of linear algebra invoke no calculus
concepts. Hence, linear algebra should be taught assuming no calculus
background.
Vectors, matrices and <insert fancy-sounding linear algebra term here>.
If this book is any good, then you don’t need to know anything about
linear algebra before reading it. That said, some familiarity with matrices and
matrix operations will be beneficial.
This doesn’t mean you should forgo solving problems by hand; it is only
through laboriously solving lots and lots of problems on paper that you will
internalize a deep and flexible understanding of linear algebra. However, only
simple (often, integer) matrices are feasible to work through by hand;
computer simulations and plotting will allow you to understand an equation
visually, when staring at a bunch of letters and Greek characters might give
you nothing but a sense of dread. So, if you really want to learn modern,
applied linear algebra, it’s helpful to have some coding proficiency in a
language that interacts with a visualization engine.
I provide code for all concepts and problems in this book in both MATLAB
and Python. I find MATLAB to be more comfortable for implementing linear
algebra concepts. If you don’t have access to MATLAB, you can use
Octave, which is a free cross-platform software that emulates nearly all
MATLAB functionality. But the popularity of Python is undeniable, and you
should use whichever program you (1) feel more comfortable using or (2)
anticipate working with in the future. Feel free to use any other coding
language you like, but it is your responsibility to translate the code into your
preferred language.
I have tried to keep the code as simple as possible, so you need only minimal
coding experience to understand it. On the other hand, this is not an intro-
coding book, and I assume some basic coding familiarity. If you understand
variables, for-loops, functions, and basic plotting, then you know enough to
work with the book code.
To be clear: You do not need any coding to work through the book. The code
provides additional material that I believe will help solidify the concepts as
well as adapt to specific applications. But you can successfully and
completely learn from this book without looking at a single line of code.
Exercises are found at the end of each chapter and focus on drilling and
practicing the important concepts. The answers (yes, all of them; not just the
odd-numbered) follow the exercises, and in many cases you can also check
your own answer by solving the problem on a computer (using MATLAB or
Python). Keep in mind that these exercises are designed to be solved by hand,
and you will learn more by solving them by hand than by computer.
Code challenges are more involved, require some effort and creativity, and
can only be solved on a computer. These are opportunities for you to explore
concepts, visualizations, and parameter spaces in ways that are difficult or
impossible to do by hand. I provide my solutions to all code challenges, but
keep in mind that there are many correct solutions; the point is for you to
explore and understand linear algebra using code, not to reproduce my code.
This book is based on an online course that I created. The book and the
course are similar but not entirely redundant. You don’t need to enroll in
the online course to follow along with this book (or the other way around). I
appreciate that some people prefer to learn from online video lectures while
others prefer to learn from textbooks. I am trying to cater to both kinds of
learners.
2.1 Scalars
We begin not with vectors but with scalars. You already know everything
you need to know about scalars, even if you don’t yet recognize the term.
Why are single numbers called "scalars"? It’s because single numbers
"scale," or stretch, vectors and matrices without changing their direction. This
will become clear and intuitive later in this chapter when you learn about the
geometric interpretation of vectors.
Scalars can be represented as a point on a number line that has zero in the
middle, negative numbers to the left, and positive numbers to the right
(Figure 2.1). But don’t confuse a point on this number line with an arrow
from zero to that number—that’s a vector, not a scalar.
Notation In this book (as in most linear algebra texts), scalars will be
indicated using Greek lowercase letters such as λ, α, or γ. This helps
disambiguate scalars from vectors and matrices.
a) 2λ = 9
b) 5∕λ = 7
c) π3 = eλ
d) 4(λ + 3) = −6
e) .5λ = .25
f) 8λ = 0
g) λ2 = 36
h) λ = 5λ2
Answers
a) λ = 9∕2
b) λ = 5∕7
c) λ = 3lnπ
d) λ = −9∕2
e) λ = .5
f) λ = 0
g) λ = ±6
h) λ = (for λ≠0)
________________________________________________
It is important to know that the definition of a vector does not include its
starting or ending locations. That is, the vector [1 -2] simply means a line that
goes one unit in the positive direction in the first dimension, and two units in
the negative direction in the second dimension. This is the key difference
between a vector and a coordinate: For any coordinate system (think of the
standard Cartesian coordinate system), a given coordinate is a unique point in
space. A vector, however, is any line—anywhere—for which the end point
(also called the head of the vector) is a certain number of units along each
dimension away from the starting point (the tail of the vector).
On the other hand, coordinates and vectors are coincident when the vector
starts at the origin (the [0,0] location on the graph). A vector with its tail at
the origin is said to be in its standard position. This is illustrated in Figure
2.3: The coordinate [1 -2] and the vector [1 -2] are the same when the vector
is in its standard position. But the three thick lines shown in the figure are all
the same vector [1 -2].
Code Vectors are easy to create and visualize in MATLAB and in Python.
The code draws the vector in its standard position.
Note that Python requires you to load in libraries (here, numpy and
matplotlib.pyplot) each time you run a new session. If you’ve already
imported the library in your current session, you won’t need to re-import
it.
Figure 2.3:The three coordinates (circles) are distinct, but
the three vectors (lines) are the same, because they have
the same magnitude and direction ([1 -2]). When the
vector is in its standard position (the black vector), the
head of the vector [1 -2] overlaps with the coordinate [1
-2].
Be careful, however: Not all enclosing brackets simply signify vectors. The
following two objects are different from the above. In fact, they are not even
vectors—more on this in a few pages!
Vectors are not limited to numbers; the elements can also be functions.
Consider the following vector function.
(2.1)
Code The semicolon in MATLAB inside square brackets is used for vertical
concatenation (that is, to create a column vector). In Python, lists and numpy
arrays have no intrinsic orientation (meaning they are neither row nor column
vectors). In some cases that doesn’t matter, while in other cases (for
example, in more advanced applications) it becomes as hassle. Additional
square brackets can be used to impose orientations.
It’s important to see that the letter is not bold-faced when referring to a
particular element. This is because subscripts can also be used to indicate
different vectors in a set of vectors. Thus, vi is the ith element in vector v, but
vi is the ith vector is a series of related vectors (v1, v2, ..., vi). I know, it’s a
bit confusing, but unfortunately that’s common notation and you’ll
have to get used to it. I try to make it clear from context whether I’m
referring to vector element vi or vector vi.
That said, there are some special vectors that you should know about. The
vector that contains zeros in all of its elements is called the zeros vector. A
vector that contains some zeros but other non-zero elements is not the zeros
vector; it’s just a regular vector with some zeros in it. To earn the
distinction of being a "zeros vector," all elements must be equal to zero.
The zeros vector can also be indicated using a boldfaced zero: 0. That can be
confusing, though, because 0 also indicates the zeros matrix. Hopefully, the
correct interpretation will be clear from the context.
The zeros vector has some interesting and sometimes weird properties. One
weird property is that it doesn’t have a direction. I don’t mean its
direction is zero, I mean that its direction is undefined. That’s because the
zeros vector is simply a point at the origin of a graph. Without any
magnitude, it doesn’t make sense to ask which direction it points in.
________________________________________________
Practice problems State the type and dimensionality of the following vectors
(e.g., "four-dimensional column vector"). For 2D vectors, additionally draw
the vector starting from the origin.
a)
b)
c)
d)
Answers
a) 4D column
b) 4D row
c) 2D column
d) 2D row
__________________________________________________________________
-|*|- Reflection This gentle introduction to scalars and vectors seems simple,
but you may be surprised to learn that nearly all of linear algebra is built up
from scalars and vectors. From humble beginnings, amazing things emerge.
Just think of everything you can build with wood planks and nails. (But donâ
€™t think of what I could build — I’m a terrible carpenter.) -|*|-
You can transform a column vector to a row vector by transposing it. The
transpose operation simply means to convert columns to rows, and rows to
columns. The values and ordering of the elements stay the same; only the
orientation changes. The transpose operation is indicated by a super-scripted
T (some authors use an italics T but I think it looks nicer in regular font). For
example:
T =
T =
= TT
There are two ways to think about subtracting two vectors. One way is to
multiply one of the vectors by -1 and then add them as above (Figure 2.5,
lower left). Multiplying a vector by -1 means to multiply each vector element
by -1 (vector [1 1] becomes vector [-1 -1]). Geometrically, that flips the
vector by 180∘.
The second way to think about vector subtraction is to keep both vectors in
their standard position, and draw the line that goes from the head of the
subtracted vector (the one with the minus sign) to the head of the other vector
(the one without the minus sign) (Figure 2.5, lower right). That resulting
vector is the difference. It’s not in standard position, but that doesn’t
matter.
You can see that the two subtraction methods give the same difference
vector. In fact, they are not really different methods; just different ways of
thinking about the same method. That will become clear in the algebraic
perspective below.
Figure 2.5:Two vectors (top left) can be added (top right)
and subtracted (lower two plots).
Do you think that a − b = b − a? Let’s think about this in terms of
scalars. For example: 2 − 5 = −3 but 5 − 2 = 3. In fact, the magnitudes of
the results are the same, but the signs are different. That’s because 2 − 5
= −(5 − 2). Same story for vectors: a−b = −(b−a). The resulting
difference vectors are not the same, but they are related to each other by
having the same magnitude but flipped directions. This should also be
intuitive from inspecting Figure 2.5: v2 −v1 would be the same line as v1
−v2 but with the arrow on the other side; essentially, you just swap the tail
with the head.
+ =
− =
More formally:
__________________________________________________________________
__________________________________________________________________
Important: Vector addition and subtraction are valid only when the vectors
have the same
dimensionality.____________________________________________________________
a) +
b)
− +
c)
+
d)
−
e)
+
f)
+
Answers
a)
b)
c)
d)
e)
f)
__________________________________________________________________
Still, the important thing is that the scalar does not rotate the vector off of its
original orientation. In other words, vector direction is invariant to scalar
multiplication.
Figure 2.6:Multiplying a scalar (λ) by a vector (v) means to
stretch or shrink the vector without changing its angle. λ
> 1 means the resulting vector will be longer than the
original, and 0 < λ < 1 means the resulting vector will
be shorter. Note the effect of a negative scalar (λ < 0):
The resulting scaled vector points "the other way" but it
still lies on the same imaginary infinitely long line as the
original vector.
_________________________________________________________________________
This definition holds for any number of dimensions and for any scalar. Here
is one example:
Because the scalar-vector multiplication is implemented as element-wise
multiplication, it obeys the commutative property. That is, a scalar times a
vector is the same thing as that vector times that scalar: λv = vλ. This fact
becomes key to several proofs later in the book.
__________________________________________________________________
a) −2
b)
(−9 + 2 × 5)
c)
0
d)
λ
Answers
a)
b)
c)
d)
__________________________________________________________________
-|*|- Reflection Vector-scalar multiplication is conceptually and
computationally simple, but do not underestimate its importance: Stretching
a vector without rotating it is fundamental to many applications in linear
algebra, including eigendecomposition. Sometimes, the simple things (in
mathematics and in life) are the most powerful. -|*|-
2.6 Exercises
1.Â
Simplify the following vectors by factoring out common scalars. For
example, [2 4] can be simplified to 2[1 2].
a)
b)
c)
d)
2.Â
Draw the following vectors [ ] using the listed starting point ().
a) (0,0)
b) (1,-2)
c) (4,1)
d) (0,0)
e) (0,0)
f) (-3,0)
g) (2,4)
h) (0,0)
i) (1,0)
j) (0,3/2)
k) (1,1)
l) (8,4)
3.Â
Label the following as column or row vectors, and state their dimensionality.
Your answer should be in the form, e.g., "three-dimensional column vector."
a)
b)
c)
d)
e)
4.Â
a)
3
b)
c)
0
d)
4
e) λ
f) γ[0 0 0 0 0]
5.Â
Add or subtract the following pairs of vectors. Draw the individual vectors
and their sum (all starting from the origin), and confirm that the algebraic and
geometric interpretations match.
a)
+
b)
−
c)
+
d)
+
e)
−
f)
+
g)
+
h)
+
i)
+
j)
+ −
k)
− −
l)
− +
2.7 Answers
1.Â
a)
3
b)
12
c)
6
d)
2.Â
This one you should be able to do on your own. You just need to plot the
lines from their starting positions. The key here is to appreciate the
distinction between vectors and coordinates (they overlap when vectors are in
the standard position of starting at the origin).
3.Â
a) 5D row vector
b) 3D column vector
c) 2D row vector
d) 3D column vector
e) 6D row vector
4.Â
I’ll let you handle the drawing; below are the algebraic solutions.
a)
b)
c)
d)
e)
T
f)
5.Â
a)
b)
c)
d)
e)
f)
g)
h)
i)
j)
k)
l)
There are four ways to multiply a pair of vectors. They are: dot product, outer
product, element-wise multiplication, and cross product. The dot product is
the most important and owns most of the real estate in this chapter.
The dot product is a single number that provides information about the
relationship between two vectors. This fact (two vectors produce a scalar) is
why it’s sometimes called the "scalar product." The term "inner product"
is used when the the two vectors are continuous functions. I will use only the
term dot product for consistency.
__________________________________________________________________
An example:
⋅ = 1×5 + 2×6 + 3×7 + 4×8
= 5 + 12 + 21 + 32
= 70
I will mostly use the notation aTb for reasons that will become clear after
learning about matrix multiplication.
Why does the dot product require two vectors of equal dimensionality? Try to
compute the dot product between the following two vectors:
You cannot complete the operation because there is nothing to multiply the
final element in the left vector. Thus, the dot product is defined only between
two vectors that have the same dimensionality.
You can compute the dot product between a vector and itself. Equation 3.2
shows that this works out to be the sum of squared elements, and is denoted
∥a∥2. The term ∥a∥ is called the1 magnitude, the length, or the norm
of vector a. Thus, the dot product of a vector with itself is called the
magnitude-squared, the length-squared, or the squared-norm, of the vector.
1If the vector is mean-centered—the average of all vector elements is subtracted from each elementâ
€”then the dot product of a vector with itself is call variance in statistics lingo.
(3.2)
(3.3)
But the second interpretation is what most people refer to when discussing
dot products. Let us consider three vectors and investigate what happens
when we move the parentheses around. In other words, is the following
statement true?
(3.4)
The answer is No. To understand why, let’s start by assuming that all
three vectors have the same dimensionality. In that case, each side of the
equation is individually a valid mathematical operation, but the two sides of
the equation will differ from each other. In fact, neither side is a dot
product. The left-hand side of Equation 3.4 becomes the vector-scalar
product between row vector uT and the scalar resulting from the dot product
vTw. Thus, the left-hand side of the equation is a row vector. Similar story for
the right-hand side: It is the scalar-vector multiplication between the scalar
uTv and the column vector w (don’t be confused by transposing a scalar:
4T = 4).
Therefore, the two sides of the equation are not equal; they wouldn’t even
satisfy a "soft equality" of having the same elements but in a different
orientation.
uT = = (3.5)
Tw = T = (3.6)
And it gets even worse, because if the three vectors have different
dimensionalities, then one or both sides of Equation 3.4 might even be
invalid. I’ll let you do the work to figure this one out, but imagine what
would happen if the dimensionalities of u, v, and w, were, respectively, 3, 3,
and 4.
The conclusion here is that the vector dot product does not obey the
associative property. (Just to make life confusing, matrix multiplication does
obey the associative property, but at least you don’t need to worry about
that for several more chapters.)
Commutative property The commutative property holds for the vector dot
product. This means that you can swap the order of the vectors that are being
"dot producted" together (I’m not sure if dot product can be used as a
verb like that, but you know what I mean), and the result is the same.
__________________________________________________________________
_________________________________________________________________________
Commutivity holds because the dot product is implemented element-wise,
and each element-wise multiplication is simply the product of two scalars.
Scalar multiplication is commutative, and therefore the dot product is
commutative.
(3.8)
Distributive property This one also holds for the dot product, and it turns
out to be really important 3for showing the link between the algebraic
definition of the dot product with the geometric definition of the dot product,
which you will learn below.
3The distributive law is that scalars distribute inside parentheses, e.g., a(b+c) = ab+ac
When looking at the equation below, keep in mind that the sum of two
vectors is simply another vector. (Needless to say, Equation 3.9 is valid only
when all three vectors have the same dimensionality.)
__________________________________________________________________
__________________________________________________________________
The distributive property says that we can break up a dot product into the
sum of two dot products, by breaking up one of the vectors into the sum of
two vectors. Conversely, you can turn this around: We can combine two dot
products into one by summing two vectors into one vector, as long as the two
dot products share a common vector (in this example, w).
Why is Equation 3.9 true? This has to do with how the dot product is defined
as the sum of element-wise multiplications. Common terms can be combined
across sums, which brings us to the following:
(3.10)
This result may seem like an uninteresting academic exercise, but it’s not:
Equation 3.13 will allow us to link the algebraic and geometric interpretations
of the dot product.
There are many proofs of this inequality; I’m going to show one that
relies on the geometric perspective of the dot product. So put a mental pin in
this inequality and we’ll come back to it in a few pages.
___________________________________
Practice problems Compute the dot product between the following pairs of
vectors.
a) T
b) T
c) T
d) T
e)
T
f)
T
g)
T
h)
T
Answers
a) -10
b) -5
c) -1
d) 0
e) -1
f) 26
g) 1/2
h) 31 = (81+3+9)∕3
_______________________________________________
3.3 Vector dot product: Geometry
Geometrically, the dot product is the cosine of the angle between the two
vectors, times the lengths of the two vectors. That seems very different from
the algebraic definition, and it’s also not intuitive that those are the same
operation. In this section, we will discover some properties and implications
of the geometric formula for the dot product; then, in the following section,
you will see that the geometric and algebraic formulae are simply different
ways of expressing the same concept.
_________________________________________________________________________
__________________________________________________________________
Note that if both vectors have unit length (|a| = |b| = 1), then Equation 3.15
reduces to the cosine of the angle between the two vectors.
Equation 3.15 can be rewritten to give an expression for the angle between
two vectors.
First, let’s understand why the sign of the dot product is determined
exclusively by the angle between the two vectors. Equation 3.15 says that the
dot product is the product of three quantities: two magnitudes and a cosine.
Magnitudes are lengths, and therefore cannot be negative (magnitudes can be
zero for the zeros vector, but let’s assume for now that we are working
with non-zero-magnitude vectors). The cosine of an angle can range between
-1 and +1. Thus, the first two terms (∥a∥∥b∥) are necessarily non-
negative, meaning the cosine of the angle between the two vectors alone
determines whether the dot product is positive or negative.
Figure 3.1:Unit circle. The x-axis coordinate corresponds to
the cosine of the angle from the origin to each point.
With that in mind, we can group dot products into five categories according
to the angle between the vectors (Figure 3.2) (in the list below, 𜃠is the
angle between the two vectors and α is the dot product):
1. 𜃠< 90∘ → α > 0. The cosine of an acute angle is always positive,
so the dot product will be positive.
2. 𜃠> 90∘ → α < 0. The cosine of an obtuse angle is always
negative, so the dot product will be negative.
3. 𜃠= 90∘ → α = 0. The cosine of a right angle is zero, so the dot
product will be zero, regardless of the magnitudes of the vectors. This is
such an important case that it has its own name: orthogonal. 5Commit
to memory that if two vectors meet at a right angle, their dot product is
exactly zero, and they are said to be orthogonal to each other. This
important concept is central to statistics, machine-learning,
eigendecomposition, SVD, the Fourier transform, and so on. The symbol
for perpendicularity is an upside-down "T." So you can write that two
vectors are orthogonal as w ⊥ v.
4. 𜃠= 0∘ → α = ∥a∥∥b∥. The cosine of 0 is 1, so the dot
product reduces to the product of the magnitudes of the two vectors. The
term for this situation is collinear (meaning on the same line). This also
means that the dot product of a vector with itself is simply ∥a∥2,
which is the same result obtained algebraically in Equation 3.2.
5. 𜃠= 180∘ → α = −∥a∥∥b∥. Basically the same situation
as above, but with a negative sign because cos(180∘) = −1. Still
referred to as collinear.
5Important: Vectors are orthogonal when they meet at a 90∘angle, and orthogonal vectors
have a dot product of 0.
Figure 3.2:Inferences that can be made about the sign of the
dot product (α), based on the angle (ðœƒ) between the
two vectors. Visualization is in 2D, but the terms and
conclusions extend to any number of dimensions.
Keep in mind that the magnitude of the dot product depends on all three
terms (the magnitudes of the two vectors and the cosine of the angle); the
discussion above pertains only to the sign of the dot product as negative,
zero, or positive.
Were you surprised that the algebraic and geometric equations for the dot
product looked so different? It’s hard to see that they’re the same, but
they really are, and that’s what we’re going to discover now. The two
expressions are printed again below for convenience.
(3.18)
sin(ðœƒab) = (3.19)
a1 = asin(ðœƒab) (3.20)
We can solve for b1 in a similar way:
cos(ðœƒab) = (3.21)
b1 = acos(ðœƒab) (3.22)
Next we need to find an expression for b2. The trick here is to express b2 in
terms of quantities we already know.
b = b1 + b2 (3.23)
b2 = b − b1 (3.24)
b2 = b − acos(ðœƒab) (3.25)
At this point, we have created a right-triangle with side-lengths defined in
terms of the original values a and b, and that means that we can apply the
Pythagorean theorem on those quantities. From there, we simply work
through some algebra to arrive at the Law of Cosines in Equation 3.30:
c2 = a 12 + b 22 (3.26)
= (asin(ðœƒab))2 + (b − acos(𜃠ab))2 (3.27)
= a2 sin2(𜃠ab) + b2 + a2 cos2(𜃠ab) − 2abcos(ðœƒab) (3.28)
= a2(sin2(𜃠ab) + cos2(𜃠ab)) + b2 − 2abcos(𜃠ab) (3.29)
= a2 + b2 − 2abcos(𜃠ab) (3.30)
Recall the trig identity that cos2(ðœƒ) + sin2(ðœƒ) = 1. Notice that when ðœƒab
= 90∘, the third term in Equation 3.30 drops out and we get the familiar
Pythagorean theorem.
I realize this was a bit of a long tangent, but we need the Law of Cosines to
prove the equivalence between the algebraic and geometric equations for the
dot product.
Let’s get back on track. Instead of thinking about the lengths of the
triangles as a, b, and c, we can think about the edges of the triangles as
vectors a, b, and c, and thus their lengths are ∥a∥, ∥b∥, and ∥c∥.
With that in mind, we can expand the definition of vector c using the
distributive property:
∥a − b∥2 = (a − b)T(a − b) = aTa − 2aTb + bTb
= ∥a∥2 + ∥b∥2 − 2aTb (3.32)
We’re almost done with the proof; we just need to do some
simplifications. Notice that we have discovered two different ways of writing
out the magnitude-squared of c. Let’s set those equal to each other and do
a bit of simplification.
Notice that ∥a∥2 and ∥b∥2 appear on both sides of the equation, so
these simply cancel. Same goes for the factor of −2. That leaves us with the
remarkable conclusion of the equation we started with in this section, re-
printed here for your viewing pleasure.
(3.
15)
Whew! That was a really long proof! I’m pretty sure it’s the longest
one in the entire book. But it was important, because we discovered that the
algebraic and geometric definitions of the dot product are merely different
interpretations of the same operation.
I also wrote that the equality holds when the two vectors form a linearly
dependent set. Two co-linear vectors meet at an angle of 0∘ or 180∘, and the
absolute value of the cosines of those angles is 1.
-|*|- Reflection There is a lot to say about the dot product. That’s no
surprise—the dot product is one of the most fundamental computational
building blocks in linear algebra, statistics, and signal processing, out of
which myriad algorithms in math, signal processing, graphics, and other
calculations are built. A few examples to whet your appetite for later
chapters and real-world applications: In statistics, the cosine of the angle
between a pair of normalized vectors is called the Pearson correlation
coefficient; In the Fourier transform, the magnitude of the dot product
between a signal and a sine wave is the power of the signal at the frequency
of the sine wave; In pattern recognition and feature-identification, when the
dot product is computed between multiple templates and an image, the
template with the largest-magnitude dot product identifies the pattern or
feature most present in the image. -|*|-
__________________________________________________________________
It is assumed that all vectors vi have the same dimensionality, otherwise the
addition is invalid. The λ’s can be any real number, including zero.
Technically, you could rewrite Equation 3.35 for subtracting vectors, but
because subtraction can be handled by setting the λi to be negative, it’s
easier to discuss linear weighted combinations in terms of addition.
An example:
But let’s start with a bit of notation. The outer product is indicated using
a notation is that initially confusingly similar to that of the dot product. In the
definitions below, v is an M-element column vector and w is an N-element
column vector.
Dot product : vTw = 1×1
Outer product : vwT = M×N
This notation indicates that the dot product (vTw) is a 1×1 array (just a
single number; a scalar) whereas the outer product (vwT) is a matrix whose
sizes are defined by the number of elements in the vectors.
The vectors do not need to be the same size for the outer product, unlike with
the dot product. Indeed, the dot product expression above is valid only when
M = N, but the outer product is valid even if M≠N.
I realize that this notation may seem strange. I promise it will make perfect
sense when you learn about matrix multiplication. Essentially, the dot
product and outer product are special cases of matrix multiplication.
Now let’s talk about how to create an outer product. There are three
perspectives for creating an outer product: The element perspective, the
column perspective, and the row perspective. The result is the same; they are
just different ways of thinking about the same operation.
Element perspective Each element i,j in the outer product matrix is the
scalar multiplication between the ith element in the first vector and the jth
element in the second vector. This also leads to the formula for computing
the outer product.
__________________________________________________________________
_________________________________________________________________________
Here is an example using letters instead of numbers, which helps make the
formula clear.
The element perspective lends itself well to a formula, but it’s not always
the best way to conceptualize the outer product.
Column perspective Each column in the outer product matrix comes from
repeating the left vector (v using the notation above) but scaled by each
element in the row vector on the right (wT). In other words, each column in
the outer product matrix is the result of scalar-vector multiplication, where
the vector is the column vector (repeated) and the scalar comes from each
element of the row vector. Thus, the number of columns in the matrix equals
the number of elements in the row vector. Notice that in the example below,
each column of the outer product matrix is a scaled version of the left column
vector.
Row perspective I’m sure you can already guess how this is going to
work: Form the outer product matrix one row at a time, by repeating the row
vector M times (for M elements in vector w), each time scaling the vector by
each element in the column vector. In the example below, notice that each
row in the outer product matrix is a scaled version of the row vector.
If you look closely at the two examples above, you’ll notice that when we
swapped the order of the two vectors, the two outer product matrices look the
same but with the columns and rows swapped. In fact, that’s not just a
coincidence of this particular example; that’s a general property of the
outer product. It’s fairly straightforward to prove that this is generally the
case, but you need to learn more about matrix multiplications before getting
to the proof. I’m trying to build excitement for you to stay motivated to
continue with this book. I hope it’s working!
Code The outer product, like the dot product, can be implemented in several
ways. Here are two of them. (Notice that MATLAB row vector elements can
be separated using a space or a comma, but separating elements with a
semicolon would produce a column vector, in which case you wouldn’t
need the transpose.)
Code block 3.5:Python
v1Â =Â np.array([2,5,4,7])Â
v2Â =Â np.array([4,1,0,2])Â
op = np.outer(v1,v2)
__________________________________________________________________
a) T
b) T
c)
T
d)
T
Answers
a)
b)
c)
d)
__________________________________________________________________
This is actually the way of multiplying two vectors that you might have
intuitively guessed before reading the past few sections.
Also called the Hadamard product, this operation involves multiplying each
corresponding element in the two vectors. The resulting product vector
therefore is the same size as the two multiplying vectors. And thus,
Hadamard multiplication is defined only for two vectors that have the same
number of elements.
An example:
Code There are many parts of Python and MATLAB syntax that are nearly
identical. Matrix multiplication, however, is annoyingly—and confusinglyâ
€”different between the languages. Notice that Python uses an asterisk (*) for
element-wise multiplication whereas MATLAB uses a dot-asterisk (.*). Itâ
€™s a subtle but important distinction.
Code block 3.7:Python
v1Â =Â np.array([2,5,4,7])Â
v2Â =Â np.array([4,1,0,2])Â
v3Â =Â v1Â *Â v2
__________________________________________________________________
b)
⊙
c)
⊙
Answers
a)
b)
c) undefined!
_________________________________________________________________________
(3.38)
The cross product is used often in geometry, for example to create a vector c
that is orthogonal to the plane spanned by vectors a and b. It is also used in
vector and multivariate calculus to compute surface integrals. However, the
cross product is not really used in data analysis, statistics, machine-learning,
or signal-processing. I’m not going to discuss the cross-product again in
this book; it is included here in the interest of completeness.
_____________________________________________
a)
,
b)
,
c)
,
d)
,
Answers
a)
b)
c)
d) undefined.
__________________________________________________________________
Therefore, the goal of this section is to derive a formula for computing a unit
vector in the same direction as some other vector v that is not necessarily a
unit vector. The way to do this is to find some scalar μ that satisfies our
criteria:6
6s.t. means "such that" or "subject to"
(3.40)
How to choose μ? Let’s start by thinking about how you would create a
"unit-norm number" (that is, find μ such that μ times a number has an
absolute value of 1). Let’s figure this out using the number 3.
|μ3| = 1
μ = 1∕3
Deriving the solution is simple: Divide both sides by the magnitude (that is,
the absolute value) of the number (the reason for the absolute value here is
that we want μ = 1∕3 even if we start with -3 instead of +3).
Now extend this concept to vectors: set μ to be the reciprocal (the inverse)
of the magnitude of the vector. In terms of notation, a unit vector is
commonly given a hat to indicate that it has a magnitude of one (v →v).
_______________________
_________________________________________________________________________
The norm of the vector, ∥v∥, is a scalar, which means (1) division is
allowed (division by a full vector is not defined) and (2) importantly, the
direction of the vector does not change. Here is a simple example:
This example also shows why the divisor is the magnitude (the square root of
sum of squared vector elements), and not the squared magnitude vTv. It is
also clear that the unit vector v points in the same direction as v.
It is worth mentioning explicitly what you might have already deduced: The
unit vector is defined only for non-zero vectors. This makes sense both
algebraically (Equation 3.41 would involve division by zero) and
geometrically (a vector with no length has no direction, hence, another vector
in the same direction is not a sensible construct).
Taking μ = 1∕∥v∥ allows for a quick proof that the unit vector really
does have unit length:
(3.42)
Code Fortunately, both Python and MATLAB have built-in functions for
computing the norm of a vector.
Code block 3.9:Python
v = np.array([2,5,4,7])Â
vMag = np.linalg.norm(v)Â
v_unit = v / vMag
__________________________________________________________________
a)
b)
c)
d)
Answers
a)
b)
c)
d)
=
__________________________________________________________________
3.10 Exercises
1.Â
Compute the dot product between the following pairs of vectors.
a)
,
b)
,
c)
,
d)
,
e)
,
f)
,
g)
,
h)
,
i)
,
j)
k)
l)
,
m)
,
n)
,
o)
,
2.Â
Assume that ∥x∥ = ∥y∥ = 1. Determine whether each of the following
equations is necessarily true, necessarily false, or could be true depending on
the elements in x and y.
3.Â
Compute the angle 𜃠between the following pairs of vectors. See if
you can do with without a calculator!
a)
,
b)
,
c)
,
4.Â
Compute the outer product between the following pairs of vectors. Implement
problems a-c element-wise (the i,jth element in the product matrix is the
product of the ith element in the left vector and the jth element in the right
vector); problems d-f row-wise (each row in the product matrix is the right-
vector scaled by each element in the left vector); and problems g-i column-
wise (each column in the product matrix is the left-vector scaled by each
element in the right-vector).
a) T
b)
T
c)
T
d)
T
e)
T
f)
T
g)
T
h)
T
i)
T
5.Â
Determine whether the following vectors are unit vectors.
a)
b)
c)
d)
6.Â
What is the magnitude of vector μv for the following μ?
a) μ = 0
b) μ = ∥v∥
c) μ = 1∕∥v∥
d) μ = 1∕∥v∥2
3.11 Answers
1.Â
a) 0
b) ac + bd
c) ab + cd
d) −9
e) −9
f) 22
g) a+b
h) 15 + 3i
i) 0
j) 0
k) 5
l) 0
m) 68
n) 0
o) 0
2.Â
a) True
b) True
c) Depends
d) False
e) Depends
f) True
3.Â
These you can solve by inspecting the elements of the vectors, without
applying the dot product formula.
4.Â
Of course, the answers are the same regardless of the perspective you use; the
goal is to become comfortable with different ways of thinking about the outer
product.
a)
b)
c)
d)
e)
f)
g)
h)
T
i)
5.Â
a) Yes
b) No
c) Yes
d) No
6.Â
a) 0
b) ∥v∥2
c) 1
d) 1∕∥v∥
2.Â
The trick is to compute the average of N numbers by summing them
together after multiplying by 1/N. Mathematically, you would write this
as vT1 .
Code block 3.13:Python
        v = [ 7, 4, -5, 8, 3 ]Â
        o = np.ones(len(v))Â
        ave = np.dot(v,o) / len(v)
3.Â
We don’t know which numbers are more important than others, so I
will use randomized weights. In fact, the solution here is simply the the
dot product between the data vector and any other vector that sums to
one.
Code block 3.15:Python
        v = [ 7, 4, -5, 8, 3 ]Â
        w = np.random.rand(len(v))Â
        wAve = np.dot(v, w/sum(w))
As you learned in the previous chapter, the value of each vector element tells
you how far along that coordinate axis to travel. This is easiest to
conceptualize for vectors in their standard position, because the endpoint of
the vector is the same as the coordinate. For example, the vector [ 1 3 ] in
standard position is a line that goes from the origin to the coordinate point
(1,3). Hence: two elements, two coordinate axes.
A major departure from the typical Cartesian axis you’re familiar withâ
€”and a beautiful feature of linear algebra—is that the coordinate axes need
not be orthogonal; that is, they need not meet at right angles (Figure 4.3).
Orthogonal axes have several useful properties, but non-orthogonal axes also
have many useful properties, and are key to many applications, including data
compression. More on this in the section on Bases.
Figure 4.3:The familiar Cartesian plane (left) has
orthogonal coordinate axes. However, axes in linear
algebra are not constrained to be orthogonal (right), and
non-orthogonal axes can be advantageous.
Fields The term "field" in mathematics might be new to you. Perhaps youâ
€™ve seen these fancy capital letters like â„, â„‚, ℚ, or ℤ. These letters
(typeface blackboard bold and sometimes called "hollow letters") refer to
fields. A field in mathematics is a set of numbers for which the basic
arithmetic operations (addition, subtraction, multiplication, division) are
defined.
The ℠stands for real, as in, real numbers. In Chapter 9, we’ll use the
field ℂ for complex numbers. There’s also the field ℚ for rational
numbers (a rational number is not a number that makes good decisions; itâ
€™s one that is defined by a ratio ).
When taking notes by hand, fields are often written as R2, and when typing
on a computer, fields are often indicated as R2 or R^6.
________________________
a)
b)
c)
d) 17
Answers
a) â„4
b) â„3
c) â„2
d) â„1
__________________________________________________________________
A vector space refers to any set of objects for which addition and scalar
multiplication are defined.1 Addition and scalar multiplication obey the
following axioms; these should all be sensible requirements based on your
knowledge of arithmetic:
1An axiom is a statement that is taken to be true without requiring a formal proof.
Geometry A subspace is the set of all points that you can reach by stretching
and combining a collection of vectors (that is, addition and scalar
multiplication).
Let’s start with a simple example of the vector v=[-1 2]. In its standard
position, this is a line from the origin to the coordinate (-1,2). This vector on
its own is not a subspace. However, consider the set of all possible vectors
that can be obtained by λv for the infinity of possible real-valued λ’s,
ranging from −∞ to +∞: That set describes an infinitely long line in the
same direction as v, and is depicted in Figure 4.4 (showing the entire
subspace would require an infinitely long page).
Figure 4.4:A 1D subspace (gray dashed line) created from a
vector (solid black line).
That gray dashed line is the set of all points that you can reach by scaling and
combining all vectors in our collection (in this case, it’s a collection of
one vector). That gray line extends infinitely far in both directions, although
the vector v is finite.
This is the important sense in which λv does not change the "direction" of v,
which was mentioned in a previous chapter.
So the subspace obtained from one vector is an infinitely long line. What
happens when you have two vectors? They each individually have an
associated infinitely long 1D subspace. But the definition of a vector
subspace allows us to combine these vectors. And you learned in the previous
chapter that adding two vectors geometrically gives a third vector that can
point in a different direction compared to the two adding vectors.
So by scaling and adding two vectors, we can reach many points that are not
within the 1D subspaces defined by either vector alone. Figure 4.5 shows an
example of combining two vectors to reach a point that could not be reached
by either vector’s subspace alone.
In fact, the set of all points reachable by scaling and adding two vectors (that
is, the linear weighted combination of those two vectors) creates a new 2D
subspace, which is a plane that extends infinitely far in all directions of that
plane. Figure 4.6 shows how this looks in 3D: The subspace created from two
vectors is a plane. Any point in the plane can be reached by some linear
combination of the two grey vectors.
But let me unpack that answer by working through the answer to a slightly
different question: How many subspace dimensionalities are possible with an
N-dimensional ambient space? That answer is a finite number, and it turns
out to be N+1. Let us count the dimensionalities, using â„3 as a visualizable
example.
The smallest possible subspace is the one defined by the zeros vector [0 0 0].
This vector is at the origin of the ambient 3D space, and any scalar λ times
this vector is still at the origin. This is the only subspace that is a single point,
and it is thus a zero-dimensional subspace.
Now for the 2-dimensional subspaces. These are formed by taking all linear
combinations of two vectors—two lines—in 3D. These vectors themselves
don’t need to pass through the origin, but the plane that is formed by
combining all scaled versions of these vectors must include the origin (Figure
4.7). It is also intuitive that there is an infinite number of 2D planes in 3D
ambient space that pass through the origin.
But wait a minute—will any two vectors form a plane? No, the vectors must
be distinct from each other. This should make sense intuitively: two vectors
that lie on the same line cannot define a unique plane. In a later section, Iâ
€™ll define this concept as linear independence and provide a formal
explanation; for now, try to use your visual intuition and high-school
geometry knowledge to understand that a unique plane can be defined only
from two vectors that are different from each other.
So there you have it: For an N-dimensional ambient space, there are N+1
possible dimensions for subspaces (0 through N), and an infinite number of
possible subspaces, except for the one 0-dimensional subspace and the one
N-dimensional subspace.
-|*|- Reflection The visualization gets a bit hairy after three dimensions. In a
4D ambient space, there is an infinite number of unique 3D subspaces. Each
of those 3D subspaces is like a cube that extends infinitely in all directions
and yet is somehow still only an infinitely thin slice of the 4D space. I can
make this work in my mind by thinking about time as the 4th dimension:
There is a single instant in time in which an infinitely expansive space exists,
but for all of time before and after, that space doesn’t exist (and yes, I am
aware that time had a beginning and might have an end; it’s just a
visualization trick, not a perfect analogy). Now take a moment to try to
visualize what an 18D subspace embedded in an ambient â„96 "looks like."
You can understand why we need the algebraic perspective to prevent
overcrowding at psychiatric institutions... -|*|-
A subspace is the set of all points that can be created by all possible linear
combinations of vector-scalar multiplication for a given set of vectors in â„N
and all scalars in â„. Subspaces are often indicated using italicized upper-case
letters, for example: the subspace V . That same notation is also used to
indicate a set of vectors. You’ll need to infer the correct reference from
context.
In words, a subspace is the set of all points that satisfies the following
conditions:
The first condition means that for some vector v ∈ V (a vector v contained in
vector subspace V ), multiplying v by any scalar λ and/or adding any other
scaled vector αw that is also contained inside the vector subspace V produces
a new vector that remains in that same subspace.
3"Closed" means that what happens in the subspace, stays in the subspace.
In math terms, that first definition for a subspace translates to:
__________________________________________________________________
Eqn. box title=Algebraic definition of a subspace
∀v,w ∈ V, ∀λ,α ∈ â„; λv + αw ∈ V (4.1)
_________________________________________________________________________
Read outloud, this statement would be "for any vectors v and w contained in
the vector subspace V , and for any real-valued scalars λ and α, any linearly
weighted combination of v and w is still inside vector subspace V ." Below is
an example of a 1D subspace V defined by all linear combinations of a row
vector in â„3.4
4Read outloud: "The subspace V is defined as the set of all real-valued scalar multiples of the row
vector [1 3 4]."
(4.2)
The first two vectors are contained in the subspace V . Algebraically thatâ
€™s the case because you can find some λ such that λ[ 1 3 4 ] = [ 3 9 12 ];
same for the second vector. Geometrically, the first two vectors are collinear
with the original vector [ 1 3 4 ]; they’re on the same infinitely long time,
just scaled by some factor.
That third vector, however, is not contained in the subspace V because there
is no possible λ that can multiply the vector to produce [ 1 3 5 ].
Geometrically, that vector points in some direction that is different from the
subspace V .
Let’s consider another example: The set of all points in â„2 with non-
negative y-values. Is this a vector subspace? It contains the point [0,0], which
is a requirement. However, you can find some point in the set (e.g., v=[2,3])
and some scalar λ (e.g., -1) such that λv is outside the set. Thus, this set is
not closed under scalar multiplication, and it is therefore not a subspace. In
fact, this an example of a subset.
4.4 Subsets
In linear algebra, subspace and subset are two entirely different concepts, but
they are easily confused because of the similarity of the names. Subsets are
actually not important for the linear algebra topics covered in this book. But
it is important to be able to distinguish subsets from subspaces.
The set of all points on the XY plane such that x > 0 and y > 0.
The set of all points such that 4 > x > 2 and y > x2.
The set of all points such that y = 4x, for x ranging from −∞ to +∞.
These are all valid subsets. The third example is also a subspace, because the
definition of that set is consistent with the definition of a subspace: an
infinitely long line that passes through the origin.
____________________________________
Answers
____________________________________________
4.5 Span
Geometry Span is a really similar concept as subspace, and they are easy to
confuse. A subspace is the region of ambient space that can be reached by
any linear combination of a set of vectors. And then those vectors span that
subspace. You can think about the difference using grammar: a subspace is a
noun and span is a verb. A set of vectors spans, and the result of their
spanning is a subspace.
For example, the subspace defined as all of â„2 can be created by the span of
the vectors [0 1] and [1 0]. That is to say, all of â„2 can be reached by some
linear weighted combination of those two vectors.
Algebra The span of a set of vectors is the set of all points that can be
obtained by any linear weighted combination of those vectors (Equation 4.3).
_____________________
_________________________________________________________________________
Thus, a vector w is in the span of the vector set S if w can be exactly created
by some linear combination of vectors in set S. For example, consider the
following two vectors w and v and set S. The question at hand is whether
either of these vectors is in the span of S. The answers are given and justified
below, but see if you can work through the problem on your own before
reading further.
Let’s start with vector v. We have a positive answer here: v is in the span
of S. Written formally:
Thus,
v
can be obtained by a weighted combination of vectors in set
S
. The weightings (in this case, 5/6 and 1/6) might not be immediately obvious
just from looking at the vectors. In fact, it is generally not trivial to find the
correct weights just by visual inspection, unless you have a really nice linear
algebra teacher who gives you easy problems. There are algorithms for
determining whether a vector is in the span of some vector set, and, if so, for
finding the correct weights. Those algorithms are not too complicated, but
they rely on concepts that you haven’t yet learned about (primarily:
determinant and Gaussian elimination). So we’ll come back to those
algorithms later.
For now, it’s important to understand the concept that the span of a set of
vectors is the entire subspace that can be reached by any linear combination
of those vectors (often verbalized as "this vector set spans that subspace"),
and that we often want to determine whether a given vector is contained in
the span of that set.
One thing to keep in mind about span: It doesn’t matter if the vectors in
the set are linear combinations of other vectors in that same set. For example,
the following set spans a 2D subspace (a plane) embedded in ambient â„4.
There are five vectors, but notice that the first three and the last two are
collinear. Hence, five vectors in total, but together they span a 2D subspace.
How do I know that this set spans only a 2D subspace? Well, I know because
I’ve already read the chapter on matrix rank and so I know the algorithm
for determining the subspace dimensionality. As I wrote in the section on
determining whether a vector is in the span of a set, it is, in most cases,
impossible to look at a set of vectors and "see" the subspace dimensionality.
You will learn several algorithms for computing this, but for now, focus on
the concept that it is possible for a set to contain five 4D vectors that together
span a 2D subspace.
__________________________________________________________________
a)
v= ,w= .S=
b)
p= ,q= .T=
c)
m= ,x= .U=
Answers
a) both
b) p yes, q no
c) Invalid dimensions
________________________________________________________________________
Finally, consider the right-hand set (panel C): This set of three vectors in â„2
is linearly dependent, because any one of the vectors can be obtained by a
linear combination of the other two vectors. In this example, the middle
vector can be obtained by averaging the other two vectors (that is, summing
them and scalar-multiplying by λ = .5). But that’s not just a quirk of this
example. In fact, there is a theorem about independence that is illustrated in
Figure 4.9: [1em]2em Any set of M > N vectors in â„N is necessarily linearly dependent.
Any set of M ≤N vectors in â„N could be linearly independent. The proof of this theorem
involves creating a matrix out of the set of vectors and then computing the
rank of that matrix. That’s beyond the scope of this chapter, but I wanted
to present this theorem now anyway, because I think it is intuitive: For
example, three vectors that lie on the same plane (a 2D subspace) cannot
possibly create a cube (a 3D subspace).
Also notice the clause could be: M ≤ N merely creates the opportunity for
independence; it is up to the vectors themselves to be independent or not. For
example, imagine a set of 20 vectors in â„25 that all lie on the same line (that
is, the same 1D subspace embedded in ambient â„25). That set contains 20
vectors, but geometrically, they’re occupying a 1D subspace, hence, that
set is linearly dependent.
Algebra A set of vectors is dependent if at least one vector in the set can be
expressed as a linear weighted combination of the other vectors in that set.
Consider the following examples.
(4.4)
(4.5)
Both examples show dependent sets. The way I’ve written the equations
on the right, it seems like w2 and v2 are the dependent vectors while the other
vectors are independent. But remember from the beginning of this section
that linear (in)dependence is a property of a set of vectors, not of individual
vectors. You could just as easily isolate v1 or v3 on the left-hand side of
Equation 4.5.
The important point is that it is possible to create any one vector in the set as
some linear weighted combination of the other vectors in the same set.
On the other hand, it is possible to create subsets of those sets that are
linearly independent. For example, the sets , , and
are all independent sets.
Try as hard as you can and for as long as you like; you will never be able to
define any one vector in each set using a linear weighted combination of the
other vectors in the same set. That’s easy to see in the first set: When
considering only the first two rows, then w1 = 2w2. However, this weighted
combination fails for the third row. Mapping this back onto the geometric
perspective, these two vectors are two separate lines in â„3; they point in
similar directions but are definitely not collinear.
The second set is more difficult to figure out by trial-and-error guessing. This
shows that even simple examples get out of hand rather quickly. Determining
whether a set of vectors is linearly independent is really important in linear
algebra, and we’re going to need more rigorous methods that will scale to
any size matrix. For example, you can put those vectors into a matrix and
compute the rank of the matrix; if the rank is the same as the number of
vectors, then the set is independent, and if the rank is less than the number of
vectors, the set is dependent. This is called the "augment-rank" algorithm,
and you will learn more about it in chapters 7 and 8.
_________________________________________________________________________
This may seem like a strange definition: Where does it come from and why is
the zeros vector so important? Some rearranging, starting with subtracting
λ1v1 from both sides of the equation, will reveal why this equation indicates
dependence:
λ1v1 = λ2v2 + ... + λnvn
v1 = v2 + ... + vn, λ ∈ â„, λ1≠0 (4.7)
Because the λ’s are scalars, then λn∕λ1 is also just some scalar. If you
like, you could replace all the fractional constants with some other constant,
e.g., βn = λn∕λ1.
The point is that with a bit of re-arranging, equation 4.6 states that one vector
can be expressed as some linear weighted combination of the other vectors in
the set. Now you can see the correspondence between the math equation and
the English definition of linear dependence. You can also see another
justification for the claim that at least one λ≠0.
Imagine that v1 = 0, λ1 = 1000, and all other λn=0. Then the right-hand side
of the equation equals the left-hand side of the equation for at least one λâ
‰ 0. Thus, any set of vectors that contains the zeros vector is a dependent
set.
Step 1:Â
Count the number of vectors (call that number M) in the set and compare
to N in â„N. As mentioned earlier, if M > N, then the set is necessarily
dependent. If M ≤ N then you have to move on to step 2.
Step 2:Â
Check for a vector of all zeros. Any set that contains the zeros vector is
a dependent set.
Step 3:Â
If you’ve gotten this far, it means you need to start doing some trial-
and-error educated guesswork. Start by looking for zeros in the entries
of some vectors, with the knowledge that zeros in some vectors in
combination with non-zero entries in corresponding dimensions in other
vectors is a tip towards independence (you cannot create something
from nothing, with the possible exception of the big bang).
Step 4:Â
This is where the real educated guesswork comes in. Start by creating
one element as a weighted combination of other vectors, and see
whether that same weighted combination will work for the other
dimensions. Consider the following set of vectors:
Look across the first dimension (top row of vector elements) and notice
that 2 times the first vector plus 1 times the second vector produces the
first element in the third vector. That same set of weights is also true for
the second dimension. However, it doesn’t work for the third
dimension. That proves that the third vector cannot be expressed as this
linear combination of the first two. You have to repeat this process (try
to find weights of M-1 vector entries that equal the Mth vector entry) for
each of M vectors. Eventually, you will determine that the set is linearly
independent or dependent.
Here is another, slightly different, way to think about step 4, which brings
back the formal definition of linear dependence, and also will help prepare
you for matrix concepts like null space, inverse, and eigendecomposition:
The goal is to come up with a set of coefficients for each vector such that the
weighted sum of all vectors gives the zeros vector. For the first dimension,
the coefficients (-2, -1, 1) will produce a zero (−2×1 + −1×2 + 1Ã
—4 = 0). Those same coefficients will also work for the second dimension,
but they don’t work for the third dimension. This means either (1) a
different set of coefficients could work for all three dimensions, or (2) the set
is linearly independent.
Step 4 is really the limiting factor here. Again, by this point in the book, your
main concern should be with the concept and definition of linear
independence; scalable and rigorous methods will be introduced in later
chapters.
b)
c)
d)
Answers
a) Dependent (1,0,0)
b) Dependent (1,-3,1)
c) Dependent (-3, 2, 1)
d) Independent
___________________________________________________
4.7 Basis
A basis is the combination of span and independence: A set of vectors
{v1,v2,...,vn} forms a basis for some subspace of â„N if it (1) spans that
subspace and (2) is an independent set of vectors.
Geometrically, a basis is like a ruler for a space. The basis vectors tell you the
fundamental units (length and direction) to measure the space they describe.
For example, the most common basis set is the familiar Cartesian axis basis
vectors, which contains only 0s and 1s:
This basis set is so widely used because of its simplicity: Each basis vector
has unit length and all vectors in the set are mutually orthogonal (that is, the
dot product of any vector with any other vector is zero). These sets fulfill the
definition of basis because they (1) are linearly independent sets that (2) span
â„2 or â„3.
Set S1 is an independent set but does not cover all of â„2—instead, it spans a
1D subspace of â„2, which is the line at Y=0. Set S2 is a basis for a plane on
the XY axes where Z=0.
Now consider the following sets of vectors; which of these forms a basis for
â„3 How about â„2?
The answers are that set M2 is a basis for â„3 because there are three vectors
in â„3 that form a linearly independent set.
None of the sets is a basis of â„2 because all of these vectors live in â„3.
Set M1 is not a basis set, because the set is linearly dependent. However, a set
comprising any one of those vectors would be a basis for a 1D subspace (a
line) embedded in ambient 3D space, which, for a Cartesian space, would be
the line on the X-axis at Y=Z=0.
Set M3 is also not a basis set because one can define the first or second
vectors as a linear combination of the other vectors. However, the third vector
and either the first or second vector would form a basis for a 2D subspace,
which is a plane in â„3.
Let’s consider the geometric consequence of different bases for the same
subspace. The graph in Figure 4.10 shows a point p; how would you identify
this point using the three basis sets below?
For basis set T, we have p[T] = [2,-.5]. Why is this the correct answer? Starting
from the origin, draw a vector that is two times the first vector in set T, and
then add -.5 times the second vector in set T. That will get you from the
origin to point p.
Now for set U. Ah, this is a trick. In fact, set U is not a basis set, because it is
not a linearly independent set. No set with the zeros vector can be
independent. For our exercise here, it is impossible to reach point p using the
vectors in set U, because span(U) is a 1D subspace that does not touch point
p.
Why is it important that the set be linearly independent? You might think that
it should be sufficient for the set to span the subspace, and any additional
vectors are just there, you know, for fun. But any given vector in the
subspace spanned by a basis should have a unique coordinate using that basis.
For example, consider the following set:
This set is not a basis for â„2, but let’s pretend for a moment that it is.
The vector that, in standard coordinates, is given by [-2 6] can be obtained by
scaling these "basis vectors" by (-6,4,0) or (0,-2,6) or (-2,0,4), or an infinite
number of other possibilities.8 That’s confusing. Therefore,
mathematicians decided that a vector must have unique coordinates within
some basis set, which happens only when the set is linearly independent.
8Independent bases ensure uniqueness.
Infinite bases Although any given vector has unique coordinates within a
basis (assuming that basis spans the subspace the vector lives in), the reverse
is not the case: There is an infinite number of bases that describe any
subspace. You might have already guessed that this is the case from the
discussion around Figure 4.10. Here is an example of a small and finite
number of distinct bases for â„2. (In fact, any linearly independent set of two
2D vectors is a basis for all of â„2.)
For example, using the first two sets, the Cartesian-basis coordinate (3,4)
would be obtained by scaling the basis vectors by [3,8] and [-3,4],
respectively.
The truth is that not all basis sets are created equal. Some bases are better
than others, and some problems are easier to solve in certain bases and harder
to solve in other bases. For example, that third basis set above is valid, but
would be a huge pain to work with. In fact, finding optimal basis sets is one
of the most important problems in multivariate data science, in particular data
compression and components analyses.
Consider Figure 4.11: The dots correspond to data points, and the black lines
correspond to basis vectors. The basis set in the left graph was obtained via
principal components analysis (PCA) whereas the basis set in the right graph
was obtained via independent components analysis (ICA). (You’ll learn
the math and implementation of PCA in Chapter 19.) Thus, these two
analyses identified two different basis sets for â„2; both bases are valid, but
which one describes the patterns in the data better?
-|*|- Reflection In this section we are discussing only basis vectors. You may
have also heard about basis functions or basis images. The concept is the
same—a basis is the set of the minimum number of metrics needed to
describe something. In the Fourier transform, for example, sine waves are
basis functions, because all signals can be represented using sine waves of
different frequencies, phases, and amplitudes. -|*|-
4.8 Exercises
1.Â
Determine whether each vector is in the span of sets S and T, and if so,
what coefficients could produce the given vectors from the sets.
a)
b)
c)
d)
e)
2.Â
Determine whether the following vector is in the set spanned by the
bracketed vectors, in other words, whether u ∈ S =
a)
,
b)
,
c)
,
3.Â
Label the following sets as independent or dependent. For dependent sets,
determine whether it is possible to modify only one element of one vector to
change it to an independent set.
a)
b)
c)
d)
e)
f)
g)
h)
i)
4.Â
Determine the value of λ that would make the following sets of vectors
dependent.
a)
b)
c)
5.Â
The following sets of vectors are dependent sets. For each set, determine the
number of vectors to remove to create an independent set with the most
number of vectors.
a)
b)
6.Â
Determine whether the following are descriptions of subspaces and subsets,
or only subsets.
7.Â
What is the dimensionality of the subspace spanned by the following vector
subspaces?
a)
b)
c)
d)
8.Â
Remove one vector in the following sets to create a basis set for a 2D
subspace.
a)
b)
c)
4.9 Answers
1.Â
a) Neither
b) S: [4,0]
c) Both [0,0]
d) T: [2,-3]
e) S: [2,-3]
2.Â
a) no
b) yes
c) no
3.Â
Most dependent sets can be made independent with one value-change if there
are N ≤ M vectors in â„M.
a) Independent
b) Independent
c) Dependent, yes
d) Dependent, no
e) Dependent, yes
f) Dependent, no
g) Independent
h) Dependent, yes
i) Dependent, yes
4.Â
a) λ = 9
b) λ = 0 or anything if
a=b=c=d=0
c) Any λ
5.Â
The answers below indicate the number of vectors that can be removed. In
these examples, there are different collections of vectors that could produce
linearly independent sets.
a) 2
b) 2
6.Â
The strategy is to think about whether any point or vector p in the subset
would still be in the subset for any scalar-multiple μp, including μ = 0.
a) Both.
b) Subset only.
c) Both.
d) Subset only.
7.Â
a) 1D
b) 2D
c) 3D
d) 1D
8.Â
a) First or second
b) First or third
c) First or second
Chapter 5
Matrices
5.1 Interpretations and uses of matrices
The goal of this chapter is to introduce you to matrices: what they look like,
how to refer to them, several important special matrices, and basic matrix
arithmetic.
In this book, we will work only with matrices that are rows × columns, or
matrices that you could print out and lay flat on a table. The terminology can
get a bit confusing here, because you might think that these are 2D matrices.
However, although they would be in two physical dimensions when printed
on a piece of paper, the number of dimensions in which a matrix lives is more
open to interpretation compared to vector dimensionality (more on this later).
Matrices that would require a 3D printer (or, for higher dimensions, a hyper-
printer) occupy cubes or hypercubes in physical space, and are called tensors.
Tensors are useful for storing data and for representing, for example, various
physical forces acting on an object. But they will not be further discussed in
this book.
Matrices are ubiquitous in pure and applied mathematics because they have
many different purposes. To give you an idea of their versatility, below is a
non-exhaustive list of some uses of matrices.
The first step of learning anything new is to become acquainted with the
terminology. When referring to an entire matrix, a boldface capital letter is
used (matrix A), and when referring to individual elements of a matrix, a
lower-case letter with subscripts is used (matrix A contains elements ai,j). The
matrix-related terms that you need to know include row, column, element,
block, diagonal, skew-diagonal, and off-diagonal. Several of these are
illustrated in Figure 5.1.
Figure 5.1:Terminology for several key matrix components.
â„M×N
â„MN, if each matrix element is its own dimension (this is the closest
interpretation to the vector dimensionality definition).
â„M, if the matrix is conceptualized as a series of column vectors (each
column contains M elements and is thus in â„M).
â„N, if the matrix is conceptualized as a stack of row vectors.
The notation is also the same as for vectors: the transpose of A is AT (also as
mentioned previously, the T is sometimes italicized, so AT = AT). Setting B to
be the transpose of A (that is, B = AT leads to the formal definition:
__________________________________________________________________
__________________________________________________________________
__________________________________________________________________
a) T
b)
T
c)
TT
Answers
a)
T
b)
c)
_________________________________________________
Square or rectangular Very simply: A square matrix is a matrix that has the
same number of rows as columns, thus an N×N matrix or M×N if M =
N.
A non-square matrix is called a rectangular matrix. You can see a square and
a rectangular matrix below.
Let me guess what you are thinking: "But Mike, a square is also a rectangle!"
Yes, dear reader, that’s technically true. However, for ease of
comprehension, it is assumed that a "rectangular matrix" is one in which the
number of rows does not equal the number of columns, thus M≠N. "Non-
square" is also acceptable.
________________________________________________________________________
It is no accident that the diagonal elements are all zeros. The diagonal must
equal its negative, and the only number for which that is true is 0 (0 = −0).
Thus, all skew-symmetric matrices have diagonals of all zeros.
You might initially think that the identity matrix would 2be a matrix of all 1â
€™s, however, it actually has 1’s only on the diagonal and all 0’s on
the off-diagonals. I is always a square matrix.
2The matrix indicated by 1 has all elements equal to 1; this is called a "ones matrix."
Subscripts are sometimes provided to indicate the size of the matrix, as in the
following examples. If there is no subscript, you can assume that I is the
appropriate size to make the equation valid.
The identity matrix has many important functions. It features in many proofs,
in the matrix inverse, and in eigendecomposition and regularization.
Zeros The zeros matrix is a matrix of all zeros. That might sound like a pretty
boring matrix, but... well, it is. As you might guess, any matrix times the
zeros matrix is still the zeros matrix, just like any number times 0 is 0 (letâ
€™s stick to finite numbers to avoid headaches from thinking about what
happens when you multiply 0 by infinity).
Zeros matrices can be square or rectangular, but in most cases you can
assume that the zeros matrix is square; rectangular matrices are often used in
code, for example when initializing data matrices.
Code Many special matrices can be created easily using dedicated functions.
Notice that zeros matrices in Python require two numbers to specify the size
(the number of rows and the number of columns).
ATA Pronounced "A transpose A" and sometimes written "AtA," this is one
of the most important matrix forms in all of applied linear algebra. Creating
ATA involves matrix multiplication, which is the topic of the next chapter.
However, there are several key properties of ATA that are worth listing here.
Their true value will become apparent as you progress through this book.
Each of the following claims will be discussed and proven in later chapters;
for now, simply marvel at the glorious existential goodness of ATA:
Diagonal A diagonal matrix has all zeros on the off-diagonals, and only the
diagonal elements (going from top-left to bottom-right) may contain non-zero
elements. I is an example of a diagonal matrix, as are the following two
matrices. 3
3It’s OK if some or all diagonal elements are zero, as long as all off-diagonal elements are zero.
(5.7)
If all diagonal elements are the same, then the matrix can be written as a
constant times the identity matrix, or λI. For example,
As shown in Figure 5.1, the diagonal of a matrix goes from the top-left to the
lower-right. The anti-diagonal goes from the top-right to the lower left.
Code In both Python and MATLAB, the same function extracts the diagonal
elements of a matrix, and produces a diagonal matrix given an input vector.
Two matrices can be augmented only if they have the same number of rows;
the number of columns need not match.
The vertical bar in the concatenated matrix indicates the break between the
two original matrices, and is sometimes omitted. Augmented matrices are
used in various applications including solving systems of linear equations and
computing the matrix inverse.
Code MATLAB and Python have dedicated functions to extract upper and
lower triangular matrices. Tip: You can provide an optional second input k
to isolate the elements above or below the kth diagonal.
Code block 5.9:Python
AÂ =Â np.random.randn(5,5)Â
L = np.tril(A) # extract the lower triangleÂ
U = np.triu(A) # extract the upper triangle
Dense and sparse A matrix in which most or all matrix elements are non-
zero is called a dense matrix (sometimes also called a "full matrix"). This
term is used only when it is necessary in context, for example, when
comparing a diagonal matrix with a dense matrix. You wouldn’t normally
call every matrix a "dense matrix."
A sparse matrix is one that contains mostly zeros and a relatively small
number of non-zero elements. Sparse matrices are extremely computationally
efficient, and therefore, a lot of modern algorithms in numerical linear
algebra emphasize sparse matrices. Notice that the 10×10 sparse matrix
below can be efficiently represented by listing the row and column indices
that contain non-zero elements.
1. All of its columns are pairwise orthogonal. That means that the dot
product between any two columns is exactly 0.
2. Each column i has ∥Qi∥ = 1, in other words, each column is unit
magnitude. Remember that the magnitude of a vector (in this case, the
column of a matrix) is computed as the dot product of that vector with
itself.
= (5.8)
__________________________________________________________________
Toeplitz Toeplitz and Hankel matrices (below) are closely related to each
other. Both involve creating new rows of a matrix as systematic
rearangements of elements in previous rows. One of the remarkable
properties of Toeplitz and Hankel matrices is that they can have rank r > 1
even if they are created from a rank r = 1 vector.
In a Toeplitz matrix, all diagonals contain the same element. The matrix
below shows a Toeplitz matrix created from a vector. This Toeplitz matrix is
also symmetric.
Notice that the main diagonal is the same as the first element of the vector
(a), the next off-diagonal is the second element of the vector (b), and so on.
Figure 5.5:Visualization of Toeplitz and Hankel matrices
created from a vector with integers from 1 to 10.
A Hankel matrix can also "wrap around" to produce a full matrix, like this:
Notice that the ith column (and the ith row) of the Hankel matrix comes from
starting the vector at the ith element and wrapping around. You can also see
how the anti-diagonals relate to the vector.
_________________________________________________________________________
Hankel matrices are used in time series convolution and in advanced signal-
processing methods such as time-delay embedding. Hankel matrices also
look pretty (Figure 5.5) and have aesthetically pleasing properties in
eigendecomposition.
Matrix addition is simple, and it works how you would think it should work:
Add or subtract each corresponding element in the two matrices. For addition
to be a valid operation, both matrices must be the same size—M×Nâ
€”and the resulting matrix will also be M×N.
Needless to say, matrix subtraction works exactly the same way, except with
a minus sign instead of a plus sign.
Like vector addition, matrix addition is commutative, meaning that
__________________________________________________________________
a)
+
b)
+
c)
+
Answers
a)
b)
c) Invalid operation!
_________________________________________________________________________
This means that you can move the scalar around, and the result is unchanged.
That feature turns out to be crucial for several proofs and derivations. The
following illustrates the idea.
_________________________________________________________________________
1. Only the diagonal elements are affected; shifting does not change the
off-diagonal elements. This is obvious from Equation 5.13 because the
identity matrix has all zeros on the off-diagonal elements.
2. The first two rows of the example matrix are identical before shifting,
and different after shifting. Thus, shifting a matrix can make matrices
with redundant rows (or columns) distinct.
3. When λ is close to zero, then A is really similar to A. Indeed, when λ
= 0, then A = A. In practical applications, λ is often selected to be as
small as possible while large enough to satisfy other constraints.
Code Shifting in code illustrates both scalar multiplication and addition. Also
note the potential for confusion that the variable l (lower-case letter "L")
can look like the number 1 and the upper-case letter "I."
__________________________________________________________________
Practice problems Shift the following matrices according to the specified λ.
a)
, λ = .1
b)
, λ = −1
Answers
a)
b)
__________________________________________________________________
_________________________________________________________________________
Note that this formula does not require the matrix to be square. Observe the
following examples (diagonal elements highlighted).
Trace The trace is an operation that produces a single number from a squre
matrix. It is indicated as tr(A) and is defined as the sum of all diagonal
elements of a matrix:
__________________________________________________________________
_________________________________________________________________________
Notice that the off-diagonal elements do not contribute to the trace; thus, two
very different matrices (different sizes, different off-diagonal elements) can
have the same trace.
IMPORTANT The trace is defined only for square matrices. This may seem
strange, considering that the trace is the sum of the diagonal elements and the
diagonal exists for rectangular matrices. The reason for this rule has to do
with a property of eigendecomposition: The trace of a matrix equals the sum
of its eigenvalues. Eigendecomposition is valid only on square matrices, and
so ancient and wise linear algebraticians decreed that only square matrices
can have a trace.
Code Extracting the diagonal from a matrix was shown on page §.
__________________________________________________________________
a)
b)
c)
Answers
a) −9
b) 4
c) 0
____________________________________________________________
5.10 Exercises
1.Â
For the following matrix and vectors, solve the given arithmetic
problems, or state why they are not solvable.
a) wuT + A
b) wuT + AT
c) uvT − A
d) vwT − A
e) vwT + AT
2.Â
Perform the following matrix operations, when the operation is valid.
A = ,B= ,
C = D=
a) A + 3B
b) A+C
c) C − D
d) D+C
e) AT + D
f) (A + B)T + 2C
g) 3A + (BT + C)T
h) −4(AT + C)T + D
3.Â
An N×N matrix has N2 elements. For a symmetric matrix, however, not all
elements are unique. Create 2×2 and 3×3 symmetric matrices and count
the number of total elements and the number of possible unique elements.
Then work out a formula for the number of possible unique elements in such
a matrix.
4.Â
Identify the following types of matrices from the list provided in the section
"A zoo of matrices." Note that some matrices can be given multiple labels.
a)
b)
c)
d)
e)
f)
5.Â
To "decompose" a matrix means to represent a matrix using the sum or
product of other matrices. Let’s consider an additive decomposition, thus
starting from some matrix A and setting A = B + C. You can also use more
matrices: A = B + ... + N. Decompose the following matrices A. Are your
decompositions unique? (A decomposition is "unique" if it has exactly one
solution.)
6.Â
Create a Hankel matrix from the vector
7.Â
Determine whether the following matrices are orthogonal.
a)
b)
c) I17
8.Â
Consider two M×N matrices A and B. Is the sum of their traces equal to
the trace of their sum? That is, tr(A) + tr(B) = tr(A + B)? Try for a few
examples, then see if you can work out a proof for this property.
a)
,
b)
,
c)
,
9.Â
Here’s another neat property of the trace: The trace of the outer product is
equal to the dot product: tr(vwT) = vTw. Demonstrate this property using the
following sets of vectors.
a)
b)
c)
10.Â
Perform the indicated matrix operations using the following matrices and
scalars. Determine the underlying principle regarding trace, matrix addition,
and scalar multiplication.
a) tr(A)
b) tr(B)
c) tr(A + B)
d) tr(λC)
e) λ tr(C)
f) λ tr(αC)
g) α tr(λC)
h) tr(αA + λB)
i) (λα) tr(A + B)
j) tr(λA + λB)
k) λ tr(A + B)
l) tr(A + BT)
5.11 Answers
1.Â
a) Size mismatch.
b)
c)
d) Size mismatch.
e) Size mismatch.
2.Â
a)
b) Invalid.
c)
d)
e)
f)
g)
h) Invalid.
3.Â
n(n+1)/2
4.Â
5.Â
The additive decomposition is definitely not unique; there is an infinity of
possible matrices that can sum to produce a given A. One interesting solution
is to create matrices of all zeros and one non-zero element. For example:
The reason why this is an interesting decomposition is that the four matrices
are convenient basis matrices for the vector space â„2×2
6.Â
7.Â
8.Â
This property (sum of traces equals trace of sums) holds because of the
element-wise definition of both trace and matrix addition:
a) -2 in both cases
b) 21 in both cases
c) a + b + c + j + k + l in both cases
9.Â
a) -6
b) 0
c) ae + bf + cg + dh
10.Â
The underlying principle here is the same as in question 8: The trace is a
linear operator, so you can scalar-multiply and sum, and trace remains the
same.
a) 2
b) -1
c) 1
d) 5a + 5d
e) 5(a + d)
f) 5(−3a − 3d)
g) −3(5a + 5d)
h) -11
i) -15
j) 5
k) 5
l) 1
2.Â
You just discovered one way to make a symmetric matrix!
Code block 5.19:Python
AÂ =Â np.random.randn(4,4)Â
Al = np.tril(A)Â
S = Al + Al.T
3.Â
The point here is to appreciate that indexing the diagonal of a matrix
involves the i,i indices.
Code block 5.21:Python
DÂ =Â np.zeros((4,8))Â
for d in range(min(D.shape)):Â
    D[d,d] = d+1
The following five statements are ways to say the operation AB out loud
(e.g., when trying to show off to your math-inclined friends and family):
Validity Before learning how standard matrix multiplication works, you need
to learn when matrix multiplication is valid. The rule for multiplication
validity is simple and visual, and you need to memorize this rule before
learning the mechanics of multiplication.
If you write the matrix sizes underneath the matrices, then matrix
multiplication is valid only when the two "inner dimensions" match, and the
size of the resulting product matrix is given by the "outer dimensions." By
"inner" and "outer" I’m referring to the spatial organization of the matrix
sizes, as in Figure 6.1.
Figure 6.1:Visualization of the rule for matrix
multiplication validity. Note the reference to "inner
dimensions" (N) and "outer dimensions" (M and K).
The first pair (AB) is valid because the "inner" dimensions match (both 2).
The resulting product matrix will be of size 5×7. The second pair shows
the lack of commutativity in matrix multitplication: The "inner" dimensions
(7 and 5) do not match, and thus the multiplication is not valid. The third pair
is an interesting case. You might be tempted to call this an invalid operation;
however, when transposing C, the rows and columns swap, and so the "inner"
dimensions become consistent (both 5). So this multiplication is valid.
Here’s something exciting: you are now armed with the knowledge to
understand the notation for the dot product and outer product. In particular,
you can now appreciate why the order of transposition (vTw or vwT)
determines whether the multiplication is the dot product or the outer product
(Figure 6.2).
__________________________________________________________________
Practice problems For the following matrices, determine whether matrix
multiplication is
valid, and, if so, the size of the product matrix.
a) AB
b) AC
c) ABT
d) BCAT
e) BBTA
Answers
a) no
b) 2×4
c) 2×3
d) 3×2
e) no
___________________________________________________________
Code Unfortunately, matrix multiplication is confusingly different between
MATLAB and Python. Pay close attention to the subtle but important
differences (@ vs. *).
It is now time to learn how to multiply matrices. There are four ways to
think about and implement matrix multiplication. All four methods give the
same result, but provide different perspectives on what matrix multiplication
means. It’s useful to understand all of these perspectives, because they
provide insights into matrix computations in different contexts and for
different problems. It is unfortunate that many linear algebra textbooks teach
only the dot-product method (what I call the "element perspective").
This makes it clear that element ci,j comes from the dot product between row
ai and column bj.
Here is another example; make sure you see how each element in the product
matrix is the dot product between the corresponding row or the left matrix
and column in the right matrix.
Figure 6.3:Visual representation of the mechanism of
computing each element of the matrix multiplication
product as the vector dot product between each row of
the left matrix (from left-to-right and each column of the
right matrix (from top-to-bottom).
There is a hand gesture that you can apply to remember this rule: extend your
pointer fingers of both hands; simultaneously move your left hand from left
to right (across the row of the left matrix) while moving your right hand
down towards you (down the column of the right matrix) (Figure 6.3).
Below you can see another visualization of matrix multiplication from the
element perspective. This visualization facilitates three important features of
matrix multiplication.
Figure 6.4:A simulacrum of building up a matrix one layer
at a time. Each layer is the same size as the product yet
provides only partial information.
The first point above is relevant for understanding data covariance matrices.
The second and third points are important for understanding several matrix
decompositions, most importantly QR decomposition and generalized
eigendecomposition.
Remember that the outer product is a matrix. Each outer product is the same
size as C, and can be thought of as a layer. By analogy, imagine constructing
an image by laying transparent sheets of paper on top of each other, with each
sheet containing a different part of the image (Figure 6.4).
Below is an example using the same matrix as in the previous section. Make
sure you understand how the two outer product matrices are formed from
column ai and row bj. You can also use nearly the same hand gesture as with
the element perspective (Figure 6.3), but swap the motions of the left and
right hands.
2Notice that in each of the "layer matrices," the columns form a dependent set
(the same can be said of the rows). However, the sum of these singular
matrices—the product matrix—has columns that form a linearly
independent set.
2Each of these layers is a rank-1 matrix. Rank will be discussed in more detail in a separate chapter, but
for now, you can think of a rank-1 matrix as containing only a single column’s worth of
information; all the other columns are scaled versions.
The layer perspective of matrix multiplication is closely related to the
spectral theorem of matrices, which says that any matrix can be represented
as a sum of rank-1 matrices. It’s like each rank-1 matrix is a single color
and the matrix is the rainbow. This important and elegant idea is the basis for
the singular value decomposition, which you will learn about in Chapter 16.
(3) The "column perspective" From the column perspective, all matrices
(the multiplying matrices and the product matrix) are thought of as sets of
column vectors. Then the product matrix is created one column at a time.
The first column in the product matrix is a linear weighted combination of all
columns in the left matrix, where the weights are defined by the elements in
the first column of the right matrix. The second column in the product matrix
is again a weighted combination of all columns in the left matrix, except that
the weights now come from the second column in the right matrix. And so on
for all N columns in the right matrix. Let’s start with a simple example:
(6.2)
Let’s go through Equation 6.2 slowly. The first column of the product
matrix is the sum of all columns in matrix A. But it’s not just the columns
added together—each column in A is weighted according to the
corresponding element from the first column of matrix B. Then, the second
column of matrix C is created by again summing all of the columns in matrix
A, except now each column is weighted by a different element of column 2
from matrix B. Equation 6.2 shows only two columns, but this procedure
would be repeated for however many columns are in matrix B.
Now for the same numerical example you’ve seen in the previous two
perspectives:
= =
(4) The "row perspective" You guessed it—it’s the same concept as
the column perspective but you build up the product matrix one row at a time,
and everything is done by taking weighted combinations of rows. Thus: each
row in the product matrix is the weighted sum of all rows in the right matrix,
where the weights are given by the elements in each row of the left matrix.
Let’s begin with the simple example:
(6.3)
The top row of the product matrix is created by summing together the two
rows of the right matrix, but each row is weighted according to the
corresponding element of the top row of the left matrix. Same story for the
second row. And of course, this would continue for however many rows are
in the left matrix.
Practice problems Multiply the following pairs of matrices four times, using
each of four perspectives. Make sure you get the same result each time.
a)
b)
Answers
a)
b)
__________________________________________________________________
__________________________________________________________________
Practice problems Perform the following matrix multiplications. Use
whichever perspective you find most confusing (that’s the perspective
you need to practice!).
a)
b)
Answers
a)
b)
_________________________________________________
AB = λA(C + D)
AB = A(C + D)λ
The λ can be moved around because it is a scalar, but A must pre-multiply
both sides (or post-multiply both sides, but it must be consistent on the left-
and right-hand sides of the equation). In contrast to the above, the following
progression of equations is WRONG.
3Matrix sizes are not stated, so assume that the sizes make the operations valid.
B = λ(C + D)
AB = λ(C + D)A
In other words, if you pre-multiply on one side of an equation, you must pre-
multiply on the other side of the equation. Same goes for post-multiplication.
If there are rectangular matrices in the equation, it is possible that pre- or
post-multiplying isn’t even valid.
Let’s see two examples of 3×3 matrices; notice how the diagonal
elements appear in the rows (pre-multiply) or the columns (post-multiply).
Also notice how the product matrix is the same as the dense matrix, but either
the columns or the rows are scaled by each corresponding diagonal element
of the diagonal matrix.
= (6.4)
= (6.5)
__________________________________________________________________
_________________________________________________________________________
a)
b)
c)
d)
Answers
a)
b)
c)
d)
__________________________________________________________________
Multiplying two diagonal matrices The product of two diagonal matrices is
another diagonal matrix whose diagonal elements are the products of the
corresponding diagonal elements. That’s a long sentence, but it’s a
simple concept. Here’s an example:
(6.6)
I assume you got the matrix . Now let’s try it again, but this time
transpose each matrix individually before multiplying them:
Did you get the same matrix as above? Well, if you did the math correctly,
then you will have gotten , which is the not the same as the previous
result. It’s not even the same result but transposed. In fact, it is an entirely
different matrix.
OK, now let’s try it one more time. But now, instead of applying the
transpose operation to each individual matrix, swap the order of the matrices.
Thus:
And now you get the same result as the first multiplication: .
And this brings us to the main topic of this section: An operation applied to
multiplied matrices gets applied to each matrix individually but in reverse
order.
It’s a weird rule, but that’s just how it works. "LIVE EVIL"4 is a
mnemonic that will help you remember this important rule. Notice that LIVE
spelled backwards is EVIL. It’s a palindrome.
__________________________________________________________________
4n.b.: LIVE EVIL is not a recommendation for how to interact with society. Please be nice,
considerate, and generous.
Eqn. box title=The LIVE EVIL rule: Reverse matrix order
(A…B)T = BT…AT (6.7)
_________________________________________________________________________
The LIVE EVIL rule applies to other operations as well, such as the matrix
inverse. For example:
= =
For square matrices, ignoring the LIVE EVIL rule still gives a result (though
incorrect). However, for rectangular matrices, multiplication would be
impossible when ignoring the LIVE EVIL rule. This is illustrated in Figure
6.6.
Figure 6.6:Example of the LIVE EVIL law for transposing
matrix multiplication. Pay attention to the matrix sizes:
Ignoring the LIVE EVIL rule and transposing without
reversing leads to an invalid expression (top), whereas
the multiplication remains valid when swapping matrix
order (bottom).
a)
T T
b)
T T
Answers
a)
b)
_________________________________________________
bTA = = =
_________________________________________________________________________
Let’s work through a proof of this claim. The proof works by transposing
Ab and doing a bit of algebra (including applying the LIVE EVIL rule!) to
simplify and re-arrange.
5The proof works because A = AT. If the matrix weren’t symmetric, then
A≠AT, in other words, A and AT would be different matrices. And of
course, b and bT are the same except for orientation.
5Notice that our proof here involved transposing, expanding, and simplifying. This strategy underlies
many linear algebra proofs.
Let’s look at an example.
Ab = =
bTA = =
Notice, as mentioned above, that the two results are not identical vectors
because one is a column and the other is a row. However, they have identical
elements in a different orientation.
__________________________________________________________________
Practice problems Perform the following matrix multiplications.
a)
b)
Answers
a)
b)
__________________________________________________________________
There are two methods: Additive and multiplicative. The additive method is
not widely used in practical applications (to my knowledge), but it is worth
learning. The additive method to create a symmetric matrix from a non-
symmetric matrix is to add the matrix to its transpose. This method is valid
only for square matrices.
(6.9)
An example will illustrate why Equation 6.9 works. Notice that the diagonal
elements are doubled, which is why dividing by 2 is an appropriate
normalization factor. (In the matrices below, assume that b≠d, c≠h, and
f≠i.)
(6.10)
This is just one example of a 3×3 matrix. What if this is some quirk of this
particular matrix? How do we know that this method will always work? This
is an important question, because there are several special properties of 2Ã
—2 or 3×3 matrices that do not generalize to larger matrices.
The proof that Equation 6.9 will always produce a symmetric matrix from a
non-symmetric square matrix comes from the definition of symmetric
matrices (C = CT). Therefore, the proof works by transposing both sides of
Equation 6.9, doing a bit of algebra to simplify, and seeing what happens (Iâ
€™m omitting the scalar division by 2 because that doesn’t affect
symmetry).
6C = AT + A (6.11)
CT = (AT + A)T (6.12)
CT = ATT + AT (6.13)
CT = A + AT (6.14)
Because matrix addition is commutative, A + AT = AT + A. The right-hand
side of Equation 6.14 is the same as the right-hand side of Equation 6.11.
And if the right-hand sides of the equations are equal, then the left-hand sides
must also be equal. This proves that C = CT, which is the definition of a
symmetric matrix. This proof does not depend on the size of the matrix,
which shows that our example above was not a fluke.
_____________________________________________________
6The matrices are summed, not multiplied, so the LIVE EVIL rule does not apply.
Practice problems Create symmetric matrices from the following matrices
using the additive method.
a)
b)
Answers
a)
b)
_________________________________________________
The multiplicative method This involves multiplying a matrix by its
transpose. In fact, this is the ATA matrix that you learned about in the
previous chapter.
First we prove that ATA is a square matrix. Let’s say that A is size MÃ
—N (assume M≠N). First note that AA is not a valid multiplication. Now
consider that ATA means (M ×N)×(N ×M). The "inner" dimensions
(N) match, and the result will be M×M. So there you go: we just proved
that ATA is a square matrix.
Now let’s prove that ATA is symmetric. The proof follows the same
strategy that we applied for the additive method: transpose ATA, do a bit of
algebra, and see what happens.
(6.15)
We transposed the matrix, applied the LIVE EVIL rule (and the property that
a double-transpose leaves the matrix unchanged), and got back to the original
matrix. A matrix that equals its transpose is the definition of a symmetric
matrix.
Are these two features unique to ATA? What about AAT—is that also square
and symmetric? The answer is Yes, but I want you to get a piece of paper and
prove it to yourself.
a)
b)
Answers
a)
ATA = , AAT =
b)
ATA = , AAT =
_________________________________________________
If you multiply two symmetric matrices, will the product matrix also be
symmetric? Let’s try an example to find out:
(6.16)
Is the result symmetric? On its face, it seems like it isn’t. However, this
equation reveals an interesting condition on the symmetry of the result of
multiplying two 2×2 symmetric matrices: If a = c and d = f (in other
words, the matrix has a constant diagonal), then the product of two
symmetric matrices is itself symmetric. Observe and be amazed!
(6.17)
Is this a general rule that applies to symmetric matrices of any size? Let’s
repeat with 3×3 matrices, using the constant-diagonal idea.
A quick glance reveals the lack of symmetry. For example, compare the
element in position 2,1 with that in position 1,2. I won’t write out the
product of two 4×4 symmetric matrices (if you want to try it, go for it), but
you can take my word for it: the resulting product matrix will not be
symmetric.
The lesson here is that, in general, the product of two symmetric matrices is
not a symmetric matrix. There are exceptions to this rule, like the 2×2 case
with constant diagonals, or if one of the matrices is the identity or zeros
matrix.
Is this a surprising result? Refer back to the discussion at the end of the
"element perspective" of matrix multiplication (page §) concerning how the
upper-triangle vs. the lower-triangle of the product matrix is formed from
earlier vs. later rows of the left matrix. Different parts of the two multiplying
matrices meet in the lower triangle vs. the upper triangle of the product
matrix.
You can also see that the product matrix is not symmetric by trying to prove
that it is symmetric (the proof fails, which is called proof-by-contradition).
Assume that A and B below are both symmetric matrices.
(6.18)
To be sure, Equation 6.18 is a perfectly valid equation, and all of the
individual multiplications are valid. However, matrix multiplication is not
commutative, which is why we had to put a ≠sign at the end of the
equation. Thus, we cannot assume that AB = (AB)T, therefore, the
multiplication of two symmetric matrices is, in general, not a symmetric
matrix.
-|*|- Reflection This may seem like a uselessly academic factoid, but it leads
to one of the biggest limitations of principal components analysis, and one of
the most important advantages of generalized eigendecomposition, which is
the computational backbone of many machine-learning methods, most
prominently linear classifiers and discriminant analyses. -|*|-
_________________________________________________________________________
You might have already guessed that Hadamard multiplication is valid only
for two matrices that are both M×N, and the product matrix is also size
M×N.
There is also element-wise division, which is the same principle but for
division. This operation is valid only when the divisor matrix contains all
non-zero elements.
One can debate whether Hadamard multiplication and division are really
matrix operations; arguably, they are simply scalar multiplication or division,
implemented en masse using compact notation. Indeed, element-wise matrix
multiplication in computer applications facilitates convenient and efficient
coding (e.g., to avoid using for-loops), as opposed to utilizing some special
mathematical properties of Hadamard multiplication. That said, Hadamard
multiplication does have applications in linear algebra. For example, it is key
to one of the algorithms for computing the matrix inverse.
__________________________________________________________________
a)
,
b)
,
Answers
a)
b) Undefined!
____________________________________________________
To compute the Frobenius dot product, you first vectorize the two matrices
and then compute their dot product as you would for regular vectors.
Vectorizing a matrix means concatenating all of the columns in a matrix to
produce a single column vector. It is a function that maps a matrix in â„M ×N
to a vector in â„MN.
_________________________________________________________________________
_________________________________________________________________________
As with many other operations you’ve learned about so far, there is some
arbitrariness in vectorization: Why not concatenate across the rows instead of
down the columns? It could go either way, but following a common
convention facilitates comprehension.
Anyway, with that tangent out of the way, we can now compute the
Frobenius dot product. Here is an example:
A curious yet useful way to compute the Frobenius dot product between
matrices A and B is by taking the trace of ATB. Therefore, the Frobenius dot
product can also be written as follows.
__________________________________________________________________
_________________________________________________________________________
I’ve omitted the matrix sizes in this equation, but you can tell from
inspection that the operation is valid if both matrices are size M×N,
because the trace is defined only on square matrices.
The reason why Equation 6.23 is valid can be seen by working through a few
examples, which you will have the opportunity to do in the exercises.
The Frobenius dot product has several uses in signal processing and machine
learning, for example as a measure of "distance," or similarity, between two
matrices.
The Frobenius inner product of a matrix with itself is the sum of all squared
elements, and is called the squared Frobenius norm or squared Euclidean
norm of the matrix. More on this in the next section.
Code The code below shows the trace-transpose trick for computing the
Frobenius dot product.
__________________________________________________________________
a)
,
b)
,
Answers
a) −16
b) 40
_________________________________________________________________________
In section 3.9 you learned that the square root of the dot product of a vector
with itself is the magnitude or length of the vector, which is also called the
norm of the vector.
Let’s start with the Frobenius norm, because it’s fresh in your mind
from the previous section. The equation below is an alternative way to
express the Frobenius norm.
_____________________________________________________________
(6.24)
∥A∥F =
_________________________________________________________________________
Now you know three ways to compute the Frobenius norm: (1) directly
implementing Equation 6.24, (2) vectorizing the matrix and computing the
dot product with itself, (3) computing tr(ATA).
(6.25)
∥A − B∥F =
__________________________________________________________________
Of course, this formula is valid only for two matrices that have the same size.
The Frobenius norm is also called the â„“2 norm (the â„“ character is just a
fancy-looking l, or a lower-case cursive L). There is also an â„“1 matrix
norm. To compute the â„“1 norm, you sum the absolute values of all
individual elements in each column, then take the largest maximum column
sum.
There are many other matrix norms with varied formulas. Different
applications use different norms to satisfy different criteria or to minimize
different features of the matrix. Rather than overwhelm you with an
exhaustive list, I will provide one general formula for the matrix p-norm; you
can see that for p = 2, the following formula is equal to the Frobenius norm.
(6.26)
(6.28)
(6.29)
Finally, we re-sum the norms of the rows back to the norm of the matrix. And
that brings us back to our original conclusion in Equation 6.27: The norm of a
matrix-vector product is less than or equal to the product of the Frobenius
norms of the matrix and the vector.
__________________________________________________________________
a)
,
b)
,
Answers
a) ≈ 17.94
b) ≈ 12.73
_________________________________________________________________________
This would be the equivalent of a scalar division like . Well, that doesn’t
exist for matrices; it is not possible to divide one matrix by another. However,
there is a conceptually comparable operation, and it is based on the idea of re-
writing scalar division like this:
The matrix version of this is AB−1. The matrix B−1 is called the matrix
inverse. It is such an important topic that it merits its own chapter (Chapter
12). For now, I’ll leave you with two important facts about the matrix
inverse.
6.12 Exercises
1.Â
Determine whether each of the following operations is valid, and, if so,
the size of the resulting matrix.
a) CB
b) CTB
c) (CB)T
d) CTBC
e) ABCB
f) ABC
g) CTBATAC
h) BTBCCTA
i) AAT
j) ATA
k) BBATABBCC
l) (CBBTCCT)T
m) (A + ACCTB)TA
n) C + CATABC
o) C + BATABC
p) B + 3B + ATA − CCT
q) A ⊙ (ABC)
r) A ⊙ ABC(BC)T
2.Â
Compute the following matrix multiplications. Each problem should be
completed twice using the two indicated perspectives of matrix multiplication
(#1: element, #2: layer, #3: column, #4: row).
a)
#1,2:
b)
#2,4:
c)
#3,4:
d)
#1,4:
e)
#2,3:
f)
#1,3:
g)
#2,3:
h)
#1,2:
i)
#2,3:
3.Â
Compute the following matrix-vector products, if the operation is valid.
a)
b) T
c)
d) T
e)
f)
g)
h)
T
4.Â
Consider square matrices A and B, with nonzero values at all elements. What
assumptions on symmetry would make the following equalities hold (note:
the operations might also be impossible under any assumptions)? Provide a
proof or example for each.
a) AB = ATBT
b) AB = (AB)T
c) AB = ABT
d) AB = ATB
e) AB = BTA
f) AB = (BA)T
5.Â
In section 6.7 you learned that the product of two symmetric matrices is
generally not symmetric. That was for standard multiplication; is the
Hadamard product of two symmetric matrices symmetric? Work through
your reasoning first, then test your hypothesis on the following matrix pair.
6.Â
For the following pairs of matrices, vectorize and compute the vector dot
product, then compute the Frobenius inner product as tr(ATB).
a)
,
b)
,
c)
,
d)
,
e)
,
f)
,
7.Â
Implement the indicated multiplications for the following matrices.
a) AB
b) AC
c) BC
d) CA
e) CB
f) BCA
g) ACB
h) ABC
6.13 Answers
1.Â
a) no
b) yes: 4×3
c) no
d) yes: 4×4
e) no
f) yes: 2×4
g) yes: 4×4
h) no
i) yes: 2×2
j) yes: 3×3
k) no
l) no
m) yes: 3×3
n) no
o) yes: 3×4
p) yes: 3×3
q) no
r) yes: 2×3
2.Â
a)
b)
c)
d)
e)
f)
g)
h)
i)
3.Â
a)
b)
c)
d)
e)
f)
g) invalid
h)
4.Â
5.Â
Yes, because the multiplication is done element-wise.
6.Â
a) a + 2b + 3c + 4d
b) 63
c) 135
d) a2 + b2 + ca + bd
e) a2 + b2 + c2 + d2
f) undefined
7.Â
a)
b)
c)
d)
e)
f)
g)
h)
1.Â
Note: Showing equivalence between the two results is done by
subtracting the matrices to get the zeros matrix (mathematically, this is x
= x ⇒ x−x = 0). However, due to precision and rounding errors, the
results might be very small numbers such as 1e-16 (10−16). You can
consider numbers smaller than around 1e-15 to be equal to zero.
Code block 6.13:Python
AÂ =Â np.random.randn(2,4)Â
BÂ =Â np.random.randn(4,3)Â
C1Â =Â np.zeros((2,3))Â
for i in range(4):Â
    C1 += np.outer(A[:,i],B[i,:])Â
Â
C1 - A@B # show equality
2.Â
The diagonals of the two product matrices are the same.
Code block 6.15:Python
DÂ =Â np.diag(np.arange(1,5))Â
AÂ =Â np.random.randn(4,4)Â
C1Â =Â D*AÂ
C2Â =Â D@AÂ
print(np.diag(C1))Â
print(np.diag(C2))
4.Â
The challenge isn’t so challenging, but it’s good excuse to gain
experience coding with norms. My strategy for the inequality is to show
that the right-hand side minus the left-hand side is positive. You can run
the code multiples times to test different random numbers.
Code block 6.19:Python
import numpy as npÂ
m = 5Â
AÂ =Â np.random.randn(m,m)Â
v = np.random.randn(m)Â
LHSÂ =Â np.linalg.norm(A@v)Â
RHS = np.linalg.norm(A,ord=’fro’) *Â
      np.linalg.norm(v)Â
RHS-LHS # should always be positive
The rank of a matrix is a single number associated with that matrix, and is
relevant for nearly all applications of linear algebra. Before learning about
how to compute the rank, or even the formal definition of rank, it’s useful
to be exposed to a few key facts about matrix rank. In fact, you won’t
learn the formal methods to compute rank until the next few chapters; here
we focus on the concept and interpretations of rank.
6. There are several definitions of rank that you will learn throughout this
book, and several algorithms for computing the rank. However, the key
definition to keep in mind is that the rank of a matrix is the largest
number of columns that can form a linearly independent set. This is
exactly the same as the largest number of rows that can form a
linearly independent set.
-|*|- Reflection Why all the fuss about rank? Why are full-rank matrices so
important? There are some operations in linear algebra that are valid only
for full-rank matrices (the matrix inverse being the most important). Other
operations are valid on reduced-rank matrices (for example,
eigendecomposition) but having full rank endows some additional properties.
Furthermore, many computer algorithms return more reliable results when
using full-rank compared to reduced-rank matrices. Indeed, one of the main
goals of regularization in statistics and machine-learning is to increase
numerical stability by ensuring that data matrices are full rank. So yeah,
matrix rank is a big deal. -|*|-
The phrasing that seems correct here would be "the number of linearly
independent columns" in the matrix, but of course you know that linear
independence is a property of a set of vectors, not individual vectors within a
set.
Below are a few matrices and their ranks. Although I haven’t yet taught
you any algorithms for computing rank, try to understand why each matrix
has its associated rank based on the description above.
This object lives in â„3, although it spans only a 1D subspace (a line). The
rank of this matrix is therefore 1. In fact, all vectors have a rank of 1, except
for the zeros vector, which has a rank of 0.
Now let’s think about that matrix as comprising two row vectors in â„3.
Those two vectors are distinct, meaning they span a 2D plane embedded in
ambient 3D space (Figure 7.2, right panel). Thus, the rank is 2.
Figure 7.2:Geometrically, the rank of a matrix is the
dimensionality of the subspaces spanned by either the
columns (left) or the rows (right).
Computing the matrix rank is, in modern applications, not trivial. In fact,
beyond small matrices, computers cannot actually compute the rank of a
matrix; they can only estimate the rank to a reasonable degree of certainty.
Below is a list of three methods to compute the rank of a matrix. At this point
in the book, you can implement the first method; the other two methods rely
on matrix operations that you will learn about in later chapters.
1. Count the largest number of columns (or rows) that can form a linearly
independent set. This involves a bit of trial-and-error and a bit of
educated guessing. You can follow the same tips for determining linear
independence in Chapter 4 (page §).
2. Count the number of pivots in the echelon or row-reduced echelon
form of the matrix (Chapter 10).
3. Count the number of nonzero singular values from a singular value
decomposition of the matrix (Chapter 16).
Code Except for your linear algebra exams, you will never compute matrix
rank by hand. Fortunately, both Python and MATLAB have functions that
return matrix rank.
__________________________________________________________________
a)
b)
c)
d)
Answers
a) r = 1
b) r = 2
c) r = 2
d) r = 2
_________________________________________________________________________
I’ll keep this brief: Scalar multiplication has no effect on the rank of a
matrix, with one exception when the scalar is 0 (because this produces the
zeros matrix, which has a rank of 0).
The reason why scalar multiplication has no effect is that the scalar simply
stretches the information already present in the matrix; it does not transform,
rotate, mix, unmix, change, combine, or create any new information.
I don’t think this rule needs an equation, but in the interest of practicing
my LaTeX skills:
(7.2)
Code There isn’t any new code in this section, but I decided to add these
code blocks to increase your familiarity with matrix rank. You should
confirm that the ranks of the random matrix and the scalar-multiplied matrix
are the same.
On the other hand, adding two rank-1 matrices does not guarantee a rank-2
matrix:
With that in mind, here are some more examples. The ranks of the individual
matrices are given in small subscripted numbers (n.b. non-standard notation
used only in this chapter for convenience), but the ranks of the summed
matrices are missing. You should compute them on your own, then check in
the footnote2 for the answers.
2Top to bottom: 3, 3, 2, 1
3 + 3 = (7.4)
(7.5)
2 + 1 =
2 + 0 = (7.6)
3 + 2 = (7.7)
The rule shown in Equation 7.3 applies to matrix subtraction as well, because
subtraction is the same thing as addition and multiplying the matrix by -1,
and scalar multiplication doesn’t change the rank of a matrix (see
Equation 7.2).
For example, imagine two 3×4 matrices, each with rank 3. The sum of
those matrices cannot possibly have a rank of 6 (Equation 7.3), because 6 is
greater than the matrix sizes. Thus, the largest possible rank of the summed
matrix is 3 (the rank could be smaller than 3, depending on the values in the
matrices). __________________
a) A + B
b) (A + B) + 0C
c) C − 3A
d) 0((C + A) + 4A)
Answers
a) 15
b) 15
c) 18
d) 0
____________________________________________________________
3 3 = (7.9)
2 1 = (7.10)
2 0 = (7.11)
3 2 = (7.12)
How to understand this rule? You can think about it in terms of the column
space of the matrix C = AB. "Column space" is a concept you’ll learn
more about in the next chapter, but basically it’s the subspace spanned by
the columns of a matrix. Think of the jth column of C as being the matrix-
vector product of matrix A and the jth column in B:
(7.13)
This means that each column of C is a linear combination of columns of A
with the weights defined by the corresponding column in B. In other words,
each column of C is in the subspace spanned by the columns of A.
-|*|- Reflection The rules in the past two sections will prepare you for the
next two sections, which have major implications for applied linear algebra,
primarily statistics and machine-learning. Moreover, the more comfortable
you are with matrix rank, the more intuitive advanced linear algebra
concepts will be. -|*|- ________________
a) AB
b) (AB)C
c) 3CA
d) (C + A)A
Answers
a) 4
b) 4
c) 4
d) 4
____________________________________________________________
The key take-home message from this section is that these four matrices—A,
AT, ATA, and AAT—all have exactly the same rank.
You already know that A and AT have the same rank because of property 3 in
the first section of this chapter: the rank is a property of the matrix; it does
not reflect the columns or the rows separately. Thus, transposing a matrix
does not affect its rank.
Proving that ATA and AAT have the same rank as that of A takes a little more
work. I’m going to present two explanations here. Unfortunately, both of
these explanations rely on some concepts that I will introduce later in the
book. So if you find these explanations confusing, then please ear-mark this
page and come back to it later. I know it’s a bit uncomfortable to rely on
concepts before learning about them, but it often happens in math (and in life
in general) that a purely monotonic progression is impossible.
Now we’ve proven that ATA and A have the same null spaces. Why does
that matter? You will learn in the next chapter that the row space (the set of
all possible weighted combinations of the rows) and the null space together
span all of â„N, and so if the null spaces are the same, then the row spaces
must have the same dimensionality (this is called the rank-nullity theorem).
And the rank of a matrix is the dimensionality of the row space, hence, the
ranks of ATA and A are the same.
Proving this for AAT follows the same proof as above, except you start from
yTA = 0 instead. I encourage you to reproduce the proof with a pen and some
paper.
Let’s see an example. The following page shows a 2×3 rank-1 matrix,
and then that matrix transposed and multiplied by its transpose. You can
confirm via visual inspection that the ranks of all these matrices are 1.
A = rank(A) = 1
AT = rank(AT) = 1
ATA = rank(ATA) = 1
AAT = rank(AAT) = 1
The interesting property of random matrices that is most relevant for this
book—and for using computers to explore concepts in linear algebra—is
that they are basically always full rank. Almost any time you populate a
matrix with random numbers, you can assume that that matrix will have its
maximum possible rank (there are some exceptions described below).
Why is this the case? When you generate random numbers on computersâ
€”particularly floating-point precision random numbers—it is simply
mindbogglingly unlikely that linear dependencies in the columns will just
happen to arise. Here is an example of a 4×4 matrix of random numbers I
generated in MATLAB using the function randn:
Apologies for making the font so small. The point isn’t for you to read
the actual numbers; the point is for you to appreciate that the probability of
linear dependencies leading to a reduced-rank matrix is infinitesimal. Thus,
whenever you create random matrices on computers, you can assume that
their rank is their maximum possible rank.
In statistics and machine learning, one of the main reasons to shift a matrix is
to transform a reduced-rank matrix into a full-rank matrix. Remarkably, this
feat can be accomplished while making only tiny (and usually practically
insignificant) changes to the information contained in the matrix.
To show how this is possible, I’ll start with an extreme example that
shows how shifting a matrix can transform it into a full-rank matrix. What is
the value of λ in the equation below?
(7.24)
I’m sure you calculated that λ = 1, and shifting trivially moved the
matrix rank from 0 to 3. On the other hand, usually the goal of shifting a
matrix is to change the information contained in the matrix as little as
possible, and I’ve clearly violated that principle here.
(7.25)
The ranks of these matrices are, respectively, 2, 3, and 3. And unlike with
example 7.24, the matrices A and A are really close to each other: All the off-
diagonal elements are identical, and the diagonal elements differ by a mere
.01, which, for this matrix, corresponds to a change of less than 1% on the
diagonal elements.
(7.26)
Again, the ranks are 2, 3, and 3. But let’s think about what we’ve
done: By setting λ to be large relative to the values in A, we’ve pushed
the matrix to be close to a scaled version of the identity matrix (in fact, we
could even say that that matrix is 103I plus some flotsam and jetsam).
I’d like to give you some sense of why it is difficult to compute the rank
of large matrices, from both an algebraic and geometric interpretation.
I wrote above that one way to compute the rank of a matrix is to count the
number of non-zero singular values. You haven’t yet learned about
singular values, but it is sufficient for now to know that an M×N matrix
has min{M,N} singular values.
Code The standard computer algorithm for computing rank is to count the
number of singular values above a threshold. The purpose of this code block
is for you to inspect the source code for computing rank. Even if you donâ
€™t understand each line of the code, you should be able to see the following
operations: (1) compute the SVD of the matrix, (2) define a "tolerance" (the
threshold for identifying a number as being significantly nonzero), (3) count
the number of singular values above this tolerance.
But the sensors on the satellite are not perfect, and there is a tiny, tiny bit of
noise that corrupts the signal. So in fact, if you would look at the subspace
spanned by the columns of the matrix at "eye level," you would expect to see
the vectors perfectly lying in a plane. Instead, however, you see the vectors
pointing ever-so-slightly above or below that plane (Figure 7.3).
Your computer would tell you that the rank of this data matrix is 3, which
you know is actually due to sensor noise. So you might want your rank-
estimating-algorithm to ignore some small amount of noise, based on what
you know about the data contained in the matrix.
I wrote that there are several algorithms that can answer that question, but
that you hadn’t yet learned the necessary concepts. We can now re-
interpret this problem in the context of matrices and rank.
There is an entire section about this procedure in the next chapter (section
8.3), including a deeper explanation and diagrams.4 However, I believe that
you are now knowledgeable enough to be introduced to the augment-rank
algorithm. If you’re struggling with understanding why the following
procedure tells us whether a vector is in the span of a set of vectors, then
don’t worry—it means you have something to look forward to in the
next chapter!
4Repetition facilitates comprehension, especially when knowledge increases between each repetition.
7.12 Exercises
1.Â
Compute the rank of the following matrices based on visual inspection.
a)
b)
c)
d)
2.Â
Compute the rank of the result of the following operations.
a) A
b) B
c) A+B
d) AAT
e) ATA
f) ABT
g) (ABT)T
h) 2AAT
3.Â
For the following matrices, what value of λ would give each matrix rank m
− 1?
a)
b)
c)
d)
4.Â
Determine the maximum possible rank of the following operations.
A ∈ â„2×3, B ∈ â„3×3, C ∈ â„3×4, D ∈ â„3×4
a) A
b) B
c) C
d) D
e) CTB
f) CTC
g) AD
h) CD
i) B+B
j) C+D
k) BATAC
l) BATAC +D
7.13 Answers
1.Â
a) r = 1
b) r = 2
c) r = 3
d) r = 3
2.Â
a) 2
b) 2
c) 2
d) 2
e) 2
f) 2
g) 2
h) 2
3.Â
a) λ = 3
b) λ≠0
c) λ≠2
d) λ = 0
4.Â
a) 2
b) 3
c) 3
d) 3
e) 3
f) 3
g) 2
h) invalid!
i) 3
j) 3
k) 2
l) 3
2.Â
On the laptop that I’m using while writing this exercise, I got rank-5
matrices down to scaling factors of around 10−307.
Code block 7.9:Python
ZÂ =Â np.zeros((5,5))Â
NÂ =Â np.random.randn(5,5)Â
ZNÂ =Â ZÂ +Â N*np.finfo(float).eps*1e-307Â
print(np.linalg.matrix_rank(Z))Â
print(np.linalg.matrix_rank(ZN))Â
print(np.linalg.norm(ZN,’fro’))
As you work through this chapter, try to think about how each concept fits
into these questions. At the end of the chapter, I will provide a philosophical
discussion of the meaning of these two questions.
"Column space" sounds like a fancy and exotic term, but in fact, you already
know what a column space is: The column space of a matrix is the subspace
spanned by all columns of that matrix.1 In other words, think of a matrix as a
set of column vectors, and the subspace spanned by that set of vectors as the
column space of the matrix.
1Column space is also sometimes called the range or the image of a matrix.
The column space is indicated using the notation C(A). Here is a formal
definition.
__________________________________________________________________
_________________________________________________________________________
The two equations above show two different, and equally acceptable, ways to
express the same concept.
The main difference between a subspace spanned by a set of vectors vs. the
column space of a matrix is conceptual: a set of vectors is a collection of
separate objects, whereas a matrix is one unit; it can be convenient to talk
about particular groups of elements in a matrix as if they were column
vectors. But the matrix elements that form columns always are also part of
rows, and they are also individual elements. That fluid level of reorganization
isn’t possible with sets.
Also relevant here is the distinction between basis and span. The columns of
a matrix span a subspace, but they may or may not be a basis for that
subspace. Remember that a set of vectors is a basis for a subspace only if that
set is linearly independent.
Consider the two matrices below; their column spaces are identical, but the
columns of the left matrix form a basis for that subspace, whereas the
columns of the right matrix do not.
Let’s try a few examples. For each of the matrices below, determine (1)
the dimensionality of the ambient space in which the column space is
embedded, (2) the dimensionality of the column space, and (3) whether the
columns form a basis for the column space.
Interestingly, A and AAT have the same column space. Let’s first confirm
that the dimension of their ambient spaces are the same: Matrix A is size MÃ
—N, so C(A) ∈ â„M. Matrix AAT is size (M×N)×(N×M) = M×M,
and therefore is also ∈ â„M. This doesn’t prove that their column spaces
are the same, but it is a prerequisite. (For example, it should now be obvious
that A and ATA cannot possibly have the same column space if M≠N).
Now let’s see why the column spaces must be equal. Recall from section
6.1 that the "column perspective" of matrix multiplication states that
multiplication is a linear weighted combination of the columns of the left
matrix, where the weights come from the columns in the right matrix. Thus,
AAT is simply a linear weighted combination of the columns of A, which
means it is in the span of the column space of A.
Let’s see this in an example. I’m going to write out the multiplication
AAT using the column perspective.
Next we rely on the fact that AAT and A have the same rank (see rank-nullity
theorem, page §), which means the dimensionalities of their column spaces
are the same. If the column space of AAT is a subset of the column space of
AAT and those two subspaces have the same dimensionality, then they must
be equal.
There is another explanation of why AAT and A have the same subspace,
which relies on the singular value decomposition. More on this in chapter 16!
On the other hand, A and AAT generally do not have exactly the same
columns. So, the columns of those two matrices span the same subspace, but
can have different basis sets.
Let’s start with an example. Consider the following matrix and vectors.
First, notice that the column space of A is a 2D plane embedded in â„3. This
is the case because the two columns form a linearly independent set (it is not
possible to obtain one column by a scaled version of the other).
Now onto the question at hand: Is vector v in the column space of matrix A?
Formally, this is written as v ∈ C(A) or v C(A). This is not a trivial
question: We’re asking whether a vector in 3D happens to lie on an
infinitely thin plane.
For vector v, the answer is Yes, v ∈ C(A), because it can be expressed as a
linear combination of the columns in matrix A. A bit of guesswork will lead
to you to coefficients of (1,2) for the columns to produce the vector v:
(8.5)
One of the beautiful features of linear algebra is that it allows expressing a
large number of equations in a compact form. We can do that here by putting
the column coefficients (1,2) into a vector, and then re-writing equation 8.5
as a matrix-vector multiplication of the form Ax = v:
(8.6)
Whenever you see a matrix equation, the first thing you should do is confirm
that the matrix sizes allow for a valid equation. Here we have matrix sizes
(3×2)×(2×1) = (3×1). That works.
Start by creating a matrix B = A⊔v (that is, augment the matrix with the
vector). Then compute the ranks of these two matrices (B and A).
Figure 8.1:The dashed gray line represents C(M) ∈ â„2.
Then v ∈ C(M) and w
C(M)
There are two possible outcomes: (1) the ranks are the same, which means
that v ∈ C(A); or (2) the rank of B is one higher than the rank of A, which
means that v C(A).
Why is this the case? You can think about it geometrically: If v is in the
column space of A, then the vector is sitting somewhere in the subspace,
hence, no new geometric directions are obtained by including vector v. In
contrast, if v is outside the column space, then it points off in some other
geometric dimension that is not spanned by the column space; hence, B has
one extra geometric dimension not contained in A, and thus the rank is one
higher. This is depicted in Figure 8.1.
A corollary of this method is that if A is full rank square matrix (that is,
rank=M), then v is necessarily in the column space, because it is not possible
to have a subspace with more than M dimensions in â„M.
This algorithm only tells you whether a vector is in the column of a matrix. It
doesn’t reveal how to combine the columns of the matrix to express that
vector. For that, you can apply a procedure called Gaussian elimination,
which is a major topic of Chapter 10. In the practice problems below, you can
use a bit of trial-and-error and educated guessing to find the coefficients.
__________________________________________________________________
a)
,
b)
,
c)
,
Answers
a) yes. (2,1)
b) yes. (2,-2,3)
c) no.
__________________________________________________________
The primary difference is the way that you ask the question whether a given
vector is in the row space of the matrix. In particular, this changes how you
multiply the matrix and the vector: Instead of matrix-vector multiplication as
with the column space (Ax = v), you have to put the row vector on the left
side of the matrix, like this: xTA = vT . Now the weighting vector x,
sometimes also called the coefficients vector, is a row vector on the left,
meaning that we are taking weighted combinations of rows of A instead of
columns.
__________________________________________________________________
Practice problems Determine whether the following vectors are in the row
space of the accompanying matrices, and, if so, the coefficients on the rows
to reproduce the vector.
a) T,
b)
T,
c)
T,
Answers
a) no.
b) yes. (.5,.5,0)
c) wrong sizes!
_________________________________________________________________________
Simply put, A and ATA have the same row spaces (R(A) = R(ATA)). The
arguments, explanations, and implications are exactly the same as with the
column spaces discussed in section 8.2.
One new concept I will add here is that the fact that R(A) = R(ATA) can be an
example of dimensionality reduction: both of those matrices have the same
row space, but ATA might be a smaller matrix (that is, it might have fewer
numbers), and therefore be computationally easier to work with.
Let’s see an example to make this more concrete. See if you can come up
with a vector (that is, find x and y) that satisfies the equation.
(8.8)
Did you come up with the vector [-2 1]T? That satisfies the equation and
therefore is in the null space of that matrix (thus: y ∈ N(A)).
But that’s not the only vector that satisfies that equation—perhaps your
solution was [2 -1]T or [-1 .5]T or [-2000 1000]T. I’m sure you see where
this is going: There is an infinite number of vectors in the null space of this
matrix, and all of those vectors are scaled versions of each other. In other
words:
(8.9)
As mentioned earlier, the zeros vector also satisfies the equation, but thatâ
€™s a trivial solution and we ignore it.
_________________________________________________________________________
(Aside on math notation: The | indicates "such that" and the minus sign
excludes 0 from the set.)
The previous examples used square matrices; below are two examples of
rectangular matrices so you can see that null space is not just for squares. The
hard work of finding the vector y is already done, so pay attention to the
sizes: the null space is all about linear weighted combinations of the N
columns, which are in â„M, and the vector that contains the weightings for the
columns is y ∈ â„N, corresponding to the dimensionality of the row space.
There is a deterministic relationship between the rank of a matrix, its size,
and the dimensionality of the four matrix spaces. We will return to this in
section 8.10.
__________________________________________________________________
Practice problems Find a vector in the null space of each matrix, if there is
one.
a)
b)
c)
d)
Answers
a)
b)
a)
, ,
b)
, ,
c)
, ,
Answers
a) Neither
b) Both
c) Wrong sizes!
___________________________________________________
Left-null space There is a complementary space to the null space, called the
left-null space, which is the same concept but with a row vector on the left of
the matrix instead of a column vector on the right of the matrix. The resulting
zeros vector is also a row vector. It looks like this:
(8.13)
The left-null space can be thought of as the "regular" null space of the matrix
transpose. 5This becomes apparent when transposing both sides of Equation
8.13.
5The "regular" null space is formally the "right null space," but this is implied when referring to the
"null space."
(yTA)T = 0TT
ATy = 0 (8.14)
Considering the left-null space as the (right) null space of the matrix
transpose is analogous to how R(A) = C(AT). This also means that the two
null spaces are equal when the matrix is symmetric:
= (8.16)
= (8.17)
Here’s the rule: For an M×N matrix, the row space is in ambient â„N
and the left-null space is in ambient â„M. Again, this is sensible because the
left-null space provides a weighting of all the rows to produce the zeros row
vector. There are M rows and so a null space vector must have M elements.
Look closely at Equation 8.16: Is that vector the only vector in the left-null
space? I’m not referring to scaled version of that vector; I mean that there
are more vectors in the left-null space that are separate from the one I printed.
How many more (non-trivial) vectors can you identify?
__________________________________
Practice problems Find a vector in the left-null space of each matrix, if there
is one. Notice that these matrices appeared in the practice problems on page
§; are the answers the same?
a)
b)
c)
d)
Answers
a)
b)
c) Empty left
null space
d) Empty left
null space
__________________________________________________
Code Python and MATLAB will return basis vectors for the null space of a
matrix (if it has one). How do they solve this seemingly magical feat? They
use the singular value decomposition! However, you’ll have to wait
several more chapters to understand why the SVD reveals bases for the null
space.
Recall that a matrix times a vector produces another vector. We’ll stick to
â„2 here so that everything can be easily visualized. Consider the following
matrix and vectors.
Figure 8.2 shows that when you multiply a vector in the null space by that
matrix, the resulting "vector" is just a point at the origin. And that’s
basically the end of the line for this matrix-vector product My—it cannot do
anything else but sit in the origin. Just like the basement of a horror movie:
Once you go in, you never come out. There is no escape. There is no possible
other matrix A such that AMy≠0. In other words, the matrix entered its
null space and became a singularity at the origin; no other matrix can bring it
back from the abyss.
And this brings me to the final point I want to make in this section: There is
nothing special about vector y, and there is nothing special about matrix M.
Instead, what is special is their combination, that y ∈ N(M). MAy≠0
because y is not in the null space of matrix A, and vector Ay is not in the null
space of matrix M (it could be if A is the identity matrix, the zeros matrix, or
has y as an eigenvector, but that is galactically unlikely when populating A
with random integers).
__________________________________________________________
Practice problems For the following matrices, find a basis for the column
space and a basis for the left-null space. Then draw those two basis vectors
on the same Cartesian plot (one plot per matrix). Do you notice anything
about the plots?
a)
b)
Answers In both cases, the column space and the left-null space are
orthogonal to each other. You can easily confirm that in these examples by
computing the dot products between the basis vectors. That’s not just
some quirky effect of these specific matrices; that’s a general principle,
and you will soon learn the reason why. (You can repeat this exercise for the
row space and the null space and arrive at the same conclusion.)
__________________________________________________________________
You already know what it means for two lines to be orthogonal to each other
(dot product of zero; meet at a right angle); what does it mean for two
subspaces to be orthogonal? Let me start by mentioning something that’s
pretty obvious when you think about it: If a pair of vectors is orthogonal, then
any scalar-vector multiplication will also be orthogonal.
(8.18)
The reason why Equation 8.18 is obvious is that if vTw = 0 then any scalar
times 0 is still 0.
Now let’s take this one step further: Imagine that two vectors v1 and v2
are each orthogonal to w. They don’t have to be orthogonal to each other.
Here’s an example:
Now we can combine this concept with Equation 8.18 to write a new
equation:
(8.19)
This equation says that any linear combination of v1 and v2 is orthogonal to
any scaled version of w.
__________________________________________________________________
Read aloud, this definition is "For any vector v in subspace S, and for any
vector w in subspace W, vector v is orthogonal to vector w.
Figure 8.3:Two 2D subspaces in â„3 cannot be orthogonal
subspaces.
But alas, no, these planes are not orthogonal subspaces. The line of
intersection between these two planes is a vector that is contained in both
subspaces. And a vector that appears is both subspaces is clearly not
orthogonal to itself. Of course, there can be many vectors in one plane that
are orthogonal to many vectors in the other plane, but for the two subspaces
to be orthogonal, every possible vector needs to be orthogonal.
Figure 8.4:A 1D subspace can be an orthogonal
complement to a 2D subspace in â„3.
T =
The orthogonality of the column space and left-null space is a big deal,
because we’re talking about entire subspaces, not just individual vectors.
A vector is a mere finite object, but subspaces are infinite expanses. Thus, the
column space and the left-null space are orthogonal complements, and so
together they must fill up the entire ambient space of â„M.
(8.23)
Let me explain that again so it’s clear: For any given M×N matrix,
every vector in â„M is either in the column space or in the left-null space. No
vector can be in both (because they are orthogonal subspaces) except for the
trivial zeros vector. Therefore, the column space and the left-null space
together span all of â„M.
Analogously, if someone asks you if you want ice cream with sprinkles,9 ice
cream without sprinkles, or no ice cream, then those mutually exclusive
options literally account for every imaginable thing in the universe.
9OK, perhaps not a perfect analogy, but the point is that I like sprinkles without ice cream.
One implication of Equation 8.23 is that if the column space of the matrix
spans all of â„M, then the left-null space must be empty, because only the
zeros vector can be orthogonal to an entire ambient space.
The easiest way to choose a basis for the column space is to take the first r
columns, where r is the rank of the matrix (make sure that those columns
form a linearly independent set). Of course, this isn’t necessarily the best
basis set; it’s just the easiest for small matrices.
Figure 8.6 shows these vectors, with black lines for the column space basis
vectors and gray lines for the left-null space basis vectors.
Matrix D has an empty null space, which I’ve depicted as a dot at the
origin of the graph. It’s visually apparent that the column spaces are
orthogonal to their respective left-null spaces.
Orthogonality of the row space and the null space There isn’t a lot to
say here that isn’t written above. Just swap "column" with "row" and
"left-null space" with "null space." I will briefly walk through the reasoning
as above but using slightly different notation, for variety.
The idea is to find a vector that is orthogonal to all rows in A, in other words,
y ⊥ R(A).
We can express this by writing that the dot product between each row of the
matrix (indicated as am below) and the vector y is 0.
a1y = 0
a2y = 0
amy = 0
And then we simply collect all of these individual equations into one compact
matrix equation, which is, of course, the definition of the null space.
As with the column and left-null spaces, the row space and null space are
orthogonal complements that together span all of â„N:
(8.24)
Any vector in â„N is either in the row space or in the null space. The only
vector that can be in both spaces is the N-element zeros vector.
First, I want to reiterate that "dimension" is not the same thing as "rank." The
rank is a property of a matrix, and it’s the same regardless of whether you
are thinking about rows, columns, or null spaces. The ambient dimensionality
differs between rows and columns for non-square matrices.
The null space contains one basis vector, which means it has dimensionality
of one, while the column and row spaces each has dimensionalities of 2
(notice that row 3 is a multiple of row 1). The rank of the matrix is also 2.
You can see in this example that the dimensionality of the column space plus
the dimensionality of the left-null space adds up to the ambient
dimensionality â„3.
Two more examples and then I’ll present the rules: If the column space is
2D and embedded in â„2, then the column space already covers the entire
ambient space, which means there’s nothing for the left-null space to
capture; the left-null space must therefore be the empty set. Here’s a
matrix to illustrate this point; note that the left-null is empty, because there is
no way to combine the rows of the matrix (or, the columns of the matrix
transpose) to get a vector of zeros.
Final example: The 2×2 zeros matrix has columns in ambient â„2, but the
column space is empty; it contains nothing but a point at the origin. It is 0-
dimensional. Therefore, its orthogonal complement must fill up the entirety
of â„2. This tells us that the left-null space must be 2-dimensional. What is a
basis set for that left-null space? Literally any independent set of vectors can
be a basis set. A common choice in this situation is the identity matrix, but
that’s because it’s convenient, not because it’s the only basis set.
The story is the same for the row space and the null space, so I will just re-
state it briefly: The row space lives in ambient â„N but can span a lower-
dimensional subspace depending on the elements in the matrix. The
orthogonal complement—the null space—fills up whatever directions in
â„N are not already spanned by the row space. If the matrix is full-row rank,
then the row space already spans all of â„N and therefore the null space must
be empty.
â„M â„N
And with numbered equations in a grey box:
__________________________________________________________________
__________________________________________________________________
One more relevant formula: The rank of the matrix is the dimensionality of
the column space, which is the same as the dimensionality of the row space:
(8.27)
_________________________________________________________________________
a)
b)
c)
Answers The four numbers below are the dimensionalities of the column
space, left-null space, row space, and null space. You can confirm that the
sum of the first two numbers corresponds to the number of rows, and the sum
of the second two numbers corresponds to the number of columns.
a) 2,1,2,1
b) 2,0,2,4
c) 1,2,1,1
________________________________________________________
8.11 More on Ax = b and Ay = 0
The letters might look different, though. For example, in statistics, the
common form is Xβ = y, where X is called the "design matrix," β is called
the "regression coefficients," and y is called the "observed data." You will
learn more about those terms and what they mean in Chapter 14. But you can
see that the general form is the same.
Now I’d like to tell you more about Ay = 0. It may seem strange that
someone would be so interested in finding the null space of a matrix,
considering that the null space is the "black hole of no return." In practice,
people are not interested in this matrix A per se; instead, they are interested
in a shifted version of this matrix, expressed as (A − λI)y = 0.
I hope this helps put things in perspective. It’s not the case that every
problem in linear algebra boils down to one of these two equations. But as
you proceed in your adventures through the jungle of linear algebra, please
keep these two equations in mind; the terminology may differ across fields,
but the core concepts are the same.
8.12 Exercises
1.Â
For each matrix-vector pair, determine whether the vector is in the
column space of the matrix, and if so, the coefficients that map the
vector into that column space.
a)
,
b)
,
c)
,
d)
,
e)
,
f)
,
g)
,
h)
,
2.Â
Same as the previous exercise but for the row space.
a) T
,
b) T
,
c) T
,
d)
, T
3.Â
For each matrix-set pair, determine whether the vector set can form a basis
for the column space of the matrix.
a)
,
b)
,
c)
,
d)
,
e)
,
4.Â
Determine whether the following matrices have a null space. If so, provide
basis vector(s) for that null space.
a)
b)
c)
d)
5.Â
Fill in the blanks (dim=dimensionality) for matrix A ∈ â„2×3
8.13 Answers
1.Â
a)
g)
h)
2.Â
a) T
b) T
3.Â
a) No. Any column space basis must be a single vector that is a multiple
of [1 2]T.
b) Yes: C(M) = â„2, so any independent set of two vectors can be a basis.
c) Yes
d) Yes
e) Yes for the same reason as (b).
4.Â
a)
b) No null space
c) No null space
d)
5.Â
a) 2
b) 1
c) 0
d) Trick question; dim(C(A)) cannot be greater than 2.
e) Trick question; dim(N(A)) must be >0.
f) 2
g) 1
h) 0
2.Â
Recall that the dimensionality of the row space equals that of the column
space, which equals the rank of the matrix.
Code block 8.5:Python
AÂ =Â np.random.randn(16,9)Â @Â
    np.random.randn(9,11)Â
rn = null_space(A)Â
ln = null_space(A.T)Â
r  = np.linalg.matrix_rank(A)Â
print(rn.shape[1]+r)Â
print(ln.shape[1]+r)
The imaginary operator was, for a long time, just a quirky exception. It was
Karl Friederich Gauss (yes, that Gauss)1 who had the brilliant insight that the
imaginary operator was not merely an exceptional case study in solving one
kind of equation, but instead, that the imaginary unit was the basis of an
entirely different dimension of numbers. These numbers were termed
"complex" and had both a "real" part and an "imaginary" part. Thus was born
the complex plane as well as the field of complex numbers, â„‚.
1Fun fact: Gauss and many other people despised the term "imaginary," instead arguing that "lateral"
would be better. I (and many others) whole-heartedly agree, but unfortunately "imaginary" remains
standard terminology.
Gauss was interested in complex numbers because he was developing what is
now known as the Fundamental Theory of Algebra. The FTA, as it is
sometimes abbreviated, states that an nth order algebraic equation has exactly
n roots. That is, an equation of the form:
(9.1)
has exactly n solutions for x (remember that x0 = 1). The thing is that these
solutions might be complex-valued. This important result is the reason why
you can get complex-valued eigenvalues from real-valued matrices. More on
this in Chapter 15. For now, let’s review complex numbers.
You learned in school that numbers exist on a line, with zero in the middle,
negative numbers on the left, and positive numbers on the right, like in Figure
9.1
Don’t lose sleep over what complex numbers really mean, whether
imaginary numbers have any physical interpretation, or whether intelligent
life elsewhere in the universe would also come up with imaginary numbers (I
think the answer is Yes, and I hope they call them "lateral numbers"). It
doesn’t matter. What does matter is that complex numbers are useful
from a practical perspective, and they simplify many operations in both
theoretical and applied mathematics.
Complex numbers are referred to using the real and imaginary axis
coordinates, just like how you would refer to XY coordinates on a Cartesian
axis. Figure 9.3 shows a few examples of complex numbers as geometric
coordinates and their corresponding labels.
1. Complex numbers are always written using the real part first.
2. In between the two components could be a space, a comma, or a plus
or minus sign. You might also see square brackets, parentheses, or
nothing around the numbers. Variations include [a bi], (a,bi), [a+bi], a
bi.
3. z is the go-to letter to indicate a complex number: z = a+ib. After that,
w is the next-best-thing.
4. You can position the i before or after the imaginary component: [a bi]
or [a ib].
5. Most people use i to indicate the imaginary operator. Engineers tend to
use j because they use i for electrical current. On the other hand,
engineers write handwritten notes in ALL CAPS and start counting at 0,
so let’s not be too quick to adapt all engineering practices.
6. To avoid embarrassment at math parties, be careful to distinguish the
following terms:
Complex number: A number that contains two parts, real and
imaginary, like in the notational varieties above.
Imaginary number: A complex number with no real part: [0+ib] =
ib.
Imaginary operator: The symbol (i or j) that represents the square
root of minus one ( ), without any other numbers attached to it.
Imaginary component: This is the real number that multiplies the
imaginary operator. In the number a+ib, the imaginary component
is b, not ib! Geometrically, this corresponds to the distance on the
y-axis on a complex plane.
The reason why complex numbers are so useful is that they pack a lot of
information into a compact representation. For the real number line, a number
has only two pieces of information: its distance away from the zero and its
sign (left or right of zero). A complex number contains more information: the
real part; the imaginary part; the distance of the complex number to the
origin; and the angle of the line from the origin to the complex number,
relative to the positive real axis.
Code There are several ways to create complex numbers in Python and
MATLAB. The complex operator in Python is 1j (or any other number and
j). MATLAB accepts 1i and 1j. Be mindful that Python is more stringent
about data types than is MATLAB. For example, try running the code below
without the second input into the np.zeros() function.
Code block 9.1:Python
z = np.complex(3,4)Â
ZÂ =Â np.zeros(2,dtype=complex)Â
Z[0]Â =Â 3+4j
SVG-Viewer needed.
Figure 9.4:A complex number ([1 2i] and its complex
conjugate ([1 -2i]).
a) 3 + 4i
b) −6(−5 − i)
c) j
d) 17
Answers
a) 3 − 4i
b) −6(−5 + i)
c) −j
d) 17
___________________________________________________________
-|*|- Reflection Complex conjugate pairs are used in many areas of applied
mathematics. For example, the most efficient way to compute a power
spectrum from the Fourier transform is to multiply the complex-valued
spectrum by its complex conjugate. More germane to this book: A matrix
with entirely real-valued entries can have complex-valued eigenvalues; and
when there are complex-valued eigenvalues, they always come in conjugate
pairs. -|*|-
Subtraction works exactly the same way; just be careful that you are
replacing the correct plus signs with a minus signs. Pay attention to the pluses
and minuses here:
z − w = a + ib − (c + id)
= (a-c) + i(b-d)
Multiplication Multiplication of complex numbers, unfortunately, does not
work the way you might initially expect. You might have expected (hoped)
that you would separately multiply the two real parts and the two imaginary
parts, and then put them together. Instead, you have to incorporate the cross-
terms. Fortunately, though, the multiplication does follow algebraic rules you
already know for expanding grouped terms.
zw = (a + ib)(c + id) (9.4)
= ac + iad + ibc + i2bd (9.5)
= ac-bd + i(ad+bc) (9.6)
Notice that i2bd = −bd.
=
Notice that the denominator becomes real-valued, which makes the fraction
easier to work with. In the interest of learning-by-repetition, here is the
concept again in compact form:
(9.8)
Practice problems For the following two complex numbers, implement the
indicated arithmetic operations.
a) 2z + wz
b) w(z + z)
c) 5z + 6w
d) 5z + 6w
e)
Answers
a) 30 − 15i
b) −12 − 30i
c) 3
d) 3 + 60i
e) (1 + 12i)
____________________________________________________
Notice that nothing happened to the real-valued elements (first and third
entries). For this reason, the Hermitian and "regular" transpose are identical
operations for real-valued matrices.
Dot product with complex vectors The dot product with complex vectors is
exactly the same as as the dot product with real-valued vectors: element-wise
multiply and sum.
However, in nearly all cases, the "regular" dot product is replaced with the
Hermitian dot product, which simply means to implement the dot product as
zHw instead of zTw.
SVG-Viewer needed.
Figure 9.5:The vector [0 i] has length=1, which the dot
product formula must produce.
Why are complex vectors conjugated when computing the dot product? The
answer will be obvious from a geometric perspective: Recall that the dot
product of a vector with itself is the squared length of the line represented by
that vector. Consider what happens when we compute the length of a
complex vector that we know has length=1 (Figure 9.5):
Clearly, the third and fourth options provide accurate results. This example
also shows that it doesn’t matter which vector is conjugated, although the
third option (vHv) is generally preferred for typographical reasons.
Code The MATLAB dot() function always implements the Hermitian dot
product, because conjugating real-valued numbers has no effect. In Python,
however, you need to use vdot() instead of dot().
Code block 9.5:Python
import numpy as npÂ
v = [0,1j]Â
print(np.dot(v,v))Â
print(np.vdot(v,v))
Code block 9.6:MATLAB
v = [0 1i];Â
dot(v,v)
__________________________________________________________________
Practice problems For the following vectors, implement the specified
operations.
a) zH(z + w)
b)
c) 2z + w ⊙ z
Answers
a) 27 − 2i
b) −10
c)
__________________________________________________________________
(9.9)
Notice that the diagonal elements must be real-valued, because only real-
valued numbers are equal to their complex conjugate (a + 0i = a − 0i).
Notice also that Hermitian matrices may contain numbers with non-zero-
valued imaginary parts.
You will have the opportunity to confirm that this matrix is indeed unitary in
the code challenges. There is a lot more that could be said about complex
matrices in linear algebra. However, the topics introduced in this chapter
cover what you will need to know for most linear algebra applications,
including everything you need to know for the rest of this book.
9.7 Exercises
1.Â
Implement the specified operations using the following variables.
a) wd
b) dHwd
c) Rd
d) RHRd
e) wz
f) wz∗
g) wdR
h) wdHR
9.8 Answers
1.Â
a)
b) 80 + 200i
c)
d)
e) 5 − 2i
f) −5 + 2i
g) Wrong sizes!
h)
1.Â
On page § I showed an example of a unitary matrix. Confirm that this
is unitary by showing that UHU = I. Also confirm that UTU≠I.
2.Â
In Chapter 6 you learned two methods (additive and multiplicative) to
create a symmetric matrix from a non-symmetric matrix. What happens
when you apply those methods to complex matrices? To find out,
generate a 3×3 matrix of complex random numbers. Then apply those
two methods to generate two new matrices, and test whether those
matrices are (1) symmetric, (2) Hermitian, or (3) neither.
9.10 Code solutions
1.Â
Both Python and MATLAB allow you to implement Hermitian and
regular transpose operations. Be careful that you are implementing the
correct version! Notice that the method .H in Python requires converting
the matrix into a matrix object.
Code block 9.7:Python
UÂ =Â .5*np.array([Â [1+1j,1-1j],[1-1j,1+1j]Â ])
print([email protected](U).H)Â #Â HermitianÂ
[email protected] # not Hermitian
Code block 9.8:MATLAB
U = .5*[1+1i 1-1i; 1-1i 1+1i];Â
U’*U % HermitianÂ
transpose(U)*U % not Hermitian
2.Â
Both methods for creating symmetric matrices (A + AT and ATA) work
for complex matrices, except that the resulting matrices are Hermitian,
not symmetric. As mentioned earlier, be mindful that you are taking the
Hermitian transpose, not the "regular" transpose. Python does not have a
built-in function to test whether a matrix is Hermitian, but subtracting
two equal matrices will produce the zeros matrix.
Code block 9.9:Python
r = np.random.randn(3,3)Â
i = np.random.randn(3,3)Â
A = np.matrix( r+i*1j )Â
A1Â =Â A+A.HÂ
A2Â =Â [email protected]Â
print(A1-A1.H)Â
print(A2-A2.H)
In this chapter you will learn how to represent systems of equations using
matrices and vectors, and how to solve those systems using linear algebra
operations. This knowledge will be central to a range of applications
including matrix rank, the inverse, and least-squares statistical model fitting.
To be sure, this single equation does not need matrices and vectors. In fact,
using matrices to solve one equation simply creates gratuitous confusion. On
the other hand, consider the following system of equations.
2x + 3y − 5z = 8
− 2y + 2z = −3
5x − 4z = 3
You can imagine an even larger system, with more variables and more
equations. It turns out that this system can be represented compactly using the
form Ax = b. And this isn’t just about saving space and ink—converting
a system of many equations into one matrix equation leads to new and
efficient ways of solving those equations. Are you excited to learn? Let’s
begin.
Now things are starting to get interesting. The point on the graph where the
two lines intersect is the solution to both equations. In this case, that point is
(x,y) = (4∕3,8∕3). Try this yourself by plugging those values into both
equations in system 10.3. You can also try points that are on one line but not
the other; you will find that those pairs of numbers will solve only one of the
equations. Try, for example, (0,4) and (-4,0).
When learning about equations in high school, you learned that any
arithmetic operation performed on one side of the equation must be done to
the other. For example, in the equation x = 4, you may multiply both sides by
8 (8×x = 8×4), but it is not allowed to multiply only one side by 8.
With a system of equations, you still have the same rule, although you donâ
€™t have to apply the same operation to all equations in the system. But
having a system of equations allows you to do something more: You may add
and subtract entire equations from each other (this is analogous to
multiplying the left-hand side by "8" and the right-hand side by "4×2").
Let’s try this with Equation 10.3. I will transform the first equation to be
itself minus the second equation:
(10.4)
Next, I will replace the second equation by itself plus two times the original
first equation:
(10.5)
At a superficial glance, system 10.5 looks really different from system 10.3.
What does the graph of this "new" system look like? Let’s see:
So the conclusion here is that you can take a system of equations, scalar
multiply individual equations, and add and subtract equations from each
other, to your heart’s delight. The individual equations—and the lines
representing those equations—will change, but the point of intersection will
remain exactly the same (of course, this statement breaks down if you scalar-
multiply by 0, so let’s exclude this trivial annoyance).
Now let me explain why this concept is powerful. Consider the system of
equations 10.6. Take a moment to try to find the solution to this system (that
is, the (x,y) pair that satisfies both equations); try it based on visual
inspection without writing anything down and without adding or subtracting
the equations from each other.
(10.6)
Did you get it? It’s actually pretty difficult to solve in your head. Now I
will add the first equation to the second; try again to solve the system without
writing anything down.
(10.7)
Suddenly, this new second equation is much easier to solve. You see
immediately that y = 1, and then you can plug that into the first equation to
get x = 5∕2 = 2.5. Thus, the solution is now easier to calculate because the
second equation decouples y from x. This is the principle of row reduction
and Gaussian elimination, which you will learn about later in this chapter.
1. Variables: These are the unknowns that you want to solve for. They
are typically labeled x, y, ..., or x1, x2, ...
2. Coefficients: These are the numbers that multiply the variables. There
is one coefficient per variable. If the variable is sitting by itself, then the
coefficient is 1; if the variable is not present, then the coefficient is 0.
3. Constants: These are the numbers that do not multiply variables. Every
equation has one constant (which might be 0).
Sometimes, these components are easy to spot. For example, in the equation
(10.8)
The variables are (x,y,z), the corresponding coefficients are (2,3,−4), and
the constant is 5. In other equations, you might need to do a bit of arithmetic
to separate the components:
(10.9)
The variables are still (x,y,z), the coefficients are (2,6,−1), and the constant
is 6.
Observe how this system is translated into matrices and vectors. In particular,
observe how the coefficients, variables, and constants are organized:
(10.11)
You can see that converting a system of equations into a matrix equation is
not terribly difficult. But it does require vigilance, because a simple mistake
here can have dire consequences later on. For example, consider the
following system:
(10.12)
a)
b)
c)
d)
Answers
a)
=
b)
=
c)
=
d)
=
__________________________________________________________________
Practice problems (2 of 2) Convert the following matrix-vector products
into their "long-form" equations (i.e., the opposite of the previous exercises).
a)
=
b)
=
c)
d)
=
Answers
c)
d)
__________________________________________________________________
-|*|- Reflection Sometimes, the biggest challenge in data analysis and
modeling is figuring out how to represent a problem using equations; the rest
is usually just a matter of algebra and number-crunching. Indeed, the
translation from real-world problem to matrix equation is rarely trivial and
sometimes impossible. In this case, representing a system of equations as
matrix-vector multiplication leads to the compact and simplified notation: Ax
= b. And this form leads to an equally compact solution via the least-squares
algorithm, which is a major topic of Chapter 14. -|*|-
10.3 Row reduction, echelon form, and pivots
Row reduction may initially seem like a tangent from representing and
solving systems of equations, however, it provides the computational
backbone of solving systems of equations.
Row reduction involves modifying rows of a matrix while leaving many key
properties of the matrix intact. It is based on the principle that you learned
about in section 10.1: in a system of equations, individual equations can be
scalar-multiplied, added, and subtracted. Thus, row reduction involves
linearly combining the rows of a matrix, with the goal of transforming a
matrix into a form that facilitates subsequent inspection and analyses.
So how does row reduction work? It’s exactly the same procedure we
applied in Equation 10.6 (page §): Replace rows in the matrix with linear
combinations of other rows in the same matrix. Let’s start with an
example.
Consider the matrix defined by the coefficients in the equation system 10.6
(the constants vector is omitted for now; after learning the mechanism of row
reduction, you’ll learn how to incorporate the right-hand side of the
equations). Notice that by adding the first row to the second row, we get a
zero in the (2,1) position.
These two matrices are not the same, but they are related to each other by a
simple linear operation that could be undone if we keep track of which rows
were added to which other rows. That linear operation increased the number
of zeros in the matrix from none to one, and so is consistent with our goal. In
this case, the linear transformation converted the original matrix into an
upper-triangular matrix, which, in the parlance of row reduction, is called the
echelon form of the matrix.
Echelon form One of the main goals of row reduction is to convert a dense
matrix into its echelon form.3 A matrix is in its echelon form when the
following two criteria are satisfied:
3Sometimes also called
row echelon form.
1. The first non-zero number in each row is to the right of the first non-
zero numbers in the rows above.
2. Rows of all zeros are below rows with at least one non-zero element.
Notice how each of these matrices conforms to the two criteria: the first non-
zero term in each row is to the right of the first non-zero term in the rows
above, and rows of all zeros are on the bottom.
Obviously, most matrices are not already in their echelon form. This is where
row reduction comes in: Apply row reduction to a matrix until it reaches
echelon form. (Some matrices need to have rows swapped; we’ll deal
with this situation later.)
Let’s try an example with a 3×3 matrix. The goal is to find the
multiples of some rows to add to other rows in order to obtain the echelon
form of the matrix. It’s often easiest to start by clearing out the bottom
row using multiples of the top row(s).
One of the nice features of the echelon form is that linear dependencies in the
columns and rows reveal themselves. In the example above, you can
immediately see that the third column cannot be formed from a linear
combination of the first two. It is similarly straightforward to use the
presence of zeros to convince yourself that the first row cannot be created by
combinations of the other two rows (for example, nothing times zero can
produce the 1 in the top-left element). So, this matrix comprises a set of
linearly independent columns (and rows), which means this is a rank-3 (full-
rank) matrix.
Watch what happens to the echelon form when there are linear dependencies.
When the columns (or rows) of a matrix form a linearly dependent set, the
echelon form of the matrix has at least one row of zeros.
In fact, this is one way to compute the rank of a matrix: Transform it to its
echelon form, and count the number of rows that contain at least one non-
zero number. That count is the rank of the matrix. I’ll have more to say
about the relationship between rank and row reduction in a later section, but
this statement (rank is the number of non-zeros rows in the echelon form) is
the main idea.
______________________________________________________________
Practice problems Convert the following matrices into their echelon form.
a)
b)
c)
Answers
a)
b)
c)
_________________________________________________
A few tips for row reduction Row reduction is admittedly kindof a weird
procedure when you first start doing it. But after you solve several problems,
it will start to feel more natural. Here are a few tips that might help you avoid
difficult arithmetic. These are not steps that you always implement; these are
strategies to keep in mind that might make things easier.
It’s trivial that IA = A. But what would happen if we change the identity
matrix just a bit? Let’s see: 4
4This is no longer the identity matrix, so I’ll call it R for reduction.
What we’ve done here is keep row 1 the same and double row 2. But how
do we linearly combine rows? Well, if changing the diagonal elements affects
only the corresponding row, then we need to change the off-diagonal
elements to combine rows. To replace row-2 with row-1 plus row-2, you put
a 1 in the (2,1) position of the R matrix:
More generally, to multiply the ith row by σ and add it to the jth row, you
enter σ into the (i,j) entry of the identity matrix to form an R matrix.
Each nth step of row reduction has its own Rn matrix, each one to the left of A
and any previously applied R matrices. Depending on why you are
implementing row reduction, you might not need to keep track of the
transformation matrices, but it is important to understand that every echelon
matrix E is related to its original form A through a series of transformation
matrices:
(10.14)
This also shows how row reduction does not actually entail losing
information in the matrix, or even fundamentally changing the information in
the matrix. Instead, we are merely applying a sequence of linear
transformations to reorganize the information in the matrix. (That said, row
reduction without storing the R matrices does involve information-loss;
whether that matters depends on the goal of row-reduction. More on this
point later.)
Exchanging rows in a matrix In the examples of row reduction thus far, the
matrices "just so happened" to be constructed such that each row had nonzero
elements to the right of nonzero elements in higher rows (criteria #1 of an
echelon form matrix). This is not guaranteed to happen, and sometimes rows
need to be exchanged. Row exchanges have implications for several matrix
properties.
For example, if you want to exchange the first and second rows of a 3×3
matrix, create matrix P, which is the identity matrix with the first two rows
exchanged, then left-multiply by this permutation matrix.
(10.15)
(10.16)
__________________________________________________________________
If you are confused about why PA and AP are so different, you might
consider reviewing section 6.1 on thinking about matrix multiplication as
linear combinations of rows vs. columns.
Now we can add to our general formula for transforming a matrix to its
echelon form (Equation 10.14) to include row exchanges.
(10.17)
(You might need additional permutation matrices, in which case they would
be called P1, P2, and so on.)
Notice that the right-most matrix is not in its echelon form, because the
leading non-zero term in the third row is to the left of the leading non-zero
term in the second row. Swapping the second and third rows will set things
right, and we’ll have our proper echelon form.
Row swapping has implications for the sign of the determinant, which is a
number associated with a square matrix that is zero for all singular square
matrices and non-zero for full-rank matrices. You’ll learn more about
determinants in the next chapter, including the effect of row swaps. I mention
it here to create a space in your brain for this information to be slotted in. To
quell suspense: Each row swap flips the sign of the determinant but does not
change its magnitude.
After putting the matrix into echelon form, the pivots are the left-most non-
zero elements in each row. Not every row has a pivot, and not every column
has a pivot. A zero cannot be a pivot even if it’s in a position that could
be a pivot. That’s because pivots are used as a denominator in row-
reduction, and terrible things happen when you divide by zero. In the
following matrix, the gray boxes highlight the pivots.
Practice problems Use row reduction to transform the matrix into its
echelon form, and then identify the pivots.
a)
b)
c)
b)
c)
__________________________________________________________________
Pivot-counting and rank The rank of a matrix is the number of pivots in the
echelon form of that matrix. Let’s think about why this is the case.
During row reduction, any row that can be created using a linear weighted
combination of other rows in the matrix will turn into a row of zeros. We
could phrase this another way: If row reduction produces a zeros row, then it
means that some linear combination of row vectors equals the zeros vector.
This is literally the definition of linear dependence (Equation 4.6, page §).
But this only proves that row reduction can distinguish a reduced-rank matrix
from a full-rank matrix. How do we know that the number of pivots equals
the rank of the matrix?
One way to think about this is using the theorem presented in Chapter 8, that
the rank of a matrix is the dimensionality of the row space. Row reduction
makes the dimensionality of the row space crystal clear, because all rows that
can be formed by other rows are zeroed out. Any remaining rows necessarily
form a linearly independent set, and the rank of a matrix is the largest number
of rows that can form a linearly independent set. Each non-zeros row in the
echelon form contains exactly one pivot; hence, the number of pivots in a
matrix equals the number of non-zeros rows in the echelon form of the
matrix, which equals the dimensionality of the row space (and thus also the
dimensionality of the column space), which equals the rank.
For this same reason, row reduction reveals a basis set for the row space: You
simply take all the non-zeros rows.
So the echelon form "cleans up" the matrix to reveal the dimensionality of the
row space and therefore the rank of the matrix. And the "cleaning" process
happens by moving information around in the matrix to increase the number
of zero-valued elements.
__________________________________________________________
Practice problems Use row reduction to compute the rank of the following
matrices by counting pivots.
a)
b)
c)
a) Easy: no pivots!
r=0
b)
,r = 3
c)
,r = 2
__________________________________________________________________
Non-uniqueness of the echelon form Did you always get the exact same
echelon matrices that I got in all the exercises above? You might have, but
you might also have gotten some slightly different results.
The echelon form of a matrix is non-unique. This means that a given matrix
can have multiple—equally valid—associated echelon-form matrices.
That’s because of row-swapping and row-scalar multiplications. Take
any echelon matrix, pick a row at random, and multiply that randomly
selected row by some random scalar. There is literally no end to the possible
matrices you can create this way, and they will all be the echelon form of the
original matrix (assuming no arithmetic mistakes).
On the other hand, some features of the infinite number of echelon form
matrices for a given matrix are constant. In particular, the number of pivots
will always be the same (although the numerical values of the pivots may
differ), as will the number of zeros rows (this is obvious, because the number
of zeros rows is M − r, where M is the number of rows and r is the rank).
Now we’re ready for step 3. What is "back substitution"? This means
mapping the matrix-vector equation back into a "long-form" system of
equations, and then solving the system from the bottom row to the top row.
(10.24)
Take a moment to compare system 10.24 to matrix 10.23, and make sure you
see how they map onto each other.
Now the solution to this system is easy to compute, and you can understand
where the name "back-substitution" comes from: Starting from the bottom
row and moving upwards, substitute variables in each equation until youâ
€™ve solved for all the variables. In this case, we already see that y = 1, and
substituting that into the first equation leads to x = 5∕2.
Are you impressed by Gaussian elimination? Perhaps you are now thinking
that this is the pinnacle of using linear algebra to solve systems of equations,
and it couldn’t possibly get better than this. Well, prepare to be proven
wrong!
__________________________________________________________________
a)
b)
Answers
a)
b)
_________________________________________________
-|*|- Reflection Before the Internet age, communication among scientists was
slow, difficult, and limited. Geographical distances and language barriers
exacerbated the problem. I’m not a math historian, but I did read up a bit
about the history of the term "Gaussian elimination." Apparently, the method
was known to Chinese mathematicians even before Christ was born. Gauss
himself did not discover the method, but improved it, along with better
notation, from Newton (who might have discovered it independently of the
Chinese). The attribution to Gauss was made in the 20th century, 100 years
after his death. He could neither correct the record nor inappropriately take
credit. -|*|-
It is often useful to apply step 2 from the bottom row of the matrix to the top.
Thus, to obtain the echelon form of the matrix, you work from the top row
down to the bottom; and to obtain the RREF, you work from the bottom row
(or the last row with pivots if there are rows of zeros) back up to the top.
Below are a few examples of matrices and their row-reduced echelon forms;
notice that all pivots have a value of one and that the pivot is the only non-
zero entry in its column.
At this point, you might be concerned that transforming a matrix into its
RREF is non-unique, meaning that many different matrices will become the
same row-reduced echelon matrix. That’s true: although each matrix has
a unique RREF (that is, there is exactly one RREF associated with a matrix),
many different matrices can lead to the same RREF. For example, all square
full-rank matrices have the identity matrix as their RREF.
However, you already know how to keep track of the progression from the
original matrix to its RREF: Put each step of row reduction and row swaps
into transformation matrices, as described earlier in this chapter. On the other
hand, the usual goal of RREF is to solve a system of equations, and that
solution is the same for the original matrix, the echelon form, and the RREF.
Thus, in practice, you will probably not need to worry about saving the
intermediate transformation matrices.
Code Computing the RREF of a matrix is easy. In Python, you must first
convert the matrix (which is likely to be stored as a numpy array) into a
sympy matrix.
a)
b)
c)
Answers
a)
b)
c)
_________________________________________________
OK, so now it’s time for back-substitution... except you don’t need to
do any back-substitution! All the work is done for you; you simply read off
the solutions right from the equations.
_____________________________________________
a)
b)
c)
Answers
a)
b)
c) x=y=z=1
_________________________________________________
-|*|- Reflection Math has many life-lessons if you look at it in the right way.
The wisdom in this chapter is the following: Try to solve a problem (system
of equations) with no preparation or organization, and you will spend a lot of
time with little to show except stress and frustration. Do some preparation
and organization (Gaussian elimination) and you can solve the problem with
a bit of dedication and patience. And if you spend a lot of time preparing and
strategizing before even starting to solve the problem (Gauss-Jordan
elimination), the solution may simply reveal itself with little additional work.
You’ll be left with a deep sense of satisfaction and you will have earned
your Mojito on the beach. -|*|-
Do you spot the problem? Let’s map the RREF back onto a system of
equations:
(10.27)
Infinite solutions (Figure 10.5C) Geometrically, this means that the two
equations are collinear; they are in the same subspace. Here’s a system
that would produce Figure 10.5C:
(10.28)
Notice that the second equation is the first multiplied by 2. And here is that
system’s corresponding matrix and RREF, and below that, the mapping
back into a system of equations:
(10.29)
Yeah, it’s not really a "system" anymore, unless you want to include an
extra equation that reads 0 + 0 = 0.
So what is the solution to this equation? There are many. Pick whatever you
want for x and then solve for y. Or pick whatever you want for y and solve
for x.
-|*|-
To quell suspense, the main take-home message here is that row reduction
does not affect the row space of the matrix, but it can drastically affect the
column space of the matrix; on the other hand, the dimensionalities of the
matrix subspaces do not change. Now let’s work through this
methodically.
Let’s talk about rank first. You already know that rank is a property of
the matrix, not of the rows or the columns. The rank doesn’t change
before vs. after row-reduction. In fact, row-reduction makes the rank easier to
compute (count the pivots).
Next let’s think about the row space. That doesn’t change after row
reduction. To understand why, think about the definition of a subspace and
the process of row reduction: A subspace is defined as all possible linear
combinations of a set of vectors. That means that you can take any constant
times any vector in the subspace, add that to any other constant times any
other vector in the subspace, and the resulting vector will still be in the
subspace.
Now think about the process of row reduction: you take some constant times
one row and add it to another row. That is entirely consistent with the
algebraic definition of a subspace. So you can do row-reduction to your
heart’s delight and you will never leave the initial subspace.
The only characteristic of the row space that could change is the basis
vectors: If you take the rows of the matrix (possibly sub-selecting to get a
linearly independent set) as a basis for the row space, then row reduction will
change the basis vectors. But those are just different vectors that span the
same space; the subspace spanned by the rows is unchanged before, during,
and after row reduction.10
10Anyway, the rows tend to be a poor choice for basis vectors. You will understand why when you
learn about the SVD.
Now let’s talk about the column space. The column space actually can
change during row reduction. Let me first clarify that the dimensionality of
the column space does not change with row reduction. The dimensionality
will stay the same because the dimensionality of the row space, and the rank
of the matrix, are unaffected by row reduction.
But what can change is the actual subspace that is spanned by the columns.
This can happen when the column space occupies a lower-dimensional
subspace of the ambient space in which the columns live. The reason is that
row reduction involves changing entire rows at a time; individual elements of
a column will change while other elements in the same column stay the same.
I believe that one clear example will suffice for understanding. Below you
can see a matrix and its RREF.
(10.30)
The two columns form a linearly independent set in both matrices (clearly
seen in both, though easier to see in the RREF), so we can take the columns
as bases for the column spaces. In both matrices, the column space is a 2D
plane embedded in a 3D ambient space.
But are they the same plane, i.e., the same subspace? Not at all! They
intersect at the origin, and they have a line in common (that is, there is a 1D
subspace in both 2D subspaces), but otherwise they are very different from
each other—before RREF it was a tilted plane and after RREF it’s the
XY plane at Z=0 (Figure 10.6).
On the other hand, keep in mind that row reduction is not necessarily
guaranteed to change the column space. Consider the following matrix and its
RREF:
It’s a full-rank matrix and its RREF is the identity matrix. The column
space spans all of â„2 both before and after row reduction. Now this situation
is analogous to the row space: The elements in the columns are different but
the subspaces spanned by those columns are exactly the same.
Again, the take-home messages in this section are that row reduction (1) does
not affect the rank or the dimensionalities of the matrix subspaces, (2) does
not change the row space, and (3) can (but does not necessarily) change the
column space.
10.9 Exercises
1.Â
Reduce the following matrices to their echelon form. Highlight the
pivots.
a)
b)
2.Â
Given the following matrix sizes and ranks, determine the number of zeros
rows in the row-reduced echelon form.
a) â„2×3,r = 2
b) â„3×2,r = 2
c) â„7×7,r = 7
d) â„7×7,r = 3
e) â„7×2,r = 2
f) â„2×7,r = 2
g) â„6×7,r = 0
h) â„4×4,r = 4
3.Â
Use row reduction to determine whether the following systems have zero,
one, or an infinite number of solutions. You don’t need to solve the
system.
a)
b)
c)
4.Â
Solve the following systems of equations, by converting the equations in a
matrix equation and then apply Gaussian elimination and back-substitution
(or Gauss-Jordan elimination).
a)
b)
10.10 Answers
1.Â
a)
b)
2.Â
a) 0
b) 1
c) 0
d) 4
e) 5
f) 0
g) 6
h) 0
3.Â
a) No solutions
b) ∞ solutions
c) One solution
4.Â
And voilà ! You now have all the pieces of a system of equations. If you
want to create systems that have zero, one, or infinite solutions, you will need
to make some minor adjustments to the above procedure (e.g., by changing a
coefficient after already solving for the constants).
2.Â
The margin notes in section 10.5 described patterns in the RREF of a matrix
that depend on the size of the matrix. The idea is that every RREF is
essentially the identity matrix, possibly with some additional rows of zeros or
columns of non-zero numbers. Write code to explore these possibilities. In
particular, create random numbers matrices that are (1) square, (2) wide, or
(3) tall; and then compute and examine the RREF of those matrices.
Thing 1: only for square matrices The determinant is defined only for
square matrices. Any time you hear or read the word "determinant," you can
immediately start thinking about square (M×M) matrices. Therefore, all of
the matrices in this chapter are square.
Thing 3: zero for singular matrices The determinant is zero for a non-
invertible matrix—a matrix with rank r < M. You can also say that a matrix
with linear dependencies in the columns or in the rows has a determinant of
zero. In fact, when you learn about computing the inverse in the next chapter,
you’ll learn that you should compute the determinant of a matrix before
even starting to compute the inverse.
However, there are handy short-cuts for computing the determinant of 2×2
and 3×3 matrices. Therefore, I will introduce these short-cuts first, and
then present the "full" determinant formula thereafter.
__________________________________________________________________
a)
=1
b)
= −1
c)
= −3
d)
= −3
e)
=0
f)
= 15
g)
= −8
h)
= −3
Answers
a) Correct
b) Correct
c) No, Δ = −53
d) No, Δ = 3
e) Correct
f) No, Δ = 0
g) No, Δ = −15
h) Correct
_______________________________________________________
Determinant and transpose The determinant is robust to the transpose
operation. In other words, det(A) = det(AT). This is because the determinant
is a property of the matrix, not of the rows or the columns. This is easy to
prove:
= ad − bc (11.6)
= ad − cb (11.7)
Notice that the difference between the matrix and its transpose is simply the
swapping of c and d. Scalar multiplication is commutative, and thus the
determinant is unaffected by the transpose.
This proof is valid only for 2×2 matrices because Equation 11.1 is a short-
cut for 2×2 matrices. However, this property does generalize to any sized
square matrix. We will revisit this concept when learning about 3×3
matrices.
This is interesting because it means that we can solve for a particular matrix
element if we know the determinant. Let’s see an example:
So we start with the knowledge that the matrix determinant is 4 and then we
can solve for a unique matrix element λ that makes the equation true. Letâ
€™s try another example:
Notice how the equation worked out: There were two λs on the diagonal,
which produced a second-order polynomial equation, and thus there were two
solutions. So in this case, there is no single λ that satisfies our determinant
equation; instead, there are two equally acceptable solutions. That’s no
surprise: The Fundamental Theorem of Algebra tells us that an nth order
polynomial has exactly n roots (though they might be complex-valued).
Now I want to push this idea one step further. Instead of having only λ on
the diagonal of the matrix, let’s have both a real number and an unknown.
I’m also assuming that the matrix below is singular, which means I know
that its determinant is zero.
= 0 ⇒ (1 − λ)2 − 9 = 0 (11.9)
λ = 4 ⇒
Now I will replace that specific matrix with a letter so that it generalizes to
any square matrix of any size. That gives us the characteristic polynomial of
the matrix.
____________________________________________________________
_________________________________________________________________________
But more importantly for our purposes in this book, when the characteristic
polynomial is set to zero (that is, when we assume the determinant of the
shifted matrix is 0), the λ’s—the roots of the polynomial—are the
eigenvalues of the matrix. Pretty neat, eh? More on this in Chapter 15. For
now, we’re going to continue our explorations of determinants of larger
matrices._________________________________________
a)
b)
c)
d)
e)
f)
g)
h)
Answers
a) λ = 8
b) λ = 0
c) λ = 1∕4
d) λ = −7
e) λ = 5∕9
f) λ = ±2
g) λ = ±3
h) λ = 6,4
_______________________________________________________
It’s easiest to understand visually. There are two ways to think about the
procedure, one that "wraps around" the matrix and one that augments the
matrix with the first two columns (Figure 11.1). Of course, these aren’t
really different methods, just different ways of interpreting the same
procedure. Whichever picture you find easier to remember is the one you
should focus on.
Figure 11.1:Two visualizations of the short-cut for
computing the determinant of a 3×3 matrix.
I hope you find the pictures intuitive and memorable. Trying to memorize the
algebraic equation is a terrible idea, but it’s printed below for
completeness.
__________________________________________________________________
= aei + bfg + cdh − ceg − bdi − afh (11.12)
________________________________________________________________________
Let’s see a few examples with numbers.
The third example has a determinant of zero, which means the matrix is
singular (reduced-rank). It’s singular because the columns (or rows) form
a dependent set. Can you guess by visual inspection how the columns are
linearly related to each other? It might be easier to see from looking at the
rows.
a)
= −2
b)
= −3
c)
= −3
d)
=8
Answers
a) No, Δ = −3
b) No, Δ = +3
c) Correct
d) No, Δ = 0
____________________________________________________
Practice problems Which value(s) of λ would make the following matrices
singular?
a)
b)
c)
d)
Answers
a) any λ
b) λ = 9
c) λ = 3
d) λ = 1,3,5
_____________________________________________________
Determinant and transpose We can now show that the determinant of a 3Ã
—3 matrix is the same before and after transposing. It takes slightly longer to
prove than for the 2×2 case, but is still worth the effort. The ordering of
the terms is different, but the individual products are the same, as are their
signs.
= aei + bfg + cdh − ceg − bdi − afh (11.14)
= aei + dhc + gbf − gec − dbi − ahf (11.15)
It turns out that the aforediscussed tricks for computing the determinants of
2×2 and 3×3 matrices are simplifications of the full procedure to
compute the determinant of any sized matrix. In this section you will learn
that full procedure. It gets really complicated really quickly, so I will
illustrate it using a 4×4 matrix, and then you’ll see how this simplifies
to the smaller matrices.
In general, the procedure is to multiply the ith element of the first row of the
matrix by the determinant of the 3×3 submatrix created by excluding the
ith row and ith column. That gives four numbers. You then add the 1st and 3rd,
and subtract the 2nd and 4th. Figure 11.2 shows the operation.
The full procedure scales up to any sized matrix. But that means that
computing the determinant of a 5×5 matrix requires computing 5
determinants of 4×4 submatrices, and each of those submatrices must be
broken down into 3×3 submatrices. Honestly, I could live 1000 years and
die happy without ever computing the determinant of a 5×5 matrix by
hand. ________________________________________________
a)
b)
c)
Answers
a) -10
b) 72
c) 0
_________________________________________________________________________
In fact, for some matrices, it’s easier to apply row reduction to get3 the
matrix into its echelon form and then compute the determinant as the product
of the diagonal. HOWEVER, be aware that row exchanges and row-scalar-
multiplication affect the determinant, as you will learn in subsequent sections.
3Useful tip for exams!
Why is this shortcut true? Let’s start by proving it for upper-triangular 2Ã
—2 matrices:
(11.16)
You can now prove this for lower-triangular and diagonal matrices.
The proof for 3×3 matrices is comparable but involves slightly more
algebra. The expression below draws from Equation 11.12 but sets the lower-
triangular elements to zero.
(11.17)
Finally, let’s inspect the 4×4 case. Figure 11.4 is a modified version of
Figure 11.2 with e = i = f = m = n = o = 0.
Row swapping Curiously enough, swapping the rows of a matrix flips the
sign of the determinant, without affecting the magnitude of the determinant.
Let’s start with a simple example of the identity matrix with and without
rows swapped: 4
4As sure as the sun rises, the determinant of the identity matrix of any size is 1.
Thus, each row swap reverses the sign of the determinant. Two row swaps
therefore "double-reverses" the sign. More generally, an odd number of row
swaps effectively multiplies the sign of the determinant by −1, whereas an
even number of row swaps preserves the sign of the determinant.
= aei + bfg + cdh − ceg − bdi − afh (11.18)
= bdi + ceg + afh − bfg − aei − cdh (11.19)
= cdh + aei + bfg − afh − ceg − bdi (11.20)
Now you know that the determinant is unaffected by adding rows to each
other. Why does this happen? In Equation 11.21, the copy of row 2
embedded in row 1 adds, and then subtracts, the added elements (+cd and
−dc). Same story for Equation 11.22: The "extra terms" come from the
products of the right-going diagonals with the added elements (gei, hfg, idh)
(consult back to Figure 11.1 on page §), and then those same "extra terms"
are subtracted off from the left-going diagonals (ieg, hdi, gfh).
The previous paragraph will provide an answer to the next question: What
about adding multiples of one row to another? In the two examples above, I
only showed a row being added to another row, but row reduction often
involves scalar-multiplying rows and adding them. It turns out that scalar
multiplying does not change our conclusion: combining rows has no effect on
the determinant. The reason is that the multiple is simply a different constant
that gets added and then subtracted to the determinant formula.
Let’s think through this using example 11.22. Instead of adding row 3 to
row 1, let’s add 4 times row 3 to row 1. The first row of the new matrix
then becomes [ (a + 4g) (b + 4h) (c + 4i) ]. Well, 4g is still just a constant, so
we might as well give it a new variable name g′. Thus, the first element
becomes (a + g′). Same for the other two elements of the first row: (b + hâ
€²) and (c + i′). Then the opposite-signed terms in Equation 11.22 are, for
example, g′ei and −ieg′. Those still cancel out.
Finally, we can address subtracting rows: Subtracting rows is the same thing
as adding rows with one row scalar-multiplied by −1. For this reason,
subtracting rows has the same impact on the determinant as adding rows:
None whatsoever.
The answer is that multiplying a row by some scalar β scales the determinant
by β. Let’s see an example in our favorite lettered 2×2 matrix.
(11.23)
How does this generalize to scaling rows by different values? Let’s find
out:
(11.24)
In other words, any scalar that multiplies an entire row, scales the
determinant by the same amount.
This scales up to any sized matrix, which is easy to confirm in the 3×3
case:
= ζ(aei + bfg + cdh − ceg − bdi − afh) (11.25)
The reason why this happens is that each term in the determinant formula
contains exactly one element from each row in the matrix. Therefore,
multiplying an entire row by a scalar means that every term in the formula
contains that scalar, and it can thus be factored out.
One way to think about this is that matrix-scalar multiplication is the same
thing as row-scalar multiplication, repeated for all rows and using the same
scalar. So if multiplying one row by a scalar β scales the determinant by β,
then multiplying two rows by β scales the determinant by ββ = β2.
Generalizing this to all M rows of a matrix leads to the following formula:
(11.26)
It is also insightful to think about this from an element perspective of matrix-
scalar multiplication.
β = = βaβd − βbβc = β2(ad − bc) (11.27)
You can see that each term in the determinant formula brings β2, which can
be factored out. Let’s see how this looks for a 3×3 matrix.
β =
You will have the opportunity to examine this yourself in the code challenges
at the end of this chapter. The main point is that the determinant is an
important concept to understand for the theory of linear algebra, but you
should avoid computing or using it directly when implementing linear
algebra concepts on computers.
11.10 Exercises
1.Â
Compute the determinant of the following matrices.
a)
b)
c)
d)
e)
f)
g)
h)
i)
j)
k)
l)
m)
n)
o)
2.Â
Apply row-reduction to obtain the echelon form of these matrices, and then
compute the determinant as the product of the diagonal. Keep track of row-
swaps and row-scalar multiplications.
a)
b)
c)
d)
3.Â
Does having a zero in the main diagonal necessarily give a matrix Δ = 0? The
correct answer is "it depends"; now you need to figure out what it depends
on, and why that is. Then create an example matrix with all zeros on the
diagonal but with Δ≠0.
11.11 Answers
1.Â
a) a(c − b)
b) 0
c) 0
d) 1
e) 16
f) 2
g) 1
h) −1
i) −1
j) 107
k) −1
l) +1
m) 1
n) 0
o) −24
2.Â
These answers show the echelon form and the determinant of the original
matrix. Notice that the determinant of the original matrix is not necessarily
the same thing as the the product of the diagonals of the echelon form of the
matrix (depending on row manipulations).
a)
,Δ = −2
b)
,Δ = 5
c)
,Δ = −4
d)
,Δ = 0
3.Â
The determinant of a matrix is zero if at least one diagonal element is zero
only if the matrix is triangular (upper-triangular, lower-triangular, or
diagonal). This is because the determinant of a triangular matrix is equal to
the product of the diagonal elements. Here is an example of a matrix with a
diagonal of zeros and Δ = 4:
Run these three steps in a double for-loop: One over matrix sizes ranging
from 3×3 to 30×30, and a second that repeats the three steps 100 times.
This is equivalent to repeating a scientific experiment multiple times.
2.Â
One easy way to create a singular matrix is to set column 1 equal to
column 2.
Code block 11.5:Python
import numpy as npÂ
import matplotlib.pyplot as pltÂ
Â
ns = np.arange(3,31)Â
iters = 100Â
dets = np.zeros((len(ns),iters))Â
Â
for ni in range(len(ns)):Â
  for i in range(iters):Â
    A = np.random.randn(ns[ni],ns[ni])#step1
    A[:,0] = A[:,1] # step 2Â
    dets[ni,i]=np.abs(np.linalg.det(A))#st3Â
Â
plt.plot(ns,np.log(np.mean(dets,axis=1)))Â
plt.xlabel(’Matrix size’)Â
plt.ylabel(’Log determinant’);
Code block 11.6:MATLAB
ns = 3:30;Â
iters = 100;Â
dets = zeros(length(ns),iters);Â
Â
for ni=1:length(ns)Â
    for i=1:itersÂ
        A = randn(ns(ni)); % step 1
        A(:,1) = A(:,2); % step 2Â
        dets(ni,i) = abs(det(A)); %Â
    endÂ
endÂ
Â
plot(ns,log(mean(dets,2)),’s-’)Â
xlabel(’Matrix size’)Â
ylabel(’Log determinant’)
Chapter 12
Matrix inverse
As usual, I will begin this chapter with some general conceptual
introductions, and then we’ll get into the details.
In section 6.11 I wrote that matrix division per se doesn’t exist, however,
there is a conceptually similar operation, which involves multiplying a matrix
by its inverse. By way of introduction to the matrix inverse, I’m going to
start with the "scalar inverse." Solve for x in this equation.
(12.1)
Obviously, x = 1∕3. How did you solve this equation? I guess you divided
both sides of the equation by 3. But let me write this in a slightly different
way:
3x = 1
3−13x = 3−11
1x = 3−1
x =
I realize that this is a gratuitously excessive number of steps to write out
explicitly, but it does illustrate a point: To separate the 3 from the x, we
multiplied by 3−1, which we might call the "inverse of 3." And of course, we
had to do this to both sides of the equation. Thus, 3×3−1 = 1. A number
times its inverse is 1. The number 1 is special because it is the multiplicative
identity.
What about the following equation; can you solve for x here?
Nope, you cannot solve for x here, because you cannot compute 0/0. There is
no number that can multiply 0 to give 1. The lesson here is that not all
numbers have inverses.
OK, now let’s talk about the matrix inverse. The matrix inverse is a
matrix that multiplies another matrix such that the product is the identity
matrix. The identity matrix is important because it is the matrix analog of the
number 1—the multiplicative identity (AI = A). Here it is in math notation:
(12.2)
Now let me show you an example of using the matrix inverse. In the
equations below, imagine that A and b are known and x is the vector of
unknowns that we want to solve for.
Ax = b (12.3)
A−1Ax = A−1b (12.4)
Ix = A−1b (12.5)
x = A−1b (12.6)
IMPORTANT:1 Because matrix multiplication is non-commutative (that is,
AB≠BA), you need to be mindful to multiply both sides of the equation by
the matrix inverse on the same side. For example, the following equation is
invalid:
1This was already mentioned in section 6.2 but it’s important enough to repeat.
(12.7)
This equation is WRONG because the inverse pre-multiplies on the left-hand
side of the equation but post-multiplies on the right-hand side of the equation.
As it turns out, post-multiplying by A−1:
(12.8)
is invalid for this equation, because both Ax and b are column vectors. Thus,
the sizes do not permit matrix post-multiplication.
Inverting the inverse Because the inverse is unique (you’ll learn the
proof of this claim later), it can be undone. Thus, the inverse of an inverse is
the original matrix. That is:
(12.9)
Conditions for invertibility You saw above that not all numbers have an
inverse. Not all matrices have inverses either. In fact, many (or perhaps most)
matrices that you will work with in practical applications are not invertible.
Remember that square matrices without an inverse are variously called
singular, reduced-rank, or rank-deficient.
A matrix has a full inverse matrix if the following criteria are met:
1. It is square
2. It is full-rank
So the full matrix inverse exists only for square matrices. What does a "full"
matrix inverse mean? It means you can put the inverse on either side of the
matrix and still get the identity matrix:
(12.11)
Thus, the full matrix inverse is one of the few exceptions to matrix
multiplication commutivity.
1. Matrix A is invertible.
2. B is an inverse of A.
3. C is also an inverse of A, distinct from B (thus, B≠C).
The second proof is shorter; notice that each subsequent expression is based
on adding or removing the identity matrix, expressed as the matrix times its
inverse:
(12.17)
Again, the conclusion is that any two matrices that claim to be the inverse of
the same matrix must be equal to each other.
To prove this claim, we start from two assumptions: That the matrix A is
symmetric, and that it has an inverse. The strategy here is to transpose the
inverse equation and then do a bit of algebra. Here’s math tip: Write
down the first equation, close the book, and see if you can discover the proof
on your own.
A−1A = I (12.18)
(A−1A)T = IT (12.19)
ATA−T = I (12.20)
AA−T = I (12.21)
AA−T = AA−1 (12.22)
In the previous page, we proved that if a matrix has an inverse, it has one
unique inverse. Therefore, Equation 12.22 brings us to our final conclusion
that if the matrix is symmetric, its inverse is also symmetric.
Avoid the inverse when possible! The last thing I want to discuss before
teaching you how to compute the matrix inverse is that the matrix inverse is
great in theory. When doing abstract paper-and-pencil work, you can invert
matrices as much as you want, regardless of their size and content (assuming
they are square and full-rank). But in practice, computing the inverse of a
matrix on a computer is difficult and can be wrought with numerical
inaccuracies and rounding errors.
Computer scientists have worked hard over the past several decades to
develop algorithms to solve problems that—on paper—require the inverse,
without actually computing the inverse. The details of those algorithms are
beyond the scope of this book. Fortunately, they are implemented in low-
level libraries called by MATLAB, Python, C, and other numerical
processing languages. This is good news, because it allows you to focus on
understanding the conceptual aspects of the inverse, while letting your
computer deal with the number-crunching.
Computing the matrix inverse There are several algorithms to compute the
matrix inverse. In this book, you will learn three: MCA (minors, cofactors,
adjugate), row-reduction, and SVD. You will learn about the first two
algorithms in this chapter, and you’ll learn about the SVD method in
Chapter 16. But there are convenient short-cuts for computing the inverses of
a diagonal matrix and a 2×2 matrix, and that’s where we’ll start.
A−1 = (12.24)
This example is for a 3×3 matrix for visualization, but the principle holds
for any number of dimensions.
This inverse procedure also shows one reason why singular matrices are not
invertible: A singular diagonal matrix has at least one diagonal element equal
to zero. If you try to apply the above short-cut, you’ll end up with an
element of 0/0.
AÃ = (12.26)
ÃA = (12.27)
You can see that à is definitely not the inverse of A, because their product
is not the identity matrix.
Before learning the full formula for computing the matrix inverse, let’s
spend some time learning another short-cut for the inverse that works on 2Ã
—2 matrices.
The reason why you start the procedure by computing the determinant is that
the matrix has no inverse if the determinant is zero. Thus, if step 1 gives an
answer of zero, then you don’t need to apply the remaining steps. Youâ
€™ll see an example of this soon, but let’s start with an invertible matrix.
Multiplying the right-most and left-most matrices above will prove that one is
the inverse of the other. As I wrote earlier, a square invertible matrix has a
full inverse, meaning the order of multiplication doesn’t matter.
=
You can stop the computation as soon as you see that the determinant, which
goes into the denominator, is zero. This is one explanation for why a singular
matrix (with Δ = 0) has no inverse.
Below are two more examples of matrices and their inverses, and a check that
the multiplication of the two yields I.
2 −1 = ⇒ =
__________________________________________________________________
2Hint: Sometimes it’s easier to leave the determinant as a scalar, rather than to divide each element.
Practice problems Compute the inverse (if it exists) of the following
matrices.
a)
b)
c)
d)
Answers
a)
b)
c) No inverse.
d)
__________________________________________________________________
Minors matrix The minors matrix is a matrix in which each element mi,j of
the matrix is the determinant of the matrix excluding the ith row and the jth
column.4 Thus, for each element in the matrix, cross out that row and that
column, and compute the determinant of the remaining matrix. That
determinant goes into the matrix element under consideration. It’s easier
to understand visually (Figure 12.1).
4This should sound familiar from the formula for computing the determinant of a 4×4 matrix.
Figure 12.1:Each element mi,j of the minors matrix is the
determinant of the submatrix formed from excluding row
i and column j from the original matrix.
A= , m1,1 = (12.29)
m1,2 (12.30)
=
M = (12.31)
The minors matrix is the most time-consuming part of the MCA algorithm.
It’s also the most tedious. Don’t rush through it.
________________________________
a)
b)
c)
Answers
a)
b)
−3
c)
__________________________________________________________________
Cofactors matrix The cofactors matrix is the Hadamard product of the
minors matrix with a matrix of alternating signs. Let’s call that matrix G
for grid: The matrix is a grid of +1s and −1s, starting with +1 for the upper-
left element. Below are a few examples of G matrices. They look like
checkerboards (or chessboards, depending on your level of board game
sophistication) (Figure 12.2).
+0 −1 +0
−1 +0 −1
+0 −1 +0
Finally, the cofactors matrix: C = G ⊙ M. Using the example matrix from
the previous page,
Be mindful of the signs: The sign of each cofactors matrix element depends
both on the sign of the determinant, and on the sign of the G matrix. That is,
of course, obvious from your elementary school arithmetic, but most
mistakes in higher mathematics are arithmetic...
_________________________________________
Practice problems Compute the cofactors matrix of the following matrices.
a)
b)
c)
Answers
a)
b)
−3
c)
__________________________________________________________________
Adjugate matrix At this point, all the hard work is behind you. The adjugate
matrix is simply the transpose of the cofactors matrix, scalar-multiplied by
the inverse of the determinant of the matrix (note: it’s the determinant of
the original matrix, not the minors or cofactors matrices). Again, if the
determinant is zero, then this step will fail because of the division by zero.
Assuming the determinant is not zero, the adjugate matrix is the inverse of
the original matrix.
Now that you know the full MCA formula, we can apply this to a 2×2
matrix. You can see that the "short-cut" in the previous section is just a
simplification of this procedure.
Original matrix :
Minors matrix :
Cofactors matrix :
Adjugate matrix :
__________________________________________________________________
a)
b)
c)
Answers
a)
b)
c)
_________________________________________________
_________________________________________________
Practice problems Apply the MCA algorithm to the following matrices to
derive their inverses, when they exist.
a)
b)
c)
Answers
a)
3
b)
10
c) No inverse!
____________________________________________________
Code The code challenges at the end of this chapter will provide the
opportunity to implement the MCA algorithm in code. Below you can see
how easy it is to call the inverse functions in Python and MATLAB.
Code block 12.1:Python
import numpy as npÂ
AÂ =Â np.random.randn(3,3)Â
Ai = np.linalg.inv(A)Â
A@Ai
Code block 12.2:MATLAB
AÂ =Â randn(3,3);Â
Ai = inv(A);Â
A*Ai
______________________
Eqn. box title=Row reduction method of computing the inverse
rref( ) ⇒ (12.33)
_________________________________________________________________________
Let’s start with an example:
[c]Â
⇒
[c]
 [c]
Â
You can confirm that the augmented part of the final matrix is the same as the
inverse we computing from the MCA algorithm in the practice problems in
the previous section.
We know that a singular matrix has no inverse. Let’s see what happens
when we apply the row-reduction method to a rank-1 matrix.
[c]Â
⇒
That’s the end of the line. We cannot row reduce anymore, and yet we
have not gotten the identity matrix on the left. This system is inconsistent,
and ergo, this matrix has no inverse.
_______________________________________________
a)
b)
c)
d)
Answers
a)
b)
c) No inverse.
d)
14
__________________________________________________________________
Code The code below shows the row-reduction method, and then compares
that to the inverse function.
Code block 12.3:Python
import numpy as npÂ
import sympy as symÂ
AÂ Â =Â np.random.randn(3,3)Â
Acat = np.concatenate((A,np.eye(3,3)),axis=1)Â
Ar = sym.Matrix(Acat).rref()[0] # RREFÂ
Ar = Ar[:,3:] # keep inverseÂ
Ai = np.linalg.inv(A)Â
Ar-Ai
Why does it work? Equation 12.33 almost seems like magic. (In fairness,
much of mathematics seems like magic before you become familiar with it...
and a lot of it continues to seem like magic even after years of experience.) In
fact, the reason why this method works is fairly straightforward, and involves
thinking about Equation 12.33 in terms of solving a system of equations.
In Chapter 10, you learned that you can solve the equation Ax = b by
performing Gauss-Jordan elimination on the augmented matrix [A|b]. If there
is a solution—that is, if b is in the column space of A—then row reduction
produces the augmented matrix [I|x].
Here we follow the same reasoning, but the vector b is expanded to the
matrix I. That is, we want to solve AX = I, where X is the inverse of A. It
might be easier to think about this in terms of columns of the identity matrix.
I’ll use ei to indicate the ith column of the identity matrix. And I’ll use
a 3×3 matrix in the interest of concreteness, but the procedure is valid for a
matrix of any size.
Each equation individually finds xi, which is the vector that represents the
weighting of the columns in A in order to obtain the vector ei, which is one
column of the identity matrix. As long as the column space of A spans all of
RM, there is guaranteed to be a solution to each of the systems of equations.
And C(A) spans the entire ambient space when A is full-rank.
Thus, when breaking the problem down into individual column vectors, there
is nothing new compared to what you learned in Chapter 10. The only
addition here is to collect all of these separate steps together into one matrix
equation: AX = I.
Let’s start with a tall matrix, so dimensions M > N. We’ll call this
matrix T for tall. Although this matrix is not invertible, we can, with a bit of
creativity, come up with another matrix (actually, the product of several
matrices) that will left-multiply T to produce the identity matrix. The key
insight to get us started is that TTT is a square matrix.6 In fact, TTT is
invertible if rank(T) = N (more on this condition later). If TTT is invertible,
then it has an inverse.
6Yes, I realize that there is an absurd number of "T"s in these equations.
(12.37)
If this expression looks strange, then just imagine rewriting it as:
C = TTT
C−1C = I (12.38)
Here’s the thing about Equation 12.37: The first set of parentheses is
necessary because we are inverting the product of two matrices (neither of
those matrices is individually invertible!). However, the second set of
parentheses is not necessary; they’re there just for aesthetic balance. By
removing the unnecessary parentheses and re-grouping, some magic happens:
the product of three matrices that can left-multiply T to produce the identity
matrix. ___________________________
T=I (12.39)
Left inverse
_________________________________________________________________________
Here is another way to look at it:
T-L = (TTT)−1TT
T-LT = I
where T-L indicates the left inverse. (This is non-standard notation, used here
only to facilitate comprehension.)
Why is this called a "one-sided" inverse? Let’s see what happens when
we put the left inverse on the right side of the matrix:
The left-hand side of this equation is actually valid, in the sense that all the
sizes match to make the multiplications work. But the result will not be the
identity matrix. You’ll see numerical examples later. The point is that this
is a left inverse.
Conditions for validity Now let’s think about the conditions for the left
inverse to be valid. Looking back to Equation 12.38, matrix C (which is
actually TTT) is invertible if it is full rank, meaning rank(C) = N. When is
TTT a full-rank matrix? Recall from section 7.7 that a matrix times its
transpose has the same rank as the matrix on its own. Thus, TTT is full-rank if
T has a rank of N, meaning it is full column-rank.
And this leads us to the two conditions for a matrix to have a left inverse:
T = (12.40)
TTT = (12.41)
(TTT)−1 = (12.42)
T-LT = (12.44)
TT-L = (12.45)
Equation 12.44 shows that the pre-multiplying by the left inverse produces
the identity matrix. In contrast, Equation 12.45 shows that post-multiplying
by the left inverse definitely does not produce the identity matrix. This
example also highlights that it is often useful to leave the determinant as a
scalar outside the matrix to avoid dealing with difficult fractions during
matrix multiplications.
_____________________________________________________
a)
b)
c)
a)
b)
c) No left inverse!
________________________________________________
Did you figure it out? The reasoning is the same as for the left inverse. The
key difference is that you post-multiply by the transposed matrix instead of
pre-multiplying by it. Let’s call this matrix W for wide.
Briefly: We start from WWT, and then right-multiply by (WWT)-1. That leads
us to the following:
____________________________________________________
W (12.46)
W-R = WT(WTW)−1
WW-R = I
The conditions for a matrix to have a right inverse are:
As with the left-inverse, putting the right inverse on the left of matrix W is a
valid matrix multiplication, but will not produce the identity matrix.
WWT = (12.48)
(WWT)−1 = (12.49)
=
W-RW (12.51)
WW-R = (12.52)
__________________________________________________________________
Practice problems Compute the right inverse of the following matrices.
a)
b)
c)
a)
b)
c) No right inverse!
_______________________________________________
Code The code below shows the left inverse for a tall matrix; it’s your
job to modify the code to produce the right-inverse for a wide matrix! (Note:
In practice, it’s better to compute the one-sided inverses via the Moore-
Penrose pseudoinverse algorithm, but it’s good practice to translate the
formulas directly into code.)
Code block 12.5:Python
import numpy as npÂ
AÂ =Â np.random.randn(5,3)Â
Al = np.linalg.inv(A.T@A)@A.TÂ
Al@A
Code block 12.6:MATLAB
AÂ =Â randn(5,3);Â
Al = inv(A’*A)*A’;Â
Al*A
The pseudoinverse is used when a matrix does not have a full inverse, for
example if the matrix is square but rank-deficient. As I mentioned in the
beginning of this chapter, a rank-deficient matrix does not have a true
inverse. However, all matrices have a pseudoinverse, which is a matrix that
will transform the rank-deficient matrix to something close-to-but-not-quite
the identity matrix.
1.Â
Compute the inverse of the following matrices (when possible). For
problems with a *, additionally compute AA−1 and A−1A.
a)
b)
c)
∗
d)
e)
∗
f)
g)
∗
h)
i)
j)
∗
k)
l)
2.Â
7The inverse of the inverse is the original matrix. Is that A−1A−1 or
xnxm = xn+m
(xn)m = xnm
3.Â
Use the row-reduction method to compute the inverse of the following
matrices.
a)
b)
c)
d)
4.Â
For the following matrices and vectors, compute and use An−1 to solve for x
in Anx = bn.
a) A1x = b1
b) A1x = b2
c) A2x = b1
d) A2x = b2
e) A3x = b3
f) A3x = b4
g) A4x = b2
h) A4x = b4
12.10 Answers
1.Â
a)
b)
c)
d) No inverse!
e)
f)
g)
h) No inverse!
i)
j)
k)
l)
2.Â
The correct expression for the inverse of the inverse is (A−1)−1. A−1A−1
would mean to matrix-multiply the inverse by itself, which is a valid
mathematical operation but is not relevant here.
3.Â
Matrices below are the inverses (the augmented side of the matrix).
a)
b)
c)
d) Not invertible!
4.Â
The strategy here is to compute the inverses of the four A matrices, then
implement matrix-vector multiplication. Below are the inverses, and below
that, the solutions to x.
a) 18 T
b)
18 T
c) 12 T
d) 12 T
e) 16 T
f) 16 T
g) Invalid operation!
h) 27 T
1.Â
Your task here is "simple": Implement the MCA algorithm to compute
the matrix inverse. Consult the description in section 12.4. Test your
algorithm by computing the inverse of random matrices and compare
against the Python/MATLAB inv functions.
2.Â
I wrote earlier that the algorithm underlying the MP pseudoinverse is
only understandable after learning about the SVD.8 But that needn’t
stop us from exploring the pseudoinverse with code! The goal of this
challenge is to illustrate that the pseudoinverse is the same as (1) the
"real" inverse for a full-rank square matrix, and (2) the left inverse for a
tall full-column-rank matrix.
8Reminder: Demonstrations in code help build insight but are no substitute for a formal proof.
2.Â
.
Code block 12.11:Python
import numpy as npÂ
Â
# square matrixÂ
AÂ Â Â =Â np.random.randn(5,5)Â
Ai  = np.linalg.inv(A)Â
Api = np.linalg.pinv(A)Â
Ai - Api # test equivalenceÂ
Â
# tall matrixÂ
TÂ Â Â =Â np.random.randn(5,3)Â
Tl  = np.linalg.inv(T.T@T)@T.T # left invÂ
Tpi = np.linalg.pinv(T) # pinvÂ
Tl - Tpi # test equivalance
Code block 12.12:MATLAB
% square matrixÂ
AÂ Â Â =Â randn(5);Â
Ai  = inv(A);Â
Api = pinv(A);Â
Ai - Api % test equivalenceÂ
Â
% tall matrixÂ
TÂ Â Â =Â randn(5,3);Â
Tl  = inv(T’*T)*T’; % left invÂ
Tpi = pinv(T); % pinvÂ
Tl - Tpi % test equivalance
Chapter 13
Projections and orthogonalization
The goal of this chapter is to introduce a framework for projecting one space
onto another space (for example, projecting a point onto a line). This
framework forms the basis for orthogonalization and for an algorithm called
linear least-squares, which is the primary method for estimating parameters
and fitting models to data, and is therefore one of the most important
algorithms in applied mathematics, including control engineering, statistics,
and machine learning. Along the way, we’ll also re-discover the left
inverse.
We start with a vector a, a point b not on a, and a scalar β such that βa is as
close to b as possible without leaving a. Figure 13.1 shows the situation.
(Because we are working with standard-position vectors, it is possible to
equate coordinate points with vectors.)
1
SVG-Viewer needed.
Figure 13.1:We need a formula to obtain the scalar β such
that point βa is as close as possible to point b without
leaving vector a.
The question is: Where do we place β so that the point βa is as close as
possible to point b? You might have an intuition that β should be placed such
that the line from βa to b is at a right angle to a. That’s the right intuition
(yes, that’s a pun).
One way to think about this is to imagine that the line from βa to b is one
side of a right triangle. Then the line from b to a is the hypotenuse of that
triangle. Any hypotenuse is longer than the adjacent side, and so the shortest
hypotenuse (i.e., the shortest distance from b to a) is the adjacent side.
We need an expression for the line from point b to point βa. We can express
this line as the subtraction of vector b (the vector from the origin to point b)
and vector βa (the scaled version of vector a). Thus, the expression for the
line is (b − βa).
And that is the key insight that geometry provides us. From here, solving for
β just involves a bit of algebra. It’s a beautiful and important derivation,
so I’ll put all of it into a math box.
____________________________________________________
_________________________________________________________________________
Note that dividing both sides of the equation by aTa is valid because it is a
scalar quantity.
The technical term for this procedure is projection: We are projecting b onto
the subspace defined by vector a. This is commonly written as proja(b).
Note that Equation 13.4 gives the scalar value β, not the actual point
represented by the open circle in Figure 13.1. To calculate that vector βa,
simply replace β with its definition: 2
__________________________________________________________________
2In the fraction, one a in the denominator doesn’t cancel the multiplying a, because the vector in
the denominator is part of a dot product.
Eqn. box title=Projection of a point onto a line
proja(b) = a (13.5)
________________________________________________________________________
Let’s work through an example with real numbers. We’ll use the
following vector and point.
If you draw these two objects on a 2D plane, you’ll see that this example
is different from that presented in Figure 13.1: The point is "behind" the line,
so it will project negatively onto the vector. Let’s see how this works out
algebraically.
proja(b) = = = −1
Notice that β = −1. Thus, we are projecting "backwards" onto the vector
(Figure 13.2). This makes sense when we think of a as being a basis vector
for a 1D subspace that is embedded in â„2.
SVG-Viewer needed.
Figure 13.2:Example of projecting a point onto a line. The
intersection βa is coordinate (2,1).
__________________________________________________________________
Practice problems Draw the following lines (a) and points (b). Draw the
approximate location of the orthogonal projection of b onto a. Then compute
the exact proja(b) and compare with your guess.
a)
a= , b = (.5,2)
b)
a= , b = (0,1)
c)
a= , b = (0,−1)
d)
a= , b = (2,1)
a(b)isβa.
a) 3.75/4.25
b) .5
c) 0
d) 4/5
_________________________________________________________________________
-|*|- Reflection Mapping over magnitude: Meditating on Equation 13.4 will
reveal that it is a mapping between two vectors, scaled by the squared length
of the "target" vector. It’s useful to understand this intuition (mapping
over magnitude), because many computations in linear algebra and its
applications (e.g., correlation, convolution, normalization) involve some kind
of mapping divided by some kind of magnitude or norm. -|*|-
13.2 Projections in â„N
Now you know how to project a point onto a line, which is a 1D subspace.
We’re now going to extend this to projecting a point onto an N-D
subspace.
Indeed, one way of thinking about the one-sided inverse is that it projects a
rectangular matrix onto the (square) identity matrix.
You can see that this equation involves inverting a matrix, which should
immediately raise the question in your mind: What is the condition for this
equation to be valid? The condition, of course, is that ATA is square and full-
rank. And you know from the previous chapter that this is the case if A is
already square and full-rank, or if it is tall and full column-rank.
We can also consider what happens when A is a square full-rank matrix: This
guarantees that b is already in its column space, because the column space of
A spans all of â„M. In that case, Equation 13.8 simplifies a bit:
x = (ATA)−1ATb
x = A−1A−TATb
x = A−1b (13.9)
__________________________________________________________________
Practice problems Solve for x using Equation 13.8.
a)
=
b)
=
c)
=
d)
=
Answers
a) T
b) T
c) T
_________________________________
-|*|- Reflection Figures 13.2 and 13.3 provide the geometric intuition
underlying the least-squares formula, which is the mathematical backbone of
many analyses in statistics, machine-learning, and AI. Stay tuned... -|*|-
Code You can implement Equation 13.8 based on what you learned in the
previous chapter. But that equation is so important and is used so often that
numerical processing software packages have short-cuts for solving it. The
Python code uses the numpy function lstsq, which stands for least-squares.
Code block 13.1:Python
import numpy as npÂ
AÂ =Â [[1,2],[3,1],[1,1]]Â
b = [5.5,-3.5,1.5]Â
np.linalg.lstsq(A,b)[0]
Code block 13.2:MATLAB
AÂ =Â [1Â 2;Â 3Â 1;Â 1Â 1];Â
b = [5.5 -3.5 1.5]’;Â
A\b
Let’s start with a picture so you understand our goal (in â„2, of course,
because linear algebra always starts in â„2). We begin with two vectors, w
and v. Let’s call w the "target" vector, and v the "reference" vector.
The idea is that we want to break up w into two separate vectors, one of
which is parallel to w and the other is perpendicular to w. In Figure 13.4, the
component of vector w that is parallel to v is labeled w∥v and the component
that is perpendicular to v is labeled w⊥v. The thin gray dashed lines
illustrate that these two vector components sum to form the original vector w.
In other words:
(13.10)
Figure 13.4:Decomposing vector w into components that
are parallel and orthogonal to vector v.
Now that you have the picture, let’s start deriving formulas and proofs.
Don’t worry, it’ll be easier than you might think.
Parallel component If w and v have their tails at the same point, then the
component of w that is parallel to v is simply collinear with v. In other words,
w||v is a scaled version of v. Does this situation look familiar? If not, then you
were either sleep-reading section 13.1, or you just opened to book to this
page. Either way, it would behoove you to go back and re-read section 13.1.
All we need to do here is project w onto v. Here’s what that formula
looks like:
And this brings us to the full set of equations for decomposing a vector into
two components relative to a target vector.
_________________________________
Here’s a tip that might help you remember these formulas: Becau
We need to prove that w∥v and w⊥v really are orthogonal to each o
That looks on first glance like a messy set of equations, but when kee
5If you squint, this proof looks like a flock of birds.
Let’s work through an example in â„2, using vectors
It’s easily seen in this example that v||w and v⊥w are orthogonal
Now let’s try an example in â„6. We won’t be able to visualiz
a= ,b=
You can confirm (on paper or computer) that the two vector compon
a)
,
b)
,
c)
,
_______________________________________________________
-|*|- Reflection A quick glance at the geometric representations of ve
_______________________________________________________
If you’re struggling to see the link between Equations
This is a remarkable definition, because it matches the definition of t
A bit of arithmetic will confirm the first property, and a bit of trigono
This example is for a tall orthogonal matrix. Can we have a wide orth
A wide matrix has size M < N and its maximum possible rank is
You can see that all columns are pairwise orthogonal, but the third co
a)
ζ
b)
c)
a)
b) 1
c) any ζ
_______________________________________________________
v1∗ =
v2∗ =
v2∗ =
v3∗ =
Q=
Figure 13.6 shows the column vectors of V and Q. Aside from the intense
arithmetic on the previous page, it is also geometrically obvious that q3 must
be zeros, because it is not possible to have more than two orthogonal vectors
in â„2. This is an application of the theorem about the maximum number of
vectors that can form a linearly independent set, which was introduced in
section 4.6.
__________________________________________________________________
Practice problems Produce orthogonal matrices using the Gram-Schmidt
procedure.
a)
b)
c)
Answers
a)
b)
c)
_________________________________________________
-|*|- Reflection The Gram-Schmidt procedure is numerically unstable, due to
round-off errors that propagate forward to each subsequent vector and affect
both the normalization and the orthogonalization. You’ll see an example
of this in the code challenges. Computer programs therefore use numerically
stable algorithms that achieve the same conceptual result, based on
modifications to the standard Gram-Schmidt procedure or alternative
methods such as Givens rotations or Gaussian elimination. -|*|-
13.6 QR decomposition
The Gram-Schmidt procedure transforms matrix A into orthogonal matrix Q.
Unless A is already an orthogonal matrix, Q will be different than A, possibly
very different. Thus, information is lost when going from A → Q. Is is
possible to recover the information that was lost in translation? Obviously,
the answer is yes. And that’s the idea of QR decomposition:
(13.32)
The Q here is the same Q that you learned about above; it’s the result of
Gram-Schmidt orthogonalization (or other comparable but more numerically
stable algorithm).10 R is like a "residual" matrix that contains the information
that was orthogonalized out of A. You already know how to create matrix Q;
how do you compute R? Easy: Take advantage of the definition of orthogonal
matrices:
10QR decomposition is unrelated to QR codes.
(13.35)
(13.36)
Sizes of Q and R, given A The sizes of Q and R depend on the size of A, and
on a parameter of the implementation.
Let’s start with the square case, because that’s easy: If A is a square
matrix, then Q and R are also square matrices, of the same size as A. This is
true regardless of the rank of A (more on rank and QR decomposition below).
Now let’s consider a tall matrix A (M > N). Computer algorithms can
implement the "economy QR" or "full QR" decomposition. The economy QR
is what you might expect based on the example in the previous page: Q will
be the same size as A, and R will be N×N. However, it is possible to create
a square Q from a tall A. That’s because the columns of A are in â„M, so
even if A has only N columns, there are M − N more possible columns to
create that will be orthogonal to the first M. Thus, the full QR decomposition
of a tall matrix will have Q ∈ â„M×M and R ∈ â„M×N. In other words, Q
is square and R is the same size as A.
Ranks of Q, R, and A The Q matrix will always have its maximum possible
rank (M or N depending on its size, as described above), even if A is not full
rank.
It may seem surprising that the rank of Q can be higher than the rank of A,
considering that Q is created from A. But consider this: If I give you a vector
[-1 1], can you give me a different, orthogonal, vector? Of course you could:
[1 1] (among others). Thus, new columns in Q can be created that are
orthogonal to all previous columns. You can try this yourself in Python or
MATLAB by taking the QR decomposition of 1 (the matrix of all 1’s).
On the other hand, R will have the same rank as A. First of all, because R is
created from the product QTA, the maximum possible rank of R will be the
rank of A, because the rank of A is equal to or less than the rank of Q. Now
let’s think about why the rank of R equals the rank of A: Each diagonal
element in R is the dot product between corresponding columns in A and Q.
Because each column of Q is orthogonalized to earlier columns in A, the dot
products forming the diagonal of R will be non-zero as long as each column
in A is linearly independent of its earlier columns. On the other hand, if
column k in A can be expressed as a linear combination of earlier columns in
A, then column k of Q is orthogonal to column k of A, meaning that matrix
element Rk,k will be zero.
13.8 Exercises
1.Â
Without looking back at page §, derive w∥v and w⊥v and prove that
they are orthogonal.
2.Â
Simplify the following operations, assuming Q,A ∈ â„M×M.
a) Q−T
b) QQTQ-TQTQ−1QQ−1QT
c) Q-TQQQ−1AAQTQA−1
3.Â
In section 13.1, I defined the difference vector to be (b −βa). What happens
if you define this vector as (βa − b)? Do the algebra to see if the projection
formula changes.
4.Â
Why are orthogonal matrices called "pure rotation" matrices? It’s not
because they’ve never committed sins; it’s because they rotate but do
not scale any vector to which they are applied. Prove this claim by comparing
∥x∥ to ∥Qx∥. (Hint: It’s easier to prove that the magnitudes-
squared are equal.)
5.Â
a)
b)
c)
13.9 Answers
1.Â
You can check your proof on page §.
2.Â
a) Q
b) (QT)2
c) Q2A
3.Â
It doesn’t; you’ll arrive at the same projection formula.
4.Â
Here’s the proof; the key result at the end is that ∥Qx∥ = ∥x∥.
∥Qx∥2 = (Qx)T(Qx)
= xTQTQx
= xTx = ∥x∥2
5.Â
The first two columns of your Q should match v1∗ and v2∗ in the
text. But you will not get a third column of zeros; it will be some other
vector that has magnitude of 1 and is not orthogonal to the previous two
columns (easily confirmed: QTQ≠I). What on Earth is going on
here?!?!
3.Â
The issue here is normalization of computer rounding errors. To see this,
modify your algorithm so that column 3 of Q is not normalized.
You will find that both components of the third column have values
close to zero, e.g., around 10−15. That’s basically zero plus
computer round error. The normalization step is making mountains out
of microscopic anthills. Congratulations, you have just discovered one
of the reasons why the "textbook" Gram-Schmidt algorithm is avoided
in computer applications!
Chapter 14
Least-squares
14.1 Introduction
The physical and biological world that we inhabit is a really, really, really
complicated place. There are uncountable dynamics and processes and
individuals with uncountable and mind-bogglingly complex interactions.
How can we make sense of this complexity? The answer is we can’t:
Humans are terrible at understanding such enormously complex systems.
On the other hand, there is a lot of diversity in nature, and models should be
sufficiently generic that they can be applied to different datasets. This means
that the models must be flexible enough that they can be adjusted to different
datasets without having to create a brand new model for each particular
dataset.
This is why models contain both fixed features and free parameters. The
fixed features are components of the model that the scientist imposes, based
on scientific evidence, theories, and intuition (a.k.a. random guesses). Free
parameters are variables that can be adjusted to allow the model to fit any
particular data set. And this brings us to the primary goal of model-fitting:
Find values for these free parameters that make the model match the data as
closely as possible.
Here is an example to make this more concrete. Let’s say you want to
predict how tall someone is. Your hypothesis is that height is a result of the
person’s sex (that is, male or female), their parents’ height, and their
childhood nutrition rated on a scale from 1-10. So, males tend to be taller,
people born to taller parents tend to be taller, and people who ate healthier as
children tend to be taller. Obviously, what really determines an adult’s
height is much more complicated than this, but we are trying to capture a few
of the important factors in a simplistic way. We can then construct our model
of height:
(14.1)
where h is the person’s height, s is the sex, p is the parents’ height,
and n is the childhood nutrition score. These are the fixed features of the
model—they are in the model because I put them there, because I believe
that they are important. On the other hand, it is obvious that these three
factors do not exactly determine someone’s height; hence, we include the
𜖠as an "error term" or "residual" that captures all of the variance in height
that the three other variables cannot explain.
But here’s the thing: I don’t know how important each of these
factors is. So I specify that the model needs to include these terms, and I will
let the data tell me how important each term is. That’s why each fixed
feature is scaled by a β parameter. Those βs are the free parameters. For
example, if β3 is large, then nutrition is very important for determining
someone’s adult height; and if β3 is close to zero, then the data are telling
me that childhood nutrition is not important for determining height.
Of course, that leads to the question of how you find the model parameters
that make the model match the data as closely as possible. Randomly picking
different values for the βs is a terrible idea; instead, we need an algorithm
that will find the best parameters given the data. This is the idea of model
fitting. The most commonly used algorithm for fitting models to data is linear
least-squares, and that is what you will learn in this chapter.
Step 2 Work the data into the model. You get existing data from a database
or by collecting data in a scientific experiment. Or you can simulate your
own data. I made up the data in Table 14.1 for illustration purposes.
10 176 F 189 7
Table 14.1:Example data table. Height is in cm.
But this is not the format that we need the data in. We need to map the data in
this table into the model. That means putting these numbers inside the
equation from step 1. The first row of data, converted into an equation, would
look like this:
One row of data gives us one equation. But we have multiple rows,
corresponding to multiple individuals, and so we need to create a set of
equations. And because those equations are all linked, we can consider them
a system of equations. That system will look like this:
(14.2)
Notice that each equation follows the same "template" from Equation 14.1,
but with the numbers taken from the data table. (The statistically astute reader
will notice the absence of an intercept term, which captures expected value of
the data when the predictors are all equal to zero. I’m omitting that for
now, but will discuss it later.)
Step 3 The system of equations above should active your happy memories of
working through Chapter 10 on solving systems of equations. Armed with
those memories, you might have already guessed what step 3 is: convert the
system of equations into a single matrix-vector equation. It’s exactly the
same concept that you learned previously: Split up the coefficients, variables,
and constants, and put those into matrices and vectors.
(14.3)
Step 5 Statistically evaluate the model: is it a good model? How well does it
fit the data? Does it generalize to new data or have we over-fit our sample
data? Do all model terms contribute or should some be excluded? This step is
all about inferential statistics, and it produces things like p-values and t-
values and F-values. Step 5 is important for statistics applications, but is
outside the scope of this book, in part because it relies on probability theory,
not on linear algebra. Thus, step 5 is not further discussed here.
14.3 Terminology
It is unfortunate and needlessly confusing that statisticians and linear
algebrists use different terminology for the same concepts. Terminological
confusion is a common speed-bump to progress in science and cross-
disciplinary collaborations. Nonetheless, there are worse things in human
civilization than giving a rose two names. The table below introduces the
terminology that will be used only in this chapter.
The final point I want to make here is about the term "linear" in linear least-
squares. Linear refers to the way that the parameters are estimated; it is not a
restriction on the model. The model may contain nonlinear terms and
interactions; the restriction for linearity is that the coefficients (the free
parameters) scalar-multiply their variables and sum to predict the dependent
variable. Basically that just means that it’s possible to transform the
system of equations into a matrix equation. That restriction allows us to use
linear algebra methods to solve for β; there are also nonlinear methods for
estimating parameters in nonlinear models.
To make sure this point is clear, consider the following model, which
contains two nonlinearities:
(14.4)
One nonlinearity is in the regressors (β2 ), and one is in a predictor (n3).
The latter is no problem; the former prevents linear least-squares from fitting
this model (there are nonlinear alternatives that you would learn about in a
statistics course).
You know what the solution is: instead of the full inverse, we use the left-
inverse. Thus, the solution to the least squares problem is:
Xβ =y
(XTX)−1XTXβ = (XTX)−1XTy
β = (XTX)−1XTy (14.5)
When is Equation 14.5 a valid solution? I’m sure you guessed it—when
X is full column-rank, which means the design matrix contains a linearly
independent set of predictor variables. If the design matrix has repeated
columns, or if one predictor variable can be obtained from a linear
combination of other predictor variables, then X will be rank-deficient and
the solution is not valid. In statistics lingo, this situation is called
"multicollinearity."
When is the solution exact? I wrote in section 8.11 that one of the
fundamental questions in linear algebra is whether some vector is in the
column space of some matrix. Here you can see an application of that
question. We want to know whether vector y is in the column space of matrix
X. They’re both in â„M, because they both have M elements in the
columns, so it’s a valid question to ask. But here’s the thing: y has M
elements and X has N columns, where N << M (e.g., M = 1000 vs. N = 3). In
this example, we are asking whether a specific vector is inside a 3D subspace
embedded in ambient â„1000.
The answer is nearly always No, y C(X). Even for a good model, a tiny bit of
noise in the data would push vector y out of the column space of X. What to
do here?
There is more to say about this 𜖠vector (section 14.7), but first I want to
prove to you that the least squares equation minimizes ðœ–. And to do that,
we need to re-derive the least-squares solution from a geometric perspective.
Imagine an ambient â„M space, and a subspace inside that ambient space,
whose basis vectors are the columns of the design matrix. I will illustrate this
using a 2D subspace because it is easy to visualize (Figure 14.1), but the
actual dimensionality of this subspace corresponds to the number of columns
in the design matrix, which means the number of predictor variables in your
statistical model.
When working with real data, it is incredibly unlikely that y is exactly in the
column space of X. If fact, if y ∈ C(X), then probably the model is way too
complicated or the data are way too simple. The goal of science is to
understand things that are difficult to understand;3 if your model explains a
phenomenon 100%, then it’s not difficult to understand, and you should
try to work on a harder problem.
3n.b. I’m kindof joking here. Sortof.
But the column space is important because that is the mathematical
representation of our theory of the world. So then the question is, What is the
coordinate that is as close as possible to the data vector while still being
inside the subspace? The answer to that question comes from the orthogonal
projection of the vector onto the subspace. That orthogonal projection is
vector 𜖠= y − Xβ, which is orthogonal to the column space of X.
XT𜖠=0 (14.9)
XT(y − Xβ) =0 (14.10)
XTy − XTXβ =0 (14.11)
XTXβ = XTy (14.12)
β = (XTX)−1XTy (14.13)
And amazingly (though perhaps not surprisingly), we’ve arrived back at
the same solution as in the previous section.
We also see that the design matrix X and the residuals vector 𜖠are
orthogonal. Geometrically, that makes sense; statistically, it means that the
prediction errors should be unrelated to the model, which is an important
quality check of the model performance.
-|*|- Reflection You might have noticed that I’m a bit loose with the plus
signs and minus signs. For example, why is 𜖠defined as y − Xβ and not
Xβ − y? And why is 𜖠added in Equation 14.6 instead of subtracted?
Sign-invariance often rears its confusing head in linear algebra, and in many
cases, it turns out that when the sign seems like it’s arbitrary, then the
solution ends up being the same regardless. Also, in many cases, there are
coefficients floating around that can absorb the signs. For example, you could
flip the signs of all the elements in 𜖠to turn vector −𜖠into +ðœ–. -|*|-
Now we’re going to derive the least-squares formula again, this time
using row-reduction and Gauss-Jordan elimination. It may seem gratuitous to
have yet another derivation of least-squares, but this section will help link
concepts across different chapters of this book, and therefore has high
conceptual/educational value (IMHO).
The problem is not the augmented matrix: X and y both have M rows, so the
augmenting is valid. And row-reduction is also valid. However, [ X | y ] will
have N + 1 columns and will have a rank of N + 1 < M. Thus, the actual
outcome of RREF will be
(14.15)
In other words, a tall matrix with the identity matrix IN on top and all zeros
underneath. This is not useful; we need to re-think this.
The solution comes from pre-multiplying the general linear model equation
by XT:
(14.16)
Equation 14.16 is often called the "normal equation." (I know what youâ
€™re thinking and I agree: it looks less normal than the equation we derived
it from. But that’s the terminology.)
The reason why this works comes from thinking about the row spaces of X
and XTX. Remember from Chapter 8 (section 8.5) that these two matrices
have the same row space, and that XTX is a more compact representation of
the space spanned by rows of X. Furthermore, assuming that X has linearly
independent columns (which is an assumption we’ve been making in this
entire chapter), then XTX spans all of â„N, which means that any point in â„N
is guaranteed to be in the column space of XTX. And XTy is just some point
in â„N, so it’s definitely going to be in the column space of XTX.
Therefore, β contains the coefficients on the columns of matrix XTX to get us
exactly to XTy. Note that with the Gauss-Jordan approach to solving least-
squares using the normal equations, we never leave the N-dimensional
subspace of â„N, so the question of whether y ∈ C(X) doesn’t even come
up.
However, we still need to compute 𜖠to make the GLM equation correct.
The reason is that β is a solution to the normal equation, but we want to find
the solution to y, which is not the same thing as XTy. Fortunately, that’s
an easy rearrangement of Equation 14.6:
(14.18)
The statistical interpretation of Equation 14.18 is that the residual is the
difference between the model-predicted values (Xβ) and the observed data
(y).
By the way, we can also arrive at the normal equations by starting from the
known solution of the left-inverse, and then pre-multiplying by XTX:
β = (XTX)−1XTy
XTXβ = (XTX)(XTX)−1XTy (14.19)
XTXβ = XTy (14.20)
Statistically, 𜖠is the residual variance in the data that the model cannot
account for. We can also write this as follows:
(14.21)
When is the model a good fit to the data? It is sensible that the smaller ðœ–
is, the better the model fits the data. In fact, we don’t really care about the
exact values comprising 𜖠(remember, there is one ðœ–i for each data
value yi); instead, we care about the norm of ðœ–.
(14.22)
Now we can re-frame the goal of model-fitting in a slightly different way:
Find the values in vector β that minimize both sides of Equation 14.22. This
is a standard optimization problem, and can be expressed as
(14.23)
The solution to this problem is obtained by computing the derivative and
setting that equal to zero:
0= ∥Xβ − y∥2 = 2XT(Xβ − y) (14.24)
0 = XTXβ − XTy (14.25)
XTXβ = XTy (14.26)
β = (XTX)−1XTy (14.27)
And here we are, back at the same solution to least-squares that we’ve
obtained already using three different approaches. And now I’ve slipped
in a fourth method to derive the least-squares equation, using a bit of calculus
and optimization. (If your calculus is rusty, then don’t worry if the first
equation is nebulous. The idea is to take the derivative with respect to β using
the chain rule.)
We begin with a set of numbers. Let’s call it set D for data. We assume
that the order is meaningful; perhaps these are data values from a signal being
recorded over time.
I’m going to start by assuming that the signal has a constant value. Thus,
my hypothesis at this point is that there is only one meaningful signal value,
and the divergences at each time point reflect errors (for example, sensor
noise). Thus, my model is:
(14.28)
That was step 1. Step 2 is to work the data into the model. That will give a
series of equations that looks like this:
(14.29)
Why is there a "1" next to the β? My hypothesis is that a single number
explains the data. But I don’t know a priori what that number should be.
That’s what the β is for. Now, that number could be anything else, and β
would scale to compensate. For example, let’s say the average data value
is 2. With the coefficient set to 1, then β = 2, which is easy to interpret.
However, if I were to set the coefficient to 9.8, then β = .20408.... And then
the interpretation would be that the average value of the signal is β×9.8.
That is mathematically correct, but difficult and awkward to interpret. Thatâ
€™s why we set the coefficient value to 1 (in a few pages, I will start calling
this column the "intercept").
Now for step 3: Convert the set of equations into a matrix-vector equation.
(14.30)
Step 4 is to fit the model.
β = (XTX)−1XTy (14.31)
β = 8−1×17 (14.32)
β = = 2.125 (14.33)
Did you notice what happened here? We ended up summing all the data
points (dot product between the numbers and a vector of 1’s) and
dividing by the number of elements.4 That’s literally the arithmetic mean.
So we’ve re-derived the average from a statistics/linear algebra
perspective, where our model is that the data are meaningfully characterized
by a single number.
4Another good reason to set the coefficients to 1.
It’s always a good idea to visualize data and models. Let’s see what
they look like (Figure 14.2).
Figure 14.2:Observed and predicted data. Black squares
show the data (y) and white stars show the model-
predicted data (ŷ = Xβ). The dashed gray lines were
added to facilitate comparing data values to their
predicted values.
What do you think of the comparison of predicted and actual data shown in
Figure 14.2? I think the model looks pretty awful, to be honest. The data
clearly show an upward trend, which the model cannot capture. Let’s try
changing the model. (Important note: I’m skipping step 5, which would
involve formal tests of the model.)
Instead of predicting a single data value, let’s predict that the data values
change over the x-axis (the statistics term for this is a "linear trend"). The
new model equation from step 3 will look like this:
(14.34)
Note that we are no longer predicting the average value; we are now
predicting a slope. Let’s see what the parameter is:
β = (XTX)−1XTy (14.35)
β = 204−1×148 (14.36)
β = = .7255 (14.37)
Figure 14.3 shows the same data with the new model-predicted data values.
Figure 14.3:Our second modeling attempt.
It looks better compared to Figure 14.2, but still not quite right: The predicted
data are too high (over-estimated) in the beginning and too low (under-
estimated) in the end. The problem here is that the model lacks an intercept
term.
The intercept is the expected value of the data when all other parameters are
set to 0. That’s not the same thing as the average value of the data.
Instead, it is the expected value when all model parameters have a value of
zero. Thinking back to the example model in the beginning of this chapter
(predicting adult height as a function of sex, parents’ height, and
childhood nutrition), the intercept term captures the expected height of an
adult female whose parents’ height is zero cm and who had a score of 0
on childhood nutrition. Not exactly a plausible situation. So why do these
models need an intercept term?
The reason is that without an intercept, the model is forced to include the
origin of the graph (X,Y=0,0). This caused the model in Figure 14.3 to be a
poor fit. The intercept term allows the best-fit line to shift off the origin (that
is, to intersect the Y-axis at any value other than zero). Thus, in many
statistical models, the intercept is not necessarily easily interpretable, but is
necessary for proper model fitting (unless all data and regressors are mean-
subtracted, in which case the best-fit line touches the origin).
Now let’s expand the model to include an intercept term, which involves
including an extra column of 1’s in the design matrix. We’ll also need
two βs instead of one.
(14.38)
The arithmetic gets a bit more involved, but produces the following.
β = (XTX)−1XTy (14.39)
= −1
β (14.40)
β = (14.41)
Figure 14.4:Our final model of this dataset.
I’m sure you agree that this third model (linear trend plus intercept) fits
the data reasonably well (Figure 14.4). There are still residuals that the model
does not capture, but these seem like they could be random fluctuations.
Notice that the best-fit line does not go through the origin of the graph. We
don’t see exactly where the line will cross the y-axis because the x-axis
values start at 1. However, β1, the intercept term, predicts that this crossing
would be at y = −5.5357, which looks plausible from the graph.
The final thing I want to do here is confirm that 𜖠⊥ Xβ. The two
columns below show the residual and predicted values, truncated at 4 digits
after the decimal point.
It may seem like the dot product is not exactly zero, but it is 14 orders of
magnitude smaller than the data values, which we can consider to be zero
plus computer rounding error.
1.Â
Explain and write down a mathematical model that is appropriate for
this dataset.
2.Â
Write the matrix equation corresponding to the model, and describe the
columns in the design matrix.
3.Â
Compute the model coefficients using the least-squares algorithm in
MATLAB or Python. You can also divide the β coefficients by the
standard deviations of the corresponding independent variables, which
puts the various model terms in the same scale and therefore more
comparable.
4.Â
Produce scatter plots that visualize the relationship between the
independent and dependent variables.
5.Â
One measure of how well the model fits the data is called R2 ("R-
squared"), and can be interpreted as the proportion of variance in the
dependent variable that is explained by the design matrix. Thus, an R2 of
1 indicates a perfect fit between the model. The definition of R2 is
(14.42)
ðœ–i is the error (residual) at each data point that was introduced earlier,
and ȳ is the average of all elements in y.
Compute R2 for the model to see how well it fits the data. Are you
surprised, based on what you see in the scatterplots?
A simple model would predict that time and age are both linear
predictors of widget purchases. There needs to be an intercept term,
because the average number of widgets purchased is greater than zero.
The variables could interact (e.g., older people buy widgets in the
morning while younger people buy widgets in the evening), but I’ve
excluded interaction terms in the interest of brevity.
y is the number of widgets sold, β1 is the intercept, t is the time of day,
and a is the age.
2.Â
The matrix equation is Xβ = y. X has three columns: intercept, which is
all 1’s; time of day; age.
3.Â
Python and MATLAB code are on the following page.
4.Â
The figure is below, code thereafter.
5.Â
The model accounts for 36.6% of the variance of the data. That seems
plausible given the variability in the data that can be seen in the graphs.
Code block 14.5:Python
yHat = X@betaÂ
r2Â =Â 1Â -Â np.sum((yHat-y)**2)Â
       / np.sum((y-np.mean(y))**2)
Code block 14.6:MATLAB
yHat = X*beta;Â
r2Â =Â 1Â -Â sum((yHat-y).^2)Â ...Â
     / sum((y-mean(y)).^2);
Chapter 15
Eigendecomposition
15.1 What are eigenvalues and eigenvectors?
_____________________________________________________________
1There are myriad explanations of eigenvectors and eigenvalues, and in my
experience, most students find most explanations incoherent on first
impression. In this section, I will provide three explanations that I hope will
build intuition; additional explanations and insights will come in subsequent
sections as well as subsequent chapters.
1The terms eigenvalue decomposition, eigenvector decomposition, and eigendecomposition are all used
interchangeably; here I will use eigendecomposition.
But there are two key properties to know about eigendecomposition before
we get to interpretations. First, eigendecomposition is defined only for square
matrices. They can be singular or invertible, symmetric or triangular or
diagonal; but eigendecomposition can be performed only on square matrices.
All matrices in this chapter are square (the singular value decomposition is
defined for any sized matrix, and is the main topic of the next chapter).
But something is different in the right-side panel of Figure 15.1: The output
vector points in the same direction as the input vector. The matrix did not
change the direction of the vector; it merely scaled it down.
In other words, the matrix-vector multiplication had the same effect as scalar-
vector multiplication. The entire matrix is acting on the vector as if that
matrix were a single number. That might not sound so impressive in a 2×2
example, but the same concept applies to a 1010×1010 (total of 10100 [a
googel] elements!) matrix. If we could find that single number that has the
same effect on the vector as does the matrix, we would be able to write the
following equality, which is also known as the fundamental eigenvalue
equation: ______________________________________
________________________________________________________________________
When the above equation is satisfied, then v is an eigenvector and λ is its
associated eigenvalue (geometrically, λ is the amount of stretching). Letâ
€™s be clear about this equation: It is not saying that the matrix equals the
scalar; both sides of this equation produce a vector, and we cannot simply
divide by v. Instead, this equation is saying that the effect of the matrix on the
vector is the same as the effect of the scalar on the vector.
The example shown in Figure 15.1 has λ = .5. That means that Av = .5v. In
other words, the matrix shrunk the vector by half, without changing its
direction.
I0.5
Each dot corresponds to one day’s measurement of the number of people
going into the fast-food place (y-axis) vs. the gym (x-axis).
Our concern is that neither axis seems optimal for understanding the structure
in the data. Clearly, a better—more efficient and more informative—axis
would reflect this structure. I’m sure you agree that the grey line drawn
on top of the dots is an appropriate axis.
Why am I writing all of this? It turns out that that the gray line is an
eigenvector of the data matrix times its transpose, which is also called a
covariance matrix. In fact, the gray line shows the first principal component
of the data. Principal components analysis (PCA) is one of the most
important tools in data science (for example, it is a primary method used in
unsupervised machine learning), and it is nothing fancier than an
eigendecomposition of a data matrix. More on this in Chapter 19.
If this analogy is confusing, then hopefully it will make more sense by the
end of the chapter. And if it’s still confusing after this chapter, then
please accept my apologies and I hope the previous two explanations were
sensible.
15.2 Finding eigenvalues
Eigenvectors are like secret passages that are hidden inside the matrix. In
order to find those secret passages, we first need to find the secret keys.
Eigenvalues are those keys. Thus, eigendecomposition requires first finding
the eigenvalues, and then using those eigenvalues as "magic keys" to unlock
the eigenvectors.
Of course, we could set v = 0 to make Equation 15.4 true for any matrix and
any λ. But this is a trivial solution, and we do not consider it. Thus, the
eigenvalue equation tells us that (A − λI)—the matrix shifted by λ—has
a non-trivial null space.
The key here is to remember what we know about a matrix with a non-trivial
null space, in particular, about its rank: We know that any square matrix with
a non-trivial null space is reduced rank. And we also know that a reduced-
rank matrix has a determinant of zero. And this leads to the mechanism for
finding the eigenvalues of a matrix.
______________________________________________
Eqn. box title=Finding eigenvalues
|A − λI| = 0 (15.5)
__________________________________________________________________
In section 11.3 you learned that the determinant of a matrix is computed by
solving the characteristic polynomial. And you saw examples of how a
known determinant can allow us to solve for some unknown variable inside
the matrix. That’s exactly the situation we have here: We have a matrix
with one missing parameter (λ) and we know that its determinant is zero.
And that’s how you find the eigenvalues of a matrix. Let’s work
through some examples.
= 0
(a − λ)(d − λ) − bc = 0
λ2 − (a + d)λ + (ad − bc) = 0 (15.6)
The characteristic polynomial of a 2×2 matrix is a 2nd degree algebraic
equation, meaning there are two λ’s that satisfy the equation (they might
be complex). The solutions can be found using the quadratic formula, which
is one of those things that people try but fail to memorize in high-school math
class.
(15.7)
λ = (15.9)
λ = (15.10)
If you’re lucky, then your teacher or linear algebra textbook author
designed matrices with entries that can be factored into the form (λ1 − λ)
(λ2 − λ) = 0. Otherwise, you’ll have to try again to memorize the
quadratic formula, or at least keep it handy for quick reference.
λ = 2 ±
__________________________________________________________________
Practice problems Find the eigenvalues of the following matrices.
a)
b)
c)
d)
Answers
a) 4,−4
b) ±
c) 3,5
d) −3,8
_________________________________________________________
There is a small short-cut you can take for finding eigenvalues of a 2×2
matrix. It comes from careful inspection of Equation 15.6. In particular the
λ1 coefficient is the trace of the matrix and the λ0 coefficient is the
determinant of the matrix. Hence, we can rewrite Equation 15.6 as follows:
(15.12)
You still need to solve the equation for λ, so this admittedly isn’t such a
brilliant short-cut. But it will get to you the characteristic polynomial slightly
faster.
Be mindful that this trick works only for 2×2 matrices; don’t try to
apply it to any larger
matrices._________________________________________________________________
a)
b)
c)
d)
Answers
a) (−1 ± )∕2
b) ± i
c) 6, -3
d) 0, 7.5
________________________________________________________
Eigenvalues of a 3×3 matrix The algebra gets more complicated, but the
principle is the same: Shift the matrix by −λ and solve for Δ = 0. The
characteristic polynomial produces a third-order equation, so you will have
three eigenvalues as roots of the equation. Here is an example.
(−2 − λ)(1 − λ)(−λ) + 12 − 24 − 3(1 − λ) − 8λ + 12(2 + λ) = 0
− λ3 + λ2 + 5λ + 9 = 0
(3 − λ)(1 − λ)(3 − λ) = 0
The three λ’s are 3, 1, and -3.
_________________________________________
b)
c)
Answers
a) 1, 2, 3
b) 0, 1, 3
c) 4, 3, 3
________________________________________________________
M columns, M λs It is no coincidence that the 2×2 matrices had 2
eigenvalues and the 3×3 matrices had 3 eigenvalues. Indeed, the
Fundamental Theorem of Algebra states that any m-degree polynomial has m
solutions. And because an M×M matrix has an Mth order characteristic
polynomial, it will have M roots, or M eigenvalues. Hence, an M×M
matrix has M eigenvalues (some of those eigenvalues may be repeated,
complex numbers, or zeros).
-|*|- Reflection You can already see from the examples above that
eigenvalues have no intrinsic sorting. We can come up with sensible sorting,
for example, ordering eigenvalues according to their position on the number
line or magnitude (distance from zero), or by a property of their
corresponding eigenvectors. Sorted eigenvalues can facilitate data analyses,
but eigenvalues are an intrinsically unsorted set. -|*|-
__________________________________________________________________
Let’s work through a practice problem. I’ll use the matrix presented
on page §. As a quick reminder, that matrix and its eigenvalues are:
Before looking at the numbers below, I encourage you to work through this
on your own: Shift the matrix by each λ and find a vector in its null space.
(15.15)
I wonder whether you got the same answers that I wrote above. The
eigenvalues are unique—there are no other possible solutions to the
characteristic polynomial. But are the eigenvectors unique? Are the
eigenvectors I printed above the only possible solutions? Take a moment to
test the following possible eigenvectors to see whether they are also in the
null space of the shifted matrix.
v1 , ,
v2 , ,
I believe you see where this is going: There is an infinite number of equally
good solutions, and they are all connected by being scaled versions of each
other. Thus, the true interpretation of an eigenvector is a basis vector for a 1D
subspace. Therefore, we could re-imagine solution 15.15 as
(15.16)
Figure 15.3:The "preferred" eigenvector is the unit-length
basis vector for the null space of the matrix shifted by its
eigenvalue.
That said, when solving problems by hand, it’s usually easier to have
eigenvectors with integer elements, rather than having to worry about all the
fractions and square roots that arise in normalization.
______________________________________
a)
b)
a)
λ1 = 2,v1 = , λ2 = 7,v2 =
b)
λ1 = 1,v1 = , λ2 = 2,v2 =
__________________________________________________________________
I’d like to show another example of eigendecomposition of a 3×3
matrix. You’ll see that the concept is the same, but the arithmetic quickly
gets challenging. It’s important to work through enough examples that
you understand the concept, and then you let computers do the hard work for
you.
Again, the concept is that we find the 3 eigenvalues of the 3×3 matrix by
shifting by −λ, setting the determinant to zero, and solving for λ. Then we
shift the matrix by each solved λ and find a vector in the null space of that
shifted matrix. Easy peasy, right? Let’s see...
(1 − λ)(3 − λ)(6 − λ) + 42 + 36 − 9(3 − λ) − 8(6 − λ) − 21(1 −
− λ3 + 10λ2 − 27λ + 18 + 42 + 33 − 27 + 9λ − 48 + 8λ − 21 + 21
− λ3 + 10λ2 + 11
− λ(λ2 − 10λ −
− λ(λ + 1)(λ −
Our three eigenvalues are 0, -1, and 11. Notice that we got a λ = 0, and the
matrix is reduced-rank (the third column is the sum of the first two). Thatâ
€™s not a coincidence, but a deeper discussion of the interpretation of zero-
valued eigenvalues will come later.
Now we can solve for the three eigenvectors, which are the vectors just to the
left of the equals sign (non-normalized).
λ1 : ⇒ =
λ2 : ⇒ =
4λ : ⇒ =
3
4You can appreciate why linear algebra was mostly a theoretical branch of mathematics before
computers came to the rescue.
(15.18)
This is clunky and ugly, and therefore violates an important principle in
linear algebra: Make equations compact, elegant, and simple. Fortunately, all
of these equations can be simplified by having each eigenvector be a column
in a matrix, and having each eigenvalue be an element in a diagonal matrix.5
5Λ is the capital version of λ. Be careful not to confuse the Greek Λ with the Roman A. We don’t
want to start any new Greek-Roman conflicts.
(15.19)
On first glance, it may seem inconsistent that I wrote λv in Equation 15.18
but VΛ in the matrix Equation 15.19. There are two reasons why it must be
VΛ and cannot be ΛV. Both reasons can be appreciated by seeing the right-
hand side of Equation 15.19 written out for a 3×3 matrix. Before reading
the text below the equation, try to understand why (1) the λ’s have to be
the diagonal elements and (2) why we need to post-multiply by Λ. In the
eigenvectors below, the first subscript number corresponds to the
eigenvector, and the second subscript number corresponds to the eigenvector
element. For example, v12 is the second element of the first eigenvector.
(15.20)
Those two questions are closely related. The reason why the eigenvalues go
in the diagonal of a matrix that post-multiplies the eigenvectors matrix is that
the eigenvalues must scale each column of the V matrix, not each row (refer
to page ?? for the rule about pre- vs. post-multiplying a diagonal matrix). If Λ
pre-multiplied V, each element of each eigenvector would be scaled by a
different λ.
There is another reason why it’s VΛ and not ΛV. If the equation read AV
= ΛV, then we could multiply both sides by V−1, producing the statement A
= Λ, which is not generally true.
So then why did I write λv for the single equation instead of the more-
consistent vλ? For better or worse, λv is the common way to write it, and
that’s the form you will nearly always see.
Let’s now revisit the Rubik’s cube analogy from the beginning of this
chapter: In Equation 15.21, A is the scrambled Rubik’s cube with all
sides having inter-mixed colors; V is the set of rotations that you apply to the
Rubik’s cube in order to solve the puzzle; Λ is the cube in its "ordered"
form with each side having exactly one color; and V−1 is the inverse of the
rotations, which is how you would get from the ordered form to the original
mixed form.
Figure 15.4 shows what diagonalization looks like for a 5×5 matrix; the
gray-scale intensity corresponds to the matrix element value.
a)
b)
a)
, ,
b)
, ,
_________________________________________________
-|*|- Reflection The previous reflection box mentioned sorting eigenvalues.
Figure 15.4 shows that the eigenvalues are sorted ascending along the
diagonal. Re-sorting eigenvalues is fine, but you need to be diligent to apply
the same re-sorting to the columns of V, otherwise the eigenvalues and their
associated eigenvectors will be mismatched. -|*|-
But there are matrices for which no matrix V can make that decomposition
true. Here’s an example of a non-diagonalizable matrix.
Notice that the matrix is rank-1 and yet has two zero-valued eigenvalues.
This means that our diagonal matrix of eigenvalues would be the zeros
matrix, and it is impossible to reconstruct the original matrix using Λ = 0.
All triangular matrices that have zeros on the diagonal are nilpotent, have all
zero-valued eigenvalues, and thus cannot be diagonalized.
All hope is not lost, however, because the singular value decomposition is
valid on all matrices, even the non-diagonalizable ones. The two non-
diagonalizable example matrices above have singular values, respectively, of
{2,0} and {1,0}.
Equation 15.22 is the starting point for our proof. We proceed by multiplying
the entire equation by matrix A. And from there we can replace Av with λv.
β1Av1 + β2Av2 = A0
β1λ1v1 + β2λ2v2 = 0 (15.23)
Next I will multiply Equation 15.22 by λ1, and then subtract equations 15.23
from 15.22.
The two terms in the left-hand side of the difference equation both contain
β2v2, so that can be factored out, revealing the final nail in the coffin of our
to-be-falsified hypothesis:
(15.24)
Why is this the key equation? It says that we multiply three terms to obtain
the zeros vector, which means at least one of these terms must be zero. (λ1
−λ2) cannot equal zero because we began from the assumption that they are
distinct. v2≠0 because we do not consider zero eigenvectors.
Thus, we are forced to conclude that β2 = 0. This violates the assumption in
Equation 15.22, which proves that the assumption is invalid, which proves
that distinct eigenvalues lead to linearly independent eigenvectors.
Either way you look at it, the conclusion is that distinct eigenvalues lead to
linearly independent eigenvectors.
We did not need to impose any assumptions about the field from which the
λs are drawn; they can be real or complex-valued, rational or irrational,
integers or fractions. The only important quality is that each λ is distinct.
The two solutions are λ1 = λ2 = 4. Plugging 4 into the shifted matrix yields
It’s clear how to get the first eigenvector, but then how do you get the
second eigenvector? The answer is you don’t—there is only one
eigenvector. MATLAB will return the following. 8
8MATLAB normalized the vectors to be unit-length vectors: the vector [1 2] points in the same
direction as [1 2]∕ .
LÂ =
    4.0000   0
    0        4.0000
The columns of output matrix V are the same subspace, although one points
in the opposite direction. There isn’t anything "wrong" with the matrixâ
€”it has a rank of 2 and a non-zero determinant. But it has a single eigenvalue
that reveals only a single eigenvector. Let’s see another example of a
matrix with repeated eigenvalues.
Again, the two solutions are λ = 4. Plugging this into the matrix yields
(15.25)
Now we have an interesting situation: any vector times the zeros matrix
produces the zeros vector. So all vectors are eigenvectors of this matrix.
Which two vectors to select? The standard basis vectors are a good practical
choice, because they are easy to work with, are orthogonal, and have unit
length. Therefore, for the matrix above, V=I. To be clear: V could be I, or it
could be any other 2×2 full-rank matrix. This is a special case of
eigendecomposition of the identity matrix. I just multiplied it by 4 so weâ
€™d get the same eigenvalues as in the previous example.
Perhaps that was too extreme an example. Consider the following matrix.
(15.26)
This matrix has one distinct eigenvalue and one repeated eigenvalue. We
know with certainty that the eigenvector associated with λ = 6 will be
distinct (I encourage you to confirm that a good integer-valued choice is [3 -1
1]T); what will happen when we plug 4 into the shifted matrix?
(15.27)
Again, we can ask the question whether these are the only two eigenvectors.
I’m not referring to scaled versions of these vectors such as αv2, I mean
whether we could pick equally good eigenvectors with different directions
from the two listed above.
Consider vector v2. Because the third column of the shifted matrix is all
zeros, the third element of the eigenvector can be any number. I could have
picked [1 1 1] or [1 1 10]. As long as the first two elements are the same, the
third element can be anything. The pair of eigenvectors I picked is a
convenient choice because they are orthogonal and contains 0’s and 1â
€™s; but these weren’t the only possible vectors I could have selected.
Now you’ve seen several examples of the possible outcomes of repeated
eigenvalues: there can be only one eigenvector or distinct eigenvectors, or an
infinity of possible sets of distinct eigenvectors. Which of these possibilities
is the case depends on the numbers in the matrix.
To understand why this can happen, let’s revisit the proof in the
beginning of this section. We need to modify the first assumption, though:
We now assume that the two eigenvalues are the same, thus λ1 = λ2. Most
of the rest of the proof continues as already written above. Below is Equation
15.24, re-written for convenience.
Again, we have the product of three terms producing the zeros vector. What
do we know now? We still don’t accept the zeros vector as an
eigenvector, so v2≠0. However, we also know that the first term is equal to
zero. Thus, the equation really is
What can we now conclude? Not much. We have the zeros vector without the
trivial solution of a zeros eigenvector, so assumption 2 can be maintained
(any vector set that includes the zeros vector is linearly dependent). We could
now drop assumption 2 and say that the set of eigenvectors could dependent
or independent, depending on the entries in the matrix.
-|*|- Reflection Do you really need to worry about repeated eigenvalues for
applications of eigendecomposition? Repeated eigenvalues may seem like
some weird quirk of abstract mathematics, but real datasets can have
eigenvalues that are exactly repeated or statistically indistinguishable. In my
own research on multivariate neural time series analysis, I find nearly
identical eigenvalues to be infrequent, but common enough that I keep an eye
out for them. -|*|-
If 4ac > b2 in Equation 15.6, then you end up with the square root of a
negative number, which means the eigenvalues will be complex numbers.
And complex eigenvalues lead to complex eigenvectors.
⇒ Λ = ,V=
Complex solutions can also arise from "normal" matrices with real-valued
entries. Consider the following matrix, which I generated in MATLAB from
random integers (that is, I did not carefully hand-select this matrix or look it
up in a secret online website for math teachers who want to torment their
students). (Numbers are rounded to nearest tenth.)
Av = λv
Av = λv
Equation 15.28 follows from the previous line because the matrix is real-
valued, thus A = A.
Complex-valued solutions in eigendecomposition can be difficult to work
with in applications with real datasets, but there is nothing in principle weird
or strange about them.
________________________________________________________
a)
b)
c)
Answers
a) λ = 2 ± i
b) λ = 3 ± i7
c) λ = a ± ib
____________________________________________________
But of course we need to prove this rigorously for all symmetric matrices.
The goal is to show that the dot product between any pair of eigenvectors is
zero. We start from two assumptions: (1) matrix A is symmetric (A = AT) and
(2) λ1 and λ2 are distinct eigenvalues of A (thus λ1≠λ2). v1 and v2 are
their corresponding eigenvectors. I’m going to write a series of equalities;
make sure you can follow each step from left to right.
(15.28)
The middle steps are actually just way-points; we care about the equality
between the first and last terms. I’ll write them below, and then set the
equation to zero.
(15.29)
(15.30)
Notice that both terms contain the dot product v1Tv
2, which can be factored
out, bringing us to the crux of the proof:
(15.31)
Equation 15.31 says that two quantities multiply to produce 0, which means
that one or both of those quantities must be zero. (λ1 − λ2) cannot equal
zero because we began from the assumption that they are distinct. Therefore,
v1Tv2 must be zero, which means that v1 ⊥ v2, i.e., the two eigenvectors are
orthogonal. Note that this proof is valid only for symmetric matrices, when
AT = A. Otherwise, you’ll get stuck in the middle of line 15.28—it wonâ
€™t be possible to set the first and last expressions equal to each other.
Orthogonal eigenvectors are a big deal. It means that the dot product between
any two non-identical columns will be zero:
(15.32)
When putting all of those eigenvectors as columns into a matrix V, then VTV
is a diagonal matrix.
(15.33)
I hope this looks familiar. This is also the definition of an orthogonal matrix.
And that means:
VTV =I (15.34)
VT = V−1 (15.35)
Thus, the eigenvectors of a symmetric matrix form an orthogonal matrix.
This in an important property with implications for statistics, multivariate
signal processing, data compression, and other applications. You’ll see
several examples of this property later in this book, for example in Chapter
19.
a)
,
b)
,
c)
,
Answers
a) ∗ = 1
λ = −3,2
b) ∗ = −4
λ = 20,60
c) ∗ = 1
λ = 0,6
_________________________________________________________________________
Lest you think I carefully selected that matrix to give a zero eigenvalue, the
next example shows that even in the general case of a matrix where one
column is a multiple of another column (thus, a rank-1 matrix), one of the
eigenvalues will be zero.
You can see from this example that the off-diagonal product (−(σa×b))
will cancel with the "left" terms from the diagonal product (a×σb); it
doesn’t matter what values you assign to a, b, and σ. (The same effect
happens if you construct one row to be a multiple of the other, which you
should try on your own.) The canceling of the constant term means all terms
on the left-hand side of the characteristic polynomial contain λ, which means
0 will always be a solution.
At this point, you might be jumping to the conclusion that the rank of a
matrix corresponds to the number of non-zero eigenvalues. Although this is a
good thought, it’s incorrect (it is, however, true for singular values, which
you will learn in the next chapter). Let me make that more clear: The number
of non-zero eigenvalues equals the rank for some matrices, but this is not
generally true for all matrices. You’ve already seen an example of this in
the section about conditions for diagonalizability. As an additional
demonstration, create matrix A above using a=-4, b=2, σ=2. You will find
that that matrix has both eigenvalues equal to zero, and yet is a rank-1 matrix.
(This is another example of a nilpotent matrix. You can confirm that the
suggested matrix has AA = 0.)
Let’s see how this would work with a 3×3 rank-2 matrix (I encourage
you to work through this problem on your own before inspecting the
equations below).
There is a bit of algebra in that final step, but you can see that λ appears in
all terms, which means λ = 0 is one of the three solutions.
I hope you find those examples compelling. There are several explanations
for why singular matrices have at least one zero-valued eigenvalue. One is
that the determinant of a matrix equals the product of the eigenvalues, and the
determinant of a singular matrix is 0, so at least one eigenvalue must be zero.
Another way to interpret Equation 15.4 is to work the other way around: start
from the assumption that A is singular, hence there is a non-trivial solution to
Av = 0. The way to reconcile that statement with Equation 15.4 is to assume
that at least one λ = 0.
_________________________________________________________
Practice problems Compute the missing eigenvalues of these matrices (hint:
remember that the sum of eigenvalues equals the trace of the matrix).
a)
,λ = 1,?
b)
,λ =?,?
c)
,λ = −2,?,?
a) λ = 1,0
b) λ = 0,11
c) λ = −2,−1,0
__________________________________________________
Consider computing the outer product of one eigenvector with itself. That
will produce an M×M rank-1 matrix. The norm of this matrix will also be
1, because it is formed from a unit-norm vector. An M ×M matrix has M
eigenvectors, and thus, M outer product matrices can be formed from the set
of eigenvectors.
What would happen if we sum together all of those outer product matrices?
Well, nothing terribly important. It’s a valid operation, but that sum
would not equal the original matrix A. Why not? It’s because
eigenvectors have no intrinsic length; they need the eigenvalues to scale
them.
(15.36)
Expanding out the summation sign leads to the insight that we are re-
expressing diagonalization:
(15.37)
But actually, this is not exactly Equation 15.21 (page §). Previously we
right-multiplied by V−1 but here we’re right-multiplying by VT. Can you
guess what that means? It means that reconstructing a matrix using Equation
15.36 is valid only for symmetric matrices, because V−1 = VT.
But don’t worry, reconstructing a matrix via eigenlayers is still valid for
non-symmetric matrices. We just need a different formulation. In fact, now
the vector on the right is the ith row of V−1. This comes from the definition
of the outer product, as the multiplication between a column vector and a row
vector. Some additional notation should minimize confusion.
W = V-T (15.38)
10A = ∑ i=1Mv iλiwiT (15.39)
Now we have the outer product between the eigenvector and the
corresponding row of the inverse of the eigenvector matrix transposed, which
here is printed as the ith column of matrix W. I hope it is clear why Equation
15.36 is a simplification of Equation 15.39: V-T = V for an orthogonal matrix.
10What happens here for λ = 0?
i
It is important to appreciate that Equation 15.3611 is valid only when the
eigenvectors are unit-normalized. The eigenvectors need to be unit-
normalized so that they provide only direction with no magnitude, allowing
the magnitude to be specified by the eigenvalue. That equation could be
generalized to non-unit vectors by dividing by the magnitudes of the vectors.
11Eigenvector normalization is also relevant for the SVD.
On the other hand, Equation 15.39 does not require unit-normalized
eigenvectors. Can you think about why that’s the case? It’s because
Equation 15.39 includes the matrix inverse. Thus, VV−1 = I regardless of the
magnitude of the individual eigenvectors, whereas VVT = I only when each
eigenvector is unit-normalized.
A =
A1 = v1λ1v1T =
A2 = v2λ2v2T =
A1 + A2 = + = =A
It is easy to confirm that the rank of A is 2 and the ranks of A1 and A2 are
both 1.
__________________________________________________________________
a)
,
b)
,
c)
,
Answers
a)
b)
c)
_________________________________________________
-|*|- Reflection Who cares about eigenlayers? It may seem circuitous to
deconstruct a matrix only to reconstruct it again. But consider this: do you
need to sum up all of the layers? What if you would sum only the layers with
the largest k < r eigenvalues? That will actually be a low-rank approximation
of the original matrix. Or maybe this is a data matrix and you identity certain
eigenvectors that reflect noise; you can then reconstruct the data without the
"noise layers." More on this in the next few chapters! -|*|-
For simple matrices like this, a few minutes of pen-and-paper work will
easily reveal the answer. What if the matrix were larger, and what if you had
to compute A5 = AAAAA?
a)
,λ = 1,−1
b)
,λ = 3,6
c)
,λ = 1,2,3
Answers
a)
b)
c)
__________________________________________________________________
Matrix inverse The other application in this section follows the same logic:
Diagonalize a matrix, apply some operation to the diagonal elements of Λ,
then optionally reassemble the three matrices into one.
You might wonder whether this is really a short-cut, considering that V still
needs to be inverted. There are two computational advantages of Equation
15.45. One advantage is obviously from inverting a symmetric matrix (with
an orthogonal eigenvectors matrix), where V−1 = VT. A second advantage is
that because the eigenvectors are normalized, V has a low condition number
and is therefore more numerically stable. Thus, the V of a non-symmetric
matrix might be easier to invert than A. (Condition number is a measure of
the "spread" of a matrix that characterizes the stability of a matrix to minor
perturbations or noise; you’ll learn more about this quantity in the next
chapter. The point is that even if A is theoretically invertible, the inverse of V
may have a more accurate inverse.)
a)
, ,λ = 2,−5
b)
, ,λ = 2,1
a)
b)
_________________________________________________
15.12 Generalized eigendecomposition
One quick glance will reveal that the following two equations are equivalent.
Av = λv
Av = λIv
What if we replace I with another (suitably sized) matrix?
_________________
__________________________________________________________________
Generalized eigendecomposition is also called simultaneous diagonalization
of two matrices and leads to several equations that are not immediately easy
to interpret, including:
V−1B−1AV = Λ (15.48)
A = BVΛV−1 (15.49)
B = V−1Λ−1AV (15.50)
Perhaps a better way to interpret generalized eigendecomposition is to think
about "regular" eigendecomposition on a matrix product involving an
inverse._________
__________________________________________________________________
This interpretation is valid only when B is invertible. In practice, even when
B is invertible, inverting large or high-conditioned matrices can lead to
numerical inaccuracies and therefore should be avoided. Computer programs
will perform the eigendecomposition without actually inverting matrices.
Nonetheless, Equation 15.51 helps build intuition.
-|*|- Reflection You can also think of B−1A as the matrix version of a ratio
of A to B. This interpretation makes generalized eigendecomposition a
computational workhorse for several multivariate data science and machine-
learning applications, including linear discriminant analysis, source
separation, and classifiers. -|*|-
15.13 Exercises
1.Â
Find the eigenvalues of the following matrices.
a)
b)
c)
d)
e)
f)
g)
h)
i)
2.Â
Diagonalize the following matrices.
a)
b)
c)
3.Â
The following pairs of matrices show a matrix and its eigenvectors. Without
computing eigenvalues, determine the missing eigenvector component.
a)
,
b)
,
c)
,
4.Â
I wrote that finding eigenvectors involves computing the null space of
A−λI. What would happen if you started from λI − A? Try this using the
matrix in Equation 15.15 (page §) to see whether the results differ. Either
way, explain why this happens.
5.Â
15.14 Answers
1.Â
Following are the missing pieces.
a) 4
b) ±
c) 3, 5
d) -3, 8
e) (−1 ± )∕2
f) ± i
g) 2,4,1
h) a,b,c
i) a,b,c
2.Â
Matrices below are eigenvalues, eigenvectors (non-normalized).
a)
,
b)
,
c)
,
3.Â
Following are the eigenvector values.
a) ∗ = 1
b) ∗ = −4
c) ∗ = 1
4.Â
You get the same results: Same eigenvalues and same eigenvectors. It’s a
bit awkward in my opinion, because you have to flip the signs of all the
matrix elements, but conceptually it’s the same thing as writing (a − b)
= −(b − a).
5.Â
The second thing to notice is that the matrix reduces to its eigenvalue when
transformed on both sides by its eigenvector. That is: vTAv = λ. Again, this
leads to some interesting discoveries about the matrix, which you will learn
about in Chapter 17.
2.Â
Conveniently, the eigenvalues of a diagonal matrix are simply the
diagonal elements, while the eigenvectors matrix is the identity matrix.
This is because (A − λI) is made singular simply by setting λ to each
diagonal element.
In fact, the eigenvalues of any triangular matrix (including diagonal
matrices) are simply the elements on the main diagonal (see exercise
question 1g-i). However, V = I only for a diagonal matrix. The
eigenvectors matrix of a triangular matrix is itself triangular. That’s
not shown in the code below, but I encourage you to test it!
Code block 15.7:Python
import numpy as npÂ
DÂ =Â np.diag(range(1,6))Â
L,VÂ =Â np.linalg.eig(D)
Code block 15.8:MATLAB
DÂ =Â diag(1:5);Â
[V,L]Â =Â eig(D)
3.Â
The eigenvectors matrix looks cute, like a pixelated flower from a 1970â
€™s computer. Plotting the eigenvectors reveals their remarkable
property—they’re sine waves! (Figure is shown below the code.) If
you are familiar with signal processing, then this might look familiar
from the Fourier transform. In fact, there is a deep connection between
the Hankel matrix and the Fourier transform. By the way, you can run
this code on a Toeplitz matrix for comparison (spoiler alert: Not nearly
as cool!).
Code block 15.9:Python
import numpy as npÂ
import matplotlib.pyplot as pltÂ
from scipy.linalg import hankelÂ
Â
t = np.arange(1,51)Â
lstrow = np.append(t[-1],np.arange(1,t[-1]))Â
HÂ =Â hankel(t,r=lstrow)Â
d,VÂ =Â np.linalg.eig(H)Â
VÂ =Â V[:,np.argsort(d)[::-1]]Â
Â
plt.subplot(221) # the matrixÂ
plt.imshow(H)Â
plt.subplot(222) # the eigenvectorsÂ
plt.imshow(V)Â
plt.subplot(212) # some evecsÂ
plt.plot(V[:,:4]);
Before learning about the mechanics of the SVD, I want to present the big
picture and the terminology. The core idea of SVD is to provide sets of basis
vectors—called singular vectors—for the four matrix subspaces: the row
space, the null space, the column space, and the left-null space; and to
provide scalar singular values that encode the "importance" of each singular
vector. The singular vectors are similar to eigenvectors, and the singular
values are similar to eigenvalues (though their exact relationships are
complicated and will be described later). The full SVD is expressed using the
following matrix letters.
__________________________________________________________________
________________________________________________________________________
As you can see in the descriptions above, the sizes of the right-hand-side
matrices depend on the size of A. Figure 16.1 shows graphical
representations of the SVD for square, tall, and wide rectangular matrices.
Notice that the size of U corresponds to the number of rows in A, that the size
of V corresponds to the number of columns in A, and that the size of Σ is the
same as that of A. These sizes allow for U to be an orthonormal basis for â„M
and for V to be an orthonormal basis for â„N.
Let’s see what the SVD looks like for a real matrix. Figure 16.2 shows a
matrix (created by applying a 2D smoothing kernel to random noise; the
grayscale intensity at each pixel in the image is mapped onto the numerical
value at that element of the matrix) and its SVD.
We’re missing the U matrix, but I’m sure you’ve already figured
it out: Take the eigendecomposition of matrix AAT:
AAT = (UΣVT)(UΣVT)T (16.5)
= UΣVTVΣTUT (16.6)
= UΣ2UT (16.7)
So there you have it—the way to compute the SVD of any rectangular
matrix is to apply the following steps:
It is actually not necessary to compute both steps to obtain the SVD. After
applying either one of these steps, you can compute the third matrix directly
using one of the following two formulas.
Σ−1UTA = VT
Quick question: How do you know that U and V are orthogonal matrices?
Think of your answer and then check the footnote2.
I explained earlier in this chapter that the Σ matrix is the same size as A, and
yet computing ΣTΣ actually produces a square matrix, even if A is
rectangular. Thus, in practice, the "real" Σ matrix is the same size as A and
its diagonal elements are drawn from the diagonal elements of ΣTΣ.
Equivalently, you could cut out the "excess" rows or columns from square Î
£TΣ to trim it down to the size of A.
2Because they come from the eigendecomposition of a symmetric matrix.
When computing SVD by hand—which I recommend doing at least a few
times to solidify the concept—you should first decide whether to apply step
1 and then Equation 16.8, or step 2 and Equation 16.8. The best strategy
depends on the size of the matrix, because you want to compute the
eigendecomposition of whichever of ATA or AAT is smaller. On the other
hand, following steps 1 and 2 is easier than it looks, because you need to
solve for the squared eigenvalues (the singular values on the diagonal of Σ2)
only once.
You can see the effects of not normalizing in the exercises below. It will take
you a while to work through these two exercises by hand, but I believe itâ
€™s worth the effort and will help you understand the SVD.
__________________________________________________________________
Practice problems Perform the following steps on each matrix below: (1)
Compute the SVD by hand. Construct the singular vectors to be integer
matrices for ease of visual inspection. (2) Using your non-normalized
singular vectors, compute UΣVT to confirm that it does not equal the original
matrix. (3) Normalize each singular vector (each column of U and row of
VT). (4) Re-compute UΣVT and confirm that it exactly equals the original
matrix.
a)
b)
Answers Matrices are ordered as U, Σ, and VT. Top rows show integer
vectors; bottom rows show normalized versions.
a)
, , ; product =
, ,
; product=
b)
, , ; product=
, , ; product=
__________________________________________________________________
The previous section seemed to imply the trivial relationship that the
eigenvalues of ATA equal the squared singular values of A. That is true, but
there is a more nuanced relationship between the eigenvalues and the singular
values of a matrix. That relationship is organized into three cases.
Case 1: eig(ATA) vs. svd(A) The eigenvalues equal the squared singular
values, for the reasons explained in the previous section. Let’s see a quick
example:
A =
ATA =
This case concerns the eigenvalues of ATA, not the eigenvalues of A. In fact,
there are no eigenvalues of A because it is not a square matrix. This brings us
to our second case.
Case 2: eig(ATA) vs. svd(ATA) In this case, the eigenvalues and singular
values are identical—without squaring the singular values. This is because
eigendecomposition and SVD are the same operation for a square symmetric
matrix (more on this point later).
Case 3a: eig(A) vs. svd(A) for real-valued λ This is different from Case 2
because here we assume that A is not symmetric, which means that the
eigenvalues can be real-valued or complex-valued, depending on the
elements in the matrix. We start by considering the case of a matrix with all
real-valued eigenvalues. Of course, the matrix does need to be square for it to
have eigenvalues, so let’s add another row to the example matrix above.
A =
___________________________________________
Code As you might guess, the SVD is really easy to compute in Python and
in MATLAB. If you use both programs, be very mindful that Python returns
VT whereas MATLAB returns V. Python also returns the singular values in a
vector instead of in a diagonal matrix. You can use the code below to confirm
the answers to the practice problems a few pages ago.
_________________________________________________________________________
Proving that this is the case simply involves writing out the SVD of the
matrix and its transpose:
A = UΣVT (16.10)
AT = (UΣVT)T = VΣUT (16.11)
Because A = AT, these two these equations must be equal:
(16.12)
And this proves that U = V, which means that the left and right singular
vectors are the same.
Notice that this is not necessarily the same thing as "Case 2" of the
relationship between singular values and eigenvalues, because not all
symmetric matrices can be expressed as some other matrix times its
transpose.
By construction, σ1 > σ2. Indeed, SVD algorithms always sort the singular
values descending from top-left to lower-right. Thus, zero-valued singular
values will be in the lower-right of the diagonal. σ3 = 0 because this is a
rank-2 matrix. You’ll learn below that any zero-valued singular values
correspond to the null space of the matrix. Therefore, the number of non-zero
singular values corresponds to the dimensionality of the row and column
spaces, which means that the number of non-zero singular values corresponds
to the rank of the matrix. In fact, this is how programs like MATLAB and
Python compute the rank of a matrix: Take its SVD and count the number of
non-zero singular values.
Figure 16.3 shows the "big picture" of the SVD and the four subspaces. There
is a lot going on in that figure, so let’s go through each piece in turn. The
overall picture is a visualization of Equation 16.1; as a quick review: the
matrix gets decomposed into three matrices: U provides an orthogonal basis
for â„M and contains the left singular vectors; Σ is the diagonal matrix of
singular values (all non-negative, real-valued); and V provides an orthogonal
basis for â„N (remember that the SVD uses VT, meaning that the rows are the
singular vectors, not the columns).
Figure 16.3:Visualization of how the SVD reveals basis sets
for the four subspaces of a rank-r matrix. (orth. =
orthogonal matrix; diag. = diagonal matrix; 1:r = columns
(or rows) 1 through r.)
This figure additionally shows how the columns of U are organized into basis
vectors for the column space (light gray) and left-null space (darker gray);
and how the rows of VT are organized into basis vectors for the row space
(light gray) and null space (darker gray). In particular, the first r columns in
U, and the first r rows in VT, are the bases for the column and row spaces of
A. The columns and rows after r get multiplied by the zero-valued singular
values, and thus form bases for the null spaces. The singular vectors for the
column and row spaces are sorted according to their "importance" to the
matrix A, as indicated by the relative magnitude of the corresponding
singular values.
You can see that the SVD reveals a lot of important information about the
matrix. The points below are implicit in the visualization, and written
explicitly in the interest of clarity:
The rank of the matrix (r) is the number of non-zero singular values.
The dimensionality of the left-null space is the number of columns in U
from r + 1 to M.
The dimensionality of the null space is the number of rows in VT from r
+ 1 to N.
It is important the realize that the organization of the SVD matrix in Figure
16.3 is not a trivial result of the decomposition. Recall from the previous
chapter that eigenvalues have no intrinsic sorting; likewise, when you
compute the SVD by hand, the singular values are not magically revealed in
descending order. Computer algorithms will sort the singular values—and
their associated singular vectors—to produce this beautiful arrangement.
Figure 16.3 also nicely captures the fact that the column space and left-null
space are orthogonal: If each column in U is orthogonal to each other
column, then any subset of columns is orthogonal to any other non-
overlapping subset of columns. Together, all of these columns span all of
â„M, which means the rank of U is M, even if the rank of A is r < M.
The story for VT is the same, except we deal with rows instead of columns
(or, if you prefer, the columns of V) and the row space and null space of A.
So, the first r rows provide an orthonormal basis for the row space of A, and
the rest of the rows, which get multiplied by the zero-valued singular values,
are a basis for the null space of A.
________________________________________________________________
Practice problems The following triplets of matrices are UΣVT that were
computed from a matrix A that is not shown. From visual inspection,
determine the size and rank of A, and identify the basis vectors for the four
spaces of A. (Note: I re-scaled U and V to integers.)
a)
b)
Answers
a) The matrix is 2×3 with rank 2. U is a basis for the column space;
the left-null space is empty. The first 2 rows of VT are a basis for the
row space, and the third row is a basis for the null space.
b) The matrix is 4×2 with rank 2. The first two columns of U are a
basis for the column space; the latter two are a basis for the left-null
space. All of VT is a basis for the row space, and the null space is empty.
__________
-|*|- Reflection I know, Figure 16.3 is a lot to take in of first glance. Donâ
€™t expect to understand everything about the SVD just by staring at that
figure. You’ll gain more familiarity and intuition about the SVD by
working with it, which is the goal of the rest of this chapter! -|*|-
One key difference between eigendecomposition and SVD is that for SVD,
the two singular vectors matrices span the entire ambient spaces (â„M and
â„N), which is not necessarily the case with eigenvectors (e.g., for non-
diagonalizable matrices).
Let’s start by rewriting the SVD using one pair of singular vectors and
their corresponding singular value. This is analogous to the single-vector
eigenvalue equation.
(16.13)
Notice that replacing the vectors with matrices and then right-multiplying by
VT gives the SVD matrix equation that you’re now familiar with.
Now let me remind you of the definition of the column space and left-null
space: The column space comprises all vectors that can be expressed by some
combination of the columns of A, whereas the left-null space comprises all
non-trivial combinations of the columns of A that produce the zeros vector.
In other words:
C(A) : Ax = b (16.14)
N(AT) : Ay = 0 (16.15)
Now we can think about Equation 16.13 in this context: all singular vectors
are non-zeros, and thus the right-hand side of Equation 16.13 must be non-
zero—and thus in the column space of A—if σ is non-zero. Likewise, the
only possible way for the right-hand side of Equation 16.13 to be the zeros
vector—and thus in the left-null space of A—is for σ to equal zero.
You can make the same argument for the row space, by starting from the
equation uTA = σvT.
"Effective" rank You’ve read many times in this book that computers
have difficulties with really small and really large numbers. These are called
rounding errors, precision errors, underflow, overflow, etc. How does a
computer decide whether a singular value is small but non-zero vs. zero with
rounding error?
The MATLAB source code for computing rank looks like this:
s = svd(A);
r = sum(s > tol);
In other words, retrieve the singular values, and count the number of those
values that are above a tolerance (variable tol). So the question is, how to set
that tolerance? If it’s too small, then the rank will be over-estimated; if itâ
€™s too large, then the rank will be under-estimated. MATLAB’s
solution is to set the tolerance dynamically, based on the size and elements in
the matrix:
The function eps returns the distance between a number and the next-larger
number that your computer is capable of representing. For example, if your
computer could only represent integers, then eps=1. (And you probably need
to buy a new computer...)
With matrix multiplication via layers, the two vectors that multiply to create
each "layer" of the product matrix are defined purely by their physical
position in the matrix. Is that really the best way to define basis vectors to
create each layer? Probably not: The position of a row or column in a matrix
may not be the organizing principle of "importance"; indeed, in many
matrices—particularly matrices that contain data—rows or columns can be
swapped or even randomized without any change in the information content
of the data matrix.
where r is the rank of the matrix (the singular values after σr are zeros, and
thus can be omitted from this equation). Your delicate linear-algebraic
sensibilities might be offended by going from the elegant matrix Equation
16.1 to the clumsy vector-sum Equation 16.16. But this equation will set us
up for the SVD spectral theory, and will also lead into one of the important
applications of SVD, which is low-rank approximations (next section).
Because the singular values are sorted descending, A1 is actually the "most
meaningful" SVD-layer of matrix A ("meaningful" can be interpreted as the
amount of total variance in the matrix, or as the most important feature of the
matrix; more on this in the next section). A2 will be the next-most-
meaningful SVD-layer, and so on down to Ar. The SVD-layers after r have
zero-valued singular values and thus contribute nothing to the final matrix.
Thus, each corresponding left and right singular vector combine to produce a
layer of the matrix. This layer is like a direction in the matrix space. But that
direction is simply a pointer—it does not convey "importance." Instead, it is
the singular value that indicates how "important" each direction is. It’s
like a sign-post that points to the nearest gas station (1 km away) and to
Siberia (10,000 km away). The signs (the singular vectors) have the same
size; you have to look at the numbers on the signs (the singular values) to
know how far each destination is.
Figure 16.5 illustrates the concept of reconstructing a matrix by successively
adding SVD-layers. The three singular values of this random matrix are 3.4,
1.0, and 0.5 (rounded to the nearest tenth). In the next section, you will learn
how to interpret these values and how to normalize them into a more
meaningful metric.
Notice that layer 1 ("L1" in panel B) captures the most prominent feature of
the matrix A (the horizontal band in the middle); in the next section I will
refer to this as the best rank-1 approximation of A.
Although it might not be obvious from the color scaling, each column of the
L1 matrix is simply the left-most column of U scalar-multiplied by the
corresponding elements of the first row of VT, and also multiplied by Σ1,1.
Same story for matrices L2 and L3. Columns 4 and 5 of U do not contribute
to reconstructing A—that’s built-in to Equation 16.16 because the rank
of this matrix is r = 3—and it’s also apparent from visualization of the
multiplication: Columns 4 and 5 in U multiply rows 4 and 5 of Σ, which are
all zeros. In terms of matrix spaces, columns 4 and 5 of U are in the left-null
space of A.
__________________________________________________________________
Figure 16.6 shows an example. The matrix A is a 30×40 random numbers
matrix that was smoothed with a 2D Gaussian kernel. Panel A shows its SVD
decomposition, similar to previous figures. The important addition here is
that matrix A has rank = 30, whereas the right-most matrix in panel D has a
rank = 4. And yet, visually, that matrix appears extremely similar to matrix
A. In other words, the rank-4 approximation of the original rank-30 matrix
appears to capture all the important features.
Figure 16.6:Example of using SVD for a low-rank
approximation of a matrix. Panels C and D are
comparable to panels B and C in Figure 16.5. 1:4
indicates the cumulative sum of SVD-layers 1 through 4.
Why is a low-rank approximation useful? Why not simply keep A in all its
glorious original-rank perfection? There are several applications of low-rank
approximations; below are three examples.
The answer is that those two σmax values are probably not comparable,
unless you know that the matrices have numbers in exactly the same range.
Before showing you how to normalize the singular values, I want to show
that singular values are scale-dependent, meaning they change with the scale
of the numbers in the matrix. Below is an example; the second matrix is the
same as the first but multiplied by 10. Notice that their singular values are the
same except for a scaling of 10 (numbers are rounded to the nearest
thousandth).
6A = , ΣA =
Because the singular vectors are all unit-length and are scaled by the singular
values when reconstructing the matrix from its SVD layers, the sum over all
singular values can be interpreted as the total variance or "energy" in the
matrix.7 This sum over all singular values is formally called the Schatten 1-
norm of the matrix. For completeness, the equation below is the full formula
for the Schatten p-norm, although here we are considering only the case of p
= 1.
7See also section 6.10 about matrix norms.
(16.19)
The next step is to scale each singular value to the percent of the Schatten 1-
norm:
(16.20)
Getting back to the problem at the outset of this section, two matrices with
largest singular values of 8 and 678 are not comparable, but let’s say their
normalized largest singular values are 35% and 82%. In this case, 35% means
that the largest SVD-layer accounts for only 35% of the total variance in the
matrix. One interpretation is that this is a complicated matrix that contains a
lot of information along several different directions. In contrast, a largest
normalized singular value of 82% means nearly all of the variance is
explained by one component, so this matrix probably contains less complex
information. If this were a data matrix, it might correspond to one pattern and
18% noise.
Now let’s think back to the question of how many SVD-layers to use in a
low-rank approximation (that is, how to select k). When the singular values
are normalized, you can pick some percent variance threshold and retain all
SVD-layers that contribute at least that much variance. For example, you
might keep all SVD-layers with σ > 1%, or perhaps 0.1% to retain more
information. The choice of a threshold is still somewhat arbitrary, but this is
at least more quantitative and reproducible than visual inspection of the scree
plot.
_________________________________________________________________________
For example, the condition number of the identity matrix is 1, because all of
its singular values are 1. The condition number of any singular matrix is
undefined ("not a number"; NaN), because singular matrices have at least one
zero-valued singular value, thus making the condition number =???
The condition number of all orthogonal matrices is the same. Can you guess
what that condition number is, and why? To build suspense, I’ll explain
the answer at the end of this section.
But don’t take the condition number of a matrix too seriously: Matrices
can contain a lot of information or very little information regardless of their
condition number. The condition number should not be used on its own to
evaluate the usefulness of a matrix.
OK, the answer to the question about orthogonal matrices is that the
condition number of any orthogonal matrix is 1, and that’s because all of
the singular values of an orthogonal matrix are 1. This is the case because an
orthogonal matrix is defined as QTQ = I, and the eigenvalues of a diagonal
matrix are its diagonal elements. And you know from earlier in this chapter
that the singular values of Q are the principal square roots of the singular
values of QTQ.
Code You can compute the condition number on your own based on the
SVD, but Python and MATLAB have built-in functions.
Code block 16.3:Python
import numpy as npÂ
AÂ =Â np.random.randn(5,5)Â
s = np.linalg.svd(A)[1]Â
condnum = np.max(s)/np.min(s)Â
# compare above with cond()Â
print(condnum,np.linalg.cond(A))
Code block 16.4:MATLAB
AÂ =Â randn(5,5);Â
s = svd(A);Â
condnum = max(s)/min(s);Â
disp([condnum,cond(A)])Â %Â comparison
Equation 16.24 also allows us to prove that the inverse of a symmetric matrix
is itself symmetric. Let’s write out the SVD for a symmetric matrix and
its inverse (remember that symmetric matrices have identical left and right
singular vectors).
AT = (VΣVT)T = VΣVT (16.25)
A−1 = (VΣVT)−1 = VΣ−1VT (16.26)
It is immediately clear that A, AT, and A−1 have the same singular vectors;
the singular values may differ, but the point is that Σ is also symmetric, and
thus A−1 is symmetric as long as A is symmetric.
Expressing the inverse via the SVD may seem like an academic exercise, but
this is a crucial introduction to the pseudoinverse, as you will now learn.
Σi,i†=
_________________________________________________________________________
Notice that this procedure will work for any matrix, square or rectangular,
full-rank or singular. When the matrix is square and full-rank, then the
Moore-Penrose pseudoinverse will equal the true inverse, which you can see
yourself by considering what happens when all of the singular values are
non-zero.
One-sided inverse When the matrix is rectangular and either full column
rank or full row rank, then the pseudoinverse will equal the left inverse or the
right inverse. This makes computing the one-sided inverse computationally
efficient, because it can be done without explicitly computing (ATA)-1.
T = UΣVT (16.28)
−1 T
(TTT)−1TT = (16.29)
= −1VΣUT (16.30)
= −1VΣUT (16.31)
= VΣ−2VTVΣUT (16.32)
= VΣ−1UT (16.33)
I know it’s a lot to work through, but the take-home message is that when
you write out the SVD of a left inverse and simplify, you end up with exactly
the same expression as the SVD-based inverse of the original matrix (replace
−1 with †where appropriate).
Needless to say, the conclusion is the same for the right inverse, which you
can work through on your own.
1.Â
In Chapter 13, you learned about "economy" QR decomposition, which
can be useful for large tall matrices. There is a comparable "economy"
version of the SVD. Your goal here is to figure out what that means.
First, generate three random matrices: square, wide, and tall. Then run
the full SVD to confirm that the sizes of the SVD matrices match your
expectations (e.g., Figure 16.1). Finally, run the economy SVD on all
three matrices and compare the sizes to the full SVD.
2.Â
Obtain the three SVD matrices from eigendecomposition, as described
in section 16.2. Then compute the SVD of that matrix using the svd()
function, to confirm that your results are correct. Keep in mind the
discussions of sign-indeterminacy.
3.Â
Write code to reproduce panels B and C in Figure 16.5. Confirm that the
reconstructed matrix (third matrix in panel C) is equal to the original
matrix. (Note: The matrix was populated with random numbers, so donâ
€™t expect your results to look exactly like those in the figure.)
4.Â
Create a random-numbers matrix with a specified condition number. For
example, create a 6×16 random matrix with a condition number of κ
= 42. Do this by creating random U and V matrices, an appropriate Σ
matrix, and then create A = UΣVT. Finally, compute the condition
number of A to confirm that it matches what you specified (42).
5.Â
This and the next two challenges involve taking the SVD of a picture. A
picture is represented as a matrix, with the matrix values corresponding
to grayscale intensities of the pixels. We will use a picture of Einstein.
You can download the file at
https://upload.wikimedia.org/wikipedia/en/8/86/Einstein_tongue.jpgOf
course, you can replace this with any other picture—a selfie of you,
your dog, your kids, your grandmother on her wedding day... However,
you may need to apply some image processing to reduce the image
matrix from 3D to 2D (thus, grayscale instead of RGB) and the datatype
must be double (MATLAB) or floats (Python).
2.Â
Hint: You need to sort the columns of U and V based on descending
eigenvalues. You can check your results by subtracting the eigenvectors
and the matching singular vectors matrices. Due to sign-indeterminacy,
you will likely find a few columns of zeros and a few columns of non-
zeros; comparing against −U will flip which columns are zeros and
which are non-zeros. Don’t forget that Python returns VT.
Code block 16.7:Python
AÂ =Â np.random.randn(4,5)Â #Â matrixÂ
L2,V = np.linalg.eig(A.T@A) # get VÂ
V = V[:,np.argsort(L2)[::-1]]#sort descendÂ
L2,U = np.linalg.eig([email protected]) # get UÂ
U = U[:,np.argsort(L2)[::-1]]#sort descendÂ
# create SigmaÂ
SÂ =Â np.zeros(A.shape)Â
for i,s in enumerate(np.sort(L2)[::-1]):Â
    S[i,i] = np.sqrt(s)Â
U2,S2,V2Â =Â np.linalg.svd(A)Â #Â svd
Code block 16.8:MATLAB
AÂ =Â randn(4,5);Â %Â matrixÂ
[V,L2] = eig(A’*A); % get VÂ
[L2,idx] = sort(diag(L2),’d’);Â
V = V(:,idx); % sort by descending LÂ
[U,L2] = eig(A*A’); % get UÂ
[L2,idx] = sort(diag(L2),’d’);Â
U = U(:,idx); % sort by descending LÂ
% create SigmaÂ
SÂ =Â zeros(size(A));Â
for i=1:length(L2)Â
    S(i,i) = sqrt(L2(i));Â
endÂ
[U2,S2,V2]Â =Â svd(A);Â %Â svd
3.Â
I changed the code slightly from the Figure to include the original
matrix to the right of the reconstructed matrix. Anyway, the important
part is creating the low-rank approximations in a for-loop. Be very
careful with the slicing in Python!
Code block 16.9:Python
import matplotlib.pyplot as pltÂ
fig,ax = plt.subplots(2,4)Â
Â
AÂ =Â np.random.randn(5,3)Â
U,s,VÂ =Â np.linalg.svd(A)Â
S = np.diag(s) # need Sigma matrixÂ
Â
for i in range(3):Â
Â
  onelayer = np.outer(U[:,i],V[i,:])*s[i]Â
  ax[0,i].imshow(onelayer)Â
  ax[0,i].set_title(’Layer %g’%i)Â
  ax[0,i].axis(’off’)Â
Â
  lowrank=U[:,:i+1]@S[:i+1,:i+1]@V[:i+1,:]Â
  ax[1,i].imshow(lowrank)Â
  ax[1,i].set_title(’Layers 0:%g’%i)Â
  ax[1,i].axis(’off’)Â
Â
Â
ax[1,3].imshow(A)Â
ax[1,3].set_title(’Orig. A’)Â
ax[1,3].axis(’off’)Â
ax[0,3].axis(’off’);
5.Â
You might have struggled a bit with transforming the image, but
hopefully the SVD-related code wasn’t too difficult. My code below
reconstructs the image using components 1-20, but you can also try, e.g.,
21-40, etc.
Code block 16.13:Python
import numpy as npÂ
import matplotlib.pyplot as pltÂ
from imageio import imreadÂ
pic = imread(’https://upload.wikimedia.org/Â
     wikipedia/en/8/86/Einstein_tongue.jpg’
np.array(pic,dtype=float) # convert to float
Â
U,s,V = np.linalg.svd( pic )Â
SÂ =Â np.zeros(pic.shape)Â
for i in range(len(s)):Â
  S[i,i] = s[i]Â
Â
comps = slice(0,21) # low-rank approx.Â
lowrank=U[:,comps]@S[comps,comps]@V[comps,:]Â
Â
# show the original and low-rankÂ
plt.subplot(1,2,1)Â
plt.imshow(pic,cmap=’gray’)Â
plt.title(’Original’)Â
plt.subplot(1,2,2)Â
plt.imshow(lowrank,cmap=’gray’)Â
plt.title(’Comps. %g-%g’Â
           %(comps.start,comps.stop-1));
6.Â
The low-rank calculations and plotting are basically the same as the
previous exercise. The main additions here are computing percent
variance explained and thresholding. It’s a good idea to check that
all of the normalized singular values sum to 100.
Code block 16.15:Python
# convert to percent explainedÂ
s = 100*s/np.sum(s)Â
plt.plot(s,’s-’); plt.xlim([0,100])Â
plt.xlabel(’Component number’)Â
plt.ylabel(’Pct variance explains’)Â
plt.show()Â
Â
thresh = 4 # threshold in percentÂ
I,J=np.ix_(s>thresh,s>thresh) # comps > X%Â
lowrank = np.squeeze(U[:,J]@S[I,J]@V[J,:])Â
Â
# show the original and low-rankÂ
plt.subplot(1,2,1)Â
plt.imshow(pic,cmap=’gray’)Â
plt.title(’Original’)Â
plt.subplot(1,2,2)Â
plt.imshow(lowrank,cmap=’gray’)Â
plt.title(’%g comps. at %g%%’Â
           %(len(I),thresh));
Code block 16.16:MATLAB
% convert to percent explainedÂ
s = 100*diag(S)./sum(S(:));Â
plot(s,’s-’), xlim([0 100])Â
xlabel(’Component number’)Â
ylabel(’Pct variance explains’)Â
Â
thresh = 4; % threshold in percentÂ
comps = s>thresh; % comps greater than X%
lowrank = U(:,comps) * ...Â
        S(comps,comps)*V(:,comps)’;Â
Â
% show the original and low-rankÂ
figure, subplot(121)Â
imagesc(pic), axis imageÂ
title(’Original’)Â
subplot(122)Â
imagesc(lowrank), axis imageÂ
title(sprintf(’%g comps with > %g%%’,...
               sum(comps),thresh))
colormap gray
7.Â
The RMS error plot goes down when you include more components.
That’s sensible. The scale of the data is pixel intensity errors, with
pixel values ranging from 0 to 255. However, each number in the plot is
the average over the entire picture, and therefore obscures local regions
of high- vs. low-errors. You can visualize the error map (variable
diffimg).
Code block 16.17:Python
RMSÂ =Â np.zeros(len(s))Â
for si in range(len(s)):Â
  i=si+1 # mind the indexing!Â
  lowrank = U[:,:i]@S[:i,:i]@V[:i,:]Â
  diffimg = lowrank - picÂ
  RMS[si] = np.sqrt(np.mean(Â
            diffimg.flatten()**2))Â
Â
plt.plot(RMS,’s-’)Â
plt.xlabel(’Rank approximation’)Â
plt.ylabel(’Error (a.u.)’);
Code block 16.18:MATLAB
for si=1:length(s)Â
  lowrank=U(:,1:si)*S(1:si,1:si)*V(:,1:si)’;
  diffimg = lowrank - pic;Â
  RMS(si) = sqrt(mean(diffimg(:).^2));Â
endÂ
plot(RMS,’s-’)Â
xlabel(’Rank approximation’)Â
ylabel(’Error (a.u.)’)
8.Â
This code challenge illustrates that translating formulas into code is not
always straightforward. I hope you enjoyed it!
Code block 16.19:Python
import numpy as npÂ
XÂ =Â np.random.randint(Â
        low=1,high=7,size=(4,2))Â
U,s,VÂ =Â np.linalg.svd(X)Â #Â eq.29Â
SÂ =Â np.zeros(X.shape)Â
for i,ss in enumerate(s):Â
  S[i,i] = ssÂ
longV1Â =Â np.linalg.inv(Â (U@S@V).T@U@S@VÂ )Â
                       Â
longV2Â =Â np.linalg.inv(Â [email protected]@U.T@U@S@VÂ )Â
                       Â
longV3Â =Â np.linalg.inv([email protected]@S@V)Â
                       Â
longV4Â =Â [email protected]_power(S.T@S,-1)Â
                     @ V@V
MPpinv = np.linalg.pinv(X) # eq.34
Code block 16.20:MATLAB
XÂ =Â randi([1Â 6],[4Â 2]);Â
[U,S,V]Â =Â svd(X);Â %Â eq.29Â
longV1 = inv((U*S*V’)’*U*S*V’)*(U*S*V’)’;
longV2 =inv(V*S’*U’*U*S*V’)*(U*S*V’)’;
longV3 = inv(V*S’*S*V’) * (U*S*V’)’;
longV4 = V*(S’*S)^(-1)*V’*V*S’*U’; %Â
MPpinv = pinv(X); % eq.34
9.Â
The pseudoinverse of a column of constants is a row vector where each
element is 1∕kn where k is the constant and n is the dimensionality.
The reason is that the vector times its pseudoinverse is actually just a dot
product; summing up k n times yields nk, and thus 1∕nk is the correct
inverse to yield 1. (I’m not sure if this has any practical value, but I
hope it helps you think about the pseudoinverse.)
Code block 16.21:Python
import numpy as npÂ
k = 5Â
n = 13Â
a = np.linalg.pinv(np.ones((n,1))*k)Â
a - 1/(k*n) # check for zeros
Code block 16.22:MATLAB
k = 5;Â
n = 13;Â
a = pinv(ones(n,1)*k);Â
a - 1/(k*n) % check for zeros
10.Â
The differences between the two approaches is much more apparent.
The issue is that high-conditioned matrices are more unstable and thus
so are their inverses. In practical applications, B might be singular, and
so an eigendecomposition on B−1A is usually not a good idea.
Code block 16.23:Python
import numpy as npÂ
from scipy.linalg import eigÂ
import matplotlib.pyplot as pltÂ
Â
M = 10 # matrix sizeÂ
cns = np.linspace(10,1e10,30)Â
avediffs = np.zeros(len(cns))Â
Â
# loop over condition numbersÂ
for condi in range(len(cns)):Â
Â
  # create AÂ
  U,r = np.linalg.qr( np.random.randn(M,M)Â
  V,r = np.linalg.qr( np.random.randn(M,M)Â
  S = np.diag(np.linspace(cns[condi],1,M))Â
  A = U@[email protected] # construct matrixÂ
Â
  # create BÂ
  U,r = np.linalg.qr( np.random.randn(M,M)Â
  V,r = np.linalg.qr( np.random.randn(M,M)Â
  S = np.diag(np.linspace(cns[condi],1,M))Â
  B = U@[email protected] # construct matrixÂ
Â
  # GEDs and sortÂ
  l1 = eig(A,B)[0]Â
  l2 = eig(np.linalg.inv(B)@A)[0]Â
  l1.sort()Â
  l2.sort()Â
Â
  avediffs[condi] = np.mean(np.abs(l1-l2))Â
Â
plt.plot(cns,avediffs);
One important note before we start: The quadratic form applies only to square
matrices. Thus, throughout this chapter, you can assume that all matrices are
square. Some will be symmetric or non-symmetric, some will be invertible
and some singular. There is some debate about whether the quadratic form
should be applied to non-symmetric matrices, because many special
properties of the quadratic form are valid only when the matrix is symmetric.
I will relax this constraint and point out when symmetry is relevant.
Equation 17.1 is called the "quadratic form of matrix A" and it represents the
energy in the matrix over the coordinate space described by vector v. That
"energy" definition will make more sense in the next section when you learn
about the geometric perspective of the quadratic form. First I want to show a
few numerical examples, using the same matrix and different vectors.
v1TAv1 = =9 (17.2)
v2TAv2 = = 19 (17.3)
Two things in particular to notice about this example: The matrix is always
pre- and post-multiplied by the same vector; and the same matrix multiplied
by different vectors will give different results (that’s obvious, but it
becomes important later). Again, we’ll return to the interpretation of this
in the next section.
Now let’s generalize this by using letters instead of numbers. I will use
a,b,c,d for the matrix elements and x,y for the vector elements.
= (17.4)
= (ax + cy)x + (bx + dy)y (17.5)
= ax2 + (c + d)xy + dy2 (17.6)
Please take a moment to work through the multiplications by hand to confirm
that you arrive at the same expression. And then take a moment to admire the
beauty of what we’ve done: We have converted a vector-matrix-vector
expression into a polynomial.
Notice that we get three terms here: x2, y2, and their cross-product xy. The
matrix elements become the coefficients on these terms, with the diagonal
elements getting paired with the individual vector elements and the off-
diagonal elements getting paired with the cross-term.
Now imagine that the matrix elements are fixed scalars and x and y are
continuous variables, as if this were a function of two variables:
(17.7)
The interpretation is that for each matrix A, we can vary the vector elements
xi and yi and obtain a scalar value for each (xi,yi) pair.
vTAv = (17.8)
= (17.9)
= ax2 + fyx + dzx + fxy + by2 + ezy + dxz + eyz + cz2 (17.10)
= ax2 + by2 + cz2 + 2dxz + 2eyz + 2fxyz (17.11)
Yeah, that’s quite a lot to look at. Still, you see the squared terms and the
cross-terms, with coefficients defined by the elements in the matrix, and the
diagonal matrix elements paired with their corresponding squared vector
elements.
I’m sure you’re super-curious to see how it looks for a 4×4 matrix.
It’s written out below. The principle is the same: diagonal matrix
elements are coefficients on the squared vector elements, and the off-
diagonals are coefficients on the cross-terms. Just don’t expect me to be
patient enough to keep this going for larger matrices...
Symmetric matrices When the matrix is symmetric, then the quadratic form
is also symmetric. This is easily proven by transposing the entire expression:
T = vTATvTT = vTAv (17.12)
Complex matrices The quadratic form for complex-valued matrices is nearly
the same as for real-valued matrices, except that the Hermitian transpose
replaces the regular transpose:2
2Recall that vT = vH for a real-valued vector.
(17.13)
If the matrix A is Hermitian (the complex version of symmetric, thus AH =
A), then the quadratic form is real-valued. Non-Hermitian matrices will have
complex-valued quadratic forms. Equations 17.12 and 17.13 are two of the
reasons why some people limit the quadratic form to symmetric (and
Hermitian) matrices.
This result is slightly harder to interpret, because the left side is in â„3
whereas the right side is in â„4 (in this particular example). This situation is
not further considered.
Code Not much new here, but be mindful of vector orientations: The vector
on the left needs to be a row vector, regardless of its original orientation. Iâ
€™ve swapped the orientations in the code just to make it a bit more
confusing (which requires you to think a bit more!).
Code block 17.1:Python
import numpy as npÂ
m = 4Â
AÂ =Â np.random.randn(m,m)Â
v = np.random.randn(1,m) # row vec.Â
v@[email protected]
Code block 17.2:MATLAB
m = 4;Â
AÂ =Â randn(m);Â
v = randn(1,m);Â
v*A*v’
-|*|- Reflection Notice that when A = I, the quadratic form reduces to the dot
product of the vector with itself, which is the magnitude-squared of the
vector. Thus, putting a different matrix in between the vectors is like using
this matrix to modulate, or scale, the magnitude of this vector. In fact, this is
the mechanism for measuring distance in several non-Euclidean geometries. -
|*|-
Now let’s think about applying this function over and over again, for the
same matrix and different elements in vector v. In fact, we can think about
the vector as a coordinate space where the axes are defined by the vector
elements. This can be conceptualized in â„2, which is illustrated in Figure
17.1. In fact, the graph of f(A,v) for v ∈ â„2 is a 3D graph, because the two
elements of v provide a 2D coordinate space, and the function value is
mapped onto the height above (or below) that plane.
Figure 17.1:A visualization of the quadratic form result of a
matrix at two specific coordinates.
Thus, the 2D plane defined by the v1 and v2 axes is the coordinate system;
each location on that plane corresponds to a unique combination of elements
in the vector, that is, when setting v = [v1,v2]. The z-axis is the function result
(ζ). The vertical dashed gray lines leading to the gray dots indicate the value
of ζ for two particular v’s.
Once we have this visualization, the next step is to evaluate f(A,v) for many
different possible values of v (of course, using the same A). If we keep using
stick-and-button lines like in Figure 17.1, the plot will be impossible to
interpret. So let’s switch to a surface plot (17.2).
That graph is called the quadratic form surface, and it’s like an energy
landscape: The matrix has a different amount of "energy" at different points
in the coordinate system (that is, plugging in different values into v), and this
surface allows us to visualize that energy landscape.
Let’s make sure this is concrete. The matrix that I used to create that
surface is
Now let’s compute one specific data value, at coordinate (-2,1), which is
near the lower-left corner of the plot. Let’s do the math:
The value (the height of the surface) of the function at coordinate (-2,1) is 1.
That looks plausible from the graph.
The coordinates for this quadratic form surface are not bound by ±2; the
plane goes to infinity in all directions, but it’s trimmed here because I
cannot afford to sell this book with one infinitely large page. Fortunately,
though, the characteristics of the surface you see here don’t change as the
axes grow; for this particular matrix, two directions of the quadratic form
surface will continue to grow to infinity (away from the origin), while two
other directions will continue at 0.
This surface is the result for one specific matrix; different matrices (with the
same vector v) will have different surfaces. Figure 17.3 shows examples of
quadratic form surfaces for four different matrices. Notice the three
possibilities of the quadratic form surface: The quadratic form can bend up to
positive infinity, down to negative infinity, or stay along zero, in different
directions away from the origin of the [v1,v2] space.
Figure 17.3:Examples of quadratic form surfaces for four
different matrices. The v1,v2 axes are the same in all
subplots, and the f(A,v) = ζ axis is adapted to each
matrix.
There is more to say about the relationship between the matrix elements and
the features of the quadratic form surface. In fact, the shape and sign of the
quadratic form reflects the definiteness of the matrix, its eigenvalues, its
invertibility, and other remarkable features. For now, simply notice that the
shape and sign of the quadratic form surface depends on the elements in the
matrix.
That said, one thing that all quadratic form surfaces have (for all matrices) is
that they equal zero at the origin of the graph, corresponding to v = [0 0].
That’s obvious algebraically—the matrix is both pre- and post-
multiplied by all zeros—but geometrically, it means that we are interested in
the shape of the matrix relative to the origin.
The proof involves (1) expressing the quadratic form as a vector dot product,5
(2) applying the Cauchy-Schwarz inequality to that dot product, and then (3)
applying the Cauchy-Schwarz inequality again to the matrix-vector product.
5The Cauchy-Schwarz inequality for the dot product was on page § if you need a refresher.
To start with, think about vTAv as the dot product between two vectors:
(v)T(Av). Then apply the Cauchy-Schwarz inequality:
(17.15)
The next step is to break up ∥Av∥ into the product of norms:6
6The Cauchy-Schwarz inequality for matrix-vector multiplication was on page §.
(17.16)
Then we put everything together:
|vTAv| ≤∥v∥∥A∥F∥v∥
|vTAv| ≤∥A∥F∥v∥2 (17.17)
Equation 17.17 is the crux of our predicament: The magnitude of the
quadratic form depends on the matrix and on our coordinate system. In other
words, of course the quadratic form tends to plus or minus infinity (if not
zero) as it moves away from the origin of the space defined by v (and thus as
the magnitude of v increases).
You can call this a feature or you can call it a bug. Either way, it impedes
using the quadratic form in statistics and in machine learning. We need
something like the quadratic form that reveals important directions in the
matrix space independent of the magnitude of the vector that we use for the
coordinate space. That motivates including some kind of normalization factor
that will allow us to explore the quadratic form in a way that is independent
of the vector v.
To discover the right normalization factor, let’s think about the quadratic
form of the identity matrix. What values of v will maximize the quadratic for
vTIv? The mathy way of writing this question is:
(17.18)
This means we are looking for the argument (here v) that maximizes the
expression.
Obviously, the quadratic form goes to infinity in all directions, because weâ
€™re simply adding up all of the squared vector elements. So then, what
normalization factor could we apply to remove this trivial growth?
__________________________________________________________________
Perhaps it isn’t yet obvious why normalizing by the squared vector
magnitude is the right thing to do. Let’s revisit Equation 17.17 but now
include the normalization factor.
≤ (17.21)
≤∥A∥F (17.22)
Now we see that the magnitude of the normalized quadratic form is bounded
by the magnitude of the matrix, and does not depend on the vector that
provides the coordinate space.
Geometrically, this means that the normalized quadratic form surface of the
identity matrix is a flat sheet on the v1,v2 plane.
Have you noticed the failure scenario yet? If you don’t already know the
answer, I think you can figure it out from looking again at Equation 17.20.
The failure happens when v = 0, in other words, with the zeros vector.
The normalized quadratic form surfaces look quite different from the "raw"
quadratic form surfaces. Figure 17.4 shows the surfaces from the normalized
quadratic forms for each of the four matrices shown in Figure 17.3.
-|*|- Reflection If you are reading this book carefully and taking notes (which
you should be doing!), then you’ll remember that on page 13.1 I
introduced the concept of "mapping over magnitude." The normalized
quadratic form can be conceptualized in the same way: It’s a mapping of
a matrix onto a vector coordinate space, over the magnitude of that
coordinate space. -|*|-
Each normalized quadratic form surface in Figure 17.4 has two prominent
features: a ridge and a valley. Going back to the interpretation that the
quadratic form surface is a representation of the "energy" in the matrix along
particular directions, the normalized surface tells us that there are certain
directions in the matrix that are associated with maximal and minimal
"energy" in the matrix.
(What does matrix "energy" mean? I put that term in apology quotes because
it is an abstract concept that depends on what the matrix represents. In the
next two chapters, the matrix will contain data covariances, and then
"energy" will translate to pattern of covariance across the data features.)
Why is this the case? Why do the eigenvectors point along the directions of
maximal and minimal "energy" in the matrix? The short answer is that the
vectors that maximize the quadratic form (Equation 17.20) turn out to be the
eigenvectors of the matrix. A deeper discussion of why that happens is
literally the mathematical basis of principal components analysis, and so I
will go through the math in that chapter.
Although I really enjoy looking at quadratic form surfaces, they are not the
best way to determine the definiteness of a matrix. Instead, the way to
determine the definiteness category of a matrix is to inspect the signs of the
eigenvalues.
Question: Can you determine the definiteness of a matrix from its singular
values? The answer is in the footnote.10
10The answer is No, because all singular values are non-negative.
Complex matrices For Hermitian matrices (remember that a Hermitian
matrix is the complex version of a symmetric matrix: it equals its Hermitian
transpose), the story is exactly the same. This is because v
HCvisareal−valuednumber,andbecausetheeigenvaluesofaHermitianmatrixarereal
The conclusion is that any matrix that can be expressed as ATA is positive
definite if it is full-rank, and positive semidefinite if it is reduced rank.
Before going further, please take a moment to test whether the above proof
works for the matrix AAT. (Obviously it will, but it’s good practice to
work through it on your own.)
Important: All matrices S = ATA are symmetric, but not all symmetric
matrices can be expressed as ATA (or AAT). 12A symmetric matrix that
cannot be decomposed into another matrix times its transpose is not
necessarily positive definite. Symmetry does not guarantee positive
(semi)definiteness.
12Decomposing a symmetric matrix S into ATA is the goal of the Cholesky decomposition, which is
valid only for positive (semi)definite matrices.
Here is an example:
But wait a minute: the quadratic form is defined for all vectors, not only for
eigenvectors. You might be tempted to say that our conclusion (positive
definite means all positive eigenvalues) is valid only for eigenvectors.
The key to expanding this to all vectors is to appreciate that as long as the
eigenvectors matrix is full-rank and therefore spans all of â„M, any vector can
be created by some linear combination of eigenvectors. Let’s see how this
works for a linear combination of two eigenvectors.
That conclusion is that if we assume that a matrix is positive definite, then all
of its eigenvalues must be positive. You can also think about this proof the
other way: If a matrix has all positive eigenvalues (meaning the right-hand
side of Equation 17.33 is always positive), then it is a positive definite matrix
(meaning the left-hand side of that equation is always positive).
We don’t need to write out a new set of equations. All we need to do 13is
change our assumption about the matrix: Now we assume that the quadratic
form of A is non-negative, meaning that ζ can be positive and it can be zero,
but it can never be negative. Below is Equation 17.27 repeated for reference.
13Pro life tip: Sometimes you need to reassess your assumptions about matrices, yourself, other people,
life, etc.
Again, the right-hand-side of the equation has a term that is strictly positive
and a term that could be positive, negative, or zero. The left-hand side of the
equation can take values that are positive or zero, but not negative. This
proves that at least one eigenvalue must be zero, and all non-zero eigenvalues
must be positive. And the proof works the other way: If all eigenvalues are
zero or positive, then the quadratic form must be non-negative, which is
definition of positive semidefinite.
The other categories I trust you see the pattern here. Proving the relationship
between the signs of the eigenvalues and the definiteness of a matrix does not
require additional math; it simply requires changing your assumption about A
and re-examining the above equations. As math textbook authors love to
write: The proof is left as an exercise to the reader.
Once you have the code working, embed it inside a for-loop to generate
500 matrices. Store the definiteness category for each matrix. Finally,
print out a list of the number of matrices in each definiteness category.
What have you discovered about the quadratic forms of random
matrices?
2.Â
Notice that I’ve set a tolerance for "zero"-valued eigenvalues, as
discussed in previous chapters for thresholds for computing rank. You
will find that all or nearly all random matrices are indefinite (positive
and negative eigenvalues). If you create smaller matrices (3×3 or 2Ã
—2), you’ll find more matrices in the other categories, although
indefinite will still dominate. Category number corresponds to the rows
of table 17.1.
Code block 17.5:Python
import numpy as npÂ
n = 4Â
nIterations = 500Â
defcat = np.zeros(nIterations)Â
Â
for iteri in range(nIterations):Â
 # create the matrixÂ
 A = np.random.randint(-10,11,size=(n,n))Â
 e = np.linalg.eig(A)[0]Â
 while ~np.all(np.isreal(e)):Â
  A = np.random.randint(-10,11,size=(n,n))Â
  e = np.linalg.eig(A)[0]Â
Â
 # "zero" threshold (from rank)Â
 t=n*np.spacing(np.max(np.linalg.svd(A)[1]))Â
 # test definitenessÂ
 if np.all(np.sign(e)==1):Â
  defcat[iteri] = 1 # pos. defÂ
 elif np.all(np.sign(e)>-1)&sum(abs(e)<t)>0:Â
  defcat[iteri] = 2 # pos. semidefÂ
 elif np.all(np.sign(e)<1)&sum(abs(e)<t)>0:Â
  defcat[iteri] = 4 # neg. semidefÂ
 elif np.all(np.sign(e)==-1):Â
  defcat[iteri] = 5 # neg. defÂ
 else:Â
  defcat[iteri] = 3 # indefiniteÂ
Â
# print out summaryÂ
for i in range(1,6):Â
  print(’cat %g: %g’%(i,sum(defcat==i)))
Clearly, these two variables are related to each other; you can imagine
drawing a straight line through that relationship. The correlation coefficient is
the statistical analysis method that quantifies this relationship.
Limited to linearity One thing to know about the correlation is that it can
detect only the linear component of an interaction between two variables.
Any nonlinear interactions will not be detected by a correlation. The lower-
right panel of Figure 18.2 shows an example: the value of y is clearly related
to the value of x, however that relationship is nonlinear; there is no linear
component of that relationship. (There are, of course, measures of nonlinear
relationships, such as mutual information. It’s also possible to transform
the variables so that their relationship is linearized.)
(18.1)
You can imagine from the formula that when all data values are close to the
mean, the variance is small; and when the data values are far away from the
mean, the variance is large. Figure 18.3 shows examples of two datasets with
the identical means but different variances.
It is obvious that we can expand the squared term in Equation 18.1 and
rewrite as follows. This is relevant for understanding covariance.
(18.2)
Once you know the variance, you also know the standard deviation. Standard
deviation is simply the square root of variance. It is implicit that standard
deviation is the principal (positive) square root of variance,2 because it
doesn’t make sense for a measure of dispersion to be negative, just like it
doesn’t make sense for a length to be negative.
2It is a bit confusing that σ indicates singular values in linear algebra and standard deviation in
statistics. As I wrote in Chapter 14, terminological overloading is simply unavoidable in modern human
civilization.
(18.3)
18.3 Covariance
The term covariance means variance between two variables. The covariance
between variables x and y is computed by modifying Equation 18.2.
________________
_________________________________________________________________________
where n is the number of observations, and x and y are the averages over all
elements in x and y. Needless to say, this equation is valid only when x and y
have the same number of elements.
This is looking better. It turns out that the normalization factor of n − 1 is
often unnecessary in applications (for example, when comparing multiple
covariances that all have the same n), therefore Equation 18.5 really
simplifies to xTy under the assumption that both variables are mean-centered.
_________________________________________________________________________
The numerators are the same. The denominator of Equation 18.6 has two
terms, each of which is almost identical to the standard deviation, except that
the 1∕(n− 1) term is missing. In fact, that normalization term is missing
once in the numerator and twice in the denominator (each time inside each
square root). In other words, that term cancels and thus is omitted for
simplicity.
Again, the formula above is technically correct but is going to give our plane
full of linear algebaticians indigestion. Let us make their flight more pleasant
by rewriting Equation 18.6, again assuming that the variables are already
mean-centered.
(18.7)
The letter r is commonly used to indicate a correlation coefficient.
Cosine similarity I realize that Chapter 3 was a long time ago, but I hope
you remember the geometric formula for the dot product, and how we
derived a formula for the angle between vectors. That was Equation 3.16
(page §), and I will reprint it below for convenience.
When the data are not mean-centered, Equation 3.16 is called cosine
similarity. The difference in interpretation and use of correlation vs. cosine
similarity is a topic for a machine-learning course, but you can already see
that the two measures are somewhere between identical and similar,
depending on the average of the data values.
We can also turn this equation around to compute a covariance matrix given
correlation and standard deviation matrices. (Again, the normalization factor
is omitted for simplicity.)
(18.13)
Thus, in this chapter I will present the linear algebra aspects of PCA while
simplifying or ignoring some of the statistical aspects.
Now let’s consider panel B. The two data channels are negatively
correlated. Keeping the same weighting of .7 for each channel actually
reduced the variance of the result (that is, the resulting component has less
variance than either individual channel). Thus, a weighting of .7 for both
channels is not a good PCA solution.
Instead, PCA will negatively weight one of the channels to flip its sign (panel
C). That resulting component will maximize variance.
This is the idea of PCA: You input the data, PCA finds the best set of
weights, with "best" corresponding to the weights that maximize the variance
of the weighted combination of data channels.
Figure 19.1:Three scenarios to illustrate PCA in 2D data.
We can visualize the data shown in panel A as a scatter plot, with each axis
corresponding to a channel (Figure 19.2), and each dot corresponding to a
time point. The dashed line is the first principal component—the weighting
of the two channels that maximizes covariance. The second principal
component is the direction in the data cloud that maximizes the residual
variance (that is, all the variance not explained by the first component)
subject to the constraint that that component is orthogonal to the first
component.
Figure 19.2:The time series data in Figure 19.1A visualized
in a scatter plot, along with the principal components.
The qualitative method involves examining the scree plot, which is a plot of
the sorted eigenvalues. You learned about creating and interpreting scree
plots in Chapter 16 about the SVD (e.g., Figure 16.6).
Panel B shows the same data, mean-centered, and with the two principal
components (the eigenvectors of the covariance matrix) drawn on top. Notice
that the eigenvector associated with the larger eigenvalue points along the
direction of the linear relationship between the two variables.
Panel C shows the same data but redrawn in PC space. Because PCs are
orthogonal, the PC space is a pure rotation of the original data space.
Therefore, the data projected through the PCs are decorrelated.
Because these data are in â„2, we can visualize the normalized quadratic form
surface of the data covariance matrix. It’s shown in Figure 19.4. The two
panels actually show the same data. I thought the rotated view (left panel)
looked really neat but was suboptimal for visualizing the eigenvectors. The
right panel is the bird’s-eye-view, and more clearly shows how the
eigenvectors point along the directions of maximal and minimal energy in the
quadratic form surface.
Let’s get back to the problem statement: Find a direction in the data space
that maximizes the data covariance. This can be translated into finding a
vector v that maximizes the quadratic form of the data covariance matrix.
(19.1)
You know from Chapter 17 that this maximization problem is difficult for the
non-normalized quadratic form, because the quadratic form will grow to
infinity as the elements in v get larger. Fortunately, avoiding this trivial
solution simply requires adding the constraint that v be a unit vector. That
can be equivalently written in one of two ways:
vmax = v , s.t.∥v∥ = 1
Chapter 20
Where to go from here?
20.1 The end... of the beginning!
1. You read the chapters in order, and you’ve now finished the book.
Congrats! Amazing! Go treat yourself to an ice cream, chocolate bar,
glass of wine, pat on the back, 2-week trip to Tahiti, or whatever you
like to do to reward yourself.
2. You skipped forwards because you are curious, even though you
haven’t finished the book. In that case, feel free to keep reading, but
then get back to the important chapters!
Abstract linear algebra As you know by now, I have tried to keep this book
focused on linear algebra concepts that are most relevant for applications. I
like to call it "down-to-Earth" linear algebra. However, linear algebra is a
huge topic in mathematics, and there are many avenues of linear algebra that
are more focused on proofs, abstractions, and topics that are mathematically
intriguing but less directly relevant to applications on computers.
If you are interested in following this line of study, then a quick Internet
search for something like "theoretical linear algebra" or "abstract linear
algebra" will give you some places to start.
Application-specific linear algebra I intentionally avoided focusing on any
particular application or any particular field, to make the book relevant for a
broad audience.
20.2 Thanks!
Thank you for choosing to learn from this book, and for trusting me with
your linear algebra education. I hope you found the book useful, informative,
and perhaps even a bit entertaining. Your brain is your most valuable asset,
and investments in your brain always pay large dividends.