General Pseudo-Inverse: SVD Applications 16-2
General Pseudo-Inverse: SVD Applications 16-2
if A has SVD A = UV
T
,
A
= V
1
U
T
is the pseudo-inverse or Moore-Penrose inverse of A
if A is skinny and full rank,
A
= (A
T
A)
1
A
T
gives the least-squares solution x
ls
= A
y
if A is fat and full rank,
A
= A
T
(AA
T
)
1
gives the least-norm solution x
ln
= A
y
SVD Applications 162
Full SVD
SVD of A R
mn
with Rank(A) = r:
A = U
1
1
V
T
1
=
u
1
u
r
1
.
.
.
v
T
1
.
.
.
v
T
r
nd U
2
R
m(mr)
, V
2
R
n(nr)
s.t. U = [U
1
U
2
] R
mm
and
V = [V
1
V
2
] R
nn
are orthogonal
add zero rows/cols to
1
to form R
mn
:
=
1
0
r(n r)
0
(mr)r
0
(mr)(n r)
1
V
T
1
=
U
1
U
2
1
0
r(n r)
0
(mr)r
0
(mr)(n r)
V
T
1
V
T
2
i.e.:
A = UV
T
called full SVD of A
(SVD with positive singular values only called compact SVD)
SVD Applications 166
Image of unit ball under linear transformation
full SVD:
A = UV
T
gives intepretation of y = Ax:
rotate (by V
T
)
stretch along axes by
i
(
i
= 0 for i > r)
zero-pad (if m > n) or truncate (if m < n) to get m-vector
rotate (by U)
SVD Applications 167
Image of unit ball under A
rotate by V
T
stretch, = diag(2, 0.5)
u
1 u
2 rotate by U
1
1
1
1
2
0.5
{Ax | x 1} is ellipsoid with principal axes
i
u
i
.
SVD Applications 168
Sensitivity of linear equations to data error
consider y = Ax, A R
nn
invertible; of course x = A
1
y
suppose we have an error or noise in y, i.e., y becomes y +y
then x becomes x +x with x = A
1
y
hence we have x = A
1
y A
1
y
if A
1
is large,
small errors in y can lead to large errors in x
cant solve for x given y (with small errors)
hence, A can be considered singular in practice
SVD Applications 1615
a more rened analysis uses relative instead of absolute errors in x and y
since y = Ax, we also have y Ax, hence
x
x
AA
1
y
y
(A) = AA
1
=
max
(A)/
min
(A)
is called the condition number of A
we have:
relative error in solution x condition number relative error in data y
or, in terms of # bits of guaranteed accuracy:
# bits accuacy in solution # bits accuracy in data log
2
y(0)
.
.
.
y(t 1)
= O
t
x(0) + T
t
u(0)
.
.
.
u(t 1)
C
CA
.
.
.
CA
t1
, T
t
=
D 0
CB D 0
.
.
.
CA
t2
B CA
t3
B CB D
O
t
maps initials state into resulting output over [0, t 1]
T
t
maps input to output over [0, t 1]
hence we have
O
t
x(0) =
y(0)
.
.
.
y(t 1)
T
t
u(0)
.
.
.
u(t 1)
C
CA
.
.
.
CA
n1
y(0)
.
.
.
y(t 1)
T
t
u(0)
.
.
.
u(t 1)
y( t + 1)
.
.
.
y()
T
t
u( t + 1)
.
.
.
u()
i=0
i
CA
i
z = 0
(by C-H) where
det(sI A) = s
n
+
n1
s
n1
+ +
0
Observability and state estimation 1912
Continuous-time observability
continuous-time system with no sensor or state noise:
x = Ax +Bu, y = Cx +Du
can we deduce state x from u and y?
lets look at derivatives of y:
y = Cx +Du
y = C x +D u = CAx +CBu +D u
y = CA
2
x +CABu +CB u +D u
and so on
Observability and state estimation 1913
hence we have
y
y
.
.
.
y
(n1)
= Ox + T
u
u
.
.
.
u
(n1)
D 0
CB D 0
.
.
.
CA
n2
B CA
n3
B CB D
_
y
y
.
.
.
y
(n1)
_
_
T
_
_
u
u
.
.
.
u
(n1)
_
_
RHS is known; x is to be determined
hence if N(O) = {0} we can deduce x(t) from derivatives of u(t), y(t) up
to order n 1
in this case we say system is observable
can construct an observer using any left inverse F of O:
x = F
_
_
_
_
_
_
y
y
.
.
.
y
(n1)
_
_
T
_
_
u
u
.
.
.
u
(n1)
_
_
_
_
_
_
Observability and state estimation 1915
reconstructs x(t) (exactly and instantaneously) from
u(t), . . . , u
(n1)
(t), y(t), . . . , y
(n1)
(t)
derivative-based state reconstruction is dual of state transfer using
impulsive inputs
Observability and state estimation 1916
A converse
suppose z N(O) (the unobservable subspace), and u is any input, with
x, y the corresponding state and output, i.e.,
x = Ax +Bu, y = Cx +Du
then state trajectory x = x +e
tA
z satises
x = A x +Bu, y = C x +Du
i.e., input/output signals u, y consistent with both state trajectories x, x
hence if system is unobservable, no signal processing of any kind applied to
u and y can deduce x
unobservable subspace N(O) gives fundamental ambiguity in deducing x
from u, y
Observability and state estimation 1917
Least-squares observers
discrete-time system, with sensor noise:
x(t + 1) = Ax(t) +Bu(t), y(t) = Cx(t) +Du(t) +v(t)
we assume Rank(O
t
) = n (hence, system is observable)
least-squares observer uses pseudo-inverse:
x(0) = O
y(0)
.
.
.
y(t 1)
T
t
u(0)
.
.
.
u(t 1)
where O
t
=
O
T
t
O
t
1
O
T
t
Observability and state estimation 1918
interpretation: x
ls
(0) minimizes discrepancy between
output y that would be observed, with input u and initial state x(0)
(and no sensor noise), and
output y that was observed,
measured as
t1
=0
y() y()
2
can express least-squares initial state estimate as
x
ls
(0) =
t1
=0
(A
T
)
C
T
CA
1
t1
=0
(A
T
)
C
T
y()
where y is observed output with portion due to input subtracted:
y = y h u where h is impulse response
Observability and state estimation 1919
Least-squares observer uncertainty ellipsoid
since O
t
O
t
= I, we have
x(0) = x
ls
(0) x(0) = O
v(0)
.
.
.
v(t 1)
=0
v()
2
2
Observability and state estimation 1920
set of possible estimation errors is ellipsoid
x(0) E
unc
=
_
_
_
O
t
_
_
v(0)
.
.
.
v(t 1)
_
_
1
t
t1
=0
v()
2
2
_
_
_
E
unc
is uncertainty ellipsoid for x(0) (least-square gives best E
unc
)
shape of uncertainty ellipsoid determined by matrix
_
O
T
t
O
t
_
1
=
_
t1
=0
(A
T
)
C
T
CA
_
1
maximum norm of error is
x
ls
(0) x(0)
tO
t1
=0
(A
T
)
C
T
CA
1
always exists, and gives the limiting uncertainty in estimating x(0) from u,
y over longer and longer periods:
if A is stable, P > 0
i.e., cant estimate initial state perfectly even with innite number of
measurements u(t), y(t), t = 0, . . . (since memory of x(0) fades . . . )
if A is not stable, then P can have nonzero nullspace
i.e., initial state estimation error gets arbitrarily small (at least in some
directions) as more and more of signals u and y are observed
Observability and state estimation 1922
Continuous-time least-squares state estimation
assume x = Ax + Bu, y = Cx + Du + v is observable
least-squares estimate of initial state x(0), given u(), y(), 0 t:
choose x
ls
(0) to minimize integral square residual
J =
t
0
y() Ce
A
x(0)
2
d
where y = y h u is observed output minus part due to input
lets expand as J = x(0)
T
Qx(0) + 2r
T
x(0) + s,
Q =
t
0
e
A
T
C
T
Ce
A
d, r =
t
0
e
A
T
C
T
y() d,
q =
t
0
y()
T
y() d
Observability and state estimation 1928
setting
x(0)
J to zero, we obtain the least-squares observer
x
ls
(0) = Q
1
r =
t
0
e
A
T
C
T
Ce
A
d
t
0
e
A
T
C
T
y() d
estimation error is
x(0) = x
ls
(0) x(0) =
t
0
e
A
T
C
T
Ce
A
d
t
0
e
A
T
C
T
v() d
therefore if v = 0 then x
ls
(0) = x(0)
Observability and state estimation 1929
i=1
||y
i
Ax
i
||
2
=
N
i=1
x
i
A
Ax
i
2y
i
Ax
i
+y
i
y
i
A
A
N
i=1
||y
i
Ax
i
||
2
= 0
N
i=1
2Ax
i
x
i
2y
i
x
i
= 0
A
N
i=1
x
i
x
i
=
N
i=1
y
i
x
i
N > n
N
i=1
x
i
x
A =
i=1
y
i
x
i=1
x
i
x
1
y
t
= x
t
x
t+1
= Ax
t
+ A
!10 !8 !6 !4 !2 0 2 4 6 8 10
!10
!8
!6
!4
!2
0
2
4
6
8
Phase plane. x
t+1
=Ax
t
+w
0 10 20 30 40 50 60 70 80 90 100
0
1
2
3
4
5
6
7
8
||A!A||
2
time
s
q
u
a
r
e
d
e
r
r
o
r
||A
A||
2
y Ax A {x
i
, y
i
}
N
i=1
arg min
A
N
i=1
||y
i
Ax
i
||
2
=
N
i=1
x
i
A
Ax
i
2y
i
Ax
i
+y
i
y
i
A
A
N
i=1
||y
i
Ax
i
||
2
= 0
N
i=1
2Ax
i
x
i
2y
i
x
i
= 0
A
N
i=1
x
i
x
i
=
N
i=1
y
i
x
i
N > n
N
i=1
x
i
x
A =
i=1
y
i
x
i=1
x
i
x
1
y
t
= x
t
x
t+1
= Ax
t
+ A
!10 !8 !6 !4 !2 0 2 4 6 8 10
!10
!8
!6
!4
!2
0
2
4
6
8
Phase plane. x
t+1
=Ax
t
+w
0 10 20 30 40 50 60 70 80 90 100
0
1
2
3
4
5
6
7
8
||A!A||
2
time
s
q
u
a
r
e
d
e
r
r
o
r
||A
A||
2
y Ax A {x
i
, y
i
}
N
i=1
arg min
A
N
i=1
||y
i
Ax
i
||
2
=
N
i=1
x
i
A
Ax
i
2y
i
Ax
i
+y
i
y
i
A
A
N
i=1
||y
i
Ax
i
||
2
= 0
N
i=1
2Ax
i
x
i
2y
i
x
i
= 0
A
N
i=1
x
i
x
i
=
N
i=1
y
i
x
i
N > n
N
i=1
x
i
x
A =
i=1
y
i
x
i=1
x
i
x
1
y
t
= x
t
x
t+1
= Ax
t
+ A
!10 !8 !6 !4 !2 0 2 4 6 8 10
!10
!8
!6
!4
!2
0
2
4
6
8
Phase plane. x
t+1
=Ax
t
+w
0 10 20 30 40 50 60 70 80 90 100
0
1
2
3
4
5
6
7
8
||A!A||
2
time
s
q
u
a
r
e
d
e
r
r
o
r
||A
A||
2
Note that this would also be the most likely A if ! were Gaussian noise
with zero mean and unit variance.
14
y Ax A {x
i
, y
i
}
N
i=1
arg min
A
N
i=1
||y
i
Ax
i
||
2
=
N
i=1
x
i
A
Ax
i
2y
i
Ax
i
+y
i
y
i
A
A
N
i=1
||y
i
Ax
i
||
2
= 0
N
i=1
2Ax
i
x
i
2y
i
x
i
= 0
A
N
i=1
x
i
x
i
=
N
i=1
y
i
x
i
N > n
N
i=1
x
i
x
A =
i=1
y
i
x
i=1
x
i
x
1
y
t
= x
t
x
t+1
= Ax
t
+ A
!10 !8 !6 !4 !2 0 2 4 6 8 10
!10
!8
!6
!4
!2
0
2
4
6
8
Phase plane. x
t+1
=Ax
t
+w
0 10 20 30 40 50 60 70 80 90 100
0
1
2
3
4
5
6
7
8
||A!A||
2
time
s
q
u
a
r
e
d
e
r
r
o
r
||A
A||
2
y Ax A {x
i
, y
i
}
N
i=1
arg min
A
N
i=1
||y
i
Ax
i
||
2
=
N
i=1
x
i
A
Ax
i
2y
i
Ax
i
+y
i
y
i
A
A
N
i=1
||y
i
Ax
i
||
2
= 0
N
i=1
2Ax
i
x
i
2y
i
x
i
= 0
A
N
i=1
x
i
x
i
=
N
i=1
y
i
x
i
N > n
N
i=1
x
i
x
A =
i=1
y
i
x
i=1
x
i
x
1
y
t
= x
t
x
t+1
= Ax
t
+ A
!10 !8 !6 !4 !2 0 2 4 6 8 10
!10
!8
!6
!4
!2
0
2
4
6
8
Phase plane. x
t+1
=Ax
t
+w
0 10 20 30 40 50 60 70 80 90 100
0
1
2
3
4
5
6
7
8
||A!A||
2
time
s
q
u
a
r
e
d
e
r
r
o
r
||A
A||
2
y Ax A {x
i
, y
i
}
N
i=1
arg min
A
N
i=1
||y
i
Ax
i
||
2
=
N
i=1
x
i
A
Ax
i
2y
i
Ax
i
+y
i
y
i
A
A
N
i=1
||y
i
Ax
i
||
2
= 0
N
i=1
2Ax
i
x
i
2y
i
x
i
= 0
A
N
i=1
x
i
x
i
=
N
i=1
y
i
x
i
N > n
N
i=1
x
i
x
i
A =
i=1
y
i
x
i=1
x
i
x
1
y
t
= x
t
x
t+1
= Ax
t
+ A
!10 !8 !6 !4 !2 0 2 4 6 8 10
!10
!8
!6
!4
!2
0
2
4
6
8
Phase plane. x
t+1
=Ax
t
+w
0 10 20 30 40 50 60 70 80 90 100
0
1
2
3
4
5
6
7
8
||A!A||
2
time
s
q
u
a
r
e
d
e
r
r
o
r
||A
A||
2
i=1
||y
i
Ax
i
||
2
=
N
i=1
x
i
A
Ax
i
2y
i
Ax
i
+y
i
y
i
A
A
N
i=1
||y
i
Ax
i
||
2
= 0
N
i=1
2Ax
i
x
i
2y
i
x
i
= 0
A
N
i=1
x
i
x
i
=
N
i=1
y
i
x
i
N > n
N
i=1
x
i
x
i
A =
i=1
y
i
x
i=1
x
i
x
1
y
t
= x
t
x
t+1
= Ax
t
+ A
!10 !8 !6 !4 !2 0 2 4 6 8 10
!10
!8
!6
!4
!2
0
2
4
6
8
Phase plane. x
t+1
=Ax
t
+w
0 10 20 30 40 50 60 70 80 90 100
0
1
2
3
4
5
6
7
8
||A!A||
2
time
s
q
u
a
r
e
d
e
r
r
o
r
||A
A||
2
y Ax A {x
i
, y
i
}
N
i=1
arg min
A
N
i=1
||y
i
Ax
i
||
2
=
N
i=1
x
i
A
Ax
i
2y
i
Ax
i
+y
i
y
i
A
A
N
i=1
||y
i
Ax
i
||
2
= 0
N
i=1
2Ax
i
x
i
2y
i
x
i
= 0
A
N
i=1
x
i
x
i
=
N
i=1
y
i
x
i
N > n
N
i=1
x
i
x
i
A =
i=1
y
i
x
i=1
x
i
x
1
y
t
= x
t
x
t+1
= Ax
t
+ A
!10 !8 !6 !4 !2 0 2 4 6 8 10
!10
!8
!6
!4
!2
0
2
4
6
8
Phase plane. x
t+1
=Ax
t
+w
0 10 20 30 40 50 60 70 80 90 100
0
1
2
3
4
5
6
7
8
||A!A||
2
time
s
q
u
a
r
e
d
e
r
r
o
r
||A
A||
2
EE363 Winter 2005-06
Lecture 6
Estimation
Gaussian random vectors
minimum mean-square estimation (MMSE)
MMSE with linear measurements
relation to least-squares, pseudo-inverse
61
Gaussian random vectors
random vector x R
n
is Gaussian if it has density
p
x
(v) = (2)
n/2
(det )
1/2
exp
1
2
(v x)
T
1
(v x)
,
for some =
T
> 0, x R
n
denoted x N( x, )
x R
n
is the mean or expected value of x, i.e.,
x = Ex =
vp
x
(v)dv
=
T
> 0 is the covariance matrix of x, i.e.,
= E(x x)(x x)
T
Estimation 62
= Exx
T
x x
T
=
(v x)(v x)
T
p
x
(v)dv
density for x N(0, 1):
4 3 2 1 0 1 2 3 4
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
PSfrag replacements
v
p
x
(
v
)
=
1
v
2
/
2
Estimation 63
mean and variance of scalar random variable x
i
are
Ex
i
= x
i
, E(x
i
x
i
)
2
=
ii
hence standard deviation of x
i
is
ii
covariance between x
i
and x
j
is E(x
i
x
i
)(x
j
x
j
) =
ij
correlation coecient between x
i
and x
j
is
ij
=
ij
ii
jj
mean (norm) square deviation of x from x is
Ex x
2
= ETr(x x)(x x)
T
= Tr =
n
i=1
ii
(using Tr AB = Tr BA)
example: x N(0, I) means x
i
are independent identically distributed
(IID) N(0, 1) random variables
Estimation 64
Condence ellipsoids
p
x
(v) is constant for (v x)
T
1
(v x) = , i.e., on the surface of
ellipsoid
E
= {v | (v x)
T
1
(v x) }
thus x and determine shape of density
can interpret E
1
(x x) has a
2
n
distribution, so Prob(x E
) = F
2
n
() where F
2
n
is the CDF
some good approximations:
E
n
gives about 50% probability
E
n+2
n
gives about 90% probability
Estimation 65
geometrically:
mean x gives center of ellipsoid
semiaxes are
i
u
i
, where u
i
are (orthonormal) eigenvectors of
with eigenvalues
i
Estimation 66
example: x N( x, ) with x =
2
1
, =
2 1
1 1
x
1
has mean 2, std. dev.
2
x
2
has mean 1, std. dev. 1
correlation coecient between x
1
and x
2
is = 1/
2
Ex x
2
= 3
90% condence ellipsoid corresponds to = 4.6:
10 8 6 4 2 0 2 4 6 8 10
8
6
4
2
0
2
4
6
8
PSfrag replacements
x
1
x
2
(here, 91 out of 100 fall in E
4.6
)
Estimation 67
Ane transformation
suppose x N( x,
x
)
consider ane transformation of x:
z = Ax + b,
where A R
mn
, b R
m
then z is Gaussian, with mean
Ez = E(Ax + b) = AEx + b = A x + b
and covariance
z
= E(z z)(z z)
T
= EA(x x)(x x)
T
A
T
= A
x
A
T
Estimation 68
examples:
if w N(0, I) then x =
1/2
w + x is N( x, )
useful for simulating vectors with given mean and covariance
conversely, if x N( x, ) then z =
1/2
(x x) is N(0, I)
(normalizes & decorrelates)
Estimation 69
suppose x N( x, ) and c R
n
scalar c
T
x has mean c
T
x and variance c
T
c
thus (unit length) direction of minimum variability for x is u, where
u =
min
u, u = 1
standard deviation of u
T
n
x is
min
(similarly for maximum variability)
Estimation 610
Degenerate Gaussian vectors
it is convenient to allow to be singular (but still =
T
0)
(in this case density formula obviously does not hold)
meaning: in some directions x is not random at all
write as
= [Q
+
Q
0
]
+
0
0 0
[Q
+
Q
0
]
T
where Q = [Q
+
Q
0
] is orthogonal,
+
> 0
columns of Q
0
are orthonormal basis for N()
columns of Q
+
are orthonormal basis for range()
Estimation 611
then Q
T
x = [z
T
w
T
]
T
, where
z N(Q
T
+
x,
+
) is (nondegenerate) Gaussian (hence, density formula
holds)
w = Q
T
0
x R
n
is not random
(Q
T
0
x is called deterministic component of x)
Estimation 612
Linear measurements
linear measurements with noise:
y = Ax +v
x R
n
is what we want to measure or estimate
y R
m
is measurement
A R
mn
characterizes sensors or measurements
v is sensor noise
Estimation 613
common assumptions:
x N( x,
x
)
v N( v,
v
)
x and v are independent
N( x,
x
) is the prior distribution of x (describes initial uncertainty
about x)
v is noise bias or oset (and is usually 0)
v
is noise covariance
Estimation 614
thus
x
v
x
v
x
0
0
v
using
x
y
I 0
A I
x
v
we can write
E
x
y
x
A x + v
and
E
x x
y y
x x
y y
T
=
I 0
A I
x
0
0
v
I 0
A I
T
=
x
x
A
T
A
x
A
x
A
T
+
v
Estimation 615
covariance of measurement y is A
x
A
T
+
v
A
x
A
T
is signal covariance
v
is noise covariance
Estimation 616
Minimum mean-square estimation
suppose x R
n
and y R
m
are random vectors (not necessarily Gaussian)
we seek to estimate x given y
thus we seek a function : R
m
R
n
such that x = (y) is near x
one common measure of nearness: mean-square error,
E(y) x
2
minimum mean-square estimator (MMSE)
mmse
minimizes this quantity
general solution:
mmse
(y) = E(x|y), i.e., the conditional expectation of x
given y
Estimation 617
MMSE for Gaussian vectors
now suppose x R
n
and y R
m
are jointly Gaussian:
x
y
N
x
y
x
xy
T
xy
y
(after alot of algebra) the conditional density is
p
x|y
(v|y) = (2)
n/2
(det )
1/2
exp
1
2
(v w)
T
1
(v w)
,
where
=
x
xy
1
y
T
xy
, w = x +
xy
1
y
(y y)
hence MMSE estimator (i.e., conditional expectation) is
x =
mmse
(y) = E(x|y) = x +
xy
1
y
(y y)
Estimation 618
mmse
is an ane function
MMSE estimation error, x x, is a Gaussian random vector
x x N(0,
x
xy
1
y
T
xy
)
note that
xy
1
y
T
xy
x
i.e., covariance of estimation error is always less than prior covariance of x
Estimation 619
Best linear unbiased estimator
estimator
x =
blu
(y) = x +
xy
1
y
(y y)
makes sense when x, y arent jointly Gaussian
this estimator
is unbiased, i.e., E x = Ex
often works well
is widely used
has minimum mean square error among all ane estimators
sometimes called best linear unbiased estimator
Estimation 620
MMSE with linear measurements
consider specic case
y = Ax + v, x N( x,
x
), v N( v,
v
),
x, v independent
MMSE of x given y is ane function
x = x + B(y y)
where B =
x
A
T
(A
x
A
T
+
v
)
1
, y = A x + v
intepretation:
x is our best prior guess of x (before measurement)
y y is the discrepancy between what we actually measure (y) and the
expected value of what we measure ( y)
Estimation 621
estimator modies prior guess by B times this discrepancy
estimator blends prior information with measurement
B gives gain from observed discrepancy to estimate
B is small if noise term
v
in denominator is large
Estimation 622
MMSE error with linear measurements
MMSE estimation error, x = x x, is Gaussian with zero mean and
covariance
est
=
x
x
A
T
(A
x
A
T
+
v
)
1
A
x
est
x
, i.e., measurement always decreases uncertainty about x
dierence
x
est
gives value of measurement y in estimating x
e.g., (
est ii
/
x ii
)
1/2
gives fractional decrease in uncertainty of x
i
due
to measurement
note: error covariance
est
can be determined before measurement y is
made!
Estimation 623
to evaluate
est
, only need to know
A (which characterizes sensors)
prior covariance of x (i.e.,
x
)
noise covariance (i.e.,
v
)
you do not need to know the measurement y (or the means x, v)
useful for experiment design or sensor selection
Estimation 624