Basics of Sensor Fusion 2020
Basics of Sensor Fusion 2020
September 3, 2020
ii
Contents
1 Introduction 1
1.1 Definition and Components of Sensor Fusion . . . . . . . . . . . 1
1.2 Model of a Drone . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Model of an Autonomous Car . . . . . . . . . . . . . . . . . . . 4
1.4 Dynamic Models of Drone and Car . . . . . . . . . . . . . . . . . 7
1.5 Tracking a Moving Drone or Car . . . . . . . . . . . . . . . . . . 9
iii
iv CONTENTS
5 Dynamic Models 53
5.1 Continuous-Time State-Space Models . . . . . . . . . . . . . . . 53
5.1.1 Deterministic Linear State-Space Models . . . . . . . . . 53
5.1.2 Stochastic Linear State-Space Models . . . . . . . . . . . 56
5.1.3 Nonlinear State-Space Models . . . . . . . . . . . . . . . 58
5.2 Discrete-Time State-Space Models . . . . . . . . . . . . . . . . . 61
5.2.1 Deterministic Linear State-Space Models . . . . . . . . . 61
5.2.2 Stochastic Linear Dynamic Models . . . . . . . . . . . . 62
5.2.3 Nonlinear Dynamic Model . . . . . . . . . . . . . . . . . 63
5.3 Discretization of Linear Dynamic Models . . . . . . . . . . . . . 64
5.3.1 Deterministic Linear Dynamic Models . . . . . . . . . . . 64
5.3.2 Stochastic Linear Dynamic Models . . . . . . . . . . . . 67
5.4 Discretization of Nonlinear Dynamic Models . . . . . . . . . . . 69
5.4.1 Discretization of Linearized Nonlinear Models . . . . . . 69
5.4.2 Euler Discretization of Deterministic Nonlinear Models . 70
5.4.3 Euler–Maruyama Discretization of Stochastic Nonlinear
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6 Filtering 73
6.1 The Filtering Approach . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3 Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . 79
6.4 Unscented Kalman Filtering . . . . . . . . . . . . . . . . . . . . 83
6.5 Iterated Extended and Unscented Kalman Filters . . . . . . . . . 87
6.6 Bootstrap Particle Filter . . . . . . . . . . . . . . . . . . . . . . . 87
Preface
These lecture notes are aimed for an M.Sc. level course on sensor fusion. The
goal is to provide a gentle and pragmatic introduction to the topic with an empha-
sis on methods that are useful for developing practical solutions to sensor fusion
problems. The reader should be familiar with:
3. basic statistics.
The notes are based on the much more comprehensive and rigorous works on
sensor fusion, estimation theory, filtering theory, stochastic process theory, and
optimization theory by Gustafsson (2018), Kay (1993), Särkkä (2013), Särkkä and
Solin (2019), and Nocedal and Wright (2006), and the interested reader is highly
recommended to have a read in these books.
The topics covered in these notes range from simple linear observation models
and least squares estimation, to modeling of dynamic systems, and inference in
dynamic systems using Kalman and particle filtering. To lower the threshold to the
topic, we sometimes trade mathematical rigor for easier understanding. This is in
particular true for the probabilistic interpretations, especially in the beginning of
the notes.
v
vi CONTENTS
Chapter 1
Introduction
Model(s)
The common denominator and main objective of sensor fusion systems are that
they take measurements from different sensors and estimate or infer one or more
quantities of interest. For example, we can use multiple sensors to infer the position
of a drone or an autonomous car – possibly along with their velocities. On a high
level, this involves three main components (Figure 1.1):
• model(s) that relate the observed quantity to the quantity of interest, and
1
2 CHAPTER 1. INTRODUCTION
px
y1
y2 py
y1 = px ,
y2 = py , (1.1)
and the aim of the estimation algorithm is to determine (px , py ) from the measure-
ments (y1 , y2 ). In this case it is indeed trivial as the measurements directly give the
position coordinates.
1.2. MODEL OF A DRONE 3
y1 = px + r1 ,
y2 = py + r2 , (1.2)
which says that the measurements that we obtain are the actual positions plus mea-
surement noises r1 and r2 . The measurement noises are typically modeled as zero-
mean Gaussian random variables. In this case we cannot exactly determine the
position from the sensor measurements because we only get to see noisy versions
of the positions. We could, however, approximate that the noises are ”small” which
implies that y1 ≈ px , y2 ≈ py and thus we can use the measurement directly as
estimates of the position coordinates. In this case given only the two sensor mea-
surements, it is hard or impossible to get rid of the noise and hence our position
estimate will inherently be off by the amount of measurement noise – not even
sophisticated statistical methods will be significantly better than the small-noise
approximation. One way to cope with this problem is to average multiple measure-
ments which reduces the variance of the estimate, but it obviously requires us to
make more measurements with the sensor. Alternatively, by adding more sensors
we can diminish the effect of measurement noise even when each of the sensors
only produce a single measurement.
Let us now look at a slightly extended sensor setup show in Figure 1.3. We
still have the wall-distance and height sensors, but additionally, we measure the
distance from a wall that is at a 45◦ angle with the ground. From the geometry of
the problem we can now determine that the relationship of the measurements and
positions should be
y1 = px + r1 ,
y2 = py + r2 ,
1 1
y3 = √ (px − x0 ) + √ py + r3 , (1.3)
2 2
where x0 is the horizontal coordinate of the rightmost point of the wall and r1 , r2 ,
and r3 are measurement noises.
The aim of the estimation algorithm is now to determine the position (px , py )
given the 3 sensor measurements (y1 , y2 , y3 ). If we had no measurement noises,
then any two of the measurements would exactly give us the position (after possi-
bly solving a simultaneous pair of equations). It is indeed this over-completeness
of the model that allows us to use least-squares estimation to determine the po-
sition from the measurements such that the effect of the measurements noises is
significantly diminished — the more over-complete the model is, the better we can
diminish the noises. Furthermore, an important factor in the above model is that
the measurement y3 is no longer a direct measurement of a position coordinate, but
a linear combination of them, and in that sense it is an indirect observation of the
position.
4 CHAPTER 1. INTRODUCTION
px
y1
y2 py
y3
The model above (and in fact the more simple model also) is an example of
static linear model which will be covered in Chapter 3. More specifically, the
model above in Equation (1.3) can be also rewritten in vector notation by rewriting
it as
y1 1 0 x 0 r1
y2 = 0 p
1 y + 0 + r2 , (1.4)
√1 √1
p x0
y3 − √ r3
| {z } | 2 {z 2 } x | {z 2 } | {z }
| {z }
y G b r
Even though, for example, radar or camera actually measures the direction
and range from the car to the landmark, provided that both the measurements are
relatively accurate, then by a simple coordinate transformation we can convert the
measurement into measurement of the landmark position in the car’s coordinate
system. Thus we get direct measurements of the relative positions of the landmarks
with respect o the car. In this case, if the landmark positions are (sxj , syj ) for
j = 1, . . . , M , and the position of the car is (px , py ), then the measurements are
given as
y1 = sx1 − px + r1 ,
y2 = sy1 − py + r2 ,
..
.
y2M −1 = sxM − px + r2M −1 ,
y2M = syM − py + r2M , (1.6)
where the matrix G is again non-square and non-invertible. However, least squares
method will be applicable to estimate the car’s position x from the noisy measure-
ments y.
Sometimes it happens that the sensor only measures the range from car to the
landmark and does not provide information on the direction. This can happen, for
example, when a certain type of radar is used. When only range measurements are
available, then the range-only model can be written as
q
y1R = (sx1 − px )2 + (sy1 − py )2 + r1R ,
..
.
q
R
yM = (sxM − px )2 + (syM − py )2 + rM
R
,
(1.8)
and the aim of the measurement algorithm would be to determine the position
(px , py ) based on the measurements y1R , . . . , yMR in the presence of the measure-
R R
ment noises r1 , . . . , rM . If we now define a function
(sx1 − px )2 + (sy1 − py )2
p
..
g(x) = q .
(1.9)
y
(sxM − px )2 + (sM − py )2
T R T , and r = r R · · · r R T , then
as well as x = px py , y = y1R · · · yM
1 M
the model can be written as
y = g(x) + r, (1.10)
where g(x) is a non-linear function of x.
Alternatively, let us now consider a case where we only measure the direction
(i.e., bearing) of the landmark with respect to the car, but no distance. This can
happen, for example, when we are using a camera without the knowledge of the
true size of the landmark. The measurement in this bearings-only model can be
written as
y1D = atan2 (sy1 − py , sx1 − px ) + r1D ,
..
.
D
yM = atan2 (syM − py , sxM − px ) + rM
D
, (1.11)
1.4. DYNAMIC MODELS OF DRONE AND CAR 7
where atan2 is the full-quadrant inverse of tangent function (e.g., function called
atan2 in Matlab and Python). This model again has the form (1.10) provided that
we define
atan2 (sy1 − py , sx1 − px )
g(x) = ..
, (1.12)
.
y y x x
atan2 (sM − p , sM − p )
T D T , and r = r D · · · r D T .
along with x = px py , y = y1D · · · yM
1 M
The range-only and bearings-only models above are examples of static non-
linear models that we cover in Chapter 4. The estimation of x in these models
will be possible by solving the nonlinear least squares problem with an iterative
optimization method.
xn = xn−1 + qn , (1.13)
where we need to use two indices for denoting the vector index and the time step
number.
Although a random walk model would be good for say a person or animal, both
drone and car have definite velocity which itself changes according to a random
8 CHAPTER 1. INTRODUCTION
walk model. At the time step transition the velocity times the time step length ∆t
is added to the position coordinates which leads to the model
where we have added noises q3,n and q4,n which act as the noises in the velocity
components.
At this point it is now convenient to change the notation so that our state
vector x not only contains the position coordinates but also the velocities x =
x y T
p p v x v y . As the velocities are just 3rd and 4th components of x, we
can also then rewrite above model as
x1,n = x1,n−1 + x3,n−1 ∆t + q1,n ,
x2,n = x2,n−1 + x4,n−1 ∆t + q2,n ,
x3,n = x3,n−1 + q3,n ,
x4,n = x4,n−1 + q4,n , (1.16)
that is,
xn = F xn−1 + qn . (1.18)
The above model is an example of a discrete-time stochastic linear state space that
we consider in Chapter 5.
The above model could also be derived by considering the state as function
of continuous-time variable x(t) with t being a time variable that can take any
value t ≥ 0. The discrete-time model can then be derived as a discretization of the
continuous-time model described as a differential equation with random input. For
example, the first random walk model corresponds to a continuous-time model
dpx (t)
= w1 (t),
dt
dpy (t)
= w2 (t),
dt (1.19)
xn = F xn−1 + qn ,
yn = G xn + b + rn , (1.21)
where we get a measurement of the state at time step n when the state xn is
changing according to the dynamic model. For this kind of linear (or in fact affine)
dynamic models we can use the Kalman filter to compute the statistical estimate
of the dynamic state xn at each time step n. This methodology is covered in
Chapter 6.
We can also combine the dynamic model (1.18) with a non-linear measurement
model of the form (1.10), which leads to a model of the form
xn = F xn−1 + qn ,
yn = g(xn ) + rn . (1.22)
For this kind of models we can use extended Kalman filter (EKF), unscented
Kalman filter (UKF), or particle filter, which will also be covered in Chapter 6.
Additionally, these methods can also be used in the case that the dynamics are
modeled with nonlinear models instead of a linear model as we have above.
10 CHAPTER 1. INTRODUCTION
Chapter 2
2.1 Sensors
A sensor is a device that provides a measurement related to the quantity of in-
terest. Usually, a sensor is implemented as a device which converts a physical
phenomenon into an electrical signal (Wilson, 2005) which is then further trans-
formed into digital form. The measurement may be direct or indirect: In direct
measurements, the quantity observed by the sensor is the quantity of interest, for
example, the temperature in a thermometer. Sensors that measure indirectly pro-
vide a measurement of a quantity that is only related to the quantity of interest. For
example, we might only be able to measure the distance to an object when we are
actually interested in the position. The indirect measurement can also be related
to dynamical properties of the quantity, for example, an accelerometer measures a
body’s acceleration which is the second derivative of the position.
There are many different types of sensors for measuring a wide range of phe-
nomena available today. These include electronic sensors, (micro-)mechanical sen-
sors, or virtual sensors and a (non-exhaustive) overview of a few commonly used
sensors, their measurements, and some of their application areas is shown in Ta-
ble 2.1.
Sensors are characterized by many different properties that affect their per-
formance. These include limitations such as measurement range, requirements and
sensitivity for the environmental conditions (temperature, humidity, etc.), sampling
frequency, as well as measurement noise, biases, and drifts. Some of these factors
have to be taken into account when designing the hardware whereas others can be
accurately dealt with when designing the estimation algorithm. In this course, we
will focus on the latter aspects of the sensors, that is, we will introduce tools that
are appropriate for taking into account the properties that can be handled by sensor
fusion.
11
12 CHAPTER 2. SENSORS, MODELS, AND LEAST SQUARES CRITERION
2.2 Models
A model is a mathematical formulation that relates the quantity of interest to the
measurements in a systematic way. Additionally, in the dynamic case, models are
also used to describe how the quantity of interest evolves over time. We already
saw examples of models in Chapter 1.
As mentioned earlier, sensors suffer from several limitations that we need to
account for in models. In particular, it is important to take uncertainties in the mea-
surements such as sensor noise or measurement noise into account. Unfortunately,
measurement noise is difficult to quantify analytically. Instead, a suitable approach
is to use statistical modeling. It is based on the observations that the measurement
noise exhibits statistical properties that can be quantified. While the actual value
of the noise can not be observed or measured, its statistical properties can, and this
2.2. MODELS 13
allows us to design sensor fusion methods that are able to (statistically) minimize
the effect of the noise. A common assumption is that it is zero mean, that is, on
average, it is zero (but each individual realization is not).
yn = gn (x) + rn . (2.1)
• the measured quantity yn is found on the left hand side of the equation1 , and
• on the right hand side there are two terms, a function gn (x) that relates the
quantities of interest x to the measurement yn and a noise term rn .
E{rn } = 0, (2.3a)
var{rn } = E{rn2 } − (E{rn }) = 2 2
σr,n , (2.3b)
Cov{rm , rn } = E{rm rn } − E{rm } E{rn } = 0, (m 6= n), (2.3c)
where E{rn } and var{rn } denote the expected value and variance of rn , respec-
tively. Note that the measurement noise variance may vary for the different mea-
2 (e.g., if the different measure-
surements, as indicated by the subscript n on σr,n
ments are obtained by different sensors).
1
There are more general formulations than (2.1) that do not share this trait, but we will not make
use of these in these notes.
14 CHAPTER 2. SENSORS, MODELS, AND LEAST SQUARES CRITERION
Furthermore, the overall noise covariance R for this model is a diagonal matrix of
the form 2
σr,1 0 . . . 0
2
..
0 σr,2 .
R = Cov{r} = .. ..
.
(2.9)
. . 0
0 . . . 0 σr,N 2
For vector measurements, y, g(x), and r are constructed in the same way as
for the scalar case in (2.8), that is,
y1 g1 (x) r1
y2 g2 (x) r2
y = . , g(θ) = . , and r = . . (2.10)
.
. . . ..
yN gN (x) rN
The measurement noise covariance is a block-diagonal matrix with the individual
covariance matrices Rn on the diagonal, that is,
R1 0 . . . 0
..
0 R2 .
R = Cov{r} = .
.
. (2.11)
.. . . 0
0 . . . 0 RN
which reads as “the estimate x̂ is the argument x that minimizes J(x)”. Typical
cost functions minimize a function of the error
en = yn − gn (x) (2.16)
between the measurement and the output predicted by the estimate. Two typical
functions are the absolute error (Figure 2.1a)
which penalizes all errors equally, and the quadratic cost function (Figure 2.1b)
J(e) J(e)
e e
(a) (b)
Figure 2.1. Examples of cost functions: (a) Absolute error, and (b) quadratic error.
ones. Additionally, it is closely related to the case where the measurement noise is
assumed to be Gaussian as discussed in Section 2.2. Minimizing the quadratic cost
function is also referred to as the least squares method. As we will see, many of
the most common estimation algorithms can be formulated based on this approach.
Typically, the aim of the estimation algorithm must be to minimize the error
over all measurements y1:N rather than only one measurement yn . Hence, the
resulting cost function is rather the sum over all the errors en . For the quadratic
cost (2.18), the cost function is thus
N
X
JLS (x) = e2n
n=1
(2.19)
XN
2
= (yn − gn (x)) .
n=1
Finally, we can write (2.19) and (2.21) more compactly by using the vector
notation introduced in Section 2.2. This yields the compact cost function
A method that determines the estimate x̂ by minimizing the quadratic cost function
in Equation (2.22) is called a least squares (LS) method. The formulations (2.19)
or (2.21) and (2.22) are equivalent. Hence, we will use these formulations inter-
changeably, depending on the context.
18 CHAPTER 2. SENSORS, MODELS, AND LEAST SQUARES CRITERION
Similarly, for the vector case (2.21), we can introduce a positive definite2
weighing Wn matrix and weight each term using this matrix. The resulting cost
function becomes
N
X
JWLS (x) = (yn − gn (x))T Wn (yn − gn (x)). (2.24)
n=1
for the scalar or vector case, respectively. The forms (2.24)–(2.25) are indeed very
general and they will be the basis for many of the estimation algorithms derived in
this course.
An important question is the choice of the weights wn (or Wn ). This can, in
principle, be arbitrary as long as they are positive (or positive definite). In practice,
however, a principled choice is to use
2
wn = 1/σr,n and Wn = R−1
n ,
that is, use the inverse (co)variance of the measurement noise as the weighting fac-
tor. The rationale behind this choice is as follows: The covariance is a measure
2
That is, a symmetric matrix M for which it holds that xT Mx > 0 for any arbitrary non-zero
vectors x. The eigenvalues of this kind of matrix are all positive.
2.3. LEAST SQUARES METHODS 19
for the amount of uncertainty in our measurement due to the measurement noise.
A large covariance means that there is a large uncertainty and vice-versa. Hence,
by weighing the measurements using the inverse of the covariance, measurements
with high uncertainty are given lower weights and measurements with low uncer-
tainty are given higher weights.
In this chapter, we focus on static, linear models. These models are quite versatile
already and have several important properties, one being that we can find a closed
form estimation algorithm.
Example 3.1. To measure the wall-distance of the drone in Figure 1.2 we could
use, for example, radar which essentially measures the time difference between the
sent signal and the signal reflected from the wall. This time difference is ideally
equal to the double distance divided by the speed of light τ = 2px /c, where
c = 299792458 m/s. If we do this measurent N times, and each measurement
is corrupted with noise, then the measurement model is
2 x
yn = p + rn , n = 1, . . . , N. (3.2)
c
which has the form (3.1) with g = 2/c.
With (3.1) in mind, we can construct similar models where the scalar measure-
ment linearly depends on a set of K unknown quantities x1 , . . . , xK :
yn = g1 x1 + g2 x2 + · · · + gK xK + rn , (3.3)
21
22 CHAPTER 3. STATIC LINEAR MODELS
with known scale factors g1 , . . . , gK and noise term rn . We can rewrite (3.3) in
matrix form as
x1
x2
yn = g1 g2 . . . gK . + rn = gx + rn , (3.4)
.
.
xK
T
where g = g1 g2 . . . gK is the vector of constants and x = x1 x2 . . . xK
is the vector of unknowns. For further generality, we can replace g above with gn
which would allow for different set of coefficients for each measurement, which
yields
yn = gn x + rn , (3.5)
For a total of N measurements y1 , y2 , . . . , yN , we can use measurement stacking
as discussed in Section 2.2 to obtain the compact model
y = Gx + r, (3.6)
where y and r are the stacked measurements and measurement noise, respectively.
Moreover,
g1
g2
G= . (3.7)
..
gN
is a matrix that contains the known factors g1 , . . . , gN .
yn = Gn x + rn . (3.9)
where Gn contains the collected terms, and similarly to previous section, we have
already allowed it to vary for each measurement.
Again, by stacking the measurements y1 , y2 , . . . , yN into a single vector y, we
obtain the compact model
y = Gx + r, (3.10)
3.1. LINEAR MODEL 23
y = Gx + b + r. (3.12)
Fortunately, it turns out that the same algorithms that are used for linear models
(without a bias) can also be directly used for model with a bias. This is because we
can now rewrite the above model as
y − b = Gx + r. (3.13)
ỹ = Gx + r. (3.14)
Example 3.2. The drone model in Equations (1.3) and (1.4) has this kind of form.
Thus, we can convert the model into linear model by defining de-biased measure-
ments as
ỹ1 = y1 ,
ỹ2 = y2 ,
1
ỹ3 = y3 + √ x0 , (3.15)
2
which reduces (1.4) to
ỹ1 1 0 x r1
ỹ2 = 0 p
1 y + r2 , (3.16)
√1 √1
p
ỹ3 2 2 | {z }
r3
| {z } | {z } x | {z }
ỹ G r
Example 3.3. For the autonomous car model in Equations (1.6) and (1.7) we can
eliminate the bias by defining
ỹ1 = y1 − sx1 ,
ỹ2 = y2 − sy1 ,
..
.
ỹ2M −1 = y2M −1 − sxM ,
ỹ2M = y2M − syM . (3.17)
en = yn − gx. (3.18)
The least squares estimate x̂LS is the value of x that minimizes the cost func-
tion (3.19). The minima of the cost function with respect to x are found by setting
the derivative of JLS (x) (with respect to x) to zero and solving it for x.
3.2. LINEAR LEAST SQUARES 25
Two important aspects of an estimator are its mean and variance. The mean
indicates whether an estimator (on average) converges to the true parameter value
x and the covariance indicates how confident we are about the estimated value. For
the scalar least squares estimator in (3.21), the mean is given by
N
( )
1 X
E{x̂LS } = E yn
Ng
n=1
N
1 X
= E{gx + rn }
Ng
n=1
N
1 X
= gx + E{rn }
Ng
n=1
N
1 X
= N gx + E{rn }
Ng
n=1
N
X
=x+ E{rn }.
n=1
Hence, if (and only if) the measurement noise is zero-mean, that is, E{rn } = 0,
the mean reduces to
E{x̂LS } = x. (3.22)
26 CHAPTER 3. STATIC LINEAR MODELS
In this case, x̂ converges to the true parameter value and the estimator is said to be
an unbiased estimator. Furthermore, for the zero-mean noise case, the variance is
given by
which under certain conditions (Gaussianity, etc.) contains the true parameter
value with a probability of 95%. For visualization purposes the coefficient 1.96
is often replaced with constant 2 as it is an approximation anyway. Furthermore, in
multivariate case the formula for confidence interval is a bit more complicated, but
using the multiplier 2 still provides a reasonable approximation to the confidence
interval.
3.2. LINEAR LEAST SQUARES 27
This estimator is unbiased. Further assume that the standard deviation of the
measurement is σ = 10−9 s (1 nanosecond). Then by (3.24), the standard deviation
of the estimator is
1 cσ
std{x̂1 } = √ . (3.27)
N 2
With a single measurement (N = 1) we get the error of 15 cm whereas by averag-
ing 100 measurements the error drops to 1.5 cm.
To minimize this cost function, we can follow the same steps as in the scalar case.
First, we calculate the gradient of (3.28) with respect to the parameters x. Since
x is a vector, this can be done by calculating the partial derivative with respect to
each individual entry xk in x individually. A more compact alternative is to use
vector calculus. In the vector calculus notation we have
∂J (x)
LS
∂x1
∂JLS (x)
∂JLS (x) ∂x2
=
..
, (3.29)
∂x .
∂JLS (x)
∂xK
which is the gradient of JLS (x) which in some literature is denoted as ”∇JLS (x)”.
We now have the following computation rules for the vector derivatives (Petersen
and Pedersen, 2012):
∂xT a ∂aT x
= = a,
∂x ∂x
∂xT Ax
= 2Ax,
∂x
28 CHAPTER 3. STATIC LINEAR MODELS
for the least squares estimator. Note that (3.30) is valid for both scalar measure-
ments of the form (3.3) as well as (3.9), with the appropriate choice of the matrix
G.
Similar to the scalar case, we can calculate the statistical properties of the
estimator (3.30). The mean is found from
and we can see that if (and only if) E{r} = 0, we have that
E{x̂LS } = x, (3.31)
and the estimator is unbiased. Furthermore, for the zero-mean noise case, the
variance can be derived from
n o
Cov{x̂LS } = E (x̂LS − E{x̂LS }) (x̂LS − E{x̂LS })T
T
T −1 T T −1 T
=E (G G) G (Gx + r) − x (G G) G (Gx + r) − x
T
T −1 T T −1 T
=E (G G) G r (G G) G r
and is given by
Example 3.5. Let us recall the autonomous car shown in Figure 1.4 and its model
in Equations (1.6) and (1.7). The positions of the landmarks are
x2
x1
Following the same steps for the derivation as for the least squares case, we obtain
the weighted least squares estimator given by
Comparing the linear least squares estimator in (3.30) and the weighted version
in (3.37), we see that the former is a special case of the latter if and only if the
covariance matrix is proportional to the identity matrix, that is, R = σr2 I.
30 CHAPTER 3. STATIC LINEAR MODELS
x2
LS
WLS
x1
Similarly, the mean and covariance of the weighted least squares estimator can
be shown to be
and
respectively.
As discussed in Section 2.3, it is possible to make other choices for the weigh-
ing matrix W. However, it can be shown that choosing the inverse noise covariance
matrix R−1 is the optimal choice for W in the sense that it minimizes the trace of
the covariance of the estimator.
Example 3.6. Figure 3.2 shows the weighed least squares estimate of the au-
tonomous car position using the same data as in Example 3.5. It can be seen
that the confidence interval is now smaller and the estimator is closer to the true
position.
where the second term (x − m)T P−1 (x − m) is a regularization term that encodes
the prior information. The minimum of this cost function can now be derived by
setting the gradient to zero. The gradient is given by
∂JReLS (x)
= −2GT R−1 y + 2GT R−1 Gx − 2P−1 m + 2P−1 x
∂x
which now has some additional terms resulting from the regularization. Setting it
to zero and solving for x then yields the regularized linear least squares estimator
x̂ReLS = (GT R−1 G + P−1 )−1 (GT R−1 y + P−1 m). (3.40)
The estimator in (3.40) shows that when using the regularization approach, the
resulting estimate is a weighted average of the data, which enters through the term
GT R−1 y and the prior information through P−1 m. The contributions of each of
the terms are determined by two factors. First, there are the weighing matrices
R−1 and P−1 , which directly weigh the contributions. Second, the number of data
points in y also affect how much emphasis is put on the data, and how much on the
prior information.
The covariance of the estimator can then shown to be
It can further be shown that the trace of the above covariance (which is the total
variance) is strictly smaller than that of the non-regularized version in Equation
(3.38). This is because including prior information decreases the variance of the
estimator. However, if we compute the expectation of the estimator (3.40), we
find that the estimator is biased — it is unbiased if and only if P−1 = 0. This
bias is due to our inclusion of prior information which forces the estimate slightly
towards prior. However, if our prior belief is correct, the estimator is supposed to
be biased and the bias actually forces the estimate to be closer to the truth than the
measurements alone imply.
It is also possible to express the mean and covariance equations in alternative,
but equivalent form, which will be especially useful in sequential least squares that
we discuss in the next section. The matrix inversion formula, which is also called
Woodbury matrix identity, gives
If we further substitute this into the estimator equation and simplify, we can express
the estimator (3.40) and its covariance (3.41) as
x2 x2
WLS WLS
ReLS ReLS
x1 x1
and
Cov{x̂ReLS } = P − K(GPGT + R)KT , (3.44)
where K is a gain given by
In this form, a second interpretation of the regularized linear least squares estimator
is possible. Recall that m was the prior mean. Then, the term y − Gm is the error
between the output and the output predicted by the prior mean. The final estimate
is then the prior mean that is corrected by the error term through the weight K.
Finally, we also point out that the regularized least squares solution can also
be derived as non-regularized (weighted) least squares solution where the regu-
larization term is interpreted as additional set of measurements. Because (x −
m)T P−1 (x − m) = (m − x)T P−1 (m − x), we can rewrite the cost function
(3.39) as
and 3.6. On the left we have prior information where the mean is close to the truth,
but the prior variance is quite large. In this case the estimate is slightly shifted
towards the true position. On the right, the prior mean is in the wrong place with
relatively small variance, which in this case shifts the estimate to the wrong place.
If the prior information were correct, the estimate would be even better.
where the first part is the cost function that was minimized when n − 1 measure-
ments were available, and the second part is the additional cost for the nth sample.
Minimizing (3.47) is achieved in the same way as for the general linear model
in the previous section, except for that there is an additional term. The gradient is
given by
∂JSLS (x) −1 −1
= −2GT T
1:n−1 R1:n−1 y1:n−1 + 2G1:n−1 R1:n−1 G1:n−1 x
∂x
−1 T −1
− 2GT
n Rn yn + 2Gn Rn Gn x. (3.48)
Setting the gradient to zero and solving for x then yields the updated estimate
−1 T −1 −1
x̂n = (GT
1:n−1 R1:n−1 G1:n−1 + Gn Rn Gn )
−1 T −1
× (GT
1:n−1 R1:n−1 y1:n−1 + Gn Rn yn )
= (P−1 T −1 −1 T −1 T −1
n−1 + Gn Rn Gn ) (G1:n−1 R1:n−1 y1:n−1 + Gn Rn yn ).
Using the matrix inversion formula as in the previous section and some further
simplifications, we can rewrite this as
x2 x2
WLS WLS
SLS 1 SLS 2
x1 x1
x2 x2
WLS WLS
SLS 3 SLS 4
x1 x1
that is, the updated estimate is unbiased provided that the previous one was. Fur-
thermore, the covariance of the updated estimate is
Cov{x̂n }
= E{(x̂n−1 + Kn (yn − Gn x̂n−1 ) − x)(x̂n−1 + Kn (yn − Gn x̂n−1 ) − x)T }
= E{(x̂n−1 − x)(x̂n−1 − x)T } + E{(x̂n−1 − x)(yn − Gn x̂n−1 )T KT
n}
+ E{Kn (yn − Gn x̂n−1 )(x̂n−1 − x)T }
+ E{Kn (yn − Gn x̂n−1 )(yn − Gn x̂n−1 )T KT
n}
= Pn−1 − Kn (Gn Pn−1 GT + Rn )KT
n. (3.52)
Note how the covariance update in (3.52) illustrates how adding new measurements
reduces the uncertainty of the estimate.
It turns out that regularization can be easily incorporated into sequential esti-
mation. All we need to do is to set x̂0 = m and P0 = P from the regularization
term in (3.39) and then proceed with the above update equations.
Example 3.8. Figure 3.4 illustrates the sequential least squares estimation of the
autonomous car position using the same data as in Examples 3.5, 3.6, and 3.7. On
the left top we have used only one measurement, on the right top two measurements,
on the bottow left three, and in the bottom right all the four measurements. It
3.4. SEQUENTIAL LINEAR LEAST SQUARES 35
can be seen how each measurement gives more information and after all four
measurements the result exactly coincides with the weighted least squares estimate.
36 CHAPTER 3. STATIC LINEAR MODELS
Chapter 4
In Chapter 3, we discussed problems where the sensor model is linear in the param-
eters of interest. For these model types, closed-form estimators exist and further-
more, we can analytically determine the estimators’ properties such as biasedness
or covariance. However, despite these desirable characteristics, linear models can
be limiting and a nonlinear model may be more accurate in describing the relation-
ship between the sensors’ measurements and the parameters of interest.
y = g(x) + r, (4.1)
• Gradient descent,
37
38 CHAPTER 4. STATIC NONLINEAR MODELS
6 6
5 5
θ2
θ2
4 4
3 3
2 2
6 8 10 6 8 10
θ1 θ1
(a) (b)
Figure 4.1. Illustration of the cost function for a nonlinear problem with two unknowns
T
x = x1 x2 . (a) Contours of the cost function, and (b) vector field defined by the
negative gradient.
Note that all of these algorithms typically only are able to find local minima,
unless there is only a global minimum. Hence, it is important to have good initial
guesses of the parameters.
∂JWLS (x)
x̂(i+1) = x̂(i) − γ , (4.3)
∂x x=x̂(i)
where ∂JWLS (x)/∂x denotes the gradient of JWLS (x) with respect to x and γ > 0
is a positive constant defining the step length. For simplicity, we start our derivation
assuming scalar measurements yn together with the (non-weighted) least squares
cost function and later generalize it to the general model (4.1)–(4.2). In this case,
we have that
X N
JLS (x) = (yn − gn (x))2
n=1
4.2. GRADIENT DESCENT 39
and thus
N
∂JLS (x) ∂ X
= (yn − gn (x))2
∂x ∂x
n=1
N
X ∂gn (x)
= −2(yn − gn (x)) .
∂x
n=1
and the matrix of derivatives can be identified as the transpose of the Jacobian
matrix given by
∂g (x) ∂g (x) ∂g1 (x)
1 1
∂x1 ∂x2 ... ∂xK
∂g2 (x) ∂g2 (x) ..
∂x1 ∂x2 .
Gx (x) = . , (4.4)
.. .. ∂gN −1 (x)
. ∂xK
∂gN (x) ∂gN (x) ∂gN (x)
∂x1 ... ∂xK−1 ∂xK
which is a matrix containing the gradients of the elements of the vector valued g(x)
at each row. Hence, we have that
∂JLS (x)
= −2GT x (x) (y − g(x)). (4.5)
∂x
We can now directly generalize (4.5) to the general cost (4.2), which yields
∂JWLS (x) ∂
= (y − g(x))T R−1 (y − g(x))
∂x ∂x
∂ h T −1 i
= y R y − yT R−1 g(x) − g(x)T R−1 y + g(x)R−1 g(x)
∂x
−1
= −2GT x (x) R (y − g(x)).
y2 py
x1 px
Figure 4.2. Gradient descent optimization example. The left figure shows the evolution of
the estimates from start to the end, and right figure is a zoom to the end of the optimization.
It remains to determine how long the step length γ should be chosen. Choosing
γ too large may cause the cost function to increase, whereas too small steps might
cause unnecessarily slow convergence. A typical strategy is to simply choose it
small enough so that the cost decreases at every step. However, quite often it is
advisable to change the step length during the iterations in one way or another. One
approach is to use a line search for selecting it, but we will postpone the discussion
on line-search methods to Section 4.4, where we discuss it in the context of Gauss–
Newton methods. A general gradient descent algorithm is shown as Algorithm 4.1.
Example 4.1. Let us consider a similar car localization problem as in Example 3.5
which was depicted in Figure 1.4. However, let us assume that we are only able to
measure the ranges to the landmarks as specified in the model in Equations (1.8).
In this case the model becomes non-linear. We assume that the variances of the
range measurements are R1R = 1, R2R = 0.1, R3R = 0.8, and R4R = 0.5. The
result of applying the gradient descent algorithm to the corresponding weighted
least squares problem is shown in Figure 4.2. It can be seen that gradient descent
4.3. GAUSS–NEWTON ALGORITHM 41
indeed finds the minimum, but the route that it finds is far from a direct route and
when we get closer to the minimum, the steps become smaller and smaller although
we have kept γ constant.
While intuitive, the gradient descent approach suffers from an important prob-
lem. In areas where the cost function is flat, the gradient is very small. Hence, the
algorithm can only take small steps, which leads to the problem that a large num-
ber of iterations is needed to cross such areas. Thus in general, the pure gradient
descent method is a quite inefficient algorithm and if, for example, Gauss–Newton,
Levenberg–Marquardt, or Quasi–Newton methods can be used, they should be pre-
ferred. However, recently, gradient descent methods and in particular their stochas-
tic versions, stochastic gradient decent methods, have gained some popularity due
to its computational efficiency in models with high number of parameters such as
large neural networks.
4: Calculate
x̂(i+1) = x̂(i) + ∆x(i+1)
5: Set i ← i + 1
6: until Converged
7: Set x̂WLS = x̂(i)
We then set the next estimate x̂(i+1) to be equal to the minimum, which gives
−1 (i) −1 T (i) −1 (i)
x̂(i+1) = x̂(i) + (GT (i)
x (x̂ ) R Gx (x̂ )) Gx (x̂ ) R e
−1 (i) −1 T (i) −1
(4.9)
= x̂(i) + (GT (i) (i)
x (x̂ ) R Gx (x̂ )) Gx (x̂ ) R (y − g(x̂ )).
The resulting method is given in Algorithm 4.2. The algorithm produces the
least squares estimate of the parameters x̂WLS . Given this estimate, we can also ap-
proximate the covariance of the estimate by using the linearized model analogously
to the linear case in Equation (3.38), which gives
−1 −1
Cov{x̂WLS } ≈ (GT
x (x̂WLS ) R Gx (x̂WLS )) . (4.10)
Example 4.2. Figure 4.3 shows the result of applying Gauss–Newton algorithm to
the problem that we considered in Example 4.1. It can be seen that Gauss–Newton
takes a large step already in the beginning and only a few additional iterations are
needed to obtain convergence.
x1 x1
Figure 4.3. Comparison of gradient descent and Gauss–Newton in weighted least squared
problem. The full evolution of estimates is shown on the left and the end of the estimation
is zoomed on the right.
101
x1 10−2
0 2 4 6 8 10 12
Iteration i
Figure 4.4. Comparison of gradient descent and Gauss–Newton in weighted least squared
problem when the starting point is further away. The evolution of estimates is shown on
the left and the cost J(x) as function of iteration number is shown on the right.
2: for j ∈ {1, 2, . . . , Ng } do
3: Set γ ← j/Ng
if JWLS x̂(i) + γ∆x̂(i+1) < J ∗ then
4:
5: Set γ ∗ ← γ
6: end if
7: end for
Then the idea of line search is that on every step we select the step size by locally
optimizing it in the range [0, 1].
(i)
One way to perform line search is simply by evaluating JWLS (γ) on a grid of
values and then selecting the step size that gives the minimum. This is often called
exact line search in optimization literature (Nocedal and Wright, 2006). This grid
search algorithm is shown as Algorithm 4.3. It would also be possible to use more
sophisticated minimum search methods such as bisection algorithm or interpolation
methods to find the minimum.
However, it turns out that the line search does not need to be exact as we can
guarantee to find the minimum. One simple idea is to use backtracking, where we
decrease the parameter γ until it provides a sufficient decrease in the cost. One
way is to simply halve the step size until it causes a cost decrease. This kind of
approach is used, for example, in Gustafsson (2018) and it often works well.
The Armijo backtracking is an inexact line search method where we demand
that the cost is decreased at least with an amount that is predicted by a first order
Taylor series expansions as follows. Let us expand the right hand side of (4.12)
using a first order Taylor series expansion as follows (Nocedal and Wright, 2006):
JWLS x̂(i) + γ∆x̂(i+1) ≈ JWLS x̂(i) − 2γ[∆x̂(i+1) ]T GT (i) (i)
x (x̂ ) (y − g(x̂ )).
(4.13)
Then we demand that the cost decrease should satisfy the Armijo condition
JWLS x̂(i) + γ∆x̂(i+1) − JWLS x̂(i)
≤ −2β γ[∆x̂(i+1) ]T GT (i) (i)
x (x̂ ) (y − g(x̂ )), (4.14)
where β is a parameter that we can choose freely on range [0, 1), typical parameter
value being β = 0.1. In the backtracking we then decrease γ by multiplying it with
4.5. LEVENBERG–MARQUARDT ALGORITHM 45
a parameter τ (e.g., τ = 0.5) on the range (0, 1) until the condition is satisfied. The
resulting method is shown as Algorithm 4.4.
A Gauss-Newton algorithm with a generic line search is shown as Algo-
rithm 4.5. The algorithm produces the least squares estimate x̂WLS and its
covariance can be approximated using (4.10). More advanced line search methods
can be found in optimization literature (e.g. Nocedal and Wright, 2006).
Example 4.4. Figure 4.5 shows the result of applying the Gauss–Newton algorithm
with line search to the far-away initial-point case considered in Example 4.3. It can
be seen that the initial step is shorter than with plain Gauss–Newton and the cost
is decreased at every step. However, although the cost is strictly decreasing, the
intermediate costs during the iterations are larger than with plain Gauss–Newton.
However, the strict decrease in the cost enhances the stability of the algorithm.
Cost J(x)
101
x1 10−2
0 2 4 6 8 10 12
Iteration i
size by using regularization in the least squares solution following the linearization
made in the Gauss–Newton algorithm.
One way of deriving the Levenberg–Marquardt algorithm is indeed to see it
as a regularized Gauss–Newton algorithm. The Taylor series approximation of
the nonlinear function (4.7) is only valid in the local neighborhood of the current
parameter estimate x̂(i) . Hence, we can use this prior information to regularize the
approximated weighted least squares cost function using a regularization term as
introduced in Section 3.3. This yields the cost function approximation
T
(i) (i) (i)
JReLS (x) ≈ y − g(x̂ ) − Gx (x̂ )(x − x̂ )
× R−1 y − g(x̂(i) ) − Gx (x̂(i) )(x − x̂(i) )
+ λ(x − x̂(i) )T (x − x̂(i) )
T
= e(i) − Gx (x̂(i) )(x − x̂(i) ) R−1 e(i) − Gx (x̂(i) )(x − x̂(i) )
+ λ(x − x̂(i) )T (x − x̂(i) )
where the second term is the regularization around the last iteration’s estimate with
weight or damping factor λ. The gradient of the cost function approximation now
becomes
∂JReLS (x) −1 (i) −1
≈ −2GT (i)
x (x̂ )R e + 2GT (i) (i) (i)
x R Gx (x̂ )(x − x̂ ) + 2λ(x − x̂ )
∂x
−1 (i) −1
= −2GT xR e + 2(GT (i) (i)
x R Gx (x̂ ) + λI)(x − x̂ )
(4.15)
Setting (4.15) to zero and rearranging yields
−1 −1 (i)
(GT (i) (i) (i) T (i)
x (x̂ )R Gx (x̂ ) + λI)(x − x̂ ) = Gx (x̂ )R e
4.5. LEVENBERG–MARQUARDT ALGORITHM 47
and thus,
Equation (4.16b) shows that when λ approaches zero, the algorithm converges
to the Gauss–Newton algorithm. On the other hand, for large values of λ, the term
λI will dominate. In this case, we have that
1 T −1
∆x̂(i+1) ≈ G R (y − g(x̂(i) )),
λ x
which is a scaled gradient descent step. Thus, an important question is how to
choose λ. Ideally, in flat regions of the cost function where the linear approxi-
mation is good, λ should be chosen small, whereas in steep regions, it should be
chosen large. Unfortunately, there is no definite method on how to chose and adapt
λ and several different approaches have been proposed.
A simple strategy which is often used, and which is a simplified version of the
method proposed by Marquardt (1963) and used, for example, in Pujol (2007) is to
start from some damping value, for example, λ(0) (e.g. λ(0) = 10−2 ) and select a
fixed factor ν (e.g. ν = 10). Then at each step we do the following:
• First compute using Equation (4.16) a candidate x̂(i+1) using the previous
parameter value λ(i−1) . Then proceed as follows:
– If JWLS (x̂(i+1) ) < JWLS (x̂(i) ) then accept x̂(i+1) and decrease the
damping parameter by λ(i) = λ(i−1) /ν.
– Otherwise continue with x̂(i+1) = x̂(i) and increase the damping pa-
rameter by λ(i) = ν λ(i−1) .
5: else
−1
∆x̂(i+1) = (GT (i)
x (x̂ )R Gx (x̂(i) ) + λI)−1 GT
xR
−1
(y − g(x̂(i) ))
6: end if
7: if JWLS (x̂(i) + ∆x̂(i+1) ) < JWLS (x̂(i) ) then
8: Accept the candidate and decrease λ:
9: Set i ← i + 1
10: else
11: Reject the candidate and increase λ:
λ←νλ
12: end if
13: until Converged
14: Set x̂WLS = x̂(i)
x2
Gradient descent
Gauss–Newton (LS)
Levenberg–Marquardt
Levenberg–Marquardt-scaled
x1
which has the form of a non-regularized cost and hence all the algorithms presented
in this chapter are again applicable.
104
Gradient Descent
Gauss–Newton
3 Gauss–Newton (LS)
10
Levenberg–Marquardt
Levenberg–Marquardt-scaled
2
10
Cost J(x)
101
100
10−1
10−2
0 2 4 6 8 10 12
Iteration i
function as follows:
T
∂J(x)
(i)
J(x) ≈ J(x ) + (x − x(i) )
∂x
x=x(i)
2
1 ∂ J(x)
+ (x − x(i) )T (x − x(i) ), (4.19)
2 ∂x2 x=x(i)
where ∂ 2 J(x)/∂x2 denotes the Hessian matrix of J(x). The strategy is now to
minimize the right hand side with respect to x and use the result as the next guess.
The resulting Newton’s method takes the following form:
−1
∂ 2 J(x)
(i+1) (i) ∂J(x)
x =x − . (4.20)
∂x2 ∂x
x=x(i)
• The absolute or relative change in the cost falls below a certain threshold.
For example, a criterion could be (J(xi ) − J(x(i+1) ))/J(xi ) < c .
Dynamic Models
Most sensor fusion problems involve dynamically changing variables. For exam-
ple, in autonomous driving, other road users continuously appear, disappear, and
move in the traffic scene. In this case, we are interested in estimating dynamically
varying parameters. We refer to these parameters as states or the state, because
they describe the state a system is in (e.g., the location and velocity of a car, or the
position, attitude, and velocity of a unmanned aerial vehicle, etc.).
In such dynamic scenarios, we have to employ sensors that provide repeated
measurements in time, that is, they sample periodically. In between those samples,
the state of the system evolves according to some process and thus, the states
at different times are different from each other. However, since the evolution of
the state follows some dynamic process, they are related and thus, we can relate
the states at two times to each other. To do that, we need a principled way of
describing how the state evolves between samples. We find a suitable approach
in differential equations (for continuous-time processes) and difference equations
(for discrete-time processes), which can be used to describe dynamic processes.
Indeed, we will make use of differential equations to derive state-space models
that turn an Lth order differential (or difference) equation into a first order vector-
valued differential (or difference) equation.
53
54 CHAPTER 5. DYNAMIC MODELS
p
η
m
u(t)
k
where p(t), v(t), and a(t) denote the position, velocity, and acceleration of the
mass m, respectively, k is the spring constant, and η is the damping constant.
Furthermore, the forcing function u(t) is a deterministic input to the system.
By introducing a second equation, v(t) = v(t), which always holds, and divid-
ing (5.1) by m, we obtain the equation system
Observe that in (5.3), the vector on the left hand side of the equation is the
derivative of the vector on the right hand side. Hence, we can define the vector
p(t)
x(t) = ,
v(t)
where
dx(t)
ẋ(t) =
dt
denotes the time-derivative.
The form (5.4) is the state-space representation of the dynamic system in Fig-
ure 5.1 and the vector x(t) is called the state vector (or simply state). The state
vector now encodes all the information about the state the system is in (position,
speed, etc.). Therefore, solving this first-order vector valued ODE representation
rather than the original differential equation in (5.1), in some sense, provides richer
information about the system.
5.1. CONTINUOUS-TIME STATE-SPACE MODELS 55
We can now generalize this approach. Consider the Lth order ODE of the form
where
x(t) 0 1 0 ...0
0
dx(t) .. 0
dt 0 0 1 .
x(t) = , A = , Bu = . , u(t) = u(t)
.. . .
.
..
.. 0
..
d L−1 x(t)
a0 a1 . . . aL−1 b1
dtL−1
(5.8)
While quite general, Equation (5.7) is a state-space representation for models
of the type in (5.5). However, we can also derive the same type of representation for
other dynamic models. For example, consider a freshly brewed hot cup of coffee.
Newton’s law of cooling then states that the change in temperature of the coffee is
proportional to the temperature difference between the coffee and its surrounding
(assuming that the surroundings are much larger than the coffee cup). This can be
formulated as a differential equation of the form
dTc (t)
= −k1 (Tc (t) − Tr (t)), (5.9)
dt
where Tc (t) denotes the coffee’s temperature, Tr (t) the room temperature, and k1
is a constant. The change in room temperature on the other hand depends on the
56 CHAPTER 5. DYNAMIC MODELS
heat that is produced inside the room (e.g., due to heating) as well as the losses to
the outside. Again assuming that Newton’s law of cooling applies, we can write
the differential equation as
dTr (t)
= −k2 (Tr (t) − Ta (t)) + h(t), (5.10)
dt
where k2 is a constant, Ta (t) denotes the outside temperature and h(t) is the heating
input. Observe that (5.9)–(5.10) is a coupled system that can be written as the
equation system
dTr (t)
= −k2 (Tr (t) − Ta (t)) + h(t) (5.11a)
dt
dTc (t)
= −k1 (Tc (t) − Tr (t)), (5.11b)
dt
or equivalently on matrix form as
" #
dTr (t)
dt −k2 0 Tr (t) k2 1 Ta (t)
dTc (t) = k1 −k1 Tc (t)
+
0 0 h(t)
. (5.12)
dt
Hence, the model (5.7) is a quite general dynamic model of the time-varying
behavior for many different processes. Recall that our original aim was to estimate
the dynamically changing state x(t). This requires that we measure x(t) in some
way, that is, we need to combine the dynamic model with a measurement model.
The simplest case is that of a linear measurement model of the form (3.10) where
we replace the static x with time varying state x(t). Since the measurements are
obtained at discrete time instances tn (i.e., at t1 , t2 , . . . ), the measurement model
becomes
yn = Gx(tn ) + rn , (5.14)
where rn denotes the measurement noise with covariance matrix Cov{rn } = Rn .
Defining xn , x(tn ) and combining the dynamic model (5.7) and the mea-
surement model (5.14) then yields the deterministic linear state space model
4
4
2 3
Sww (ω)
w(t)
0 2
−2
1
−4
0
t ω
(a) (b)
2
Figure 5.2. Example of a white noise process w(t) with σw = 2. (a) One realization of
the process, and (b) its power spectral density Sw (ω) averaged over 100 realizations.
do not know the inputs made by the pilots. Additionally, variables such as wind
or pressure fields are not known either and it is difficult to model them accurately.
Instead, such random effects can be modeled by including a stochastic process as
an input to the differential equation (and hence to its state-space representation),
which turns an ODE into a stochastic differential equation (SDE; Øksendal, 2010;
Särkkä and Solin, 2019).
A simple class of stochastic processes are stationary stochastic processes which
can be characterized by their autocorrelation functions. The autocorrelation func-
tion of a stationary stochastic process w(t) is
where δ(τ ) denotes the Dirac delta function. Hence, the power spectral density is
a constant, that is,
2
Sww = σw . (5.17)
In other words, the white noise process contains equal contributions from each
frequency. One realization of such a process, together with its power spectral
density are shown in Figure 5.2.
The derivation of the state-space representation of the resulting SDE follows
the same steps as for the deterministic case, where the random process takes the
role of the input. Consider the Lth order SDE given by
Fp
Spacecraft
py
px Fg
Earth
p(t)
which is of the same form as (5.5) but with the stochastic process w(t) as the input.
Rewriting (5.18) into matrix form yields
dx(t)
dt
0 1 0 ... 0 x(t) 0
d2 x(t) .. dx(t) 0
dt2 0 0 1 . dt
. =
. .. . .. + . w(t).
. . .. 0 L−1. ..
dL x(t) d x(t) b1
dtL
a0 a1 . . . aL−1 dtL−1
Hence, this can also be written compactly in the form of a first order vector
valued SDE system
ẋ(t) = Ax(t) + Bw w(t).
Naturally, also coupled models (like the coffee-cup example) can be written in this
form. Together with a linear measurement equation, the stochastic linear state-
space model becomes
Note that some dynamic models may contain both deterministic inputs u(t)
and stochastic inputs w(t). Naturally, both of these can be incorporated in the
model as well.
We start our discussion again based on an example. Figure 5.3 shows an illus-
tration of a spacecraft orbiting the earth. The forces acting upon the craft are the
gravitational pull of the earth Fg and the propulsion force Fp . The position p(t) of
the craft is defined according to the coordinate system shown in Figure 5.3, with
the origin at the center of the earth. The gravitational acceleration of an object at a
distance |p(t)| from the earth’s center is approximately
2
re
g ≈ g0 ,
|p(t)|
where g0 = 9.81 m/s2 is the gravitational acceleration on the earth’s surface and
re = 6371 km is the mean earth radius. The gravitational pull is in the opposite
direction of p(t) and hence, we can write it on vector form as
2
re p(t)
Fg = −mg0
|p(t)| |p(t)|
p(t)
= −mg0 re2 .
|p(t)|3
The propulsion force is perpendicular to the direction of the position vector p(t).
Hence, we can write y
1 −p (t)
Fp = Fp
|p(t)| px (t)
When tracking the spacecraft from the ground, for example using radar, the magni-
tude of the propulsion force Fp is unknown. But since we know that the engines are
only used to make small flight path corrections to conserve fuel, we can model this
as a stochastic process, that is, we can assume that Fp = w(t) with some spectral
density σw2.
Using Newton’s second law it follows that the motion of the spacecraft is
governed by the vector-valued differential equation
y
2 p(t) 1 −p (t)
ma(t) = −mg0 re + w(t). (5.20)
|p(t)| 3 |p(t)| px (t)
The highest order of the derivative here is the acceleration on the left hand
side. Hence, a suitable state vector includes the position (zeroth derivative) and the
speed (first derivative), that is,
T
x(t) = px (t) py (t) v x (t) v y (t) .
However, when trying to rewrite (5.20) into the form of a state-space representa-
tion (5.19), we notice that this is not possible since the right hand side of (5.20) is
not linear in p(t). Nevertheless, we can still write it as a nonlinear equation system
60 CHAPTER 5. DYNAMIC MODELS
in terms of x(t) as
v x (t) 0
v x (t)
v y (t) v y (t) 0y
= 2 px (t) + p (t) w(t)
a (t) −g0 re |p(t)|3 − m|p(t)|
x
ay (t) py (t) px
−g0 re2 |p(t)| 3 m|p(t)|
0 (5.21)
f1 (x(t))
f2 (x(t)) 0
= + py (t) w(t),
f3 (x(t)) − m|p(t)|
f4 (x(t)) px
m|p(t)|
where f (x(t)) is the nonlinear state transition function, w(t) is the driving stochas-
tic process, g(xn ) is the measurement model, and rn is the measurement noise.
5.2. DISCRETE-TIME STATE-SPACE MODELS 61
x1,n = a11 x1,n−1 + · · · + a1dx xdx ,n−1 + b11 u1,n + · · · + b1du udu ,n
x2,n = a21 x1,n−1 + · · · + a2dx xdx ,n−1 + b21 u1,n + · · · + b2du udu ,n
..
.
xdx ,n = adx 1 x1,n−1 + · · · + adx dx xdx ,n−1 + bdx 1 u1,n + · · · + bdx du udu ,n
where xi,n , xi (tn ) is the ith variable at time tn . This can be rewritten in matrix
form according to
x1,n a11 . . . a1dx x1,n−1 b11 . . . b1du u1,n
.. .. .. .. .. + .. .. .. .. .
. = . . . . . . . .
xdx ,n adx 1 . . . adx dx xdx ,n−1 bdx 1 . . . bdx du udu ,n
xn = Fxn−1 + Bu un , (5.25)
with
x1,n a11 . . . a1dx b11 . . . b1du u1,n
xn = ... , F = ... .. .. , B = .. .. .. , u = .. .
. . u . . . n .
xdx ,n adx 1 . . . adx dx bdx 1 . . . bdx du udu ,n
Equation (5.25) is the general dynamic model for linear discrete-time systems with
deterministic inputs un .
62 CHAPTER 5. DYNAMIC MODELS
to the form (5.25), we proceed as follows. First, we have to choose the state
variables xn . This is easiest done at time n − 1 and for the model (5.26), we
can choose
where we have expressed the state at time n solely in terms of the state at previous
times as well as the input un . Rewriting this on matrix form finally yields
c1 c2 . . . cL
x1,n x1,n−1 d1
x2,n .. x 0
1 0 .
2,n−1
.. = .. + .. un ,
. .. . .
. .
. .
xdx ,n 0 ... 1 0 xdx ,n−1 0
xn = Fxn−1 + Bu un , (5.27a)
yn = Gxn + rn . (5.27b)
in the continuous case. For the discrete-time case, we can model this using a
random variable as the input to the dynamic model (compared to a stochastic
process for the continuous case). We denote this random input, commonly called
the process noise, as qn . Then, replacing the deterministic input un with qn yields
the stochastic discrete-time dynamic model
xn = Fxn−1 + Bq qn . (5.28)
The random variable qn follows some probability density function such that
qn ∼ p(qn ).
Here, we assume that qn is zero-mean, that is E{qn } = 0, and that it has co-
variance Cov{qn } = Qn . Furthermore, we also assume that the cross-covariance
between two random inputs qm and qn is Cov{qm , qn } = 0 for m 6= n.
Together with a linear sensor model, we then obtain the stochastic linear
discrete-time state-space model given by
xn = Fxn−1 + Bq qn , (5.29a)
yn = Gxn + rn . (5.29b)
which we want to solve on the interval between the two sampling points tn and
tn−1 . To do this, we introduce the integrating factor e−at and note that it holds
that
d −at
e x(t) = e−at ẋ(t) − e−at ax(t). (5.33)
dt
Rearranging (5.32) by moving the term ax(t) to the left hand side and multiplying
by the integrating factor yields
which yields
tn
Z tn
e−at x(t) e−at bu(t)dt,
t=tn−1
=
tn−1
Ztn
e−atn x(tn ) − e−atn−1 x(tn−1 ) = e−at bu(t)dt.
tn−1
5.3. DISCRETIZATION OF LINEAR DYNAMIC MODELS 65
Now we can solve for x(tn ) by rearranging and multiplying by the factor eatn ,
which yields
Z tn
atn −atn−1
x(tn ) = e x(tn−1 ) + e atn
e−at bu(t)dt,
tn−1
Z tn
= ea(tn −tn−1 ) x(tn−1 ) + ea(tn −t) bu(t)dt.
tn−1
Note that the second term on the right hand side of (5.35) is the convolution be-
tween the factor ea(tn −t) and the input u(t) on the interval between tn−1 and tn .
General Case. Based on the steps used for solving the scalar ODE (5.32), the
general case can now be solved. First, recall that the general dynamic model on
state-space form was given by the vector valued first order ODE system (5.15), that
is,
ẋ(t) = Ax(t) + Bu u(t). (5.36)
The key here is to note that (5.36) is a first order vector ODE just like (5.32).
Hence, we should be able to solve it using an integrating factor similar to the one
for the scalar case. Indeed, such an integrating factor can be found through the
matrix exponential. The matrix exponential for a square matrix A can be defined
in the same way as the regular exponential, that is, as an infinite sum of the form
∞
X 1 k
eA = A .
k!
k=0
Similar to the (scalar) exponential, it holds that the derivative with respect to a
scalar x of eAx is
d Ax
e = eAx A. (5.37)
dx
Another useful property of the matrix exponential is that of its transpose, which is
given by
T
(eA )T = eA . (5.38)
With this in mind, we can solve the general case following the same steps as
for the scalar case. First, we multiply (5.36) by the integrating factor e−At and
rearrange it to obtain
d −At
e x(t) = e−At ẋ(t) − e−At Ax(t),
dt
and thus, we have that
d −At
e x(t) = e−At Bu u(t).
dt
This is again separable in t. Direct integration from tn−1 to tn then leads to
Z tn Z tn
d e−At x(t) = e−At Bu u(t)dt.
tn−1 tn−1
−At tn
Z tn
e x(t) t=tn−1 = e−At Bu u(t)dt.
tn−1
Z tn
e−Atn x(tn ) − e−Atn−1 x(tn−1 ) = e−At Bu u(t)dt.
tn−1
Solving for x(tn ) by rearranging and multiplying both sides by eAtn yields
Z tn
A(tn −tn−1 )
x(tn ) = e x(tn−1 ) + eA(tn −t) Bu u(t)dt,
tn−1
or Z tn
xn = eA∆t xn−1 + eA(tn −t) Bu u(t)dt. (5.39)
tn−1
In this case, the factor in front of xn−1 (i.e., the state at time tn−1 ) is a matrix
exponential that again depends on the time difference ∆t between the two discrete
times tn−1 and tn . Note that ∆t can be arbitrary, meaning that there is no need for
uniform sampling between points tn−2 , tn−1 , tn , and so forth. Furthermore, the
second term in (5.39) is again the convolution between the deterministic input u(t)
and the matrix exponential on the same interval.
If the input u(t) is a zeroth-order-hold (ZOH) signal, which is common in
control or robot navigation, the second term can further be simplified. In that case,
u(t) is constant on the interval between tn−1 and tn and we can denote it as un−1 .
Thus, we then have that
Z tn Z tn
A(tn −t)
e Bu u(t)dt, = eA(tn −t) Bu un−1 dt,
tn−1 tn−1
Z tn
= eA(tn −t) Bu dtun−1 ,
tn−1
Defining
with Rww (τ ) = E{w(t + τ )w(t)} = Σw δ(τ ), follows the same steps as the
discretization of the deterministic model. This yields
Z tn
A(tn −tn−1 )
x(tn ) = e x(tn−1 ) + eA(tn −t) Bw w(t)dt.
tn−1
Now note that special attention has to be paid to the second term on the right hand
side. The noise process w(t) does not have the properties required for regular
integration rules to apply (Øksendal, 2010; Särkkä and Solin, 2019).
However, we can define a random variable qn as
Z tn
qn , eA(tn −t) Bw w(t)dt.
tn−1
Hence, the mean and covariance of the random variable qn given that w(t) is a
zero-mean white noise process with autocorrelation function Rww (τ ) = Σw δ(τ )
are given by
E{qn } = 0, (5.43a)
Z tn
AT (tn −τ )
Cov{qn } = eA(tn −τ ) Bw Σw BT
we dτ , Qn . (5.43b)
tn−1
Furthermore, it turns out that in this case, the process noise qn is actually a Gaus-
sian random variable, that is, it holds that
qn ∼ N (0, Qn ). (5.44)
xn = Fn xn−1 + qn (5.45)
where
Noting that the third term is the integral of the derivative of the matrix exponential
with respect to t (see (5.37)), we have that
Z tn h itn
xn ≈ e Ax ∆t
xn−1 + eAx (tn −t) dtf (xn−1 ) + eAx (tn −t) xn−1
tn−1 t=tn−1
Z tn
+ eAx (tn −t) Bu u(t)dt
tn−1
Z tn
=e Ax ∆t
xn−1 + eAx (tn −t) dtf (xn−1 ) + (I − eAx (tn −tn−1 ) )xn−1
tn−1
Z tn
+ eAx (tn −t) Bu u(t)dt,
tn−1
and finally,
Z tn Z tn
Ax (tn −t)
xn ≈ xn−1 + e dtf (xn−1 ) + eAx (tn −t) Bu u(t)dt. (5.47)
tn−1 tn−1
70 CHAPTER 5. DYNAMIC MODELS
with
qn ∼ N (0, Qn ),
Z tn
AT
Qn ≈ eAx (tn −τ ) Bw Σw BT
we
x (tn −τ ) dτ.
tn−1
and the problem is to calculate the integrals in the second and third terms on
the right hand side. However, for sufficiently small ∆t = tn − tn−1 , we can
approximate the integral by using the rectangle approximation. This means that
we can approximate the integral as
Z tn
f (x(t))dt ≈ f (xn−1 )(tn − tn−1 )
tn−1
= ∆tf (xn−1 ),
and similar for the second integral with the input u(t).
This then yields the Euler discretization of the deterministic, nonlinear dy-
namic model given by
xn ≈ xn−1 + ∆tf (xn−1 ) + ∆tBu un−1 . (5.49)
We can still use the rectangle approximation for the integral involving the deter-
ministic part (second term on the right hand side), but for the stochastic part, we
are again lacking the integration rules. Thus, we can not use the rectangle approx-
imation in this case. However, we can again define the resulting random variable
as Z tn
qn , Bw w(t)dt
tn−1
and investigate its properties instead.
First, the mean is given by
(Z )
tn
E{qn } = E Bw w(t)dt
tn−1
Z tn
= Bw E {w(t)} dt
tn−1
= 0,
where we have made use of the assumption that w(t) is a zero-mean process.
Second, the covariance can be found similar to the linear case as follows.
! Z !T
Z tn tn
Cov{qn } = E Bw w(t)dt Bw w(τ )dτ
tn−1 tn−1
Z tn Z tn
= Bw E{w(t)w(τ )T }BT w dτ dt
tn−1 tn−1
Z tn Z tn
= Bw Σw δ(t − τ )BT
w dτ dt
tn−1 tn−1
Z tn
= Bw Σw BT
w dτ.
tn−1
At this point, note that the remaining integral does not involve any random process
anymore. Hence, we can again use the rectangle approximation of the integral,
which yields
Cov{qn } ≈ (Bw Σw BT
w )(tn − tn−1 )
= ∆tBw Σw BT
w
, Qn .
This yields the Euler–Maruyama discretization of the stochastic nonlinear dy-
namic model given by
xn = xn−1 + ∆tf (xn−1 ) + qn , (5.50)
with
qn ∼ N (0, Qn ),
Qn = ∆tBw Σw BT
w.
72 CHAPTER 5. DYNAMIC MODELS
Chapter 6
Filtering
xn = f (xn−1 ) + qn , (6.1a)
yn = g(xn ) + rn , (6.1b)
where the process and measurement noises are zero-mean (E{qn } = E{rn } = 0)
with covariances Cov{qn } = Qn and Cov{rn } = Rn , respectively. Note that the
dynamic model (6.1a) may be an inherently discrete-time model (Section 5.2) or
the result of discretizing a continuous-time dynamic model.
An aspect of the state-space model that has been neglected so far are the initial
conditions. Since the dynamic model (6.1a) is based on a differential (difference)
equation, it also requires knowledge of the initial conditions of the system. Natu-
rally, in the sensor fusion and estimation context, the initial conditions are unknown
(e.g., it is not known where exactly a target is located in the beginning). Instead, it
is suitable to specify the initial conditions in a probabilistic way. In other words,
73
74 CHAPTER 6. FILTERING
x0 ∼ p(x0 ).
This allows us to specify the prior knowledge about where the initial state may be
but also take into account that this is an uncertain guess. In practice, the distribution
of the initial state is often assumed to follow a Gaussian distribution, but this does
not need to be the case. Here, we only make the assumptions that the mean and
covariance of the initial state are given by
E{x0 } = m0 , (6.2a)
Cov{x0 } = P0 . (6.2b)
Given the state-space model defined by (6.1) and (6.2), the filtering approach
can now be formulated. In particular, filtering iterates between 1) a prediction
step, which predicts the current state using the dynamic model and the previous
estimate of the state (also called time update), and 2) a measurement update step
that estimates the current state using the prediction and the new measurement. This
prediction–update strategy is actually very general, and as it will be shown, all the
algorithms discussed in this chapter make use of these two steps.
Prediction. The aim of the prediction step is to estimate the current state xn at tn
given the set of all the previous measurement data y1:n−1 = {y1 , y2 , . . . , yn−1 },
and hence, given the estimate of the state xn−1 at tn−1 . Here, the mean of the
prediction is denoted as x̂n|n−1 , which is read as “the estimate of the state at tn
given the data up to tn−1 ”. The hat indicates that it is an estimation, whereas
the subscript is closely related to the notation for conditional expectations and
densities, that is, it is read as “n given n − 1”. Similarly, the covariance of the
predicted xn is denoted as Pn|n−1 .
Together with the initial distribution (6.2), the model (6.3) has the discrete-time
equivalent
xn = Fn xn−1 + qn (6.4a)
yn = Gn xn + rn (6.4b)
with
Based on (6.4)–(6.5), the prediction and update steps can now be derived. For ease
of presentation, the measurement update step is presented first, followed by the
prediction.
Measurement Update. For the measurement update at time tn , assume that there
exists a prediction of the mean of the state x̂n|n−1 and its covariance Pn|n−1 . This
can be seen as the prior knowledge of the state at tn . Then, the new measurement
yn provides the actual (noisy) information about the true state. Hence, the ob-
jective is to minimize the error with respect to the measurement, taking the prior
information (prediction) into account.
Recall that we have based our estimators mainly on cost functions in general
and quadratic cost functions (least squares) in particular. A cost function that is
particularly well suited for this kind of problem is the regularized least squares
cost function. Hence, for the filtering problem with the linear measurement model,
the cost function becomes
where the first term accounts for the measurement yn and the second term, the
regularization term, accounts for the prior information from the prediction.
To estimate the state, we can now solve the regularized least squares problem
Since the cost function is completely linear in the unknown xn , solving (6.7) fol-
lows the same steps as solving the linear least squares problems in Chapter 3 and
we can find a closed-form solution.
Fortunately, we have already solved this problem for the static case. The solu-
tion is given by (see Section 3.3)
where E{xn | y1:n } denotes the conditional expectation of xn given the data y1:n
(and similar for the covariance).
Prediction. Given the estimated mean E{xn−1 | y1:n−1 } = x̂n−1|n−1 and its
covariance Cov{xn−1 | y1:n−1 } = Pn−1|n−1 from the previous time step tn−1 ,
the predicted mean and covariance can now be calculated. The mean is given by
Substituting xn in the expectation using the dynamic model given in (6.4), the
mean can be rewritten as
Distributing the expectation over the sum and noting that E{qn | y1:n−1 } = 0
yields
E{xn | y1:n−1 } = Fn E{xn−1 | y1:n−1 }
(6.12)
= Fn x̂n−1|n−1 ,
where the last equality makes use of (6.11).
Similarly, the covariance is found from
x̂n|n−1 = Fn x̂n−1|n−1
Pn|n−1 = Fn Pn−1|n−1 FT
n + Qn
4: Measurement update:
−1
Kn = Pn|n−1 GT T
n (Gn Pn|n−1 Gn + Rn )
x̂n|n = x̂n|n−1 + Kn (yn − Gn x̂n|n−1 )
Pn|n = Pn|n−1 − Kn (Gn Pn|n−1 GT T
n + Rn )Kn
5: end for
Using the expression of the dynamic model (6.4) for xn and (6.12) for E{xn |
y1:n−1 } yields
Cov{xn | y1:n−1 }
= E{(Fn xn−1 + qn − Fn x̂n−1|n−1 )(Fn xn−1 + qn − Fn x̂n−1|n−1 )T | y1:n−1 }.
By rewriting and expanding the quadratic term and distributing the expectation we
obtain
Cov{xn | y1:n−1 }
= E{(Fn xn−1 − Fn x̂n−1|n−1 )(Fn xn−1 − Fn x̂n−1|n−1 )T }
+ E{qn (Fn xn−1 − Fn x̂n−1|n−1 )T | y1:n−1 }
+ E{(Fn xn−1 − Fn x̂n−1|n−1 )qT T
n | y1:n−1 } + E{qn qn | y1:n−1 }.
Noting that the middle terms are zero due to the factors being independent and qn
being zero-mean, what remains is
Cov{xn | y1:n−1 }
= E{(Fn xn−1 − Fn x̂n−1|n−1 )(Fn xn−1 − Fn x̂n−1|n−1 )T | y1:n−1 }
+ E{qn qT
n | y1:n−1 },
and finally,
Cov{xn | y1:n−1 } = Fn Pn−1|n−1 FT
n + Qn . (6.13)
78 CHAPTER 6. FILTERING
Kalman Filter. Gathering the results (6.8)–(6.10) and (6.12)–(6.13), the filtering
algorithm for linear state-space models can now be formulated. First, during the
prediction (time update) step the predicted mean and covariance are calculated
according to
Second, in the measurement update step, the new measurement is incorporated and
the new state is estimated using
−1
Kn = Pn|n−1 GT T
n (Gn Pn|n−1 Gn + Rn ) , (6.15a)
x̂n|n = x̂n|n−1 + Kn (yn − Gn x̂n|n−1 ), (6.15b)
Pn|n = Pn|n−1 − Kn (Gn Pn|n−1 GT
n + Rn )KT
n. (6.15c)
and thus
E{yn | y1:n−1 } = Gn x̂n|n−1 . (6.16)
Hence, the difference between the measurement and its prediction gives an indi-
cation about how far the predicted state is from the true state: If the difference is
large, the predicted state is far and should be corrected a lot, if it is small, it is close
and does not need to be corrected much. Hence, this difference is also called the
innovation.
6.3. EXTENDED KALMAN FILTER 79
Cov{yn | y1:n−1 }
= E{(yn − E{yn | y1:n−1 })(yn − E{yn | y1:n−1 })T | y1:n−1 }
= E{(Gn xn + rn − Gn x̂n|n−1 )(Gn xn + rn − Gn x̂n|n−1 )T | y1:n−1 }
= E{(Gn (xn − x̂n|n−1 )(xn − x̂n|n−1 )T GT T
n + rn rn | y1:n−1 }
and finally,
Cov{yn | y1:n−1 } = Gn Pn|n−1 GT
n + Rn , (6.17)
which is the denominator of the Kalman gain Kn . Similarly, the cross-covariance
between the predicted state xn and predicted output yn is
Cov{xn , yn | y1:n−1 }
= E{(xn − E{xn | y1:n−1 })(yn − E{yn | y1:n−1 })T | y1:n−1 }
= E{(xn − x̂n|n−1 )(Gn xn + rn − Gn x̂n|n−1 )T | y1:n−1 }
= E{(xn − x̂n|n−1 )(xn − x̂n|n−1 )T GT
n | y1:n−1 },
which yields
Cov{xn , yn | y1:n−1 } = Pn|n−1 GT
n. (6.18)
Thus, using (6.16)–(6.18), the measurement update can be written in the alternative
form as
From (6.19), we can see that the denominator of the Kalman gain Kn is the covari-
ance of the predicted measurement. Hence, if the uncertainty of the prediction is
large (either due to a large uncertainty in the predicted state or due to large mea-
surement noise covariance Rn ), the gain becomes small. This in turn means that
the innovation only contributes little information to the updated state. Conversely,
if the uncertainty of the prediction is small, the gain becomes large and the innova-
tion contributes a lot.
first approximation is found in the extended Kalman filter (EKF), which is based
on a linearization using a Taylor series approximation of the nonlinear dynamic
and measurement models. Using this linearization, prediction and measurement
update steps similar to the Kalman filter are found.
Prediction. For the prediction, assume that the state estimate and its covariance
for tn−1 are given by x̂n−1|n−1 and Pn−1|n−1 , respectively. Then, the linear (first
order) Taylor series approximation of the dynamic model around the linearization
point x̂n−1|n−1 is given by
xn = f (xn−1 ) + qn
(6.20)
≈ f (x̂n−1|n−1 ) + Fx (x̂n−1|n−1 )(xn−1 − x̂n−1|n−1 ) + qn ,
where Fx is the Jacobian matrix of the vector-value function f . Then, based on the
linearized dynamic model (6.20), we can predict the mean and covariance of the
state xn at tn given the set of all the measurements y1:n−1 = {y1 , y2 , . . . , yn−1 }.
The mean is given by
x̂n|n−1
= E{xn | y1:n−1 }
≈ E{f (x̂n−1|n−1 ) + Fx (x̂n−1|n−1 )(xn−1 − x̂n−1|n−1 ) + qn | y1:n−1 }
= f (x̂n−1|n−1 ) + Fx (x̂n−1|n−1 ) E{xn−1 | y1:n−1 } − Fx (x̂n−1|n−1 )x̂n−1|n−1
= f (x̂n−1|n−1 ) + Fx (x̂n−1|n−1 )x̂n−1|n−1 − Fx (x̂n−1|n−1 )x̂n−1|n−1
= f (x̂n−1|n−1 ),
and thus
x̂n|n−1 = f (x̂n−1|n−1 ). (6.21)
Similarly, the covariance is
Pn|n−1
= E{(xn − x̂n|n−1 )(xn − x̂n|n−1 )T | y1:n−1 }
≈ E{(f (x̂n−1|n−1 ) + Fx (x̂n−1|n−1 )(xn−1 − x̂n−1|n−1 ) + qn − f (x̂n−1|n−1 ))
× (f (x̂n−1|n−1 ) + Fx (x̂n−1|n−1 )(xn−1 − x̂n−1|n−1 ) + qn − f (x̂n−1|n−1 ))T
| y1:n−1 }
= E{(Fx (x̂n−1|n−1 )(xn−1 − x̂n−1|n−1 ) + qn )
× (Fx (x̂n−1|n−1 )(xn−1 − x̂n−1|n−1 ) + qn )T | y1:n−1 }
= Fx (x̂n−1|n−1 ) E{(xn−1 − x̂n−1|n−1 )(xn−1 − x̂n−1|n−1 )T | y1:n−1 }
× FT T
x (x̂n−1|n−1 ) + E{qn qn | y1:n−1 }
and finally,
Pn|n−1 = Fx (x̂n−1|n−1 )Pn−1|n−1 FT
x (x̂n−1|n−1 ) + Qn . (6.22)
6.3. EXTENDED KALMAN FILTER 81
yn = g(xn ) + rn
(6.23)
= g(x̂n|n−1 ) + Gx (x̂n|n−1 )(xn − x̂n|n−1 ) + rn ,
which is linear in xn and has the same form as (6.6) but with the Jacobian matrix
Gx (x̂n|n−1 ) rather than the measurement matrix Gn . Hence, solving
yields
with covariance
x̂n|n−1 = f (x̂n−1|n−1 )
Pn|n−1 = Fx (x̂n−1|n−1 )Pn−1|n−1 FT
x (x̂n−1|n−1 ) + Qn
4: Measurement update:
−1
Kn = Pn|n−1 GT T
x (x̂n|n−1 )(Gx (x̂n|n−1 )Pn|n−1 Gx (x̂n|n−1 ) + Rn )
x̂n|n = x̂n|n−1 + Kn (yn − g(x̂n|n−1 ))
Pn|n = Pn|n−1 − Kn (Gx (x̂n|n−1 )Pn|n−1 GT T
x (x̂n|n−1 ) + Rn )Kn
5: end for
Extended Kalman Filter. We can now summarize the prediction and measure-
ment update steps for the EKF. First, during the predictions step, the predicted
mean and covariance are calculated according to
where Fx is the Jacobian matrix of the dynamic model. Second, the measurement
update is
−1
Kn = Pn|n−1 GT T
x (x̂n|n−1 )(Gx (x̂n|n−1 )Pn|n−1 Gx (x̂n|n−1 ) + Rn ) ,
(6.28a)
x̂n|n = x̂n|n−1 + Kn (yn − g(x̂n|n−1 )), (6.28b)
Pn|n = Pn|n−1 − Kn (Gx (x̂n|n−1 )Pn|n−1 GT
x (x̂n|n−1 ) + Rn )KT
n. (6.28c)
These two steps are then performed iteratively, and the algorithm is initialized
with the initial conditions. This yields the complete EKF algorithm shown in Algo-
rithm 6.2. Note that this is essentially the same algorithm as the original KF where
the nonlinear function takes the place of the prediction (both in the prediction step
and the prediction of the output) and in the covariance updates, the Jacobian matri-
ces take the place of the matrices in the dynamic model and measurement model.
Furthermore, if either of the models is linear, the corresponding step (prediction or
measurement update) may be replaced by an exact update step from the Kalman
filter in Section 6.2. Finally, note that Algorithm 6.2 is an approximation of the
filtering problem since the nonlinear function is approximated around the filtered
and predicted state, meaning that both the state estimate and its covariance may or
may not converge to the true state (similar as for static nonlinear problems).
6.4. UNSCENTED KALMAN FILTERING 83
j
where it must hold that J−1
P
j=0 wm = 1 since the expected value of the sum should
be unbiased.
Then, given the nonlinear function z = h(x), we can calculate the transformed
sigma-points according to
zj = h(xj ).
Based on the transformed sigma-points, the mean and covariance of the trans-
formed variable z as well as the cross-covariance between x and z can then be
calculated according to
J
X
j j
E{z} ≈ wm z , (6.30a)
j=1
J
wPj (zj − E{z})(zj − E{z})T ,
X
Cov{z} ≈ (6.30b)
j=1
J
wPj (xj − m)(zj − E{z})T .
X
Cov{x, z} ≈ (6.30c)
j=1
The question then is how to chose the sigma-points xj and their weights for the
j
mean wm and covariance wPj . The unscented transform is one such approach for
84 CHAPTER 6. FILTERING
0 λ
wm = , (6.32a)
L+λ
λ
wP0 = + (1 − α2 + β), (6.32b)
L+λ
1
j
wm = wPj = , j = 1, . . . , 2L. (6.32c)
2(L + λ)
In both (6.31) and (6.32), λ is
λ = α2 (L + κ) − L, (6.33)
and α, β, and κ are tuning parameters. The parameters α and κ determine the
scaling of the spread of the sigma-points in the direction of the covariance P
whereas β is related the higher order moments of x and only affects the weight
for the central sigma-point in the covariance calculations.
The choice of the tuning parameters may greatly affect the performance of the
filter and several default choices have been suggested. The value for κ is most
often chosen to be zero (i.e., κ = 0), which leaves the choices for α and β. First,
note that with κ = 0, the weight of the central sigma-point becomes
0 λ Lα2 − L α2 − 1
wm = = = , (6.34)
L+λ L + Lα2 − L α2
and the scaling factor of the root of the covariance matrix becomes
√ p √
L + λ = L + α2 L − L = α L. (6.35)
One suggestion is to choose α = 1 × 10−3 (Wan and Van Der Merwe, 2000).
This gives a quite large and negative weight wm 0 [see (6.34)] and a small scaling
factor [see (6.35)], which makes the sigma-points lie close to the mean. To avoid
this, it is sometimes more intuitive to instead start from the weight of the central
point wm 0 and then determine α based on reformulating (6.34), which yields
s
1
α= 0
. (6.36)
1 − wm
6.4. UNSCENTED KALMAN FILTERING 85
Furthermore, since all the weights for j > 0 are the same [see (6.32)], it must also
hold that
0
1 − wm
j
wm = .
2L
Finally, β only affects the central weight of the covariance and a good starting
point is normally to chose β = 2 (see, for example, Wan and Van Der Merwe
(2000)).
Note that the additional term Qn in the covariance is due to the fact that the dy-
namic model also includes the process noise term qn , which increases the uncer-
tainty. In other words, the unscented transform only calculates the covariance of
the xn−1 transformed by f (xn−1 ) and does not take the effect of the noise into
account in this form.
Measurement Update. For the measurement update, first recall that it can be
written in terms of the predicted output, its covariance, and the covariance between
the predicted output and the state, see (6.19). Hence, we can also use the unscented
transform to calculate these means and covariances.
86 CHAPTER 6. FILTERING
8: end for
In this case, the sigma points are calculated from the predicted mean x̂n|n−1
and the covariance Pn|n−1 according to
+ Rn , (6.40c)
2L
wPj (xjn − x̂n|n−1 )(ynj − E{yn | y1:n−1 })T ,
X
Cov{xn , yn | y1:n−1 } =
j=0
(6.40d)
guesses of the state whereas the weights provide an indication of how good the
guess is. Particle filtering is a very general methodology and comprises a large
class of filtering algorithms. Here, we focus on the bootstrap particle filter, which
has an intuitive explanation but also a rigorous mathematical background (Gordon
et al., 1993; Doucet and Johansen, 2011; Särkkä, 2013).
To derive the particle filter, we first take another look at the general discrete-
time state-space model given by
xn = f (xn−1 ) + qn ,
yn = g(xn ) + rn ,
with qn ∼ p(qn ) and rn ∼ p(rn ). Recall that the dynamic model is a stochastic
process, meaning that every time a system is observed, a new realization of that
stochastic process is observed. For example, consider the scalar (one-dimensional)
random walk dynamic model
xn = xn−1 + qn
with
x0 ∼ N (0, 1),
qn ∼ N (0, 1).
This model essentially states that the initial state is a random variable distributed
around x0 = 0 with variance 1 and as time elapses, random steps from that initial
point are taken. Then, at any time tn , there are infinitely many random steps that
can be taken through qn (i.e., qn can take on any value, but with some values more
likely than others as specified by the distribution of the random variable). This
implies in turn that no two realizations of any random process are the same. As an
example, Figure 6.1a illustrates 20 realization of the random walk process. As it
can be seen, each realization takes a completely different path, and the realizations
diverge as time elapses.
In practice, when estimating a system’s state, all the observations (measure-
ments) come from one single realization out of all the infinitely possible realiza-
tions of the process. The measurements then give us the information about the
particular realization that is observed and the realization and the measurements are
linked through the measurement model. For example, Figure 6.1b illustrates one
particular realization of the random walk process together with the measurements
from the linear measurement model
yn = xn + rn
10
xn
10 yn
5
xn , yn
xn
0 0
−5
−10
−10
0 20 40 60 0 20 40 60
t t
(a) (b)
Figure 6.1. Random walk process example: (a) 20 realizations of the random walk process,
and (b) one realization of the process together with its measurements.
xn
10 yn
xn
−10
0 20 40 60
t
Figure 6.2. Random walk example with measurements yn (blue) from one particular
realization of the random walk process (red) and 20 other realizations.
and measurement update steps introduced in Section 6.1, this approach alternates
between:
2. Measurement update: Evaluate how well the simulated states xjn explain the
observed measurement yn .
For example, Figure 6.2 combines the different realizations of the state trajectories
in Figure 6.1a (gray) with the measurements (blue dots) in Figure 6.1b (together
with the true trajectory that generated the measurements in red). As it can be seen
from the figure, several of the gray realizations could be the trajectories that gener-
ated the measurements in blue, at least at some points in time, whereas other real-
izations are very unlikely to be the trajectories that generated the measurements.
To develop a filtering algorithm based on this reasoning, we need to find a
strategy to a) simulate samples xjn given the samples xjn−1 , and b) evaluate how
well a particular sample xjn of the state explains the current measurement yn .
90 CHAPTER 6. FILTERING
In practice, the process noise qn is often Gaussian, that is, qn is a Gaussian random
variable (e.g., when the discrete-time model comes from the discretization of a
continuous-time model with a white noise process as the input). In that case,
sampling qn amounts to sampling from the normal distribution N (0, Qn ) where
Qn is the covariance matrix of qn .
To evaluate the importance of each sample with respect to the current measure-
ment yn , we assign a weight wnj (called the importance weight) to each sample xjn .
Intuitively, the weight should represent how well each sample explains the mea-
surement, and thus, the closer the sample is to the true state, the higher the weight
should be. If we ensure that the weights sum to one, that is, if
J
X
wnj = 1,
j=1
we can also loosely interpret the weight wnj as the probability of the jth sample
representing the correct state. The question then is how the weights should be cal-
culated, given the sample xjn and the measurement yn . So far, cost functions have
been used to evaluate whether a state explains the measurements well. However,
in this approach the cost should be low for a state that explains the measurements
well, whereas the importance weight should be high and hence, we can not use cost
functions1 .
Instead, we have another look at the measurement model. Recall that we have
assumed that the measurement model is of the form
yn = g(xn ) + rn ,
suitable measure for how well any given sample xjn explains the measurement yn ,
and we can define the non-normalized weight w̃nj as
w̃nj is non-normalized because the sum over all weights does generally not ful-
fill the requirement that the weights should sum to one. This can be ensured by
simply dividing each non-normalized weight by the sum of all the non-normalized
weights, that is,
w̃nj
wnj = PJ .
i
i=1 w̃n
Once the importance weights have been calculated, a point estimate of the
current state and its covariance can be obtained. Similar to the unscented Kalman
filter, these are given by the weighted sum of the individual samples xjn , weighed
by the importance weights wnj , that is,
J
X
x̂n|n = wnj xjn , (6.41a)
j=1
J
X
Pn|n = wnj (xjn − x̂n|n )(xjn − x̂n|n )T . (6.41b)
j=1
Observe that the covariances for the process noise (Qn ) and measurement noise
(Rn ) do not enter these equations (contrary to the Kalman-filter-type algorithms).
This is due to the fact that this uncertainty is accounted for when sampling and
calculating the weights.
It would appear that this now yields a working algorithm. However, there is one
important problem, illustrated for the random walk model in Figure 6.3: Assuming
that the red trajectory represents the true realization of our state trajectory, the
other simulated trajectories start to diverge from that trajectory as time elapses.
Consequently, toward the end of the time scale, there are only very few trajectories
in the neighborhood of the actual trajectory. In practice, this means that the weights
for almost all but a few trajectories would become zero and eventually, after some
more time has elapsed, there would be no trajectory close to the true realization
anymore and the filter would break down completely.
92 CHAPTER 6. FILTERING
20
10
xn
0
−10
−20
0 20 40 60 80 100
t
To address this problem, one additional step called resampling has to be in-
troduced. The basic idea of resampling is to make sure that samples with low
weights, that is, samples that are unlikely to be close to the true state, are discarded
and replaced with copies of the samples with high weight. Hence, resampling
essentially regenerates the sample xjn such that there are a total of approximately
bwnj Je copies of xjn in the resampled set of particles. Thus, if x̃in (for i = 1, . . . , J)
denotes the resampled particle, that particle is equal to particle xjn with probability
wnj , that is, we have that
where Pr{·} denotes the probability. This ensures that trajectories with low weight
are discarded and that the samples remain close to the true state.
The final bootstrap particle filtering algorithm then consists of the following
three steps:
These three steps are then repeated for the complete dataset and the recursion is
initialized by sampling from the initial distribution of the state p(x0 ). Assuming
Gaussian distributions for the initial state, process noise, and measurement noise,
that is,
x0 ∼ N (m0 , P0 ),
qn ∼ N (0, Qn ),
rn ∼ N (0, Rn ),
xj0 ∼ N (m0 , P0 )
2: for n = 1, 2, . . . do
3: for j = 1, 2, . . . , J do
4: Sample
qjn ∼ N (0, Q)
7: end for
8: Normalize the importance weights (j = 1, . . . , J)
w̃nj
wnj = PJ
i
i=1 w̃n
95
96 BIBLIOGRAPHY
Pujol, J. (2007). The solution of nonlinear inverse problems and the Levenberg-
Marquardt method. Geophysics, 72(4):W1–W16. (Cited on page 47.)
Richards, M. A., Scheer, J., and Holm, W. A. (2010). Principles of Modern Radar.
SciTech Publishing. (Cited on page 4.)
Skoglund, M., Hendeby, G., and Axehill, D. (2015). Extended Kalman filter modi-
fications based on an optimization view point. In 18th International Conference
on Information Fusion (Fusion), pages 1856–1861. (Cited on page 87.)
Wan, E. A. and Van Der Merwe, R. (2000). The unscented Kalman filter for non-
linear estimation. In Adaptive Systems for Signal Processing, Communications,
and Control Symposium (AS-SPCC), pages 153–158. (Cited on pages 83, 84,
and 85.)