0% found this document useful (0 votes)
62 views

Basics of Sensor Fusion 2020

This document provides lecture notes on the basics of sensor fusion. It introduces key concepts such as the definition and main components of sensor fusion systems, including sensors, models, and estimation algorithms. Models of drones and autonomous cars are presented as examples. The notes cover topics ranging from basic linear models and least squares estimation to dynamic state-space models and filtering approaches like the Kalman filter.

Uploaded by

eager18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Basics of Sensor Fusion 2020

This document provides lecture notes on the basics of sensor fusion. It introduces key concepts such as the definition and main components of sensor fusion systems, including sensors, models, and estimation algorithms. Models of drones and autonomous cars are presented as examples. The notes cover topics ranging from basic linear models and least squares estimation to dynamic state-space models and filtering approaches like the Kalman filter.

Uploaded by

eager18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

Lecture notes on

Basics of Sensor Fusion

Roland Hostettler1 and Simo Särkkä2


1
Uppsala University, Sweden
2
Aalto University, Finland

September 3, 2020
ii
Contents

1 Introduction 1
1.1 Definition and Components of Sensor Fusion . . . . . . . . . . . 1
1.2 Model of a Drone . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Model of an Autonomous Car . . . . . . . . . . . . . . . . . . . 4
1.4 Dynamic Models of Drone and Car . . . . . . . . . . . . . . . . . 7
1.5 Tracking a Moving Drone or Car . . . . . . . . . . . . . . . . . . 9

2 Sensors, Models, and Least Squares Criterion 11


2.1 Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Basic Model . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Vector Model . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Multiple Measurements and Measurement Stacking . . . . 14
2.2.4 Gaussian Measurement Noise . . . . . . . . . . . . . . . 15
2.3 Least Squares Methods . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Minimization of Cost Functions . . . . . . . . . . . . . . 16
2.3.2 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 Weighted Least Squares . . . . . . . . . . . . . . . . . . 18
2.3.4 Regularized Least Squares . . . . . . . . . . . . . . . . . 19

3 Static Linear Models 21


3.1 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Scalar Model . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Vector Model . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.3 Affine Models . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Scalar Model . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 General Linear Model . . . . . . . . . . . . . . . . . . . 27
3.2.3 Weighted Linear Least Squares . . . . . . . . . . . . . . 29
3.3 Regularized Linear Least Squares . . . . . . . . . . . . . . . . . 30
3.4 Sequential Linear Least Squares . . . . . . . . . . . . . . . . . . 33

iii
iv CONTENTS

4 Static Nonlinear Models 37


4.1 Nonlinear Model . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Gauss–Newton Algorithm . . . . . . . . . . . . . . . . . . . . . 41
4.4 Gauss–Newton Algorithm with Line Search . . . . . . . . . . . . 43
4.5 Levenberg–Marquardt Algorithm . . . . . . . . . . . . . . . . . . 45
4.6 Regularized Non-Linear Models . . . . . . . . . . . . . . . . . . 48
4.7 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . 49
4.8 Convergence Criteria . . . . . . . . . . . . . . . . . . . . . . . . 51

5 Dynamic Models 53
5.1 Continuous-Time State-Space Models . . . . . . . . . . . . . . . 53
5.1.1 Deterministic Linear State-Space Models . . . . . . . . . 53
5.1.2 Stochastic Linear State-Space Models . . . . . . . . . . . 56
5.1.3 Nonlinear State-Space Models . . . . . . . . . . . . . . . 58
5.2 Discrete-Time State-Space Models . . . . . . . . . . . . . . . . . 61
5.2.1 Deterministic Linear State-Space Models . . . . . . . . . 61
5.2.2 Stochastic Linear Dynamic Models . . . . . . . . . . . . 62
5.2.3 Nonlinear Dynamic Model . . . . . . . . . . . . . . . . . 63
5.3 Discretization of Linear Dynamic Models . . . . . . . . . . . . . 64
5.3.1 Deterministic Linear Dynamic Models . . . . . . . . . . . 64
5.3.2 Stochastic Linear Dynamic Models . . . . . . . . . . . . 67
5.4 Discretization of Nonlinear Dynamic Models . . . . . . . . . . . 69
5.4.1 Discretization of Linearized Nonlinear Models . . . . . . 69
5.4.2 Euler Discretization of Deterministic Nonlinear Models . 70
5.4.3 Euler–Maruyama Discretization of Stochastic Nonlinear
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6 Filtering 73
6.1 The Filtering Approach . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3 Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . 79
6.4 Unscented Kalman Filtering . . . . . . . . . . . . . . . . . . . . 83
6.5 Iterated Extended and Unscented Kalman Filters . . . . . . . . . 87
6.6 Bootstrap Particle Filter . . . . . . . . . . . . . . . . . . . . . . . 87
Preface

These lecture notes are aimed for an M.Sc. level course on sensor fusion. The
goal is to provide a gentle and pragmatic introduction to the topic with an empha-
sis on methods that are useful for developing practical solutions to sensor fusion
problems. The reader should be familiar with:

1. linear algebra and calculus,

2. ordinary differential equations, and

3. basic statistics.

The notes are based on the much more comprehensive and rigorous works on
sensor fusion, estimation theory, filtering theory, stochastic process theory, and
optimization theory by Gustafsson (2018), Kay (1993), Särkkä (2013), Särkkä and
Solin (2019), and Nocedal and Wright (2006), and the interested reader is highly
recommended to have a read in these books.
The topics covered in these notes range from simple linear observation models
and least squares estimation, to modeling of dynamic systems, and inference in
dynamic systems using Kalman and particle filtering. To lower the threshold to the
topic, we sometimes trade mathematical rigor for easier understanding. This is in
particular true for the probabilistic interpretations, especially in the beginning of
the notes.

v
vi CONTENTS
Chapter 1

Introduction

1.1 Definition and Components of Sensor Fusion


Sensor fusion refers to computational methodology which aims at combining the
measurements from multiple sensors such that they jointly give more information
on the measured system than any of the sensors alone. Sensor fusion has applica-
tions in many different areas of daily life and plays an important role in modern
society. For example, it is used to interpret traffic scenes in autonomous cars, for
navigation and localization in robotics, for controlling drones in aerial photography
and deliveries, and in the analysis of biosignals in biomedical engineering.

Sensor(s) Estimation Algorithm Quantity of Interest

Model(s)

Figure 1.1. The basic components of a sensor fusion system.

The common denominator and main objective of sensor fusion systems are that
they take measurements from different sensors and estimate or infer one or more
quantities of interest. For example, we can use multiple sensors to infer the position
of a drone or an autonomous car – possibly along with their velocities. On a high
level, this involves three main components (Figure 1.1):

• One or more sensor(s) that measure an observable quantity,

• model(s) that relate the observed quantity to the quantity of interest, and

• an estimation algorithm that estimates the quantity of interest by combining


the model and the measured data.

1
2 CHAPTER 1. INTRODUCTION

px

y1

y2 py

Figure 1.2. A simple illustration of fusion of multiple sensor measurements made by a


drone. The height is measured with one sensor (say, barometer) and the distance from a
wall with another sensor (say, radar). The ”fusion” of the measurements in this case simply
means using both the measurements together to determine the drone’s position.

1.2 Model of a Drone


As an illustration of (simple) sensor fusion let us take a look at the Figure 1.2
with a drone whose position (px , py ) we wish to determine. In this case we have
two sensors: one sensor which measures the distance to a nearby wall — this sen-
sor could be, for example, radar or ultrasound-based sensor; and a second sensor
which measures the distance to the ground, that is, height — this sensor could be,
for example, radar or barometer. In this case neither of the sensors alone gives
the position of the drone. However, as the wall-distance sensor directly measures
the horizontal position px of the drone and the height sensor directly measures the
vertical position py of the drone, we can ”fuse” the sensor measurement simply by
using the measurements jointly to determine the (px , py ) position. In this case the
fusion itself is trivial as no actual computations needs to be done to the measure-
ments since they directly measure the components of the quantity of interest, that
is, the position.
If we denote the sensor reading measuring the distance to the wall as y1 and
the height sensor measurement as y2 , then a mathematical model which relates the
observed quantities (measurements) to the quantity of interests is

y1 = px ,
y2 = py , (1.1)

and the aim of the estimation algorithm is to determine (px , py ) from the measure-
ments (y1 , y2 ). In this case it is indeed trivial as the measurements directly give the
position coordinates.
1.2. MODEL OF A DRONE 3

In reality, the situation is not as simple as all sensors contain measurement


noise and a more proper model is actually

y1 = px + r1 ,
y2 = py + r2 , (1.2)
which says that the measurements that we obtain are the actual positions plus mea-
surement noises r1 and r2 . The measurement noises are typically modeled as zero-
mean Gaussian random variables. In this case we cannot exactly determine the
position from the sensor measurements because we only get to see noisy versions
of the positions. We could, however, approximate that the noises are ”small” which
implies that y1 ≈ px , y2 ≈ py and thus we can use the measurement directly as
estimates of the position coordinates. In this case given only the two sensor mea-
surements, it is hard or impossible to get rid of the noise and hence our position
estimate will inherently be off by the amount of measurement noise – not even
sophisticated statistical methods will be significantly better than the small-noise
approximation. One way to cope with this problem is to average multiple measure-
ments which reduces the variance of the estimate, but it obviously requires us to
make more measurements with the sensor. Alternatively, by adding more sensors
we can diminish the effect of measurement noise even when each of the sensors
only produce a single measurement.
Let us now look at a slightly extended sensor setup show in Figure 1.3. We
still have the wall-distance and height sensors, but additionally, we measure the
distance from a wall that is at a 45◦ angle with the ground. From the geometry of
the problem we can now determine that the relationship of the measurements and
positions should be

y1 = px + r1 ,
y2 = py + r2 ,
1 1
y3 = √ (px − x0 ) + √ py + r3 , (1.3)
2 2
where x0 is the horizontal coordinate of the rightmost point of the wall and r1 , r2 ,
and r3 are measurement noises.
The aim of the estimation algorithm is now to determine the position (px , py )
given the 3 sensor measurements (y1 , y2 , y3 ). If we had no measurement noises,
then any two of the measurements would exactly give us the position (after possi-
bly solving a simultaneous pair of equations). It is indeed this over-completeness
of the model that allows us to use least-squares estimation to determine the po-
sition from the measurements such that the effect of the measurements noises is
significantly diminished — the more over-complete the model is, the better we can
diminish the noises. Furthermore, an important factor in the above model is that
the measurement y3 is no longer a direct measurement of a position coordinate, but
a linear combination of them, and in that sense it is an indirect observation of the
position.
4 CHAPTER 1. INTRODUCTION

px

y1

y2 py
y3

Figure 1.3. An illustration of sensor fusion with an additional sensor as compared to


Figure 1.2. When we measure the drone position using 3 sensors instead of 2, the over-
completeness of the problem enables us to diminish the noises in the sensor measurements.

The model above (and in fact the more simple model also) is an example of
static linear model which will be covered in Chapter 3. More specifically, the
model above in Equation (1.3) can be also rewritten in vector notation by rewriting
it as        
y1 1 0 x 0 r1
y2  =  0 p
1  y +  0  + r2 , (1.4)
√1 √1
p x0
y3 − √ r3
| {z } | 2 {z 2 } x | {z 2 } | {z }
| {z }
y G b r

which has the form


y = G x + b + r, (1.5)
and the estimation aim is to determine the unknown position vector x given the
measurement vector y and the parameters G and b. It is worthwhile to notice that
the matrix G is neither square nor invertible and additionally, the noise vector r
is unknown. Hence, we cannot just determine x by multiplying both sides with
inverse of G. Instead, we will use a least squares method to determine x.

1.3 Model of an Autonomous Car


Driver-assistance and self-driving systems in cars are also important applications of
sensor fusion. Let us now take a look at an autonomous car illustrated in Figure 1.4
where the car is determining its position by measuring the direction and/or range
(i.e., distance) to nearby landmarks whose positions are known. This kind of
measurement could be done, for example, with a camera-based computer vision
system (see, e.g., Hartley and Zisserman, 2003) that detects the landmarks in the
image and determines their direction and range. Another option to measure them
would be, for example, by using a radar (see, e.g., Richards et al., 2010).
1.3. MODEL OF AN AUTONOMOUS CAR 5

Figure 1.4. Illustration of positioning of an autonomous car from measurements of relative


locations of landmarks (e.g., traffic signs). When only ranges or directions of the landmarks
are measured, then the sensor fusion model becomes non-linear.

Even though, for example, radar or camera actually measures the direction
and range from the car to the landmark, provided that both the measurements are
relatively accurate, then by a simple coordinate transformation we can convert the
measurement into measurement of the landmark position in the car’s coordinate
system. Thus we get direct measurements of the relative positions of the landmarks
with respect o the car. In this case, if the landmark positions are (sxj , syj ) for
j = 1, . . . , M , and the position of the car is (px , py ), then the measurements are
given as

y1 = sx1 − px + r1 ,
y2 = sy1 − py + r2 ,
..
.
y2M −1 = sxM − px + r2M −1 ,
y2M = syM − py + r2M , (1.6)

where the odd-numbered measurements corresponds on the horizontal measure-


ments and even-numbered the vertical measurements in the image. We can again
6 CHAPTER 1. INTRODUCTION

rewrite the model in form (1.5) as follows:


     x  
y1 −1 0 s1 r1
 y2   0 −1  sy   r2 
 x  1 
..  p +  ..  +  .. ,
   
 ..   ..
 . = . .  py (1.7)
 x.   . 
    
  
y2M −1  −1 0  | {z } s  r2M −1 
M
y2M 0 −1 x
syM r2M
| {z } | {z } | {z } | {z }
y G b r

where the matrix G is again non-square and non-invertible. However, least squares
method will be applicable to estimate the car’s position x from the noisy measure-
ments y.
Sometimes it happens that the sensor only measures the range from car to the
landmark and does not provide information on the direction. This can happen, for
example, when a certain type of radar is used. When only range measurements are
available, then the range-only model can be written as
q
y1R = (sx1 − px )2 + (sy1 − py )2 + r1R ,
..
.
q
R
yM = (sxM − px )2 + (syM − py )2 + rM
R
,
(1.8)
and the aim of the measurement algorithm would be to determine the position
(px , py ) based on the measurements y1R , . . . , yMR in the presence of the measure-
R R
ment noises r1 , . . . , rM . If we now define a function

(sx1 − px )2 + (sy1 − py )2
p 
 .. 
g(x) =  q . 
 (1.9)
y
(sxM − px )2 + (sM − py )2
T R T , and r = r R · · · r R T , then
as well as x = px py , y = y1R · · · yM

1 M
the model can be written as
y = g(x) + r, (1.10)
where g(x) is a non-linear function of x.
Alternatively, let us now consider a case where we only measure the direction
(i.e., bearing) of the landmark with respect to the car, but no distance. This can
happen, for example, when we are using a camera without the knowledge of the
true size of the landmark. The measurement in this bearings-only model can be
written as
y1D = atan2 (sy1 − py , sx1 − px ) + r1D ,
..
.
D
yM = atan2 (syM − py , sxM − px ) + rM
D
, (1.11)
1.4. DYNAMIC MODELS OF DRONE AND CAR 7

where atan2 is the full-quadrant inverse of tangent function (e.g., function called
atan2 in Matlab and Python). This model again has the form (1.10) provided that
we define
atan2 (sy1 − py , sx1 − px )
 

g(x) =  ..
, (1.12)
 
.
y y x x
atan2 (sM − p , sM − p )
T D T , and r = r D · · · r D T .
along with x = px py , y = y1D · · · yM

1 M
The range-only and bearings-only models above are examples of static non-
linear models that we cover in Chapter 4. The estimation of x in these models
will be possible by solving the nonlinear least squares problem with an iterative
optimization method.

1.4 Dynamic Models of Drone and Car


The models of the drone and car in the previous sections are static in the sense
that the models do not model the time behaviour of their positions. Of course, if
the position changes, then we can use new sensor measurements to recalculate the
position. However, in this kind of approach, where we recompute the position from
scratch always when the device moves, we lose the information contained in the
sensor measurements that we did before. In order to use sensor measurements over
time, we need to build a dynamic model which ties the positions at different time
points together.
One simple model can be constructed by assuming that the position x =
x y T
p p could change to any other position near the current position on the next
time step. One way to model this is that if the position at time step n − 1 was xn−1 ,
then the next time step is given as

xn = xn−1 + qn , (1.13)

where qn is a zero-mean random variable, which is typically modeled as Gaussian.


The model says that we do not know where the next position will be, but the closer
positions have higher probability than the further away positions. This model
is also called a random walk model, because it corresponds to random steps to
random directions that are taken at each time step transition. In components of the
vector xn this model can be written as

pxn = pxn−1 + q1,n ,


pyn = pyn−1 + q2,n , (1.14)

where we need to use two indices for denoting the vector index and the time step
number.
Although a random walk model would be good for say a person or animal, both
drone and car have definite velocity which itself changes according to a random
8 CHAPTER 1. INTRODUCTION

walk model. At the time step transition the velocity times the time step length ∆t
is added to the position coordinates which leads to the model

pxn = pxn−1 + vn−1


x
∆t + q1,n ,
pyn = pyn−1 + vn−1
y
∆t + q2,n ,
vnx = vn−1
x
+ q3,n ,
y
vny = vn−1 + q4,n , (1.15)

where we have added noises q3,n and q4,n which act as the noises in the velocity
components.
At this point it is now convenient to change the notation so that our state
vector x not only contains the position coordinates but also the velocities x =
x y T
p p v x v y . As the velocities are just 3rd and 4th components of x, we
can also then rewrite above model as
x1,n = x1,n−1 + x3,n−1 ∆t + q1,n ,
x2,n = x2,n−1 + x4,n−1 ∆t + q2,n ,
x3,n = x3,n−1 + q3,n ,
x4,n = x4,n−1 + q4,n , (1.16)

which can also be written in convenient matrix form as


      
x1,n 1 0 ∆t 0 x1,n−1 q1,n
x2,n  0 1 0 ∆t x2,n−1  q2,n 
x3,n  = 0 0 1
    + , (1.17)
0  x3,n−1  q3,n 
x4,n 0 0 0 1 x4,n−1 q4,n
| {z } | {z } | {z } | {z }
xn F xn−1 qn

that is,

xn = F xn−1 + qn . (1.18)

The above model is an example of a discrete-time stochastic linear state space that
we consider in Chapter 5.
The above model could also be derived by considering the state as function
of continuous-time variable x(t) with t being a time variable that can take any
value t ≥ 0. The discrete-time model can then be derived as a discretization of the
continuous-time model described as a differential equation with random input. For
example, the first random walk model corresponds to a continuous-time model
dpx (t)
= w1 (t),
dt
dpy (t)
= w2 (t),
dt (1.19)

where w1 (t) and w2 (t) are continuous time white noises.


1.5. TRACKING A MOVING DRONE OR CAR 9

The model with velocity components included can be derived as discretization


of the second order model
d2 px (t)
= w1 (t),
dt2
d2 py (t)
= w2 (t). (1.20)
dt2
We will come back to these in Chapter 5.

1.5 Tracking a Moving Drone or Car


We can now combine the dynamic model of a drone or car in Section 1.4 with a
measurement model of a drone or car in Sections 1.2 or 1.3. For example, we can
combine the dynamic model in (1.18) with a linear measurement model of the form
(1.5), which leads to

xn = F xn−1 + qn ,
yn = G xn + b + rn , (1.21)

where we get a measurement of the state at time step n when the state xn is
changing according to the dynamic model. For this kind of linear (or in fact affine)
dynamic models we can use the Kalman filter to compute the statistical estimate
of the dynamic state xn at each time step n. This methodology is covered in
Chapter 6.
We can also combine the dynamic model (1.18) with a non-linear measurement
model of the form (1.10), which leads to a model of the form

xn = F xn−1 + qn ,
yn = g(xn ) + rn . (1.22)

For this kind of models we can use extended Kalman filter (EKF), unscented
Kalman filter (UKF), or particle filter, which will also be covered in Chapter 6.
Additionally, these methods can also be used in the case that the dynamics are
modeled with nonlinear models instead of a linear model as we have above.
10 CHAPTER 1. INTRODUCTION
Chapter 2

Sensors, Models, and Least


Squares Criterion

2.1 Sensors
A sensor is a device that provides a measurement related to the quantity of in-
terest. Usually, a sensor is implemented as a device which converts a physical
phenomenon into an electrical signal (Wilson, 2005) which is then further trans-
formed into digital form. The measurement may be direct or indirect: In direct
measurements, the quantity observed by the sensor is the quantity of interest, for
example, the temperature in a thermometer. Sensors that measure indirectly pro-
vide a measurement of a quantity that is only related to the quantity of interest. For
example, we might only be able to measure the distance to an object when we are
actually interested in the position. The indirect measurement can also be related
to dynamical properties of the quantity, for example, an accelerometer measures a
body’s acceleration which is the second derivative of the position.
There are many different types of sensors for measuring a wide range of phe-
nomena available today. These include electronic sensors, (micro-)mechanical sen-
sors, or virtual sensors and a (non-exhaustive) overview of a few commonly used
sensors, their measurements, and some of their application areas is shown in Ta-
ble 2.1.
Sensors are characterized by many different properties that affect their per-
formance. These include limitations such as measurement range, requirements and
sensitivity for the environmental conditions (temperature, humidity, etc.), sampling
frequency, as well as measurement noise, biases, and drifts. Some of these factors
have to be taken into account when designing the hardware whereas others can be
accurately dealt with when designing the estimation algorithm. In this course, we
will focus on the latter aspects of the sensors, that is, we will introduce tools that
are appropriate for taking into account the properties that can be handled by sensor
fusion.

11
12 CHAPTER 2. SENSORS, MODELS, AND LEAST SQUARES CRITERION

Table 2.1. Examples of common sensors

Sensor Measurement Application Examples


Accelerometer Gravity, acceleration Inertial navigation, activity track-
ing, screen rotation
Gyroscope Rotational velocity Inertial navigation, activity tracking
Magnetometer Magnetic field strength Inertial navigation, digital compass,
object tracking
Radar Range, bearing, speed Target tracking, autonomous vehi-
cles
LIDAR Range, bearing, speed Target tracking, autonomous vehi-
cles, robotics
Ultrasound Range Robotics
Camera Visual scene Security systems, autonomous vehi-
cles, robotics
Barometer Air pressure Inertial navigation, autonomous ve-
hicles, robotics
GNSS Position Autonomous vehicles, aerospace
applications
Strain gauge Strain Condition monitoring, scales

The measurement obtained by a sensor will be denoted using the symbol yn


or yn . The subscript n denotes the nth measurement, which may refer to an
arbitrary measurement index, step number in time series, a sensor identification
number in a sensor network, depending on the context. The regular notation (yn )
is used to denote scalar measurements (e.g., from a temperature sensor) whereas
the bold symbol (yn ) denotes a vector observation (e.g., as measured by a tri-axial
accelerometer).

2.2 Models
A model is a mathematical formulation that relates the quantity of interest to the
measurements in a systematic way. Additionally, in the dynamic case, models are
also used to describe how the quantity of interest evolves over time. We already
saw examples of models in Chapter 1.
As mentioned earlier, sensors suffer from several limitations that we need to
account for in models. In particular, it is important to take uncertainties in the mea-
surements such as sensor noise or measurement noise into account. Unfortunately,
measurement noise is difficult to quantify analytically. Instead, a suitable approach
is to use statistical modeling. It is based on the observations that the measurement
noise exhibits statistical properties that can be quantified. While the actual value
of the noise can not be observed or measured, its statistical properties can, and this
2.2. MODELS 13

allows us to design sensor fusion methods that are able to (statistically) minimize
the effect of the noise. A common assumption is that it is zero mean, that is, on
average, it is zero (but each individual realization is not).

2.2.1 Basic Model


Next, we introduce a very simplistic model, which turns out to be quite generic,
despite its simplicity. First, assume that a scalar measurement yn of the quantity
of interests x (e.g., the position) is available. The objective is then to relate the
measurement yn to the unknown quantity x such that the uncertainty is taken into
account. We know that the measurement might be some function of the unknown
quantity plus measurement noise. Hence, we can write the model as

yn = gn (x) + rn . (2.1)

Equation (2.1) is called a sensor model, measurement model, or observation model.


The generic anatomy of a measurement model is that

• the measured quantity yn is found on the left hand side of the equation1 , and

• on the right hand side there are two terms, a function gn (x) that relates the
quantities of interest x to the measurement yn and a noise term rn .

As mentioned above, the measurement noise rn is assumed to be a random


variable. As such, rn follows some kind of probability density function (pdf), that
is, we have that
rn ∼ p(rn ), (2.2)
which reads as “rn is distributed according to p(rn )” where p(rn ) denotes the
corresponding pdf. At this point, we will not concern ourselves with any particular
form of p(rn ) but only assume that the rn s are zero-mean, independent random
2 . In other words, we have that
variables with variance σr,n

E{rn } = 0, (2.3a)
var{rn } = E{rn2 } − (E{rn }) = 2 2
σr,n , (2.3b)
Cov{rm , rn } = E{rm rn } − E{rm } E{rn } = 0, (m 6= n), (2.3c)

where E{rn } and var{rn } denote the expected value and variance of rn , respec-
tively. Note that the measurement noise variance may vary for the different mea-
2 (e.g., if the different measure-
surements, as indicated by the subscript n on σr,n
ments are obtained by different sensors).
1
There are more general formulations than (2.1) that do not share this trait, but we will not make
use of these in these notes.
14 CHAPTER 2. SENSORS, MODELS, AND LEAST SQUARES CRITERION

2.2.2 Vector Model


The basic model only considers scalar measurements. However, many sensors
actually provide vector-valued measurements, for example, accelerometers provide
full 3D-acceleration measurements at each sampling instant. The basic model (2.1)
can easily be extended to account for such vector-valued measurements. In this
case, the vector measurement model can be written as
yn = gn (x) + rn , (2.4)
where yn is an dy -dimensional column vector. In this case, the measurement noise
rn becomes a multivariate random variable with pdf
rn ∼ p(rn ). (2.5)
As for the scalar case, p(rn ) can have any arbitrary form. For simplicity,
we again assume that the rn s are zero-mean, independent random variables with
covariance matrix Rn , that is,
E{rn } = 0, (2.6a)
Cov{rn } = E{rn rT T
n } − E{rn } E{rn } = Rn , (2.6b)
Cov{rm , rn } = E{rm rT T
n } − E{rm } E{rn } = 0, (m 6= n). (2.6c)
As it can be seen, the scalar model in (2.1) is just a special case of the vector
model in (2.4), where the measurement and noise are scalars (i.e., dy = 1). Hence,
it is often more general to work with vector model, but sometimes more instructive
to work with the scalar model.

2.2.3 Multiple Measurements and Measurement Stacking


In order to be able to fuse the measurements of one or more sensors, we need
multiple measurements. These may either be from one sensor at different time
instants (repeated measurements), from different sensors, or both. If we have
obtained a total of N measurements y1 , y2 , . . . , yN , we will refer to the set of all
measurements using the notation y1:N = {y1 , y2 , . . . , yN } for the scalar case and
y1:N = {y1 , y2 , . . . , yN } for the vector case.
Sometimes, it is also helpful to write the complete set of measurements together
with the measurement model more compactly. This can be achieved by stacking
the measurements into a single measurement vector, which yields a much more
compact model of the form
y = g(x) + r. (2.7)
For scalar measurements of the form (2.1), we have that
     
y1 g1 (x) r1
 y2   g2 (x)   r2 
y =  .  , g(x) =  .  , and r =  .  . (2.8)
     
.
 .   . .  .. 
yN gN (x) rN
2.2. MODELS 15

Furthermore, the overall noise covariance R for this model is a diagonal matrix of
the form  2 
σr,1 0 . . . 0
 2
.. 
 0 σr,2 . 
R = Cov{r} =   .. ..
.
 (2.9)
 . . 0 
0 . . . 0 σr,N 2

For vector measurements, y, g(x), and r are constructed in the same way as
for the scalar case in (2.8), that is,
     
y1 g1 (x) r1
 y2   g2 (x)   r2 
y =  .  , g(θ) =  .  , and r =  .  . (2.10)
     
.
 .   . .  .. 
yN gN (x) rN
The measurement noise covariance is a block-diagonal matrix with the individual
covariance matrices Rn on the diagonal, that is,
 
R1 0 . . . 0
 .. 
 0 R2 . 
R = Cov{r} =  .

.
. (2.11)
 .. . . 0 

0 . . . 0 RN

2.2.4 Gaussian Measurement Noise


A common assumption is that the measurement noise is zero-mean Gaussian noise.
In this case, the pdf of the noise is the (multivariate) Gaussian distribution given by

1 1 T −1
p(r) = exp − r R r , (2.12)
(2π)M/2 |R|1/2 2
where |R| denotes the matrix determinant and we have assumed that the measure-
ment noise is a vector with M dimensions. Equation (2.12) may also be written
more compactly as
p(r) = N (r; 0, R). (2.13)
More generally, N (z; µ, Σ) denotes the multivariate pdf of an M -dimensional
Gaussian random variable z with mean µ and covariance matrix Σ defined as

1 1 T −1
N (z; µ, Σ) = exp − (z − µ) Σ (z − µ) . (2.14)
(2π)M/2 |Σ|1/2 2
Assuming the measurement noise to be Gaussian is indeed reasonable in many
(but not all) applications. Furthermore, it also has some important implications
on the algorithms that we will derive, in particular with respect to the estimators’
statistical properties. However, we will not focus on these aspects throughout most
of this course.
16 CHAPTER 2. SENSORS, MODELS, AND LEAST SQUARES CRITERION

2.3 Least Squares Methods


The estimation algorithm or inference algorithm combines all the available mea-
surements by using the measurement models to estimate the parameters (or states)
in a way that is optimal with respect to some criterion — which is defined by a
cost function. The cost function is here selected to be the least squares cost func-
tion. By combining the information from multiple measurements and sensors, we
can exploit the diversity of the measurements to obtain a better parameter esti-
mate. Hence, sensor fusion algorithms can essentially be seen as an application of
estimation theory, which provides a statistically sound and flexible framework.

2.3.1 Minimization of Cost Functions


In this course (with the exception of the particle filtering method introduced in Sec-
tion 6.6), we sacrifice the generality and rigor of estimation theory for simplicity
and only consider algorithms that minimize a cost function J(x). These types of
algorithms are a good starting point for many sensor fusion applications and have
historically proved useful. They also have sound statistical interpretations. How-
ever, we will not discuss these explicitly, but for the interested student, there are a
couple of exercises that delve into this topic.
Mathematically, minimizing a cost function J(x) to find an estimate of the
quantity of interest x̂ (the hat indicates that x̂ is an estimate of the unknown quan-
tity x), can be written as the optimization problem

x̂ = arg min J(x), (2.15)


x

which reads as “the estimate x̂ is the argument x that minimizes J(x)”. Typical
cost functions minimize a function of the error

en = yn − gn (x) (2.16)

between the measurement and the output predicted by the estimate. Two typical
functions are the absolute error (Figure 2.1a)

|en | = |yn − gn (x)|, (2.17)

which penalizes all errors equally, and the quadratic cost function (Figure 2.1b)

e2n = (yn − gn (x))2 . (2.18)

which favors smaller errors over larger ones.

2.3.2 Least Squares


In practice, the quadratic cost is much more common. It implicitly imposes a
certain smoothness onto the problem by penalizing large errors more than small
2.3. LEAST SQUARES METHODS 17

J(e) J(e)

e e
(a) (b)

Figure 2.1. Examples of cost functions: (a) Absolute error, and (b) quadratic error.

ones. Additionally, it is closely related to the case where the measurement noise is
assumed to be Gaussian as discussed in Section 2.2. Minimizing the quadratic cost
function is also referred to as the least squares method. As we will see, many of
the most common estimation algorithms can be formulated based on this approach.
Typically, the aim of the estimation algorithm must be to minimize the error
over all measurements y1:N rather than only one measurement yn . Hence, the
resulting cost function is rather the sum over all the errors en . For the quadratic
cost (2.18), the cost function is thus
N
X
JLS (x) = e2n
n=1
(2.19)
XN
2
= (yn − gn (x)) .
n=1

Similarly, for vector measurements, the quadratic error of a single measure-


ment can be written as

e2n = (yn − gn (x))T (yn − gn (x)), (2.20)

and thus, the cost function becomes


M
X
JLS (x) = (yn − gn (x))T (yn − gn (x)). (2.21)
n=1

Finally, we can write (2.19) and (2.21) more compactly by using the vector
notation introduced in Section 2.2. This yields the compact cost function

JLS (x) = (y − g(x))T (y − g(x)). (2.22)

A method that determines the estimate x̂ by minimizing the quadratic cost function
in Equation (2.22) is called a least squares (LS) method. The formulations (2.19)
or (2.21) and (2.22) are equivalent. Hence, we will use these formulations inter-
changeably, depending on the context.
18 CHAPTER 2. SENSORS, MODELS, AND LEAST SQUARES CRITERION

2.3.3 Weighted Least Squares


The least squares method assumes that all measurements are equally reliable. In
other words, each measurement is weighed into the estimate equally. However, not
every measurement might actually carry the same amount of information: Some
measurements might be more important (in some sense) than others and thus,
should contribute more to the solution than others.
This can be achieved by using the weighted least squares (WLS) cost func-
tion instead. By introducing the positive weights wn for each term in (2.19), the
weighted least squares cost function for the scalar case becomes
N
X
JWLS (x) = wn (yn − gn (x))2 . (2.23)
n=1

Similarly, for the vector case (2.21), we can introduce a positive definite2
weighing Wn matrix and weight each term using this matrix. The resulting cost
function becomes
N
X
JWLS (x) = (yn − gn (x))T Wn (yn − gn (x)). (2.24)
n=1

Again, (2.23)–(2.24) can be written using the compact notation as

JWLS (x) = (y − g(x))T W(y − g(x)), (2.25)

where the overall weighing matrix is given by either


   
w1 0 . . . 0 W1 0 ... 0
 ..   .. 
 0 w2 .   0 W2 . 
W=  .. ..
 or W = 
. ..

 ..
  
 . . 0  . 0 
0 . . . 0 wN 0 . . . 0 WN

for the scalar or vector case, respectively. The forms (2.24)–(2.25) are indeed very
general and they will be the basis for many of the estimation algorithms derived in
this course.
An important question is the choice of the weights wn (or Wn ). This can, in
principle, be arbitrary as long as they are positive (or positive definite). In practice,
however, a principled choice is to use
2
wn = 1/σr,n and Wn = R−1
n ,

that is, use the inverse (co)variance of the measurement noise as the weighting fac-
tor. The rationale behind this choice is as follows: The covariance is a measure
2
That is, a symmetric matrix M for which it holds that xT Mx > 0 for any arbitrary non-zero
vectors x. The eigenvalues of this kind of matrix are all positive.
2.3. LEAST SQUARES METHODS 19

for the amount of uncertainty in our measurement due to the measurement noise.
A large covariance means that there is a large uncertainty and vice-versa. Hence,
by weighing the measurements using the inverse of the covariance, measurements
with high uncertainty are given lower weights and measurements with low uncer-
tainty are given higher weights.

2.3.4 Regularized Least Squares


In plain (weighted) least squares we find an estimate that best explains the measure-
ments. This amounts to selection of the estimate that minimizes the cost function.
However, sometimes it is also useful to regularize the estimate itself by, for exam-
ple, forcing it to be ”small” or in some predefined part of space. This is advanta-
geous in particular when the estimate is not unique — in that case a regularization
term can be used to select the estimate that correspond to the area where we believe
that the estimate should be. This kind of ”prior belief” is called ”prior information”
and indeed regularization terms correspond to prior distributions of the unknown
quantity in estimation theory.
The general form of regularized least squares (ReLS) that we use is

JReLS (x) = (y − g(x))T R−1 (y − g(x)) + (x − m)T P−1 (x − m), (2.26)

which has a quadratic regularization term defined by parameters m and P. We


have on purpose selected the weighting term to correspond to the measurement
noise covariance W = R−1 , because in that case the regularization term has
the interpretation of corresponding to a Gaussian distribution with mean m and
covariance P. What it says is that our prior belief is that the estimate should have
a Gaussian distribution with mean m and covariance P. With covariance going to
infinity, our prior belief vanishes and the problem becomes an ordinary (weighted)
least squares problem.
20 CHAPTER 2. SENSORS, MODELS, AND LEAST SQUARES CRITERION
Chapter 3

Static Linear Models

In this chapter, we focus on static, linear models. These models are quite versatile
already and have several important properties, one being that we can find a closed
form estimation algorithm.

3.1 Linear Model


3.1.1 Scalar Model
The simplest measurement model arises when the measurement is a scaled and
noisy version of a scalar quantity of interest. In this case, the scalar measurement
yn can be written as
yn = gx + rn , (3.1)
where we assume that the scale factor g is known.

Example 3.1. To measure the wall-distance of the drone in Figure 1.2 we could
use, for example, radar which essentially measures the time difference between the
sent signal and the signal reflected from the wall. This time difference is ideally
equal to the double distance divided by the speed of light τ = 2px /c, where
c = 299792458 m/s. If we do this measurent N times, and each measurement
is corrupted with noise, then the measurement model is
2 x
yn = p + rn , n = 1, . . . , N. (3.2)
c
which has the form (3.1) with g = 2/c.

With (3.1) in mind, we can construct similar models where the scalar measure-
ment linearly depends on a set of K unknown quantities x1 , . . . , xK :

yn = g1 x1 + g2 x2 + · · · + gK xK + rn , (3.3)

21
22 CHAPTER 3. STATIC LINEAR MODELS

with known scale factors g1 , . . . , gK and noise term rn . We can rewrite (3.3) in
matrix form as
 
x1
 x2 

yn = g1 g2 . . . gK  .  + rn = gx + rn , (3.4)

.
 . 
xK
T
where g = g1 g2 . . . gK is the vector of constants and x = x1 x2 . . . xK
is the vector of unknowns. For further generality, we can replace g above with gn
which would allow for different set of coefficients for each measurement, which
yields
yn = gn x + rn , (3.5)
For a total of N measurements y1 , y2 , . . . , yN , we can use measurement stacking
as discussed in Section 2.2 to obtain the compact model

y = Gx + r, (3.6)

where y and r are the stacked measurements and measurement noise, respectively.
Moreover,  
g1
 g2 
G= .  (3.7)
 
 .. 
gN
is a matrix that contains the known factors g1 , . . . , gN .

3.1.2 Vector Model


The scalar model can also be extended to the case where each measurement itself
is a vector. In this case, a single vector-valued measurement yn can be written as
  
g11 g12 . . . g1K x1
 g21 g22 . . . g2K   x2 
yn =  . ..   ..  + rn . (3.8)
  
. .. . .
 . . . .  . 
gdy 1 gdy 2 . . . gdy K xK
We can now collect the terms gij into a matrix and rewrite the above model in form

yn = Gn x + rn . (3.9)
where Gn contains the collected terms, and similarly to previous section, we have
already allowed it to vary for each measurement.
Again, by stacking the measurements y1 , y2 , . . . , yN into a single vector y, we
obtain the compact model
y = Gx + r, (3.10)
3.1. LINEAR MODEL 23

where the matrix G is given by



G1
 G2 
G =  . . (3.11)
 
.
 . 
GN
Comparing (3.6) and (3.10), it can be seen that both the scalar and the vector
model can be written in the same form, the only difference being the matrix G.
Furthermore, that common model is completely linear in all the unknowns x and
thus, this model is called the linear model.

3.1.3 Affine Models


An important class of models is a slight extension of a linear model, called affine
model, where the model includes an additional known bias b:

y = Gx + b + r. (3.12)

Fortunately, it turns out that the same algorithms that are used for linear models
(without a bias) can also be directly used for model with a bias. This is because we
can now rewrite the above model as

y − b = Gx + r. (3.13)

and consider ỹ = y − b as the measurement is a linear model. That is, we


start our estimation procedure by computing ỹ by subtracting the biases from the
measurements and then apply linear model algorithm to the modified model

ỹ = Gx + r. (3.14)

Example 3.2. The drone model in Equations (1.3) and (1.4) has this kind of form.
Thus, we can convert the model into linear model by defining de-biased measure-
ments as
ỹ1 = y1 ,
ỹ2 = y2 ,
1
ỹ3 = y3 + √ x0 , (3.15)
2
which reduces (1.4) to
     
ỹ1 1 0 x r1
ỹ2  =  0 p
1  y + r2 , (3.16)
√1 √1
p
ỹ3 2 2 | {z }
r3
| {z } | {z } x | {z }
ỹ G r

which is indeed a linear model.


24 CHAPTER 3. STATIC LINEAR MODELS

Example 3.3. For the autonomous car model in Equations (1.6) and (1.7) we can
eliminate the bias by defining

ỹ1 = y1 − sx1 ,
ỹ2 = y2 − sy1 ,
..
.
ỹ2M −1 = y2M −1 − sxM ,
ỹ2M = y2M − syM . (3.17)

3.2 Linear Least Squares


The linear model introduced in the previous section in (3.10) can be seen to be a
linear equation system. Unfortunately, in addition to the K unknown parameters
of interest in x, there are also a N dy unknown noise values, which yields a total
of K + N dy unknowns. Since we only have N dy measurements (≈ equations)
the equation system is under-determined and can not be solved. Even adding more
measurements does not solve this problem since every new measurement adds a
new unknown (the noise values).
Fortunately, we have already introduced an alternative approach: Minimizing
the error cost as discussed in Section 2.3. In this section, we will pursue the least
squares approach for the linear model introduced in the previous section.

3.2.1 Scalar Model


We start our discussion based on the scalar model (3.1). For this model, the error
for each measurement is given by

en = yn − gx. (3.18)

Hence, the least squares cost function becomes


N
X
JLS (x) = (yn − gx)2 . (3.19)
n=1

The least squares estimate x̂LS is the value of x that minimizes the cost func-
tion (3.19). The minima of the cost function with respect to x are found by setting
the derivative of JLS (x) (with respect to x) to zero and solving it for x.
3.2. LINEAR LEAST SQUARES 25

The derivative is given by


N
∂JLS (x) ∂ X
= (yn − gx)2
∂x ∂x
n=1
X
= −2g(yn − gx)
n=1
X N
X
=− 2gyn + 2g 2 x
n=1 n=1
X
= −2g yn + 2N g 2 x. (3.20)
n=1

Setting (3.20) to zero yields


X
0 = −2g yn + 2N g 2 x,
n=1

and solving for x finally yields the estimator


N
1 X
x̂LS = yn . (3.21)
Ng
n=1

Two important aspects of an estimator are its mean and variance. The mean
indicates whether an estimator (on average) converges to the true parameter value
x and the covariance indicates how confident we are about the estimated value. For
the scalar least squares estimator in (3.21), the mean is given by
N
( )
1 X
E{x̂LS } = E yn
Ng
n=1
N
1 X
= E{gx + rn }
Ng
n=1
N
1 X
= gx + E{rn }
Ng
n=1
N
1 X
= N gx + E{rn }
Ng
n=1
N
X
=x+ E{rn }.
n=1

Hence, if (and only if) the measurement noise is zero-mean, that is, E{rn } = 0,
the mean reduces to
E{x̂LS } = x. (3.22)
26 CHAPTER 3. STATIC LINEAR MODELS

In this case, x̂ converges to the true parameter value and the estimator is said to be
an unbiased estimator. Furthermore, for the zero-mean noise case, the variance is
given by

var{x̂LS } = E{(x̂LS − E{x̂LS })2 }


 !2 
 1 X N 
=E yn − x
 Ng 
n=1
 !2 
 1 X N 
=E (gx + rn ) − x
 Ng 
n=1
 !2 
N
!
 1 X 
=E N gx + rn − x
 Ng 
n=1
 !2 
 1 X N 
=E rn
 Ng 
n=1
 !2 
N
1  X 
= 2 2E rn ,
N g  
n=1

and finally assuming that var{rn } = σ 2 we get


σ2
var{x̂LS } = , (3.23)
N g2
where we have also assumed that the individual rn s are independent.
The error in the estimator is measured by the standard deviation which is the
square root of the variance
1 σ
std{x̂LS } = √ , (3.24)
N |g|

which shows that the standard error of the estimate diminishes as 1/ N with the
number of measurements. Given the estimator and its standard deviation we can
also compute the confidence interval of (loosely speaking) the true parameter value.
The typically used confidence interval is the 95% interval

[x̂LS − 1.96 std{x̂LS }, x̂LS + 1.96 std{x̂LS }], (3.25)

which under certain conditions (Gaussianity, etc.) contains the true parameter
value with a probability of 95%. For visualization purposes the coefficient 1.96
is often replaced with constant 2 as it is an approximation anyway. Furthermore, in
multivariate case the formula for confidence interval is a bit more complicated, but
using the multiplier 2 still provides a reasonable approximation to the confidence
interval.
3.2. LINEAR LEAST SQUARES 27

Example 3.4. Let us consider the wall-distance measurement model in Exam-


ple 3.1 and assume that we estimate px by using (3.21) with N measurements:
N
c X
x̂1 = yn . (3.26)
2N
n=1

This estimator is unbiased. Further assume that the standard deviation of the
measurement is σ = 10−9 s (1 nanosecond). Then by (3.24), the standard deviation
of the estimator is
1 cσ
std{x̂1 } = √ . (3.27)
N 2
With a single measurement (N = 1) we get the error of 15 cm whereas by averag-
ing 100 measurements the error drops to 1.5 cm.

3.2.2 General Linear Model


We can now generalize the result (3.21) for general linear models of the
form (3.10). Using the vector form of the least squares criterion in (2.22) together
with the model (3.10) yields the cost function

JLS (x) = (y − Gx)T (y − Gx). (3.28)

To minimize this cost function, we can follow the same steps as in the scalar case.
First, we calculate the gradient of (3.28) with respect to the parameters x. Since
x is a vector, this can be done by calculating the partial derivative with respect to
each individual entry xk in x individually. A more compact alternative is to use
vector calculus. In the vector calculus notation we have
 ∂J (x) 
LS
∂x1
 ∂JLS (x) 
∂JLS (x)  ∂x2 
= 
..

, (3.29)
∂x  . 
∂JLS (x)
∂xK

which is the gradient of JLS (x) which in some literature is denoted as ”∇JLS (x)”.
We now have the following computation rules for the vector derivatives (Petersen
and Pedersen, 2012):

∂xT a ∂aT x
= = a,
∂x ∂x
∂xT Ax
= 2Ax,
∂x
28 CHAPTER 3. STATIC LINEAR MODELS

which then gives


∂JLS (x) ∂
= (y − Gx)T (y − Gx)
∂x ∂x
∂ T
= (y y − yT Gx − xT GT y + xT GT Gx)
∂x
= −2GT y + 2GT Gx.

Then, setting the gradient to zero and solving for x yields

x̂LS = (GT G)−1 GT y (3.30)

for the least squares estimator. Note that (3.30) is valid for both scalar measure-
ments of the form (3.3) as well as (3.9), with the appropriate choice of the matrix
G.
Similar to the scalar case, we can calculate the statistical properties of the
estimator (3.30). The mean is found from

E{x̂LS } = E{(GT G)−1 GT y}


= E{(GT G)−1 GT (Gx + r)}
= E{(GT G)−1 GT Gx + (GT G)−1 GT r}
= x + (GT G)−1 GT E{r},

and we can see that if (and only if) E{r} = 0, we have that

E{x̂LS } = x, (3.31)

and the estimator is unbiased. Furthermore, for the zero-mean noise case, the
variance can be derived from
n o
Cov{x̂LS } = E (x̂LS − E{x̂LS }) (x̂LS − E{x̂LS })T
T
T −1 T T −1 T
=E (G G) G (Gx + r) − x (G G) G (Gx + r) − x
T
T −1 T T −1 T
=E (G G) G r (G G) G r

= (GT G)−1 GT E{rrT }G((GT G)−1 )T ,

and is given by

Cov{x̂LS } = (GT G)−1 GT RG((GT G)−1 )T . (3.32)

Example 3.5. Let us recall the autonomous car shown in Figure 1.4 and its model
in Equations (1.6) and (1.7). The positions of the landmarks are

sx1 = 2, sx2 = 0, sx3 = 10, sx4 = 7,


(3.33)
sy1 = 6, sy2 = 0, sy3 = 2, sy4 = 8.
3.2. LINEAR LEAST SQUARES 29

x2

x1

Figure 3.1. Least squares estimation example.

The joint measurement noise covariance matrix is a diagonal matrix

R = diag(R1x , R1y , R2x , R2y , R3x , R3y , R4x , R4y ), (3.34)

where the individial variances are given as

R1x = 1, R2x = 0.1, R3x = 0.8, R4x = 0.5,


(3.35)
R1y = 0.3, R2y = 0.5, R3y = 2, R4y = 1.5.

Furthermore, in order to use the least squares equations in this section we


need to de-bias the measurements using (3.17). Figure 3.2 shows the least squares
estimate of the autonomous car position (red dot) and its confidence ellipse (red
ellipse) when the relative locations of landmarks (orange) where measured with
the standard deviations given above. The true position is shown in blue.

3.2.3 Weighted Linear Least Squares


As mentioned in Section 2.3, it is sometimes desirable to use a weighted least
squares criterion rather than the least squares criterion to take different levels of
certainty into account. In this case, the cost function (2.25) with W = R−1 is
used together with the general linear model (3.10), which yields

JWLS (x) = (y − Gx)T R−1 (y − Gx). (3.36)

Following the same steps for the derivation as for the least squares case, we obtain
the weighted least squares estimator given by

x̂WLS = (GT R−1 G)−1 GT R−1 y. (3.37)

Comparing the linear least squares estimator in (3.30) and the weighted version
in (3.37), we see that the former is a special case of the latter if and only if the
covariance matrix is proportional to the identity matrix, that is, R = σr2 I.
30 CHAPTER 3. STATIC LINEAR MODELS

x2
LS
WLS

x1

Figure 3.2. Weighted least squares estimation example.

Similarly, the mean and covariance of the weighted least squares estimator can
be shown to be

E{x̂WLS } = x + (GT R−1 G)−1 GT R−1 E{r}


=x

and

Cov{x̂WLS } = (GT R−1 G)−1 GT R−1 RR−1 G(GT R−1 G)−1


(3.38)
= (GT R−1 G)−1 .

respectively.
As discussed in Section 2.3, it is possible to make other choices for the weigh-
ing matrix W. However, it can be shown that choosing the inverse noise covariance
matrix R−1 is the optimal choice for W in the sense that it minimizes the trace of
the covariance of the estimator.

Example 3.6. Figure 3.2 shows the weighed least squares estimate of the au-
tonomous car position using the same data as in Example 3.5. It can be seen
that the confidence interval is now smaller and the estimator is closer to the true
position.

3.3 Regularized Linear Least Squares


So far, we have assumed that the unknowns x are completely arbitrary and de-
terministic. However, as discussed in Section 2.3, in many cases, we have some
prior information about the parameters, for example we know beforehand that they
might be distributed around some mean E{x} = m with covariance Cov{x} = P.
This prior information can be incorporated into the estimator by adding a regular-
ization term to the cost function, which leads to the regularized linear least squares
estimator.
3.3. REGULARIZED LINEAR LEAST SQUARES 31

The cost function then becomes

JReLS (x) = (y − Gx)T R−1 (y − Gx) + (x − m)T P−1 (x − m), (3.39)

where the second term (x − m)T P−1 (x − m) is a regularization term that encodes
the prior information. The minimum of this cost function can now be derived by
setting the gradient to zero. The gradient is given by

∂JReLS (x)
= −2GT R−1 y + 2GT R−1 Gx − 2P−1 m + 2P−1 x
∂x
which now has some additional terms resulting from the regularization. Setting it
to zero and solving for x then yields the regularized linear least squares estimator

x̂ReLS = (GT R−1 G + P−1 )−1 (GT R−1 y + P−1 m). (3.40)

The estimator in (3.40) shows that when using the regularization approach, the
resulting estimate is a weighted average of the data, which enters through the term
GT R−1 y and the prior information through P−1 m. The contributions of each of
the terms are determined by two factors. First, there are the weighing matrices
R−1 and P−1 , which directly weigh the contributions. Second, the number of data
points in y also affect how much emphasis is put on the data, and how much on the
prior information.
The covariance of the estimator can then shown to be

Cov{x̂ReLS } = (GT R−1 G + P−1 )−1 . (3.41)

It can further be shown that the trace of the above covariance (which is the total
variance) is strictly smaller than that of the non-regularized version in Equation
(3.38). This is because including prior information decreases the variance of the
estimator. However, if we compute the expectation of the estimator (3.40), we
find that the estimator is biased — it is unbiased if and only if P−1 = 0. This
bias is due to our inclusion of prior information which forces the estimate slightly
towards prior. However, if our prior belief is correct, the estimator is supposed to
be biased and the bias actually forces the estimate to be closer to the truth than the
measurements alone imply.
It is also possible to express the mean and covariance equations in alternative,
but equivalent form, which will be especially useful in sequential least squares that
we discuss in the next section. The matrix inversion formula, which is also called
Woodbury matrix identity, gives

(GT R−1 G + P−1 )−1 = P − P GT (G P GT + R)−1 G P. (3.42)

If we further substitute this into the estimator equation and simplify, we can express
the estimator (3.40) and its covariance (3.41) as

x̂ReLS = m + K(y − Gm), (3.43)


32 CHAPTER 3. STATIC LINEAR MODELS

x2 x2
WLS WLS
ReLS ReLS

x1 x1

Figure 3.3. Regularized least squares estimation example.

and
Cov{x̂ReLS } = P − K(GPGT + R)KT , (3.44)
where K is a gain given by

K = PGT (GPGT + R)−1 . (3.45)

In this form, a second interpretation of the regularized linear least squares estimator
is possible. Recall that m was the prior mean. Then, the term y − Gm is the error
between the output and the output predicted by the prior mean. The final estimate
is then the prior mean that is corrected by the error term through the weight K.
Finally, we also point out that the regularized least squares solution can also
be derived as non-regularized (weighted) least squares solution where the regu-
larization term is interpreted as additional set of measurements. Because (x −
m)T P−1 (x − m) = (m − x)T P−1 (m − x), we can rewrite the cost function
(3.39) as

JReLS (x) = (y − Gx)T R−1 (y − Gx) + (m − x)T P−1 (m − x)


T −1
y G R 0 y G
= − x − x , (3.46)
m I 0 P−1 m I
where the last form is a non-regularized (weighted) least squares problem where
the measurement contains y and m, the measurement model matrix contains an
additional identity matrix, and the covariance of the measurement is a block matrix
containing R−1 and P−1 on the diagonal.
By using Equations (3.37) and (3.38) we can now derive the regularized estima-
tor and its covariance. This form will be used in nonlinear least squares problems
as it allows us to derive algorithms for plain weighted least squares problems and
then the solutions to regularized least squares problems can be solved with the
same algorithms by using the trick above.
Example 3.7. Figure 3.3 illustrates the effect of regularization in least squares
estimate of the autonomous car position using the same data as in Examples 3.5
3.4. SEQUENTIAL LINEAR LEAST SQUARES 33

and 3.6. On the left we have prior information where the mean is close to the truth,
but the prior variance is quite large. In this case the estimate is slightly shifted
towards the true position. On the right, the prior mean is in the wrong place with
relatively small variance, which in this case shifts the estimate to the wrong place.
If the prior information were correct, the estimate would be even better.

3.4 Sequential Linear Least Squares


In many cases, the sensor data arrives sequentially at the estimator. Assume that we
have calculated the least squares estimate x̂n−1 according to (3.37) with covariance
Pn−1 = Cov{x̂n−1 } as in (3.38) for the dataset y1:n−1 = {y1 , y2 , . . . , yn−1 }.
When a new measurement yn arrives, the weighted least squares cost function can
be written as
JSLS (x) = (y1:n−1 − G1:n−1 x)T R−1
1:n−1 (y1:n−1 − G1:n−1 x)
+ (yn − Gn x)T R−1
n (yn − Gn x), (3.47)

where the first part is the cost function that was minimized when n − 1 measure-
ments were available, and the second part is the additional cost for the nth sample.
Minimizing (3.47) is achieved in the same way as for the general linear model
in the previous section, except for that there is an additional term. The gradient is
given by
∂JSLS (x) −1 −1
= −2GT T
1:n−1 R1:n−1 y1:n−1 + 2G1:n−1 R1:n−1 G1:n−1 x
∂x
−1 T −1
− 2GT
n Rn yn + 2Gn Rn Gn x. (3.48)

Setting the gradient to zero and solving for x then yields the updated estimate
−1 T −1 −1
x̂n = (GT
1:n−1 R1:n−1 G1:n−1 + Gn Rn Gn )
−1 T −1
× (GT
1:n−1 R1:n−1 y1:n−1 + Gn Rn yn )
= (P−1 T −1 −1 T −1 T −1
n−1 + Gn Rn Gn ) (G1:n−1 R1:n−1 y1:n−1 + Gn Rn yn ).

Using the matrix inversion formula as in the previous section and some further
simplifications, we can rewrite this as

x̂n = x̂n−1 + Kn (yn − Gn x̂n−1 ) (3.49)

with the gain


−1
Kn = Pn−1 GT T
n (Gn Pn−1 Gn + Rn ) . (3.50)
Knowing that E{x̂n−1 } = x (which follows from Section 3.2), the mean of the
updated estimate is given by

E{x̂n } = E{x̂n−1 } + Kn (E{yn } − Gn E{x̂n−1 })


(3.51)
= x,
34 CHAPTER 3. STATIC LINEAR MODELS

x2 x2
WLS WLS
SLS 1 SLS 2

x1 x1
x2 x2
WLS WLS
SLS 3 SLS 4

x1 x1

Figure 3.4. Sequential least squares estimation example.

that is, the updated estimate is unbiased provided that the previous one was. Fur-
thermore, the covariance of the updated estimate is

Cov{x̂n }
= E{(x̂n−1 + Kn (yn − Gn x̂n−1 ) − x)(x̂n−1 + Kn (yn − Gn x̂n−1 ) − x)T }
= E{(x̂n−1 − x)(x̂n−1 − x)T } + E{(x̂n−1 − x)(yn − Gn x̂n−1 )T KT
n}
+ E{Kn (yn − Gn x̂n−1 )(x̂n−1 − x)T }
+ E{Kn (yn − Gn x̂n−1 )(yn − Gn x̂n−1 )T KT
n}
= Pn−1 − Kn (Gn Pn−1 GT + Rn )KT
n. (3.52)

Note how the covariance update in (3.52) illustrates how adding new measurements
reduces the uncertainty of the estimate.
It turns out that regularization can be easily incorporated into sequential esti-
mation. All we need to do is to set x̂0 = m and P0 = P from the regularization
term in (3.39) and then proceed with the above update equations.
Example 3.8. Figure 3.4 illustrates the sequential least squares estimation of the
autonomous car position using the same data as in Examples 3.5, 3.6, and 3.7. On
the left top we have used only one measurement, on the right top two measurements,
on the bottow left three, and in the bottom right all the four measurements. It
3.4. SEQUENTIAL LINEAR LEAST SQUARES 35

can be seen how each measurement gives more information and after all four
measurements the result exactly coincides with the weighted least squares estimate.
36 CHAPTER 3. STATIC LINEAR MODELS
Chapter 4

Static Nonlinear Models

In Chapter 3, we discussed problems where the sensor model is linear in the param-
eters of interest. For these model types, closed-form estimators exist and further-
more, we can analytically determine the estimators’ properties such as biasedness
or covariance. However, despite these desirable characteristics, linear models can
be limiting and a nonlinear model may be more accurate in describing the relation-
ship between the sensors’ measurements and the parameters of interest.

4.1 Nonlinear Model


In this section, we develop estimation algorithms for general models of the form

y = g(x) + r, (4.1)

where g(x) may be an arbitrary nonlinear (vector-valued) function. As for the


linear models, they aim will be to minimize the weighted least squares cost function

JWLS (x) = (y − g(x))T R−1 (y − g(x)) (4.2)


with respect to x.
Although the above cost function does not contain a regularization term, we
can recall from Section 3.3 that a regularized linear squares problem can be re-
formulated as a non-regularized least squares problem. The same trick also works
for nonlinear problems and hence the algorithms presented in this section will also
work for regularized linear squares problems. We come back to this in Section 4.6.
In some special cases, it is still possible to find analytical expressions at least
for the (W)LS estimator and possibly even for its properties. Hence, it is always
worth trying to derive it. However, for the vast majority of models, this is not possi-
ble and we have to resort to iterative, numerical optimization methods to minimize
the cost function (4.2). For this purpose, we discuss the following methods here:

• Gradient descent,

37
38 CHAPTER 4. STATIC NONLINEAR MODELS

6 6

5 5
θ2

θ2
4 4

3 3

2 2
6 8 10 6 8 10
θ1 θ1

(a) (b)

Figure 4.1. Illustration of the cost function for a nonlinear problem with two unknowns
T
x = x1 x2 . (a) Contours of the cost function, and (b) vector field defined by the
negative gradient.

• the Gauss–Newton algorithm, and

• the Levenberg–Marquardt algorithm.

Note that all of these algorithms typically only are able to find local minima,
unless there is only a global minimum. Hence, it is important to have good initial
guesses of the parameters.

4.2 Gradient Descent


The first approach, gradient descent is a method based on the following strategy.
Assume that we are given the current parameter estimate x̂(i) at the algorithm’s ith
iteration. Then, the gradient of the objective function at x̂(i) indicates the direction
of the function input x in which the function value increases. Hence, taking steps
from x̂(i) in the direction proportional to the negative gradient, the function value
should decrease (see Figure 4.1).
This leads to the update rule given by

∂JWLS (x)
x̂(i+1) = x̂(i) − γ , (4.3)
∂x x=x̂(i)

where ∂JWLS (x)/∂x denotes the gradient of JWLS (x) with respect to x and γ > 0
is a positive constant defining the step length. For simplicity, we start our derivation
assuming scalar measurements yn together with the (non-weighted) least squares
cost function and later generalize it to the general model (4.1)–(4.2). In this case,
we have that
X N
JLS (x) = (yn − gn (x))2
n=1
4.2. GRADIENT DESCENT 39

and thus
N
∂JLS (x) ∂ X
= (yn − gn (x))2
∂x ∂x
n=1
N
X ∂gn (x)
= −2(yn − gn (x)) .
∂x
n=1

The sum can be rewritten in vector form as


   
y1 g1 (x)
∂JLS (x) i  y   g2 (x) 
∂gN (x)  2 
h
= −2 ∂g∂x
1 (x) ∂g2 (x)
...  ..  −  .. 
 
∂x ∂x ∂x  .   . 
yN gN (x)
 ∂g
1 (x) ∂g2 (x) ∂gN (x)

...
 ∂g∂x(x)
1 ∂x1
∂g2 (x)
∂x1
.. 
 1 . 
= −2  ∂x. 2 ∂x2
 (y − g(x)),
 
 .. .. ∂gN (x) 
 . ∂xK−1 
∂g1 (x) ∂gN −1 (x) ∂gN (x)
∂xK ... ∂xK ∂xK

and the matrix of derivatives can be identified as the transpose of the Jacobian
matrix given by
 ∂g (x) ∂g (x) ∂g1 (x)

1 1
∂x1 ∂x2 ... ∂xK

 ∂g2 (x) ∂g2 (x) .. 
 ∂x1 ∂x2 . 
Gx (x) =  . , (4.4)

 .. .. ∂gN −1 (x) 
 . ∂xK 
∂gN (x) ∂gN (x) ∂gN (x)
∂x1 ... ∂xK−1 ∂xK

which is a matrix containing the gradients of the elements of the vector valued g(x)
at each row. Hence, we have that
∂JLS (x)
= −2GT x (x) (y − g(x)). (4.5)
∂x
We can now directly generalize (4.5) to the general cost (4.2), which yields
∂JWLS (x) ∂
= (y − g(x))T R−1 (y − g(x))
∂x ∂x
∂ h T −1 i
= y R y − yT R−1 g(x) − g(x)T R−1 y + g(x)R−1 g(x)
∂x
−1
= −2GT x (x) R (y − g(x)).

Note that the direction of the parameter update is given by −GT −1


x (x) R (y −
g(x)) which we will multiply with the step size γ. We can now absorb the factor 2
into the step size which yields the gradient descent update (4.3) given by
−1
x̂(i+1) = x̂(i) + γ GT (i) (i)
x (x̂ ) R (y − g(x̂ )). (4.6)
40 CHAPTER 4. STATIC NONLINEAR MODELS

Algorithm 4.1 Gradient Descent


Input: Initial parameter guess x̂(0) , data y, function g(x), Jacobian Gx (x)
Output: Parameter estimate x̂WLS
1: Set i ← 0
2: repeat
3: Calculate the update direction
−1
∆x(i+1) = GT (i)
x (x̂ ) R (y − g(x̂(i) ))

4: Select a suitable γ (i+1)


5: Calculate
x̂(i+1) = x̂(i) + γ (i+1) ∆x(i+1)
6: Set i ← i + 1
7: until Converged

y2 py

x1 px

Figure 4.2. Gradient descent optimization example. The left figure shows the evolution of
the estimates from start to the end, and right figure is a zoom to the end of the optimization.

It remains to determine how long the step length γ should be chosen. Choosing
γ too large may cause the cost function to increase, whereas too small steps might
cause unnecessarily slow convergence. A typical strategy is to simply choose it
small enough so that the cost decreases at every step. However, quite often it is
advisable to change the step length during the iterations in one way or another. One
approach is to use a line search for selecting it, but we will postpone the discussion
on line-search methods to Section 4.4, where we discuss it in the context of Gauss–
Newton methods. A general gradient descent algorithm is shown as Algorithm 4.1.

Example 4.1. Let us consider a similar car localization problem as in Example 3.5
which was depicted in Figure 1.4. However, let us assume that we are only able to
measure the ranges to the landmarks as specified in the model in Equations (1.8).
In this case the model becomes non-linear. We assume that the variances of the
range measurements are R1R = 1, R2R = 0.1, R3R = 0.8, and R4R = 0.5. The
result of applying the gradient descent algorithm to the corresponding weighted
least squares problem is shown in Figure 4.2. It can be seen that gradient descent
4.3. GAUSS–NEWTON ALGORITHM 41

indeed finds the minimum, but the route that it finds is far from a direct route and
when we get closer to the minimum, the steps become smaller and smaller although
we have kept γ constant.
While intuitive, the gradient descent approach suffers from an important prob-
lem. In areas where the cost function is flat, the gradient is very small. Hence, the
algorithm can only take small steps, which leads to the problem that a large num-
ber of iterations is needed to cross such areas. Thus in general, the pure gradient
descent method is a quite inefficient algorithm and if, for example, Gauss–Newton,
Levenberg–Marquardt, or Quasi–Newton methods can be used, they should be pre-
ferred. However, recently, gradient descent methods and in particular their stochas-
tic versions, stochastic gradient decent methods, have gained some popularity due
to its computational efficiency in models with high number of parameters such as
large neural networks.

4.3 Gauss–Newton Algorithm


The second method, the Gauss–Newton algorithm, is based on a different approach.
Given the estimate x̂(i) of x at the ith iteration, the nonlinear function g(x) can
locally be approximated by using a first order Taylor series expansion. This ap-
proximation is given by
g(x) ≈ g(x̂(i) ) + Gx (x̂(i) ) (x − x̂(i) ) (4.7)
where Gx is the Jacobian matrix of g(x) given by (4.4). Using this local linear
approximation of the nonlinear function, the weighted least squares cost can be
approximated by
T
JWLS (x) ≈ y − g(x̂(i) ) − Gx (x̂(i) ) (x − x̂(i) ) R−1

× y − g(x̂(i) ) − Gx (x̂(i) ) (x − x̂(i) )
T
= e(i) − Gx (x̂(i) ) (x − x̂(i) ) R−1 e(i) − Gx (x̂(i) ) (x − x̂(i) )
(4.8)
(i) (i)
were we have made use of e = y−g(x̂ ) (which is the error of the ith iteration).
The approximated cost function (4.8) is linear in the parameters x (remember
that x̂(i) is constant and given from the last iteration) and hence, we can derive
a closed form solution for the parameter update. This is achieved by setting the
approximative cost function’s gradient to zero and solving for x as
∂JWLS (x) ∂ (i) T −1 (i)
≈ (e ) R e − (e(i) )T R−1 Gx (x̂(i) ) (x − x̂(i) )
∂x ∂x
−1 (i)
− (x − x̂(i) )T GT (i)
x (x̂ ) R e

−1
+(x − x̂(i) )T GTx (x̂ (i)
) R Gx (x̂ (i)
) (x − x̂ (i)
)
−1 (i) −1
= −2GT (i)
x (x̂ ) R e + 2GT (i) (i) (i)
x (x̂ ) R Gx (x̂ ) (x − x̂ ) = 0.
42 CHAPTER 4. STATIC NONLINEAR MODELS

Algorithm 4.2 Gauss–Newton Algorithm


Input: Initial parameter guess x̂(0) , data y, function g(x), Jacobian Gx
Output: Parameter estimate x̂WLS
1: Set i ← 0
2: repeat
3: Calculate the update direction
−1
∆x(i+1) = (GT (i)
x (x̂ )R Gx (x̂(i) ))−1 GT (i)
x (x̂ )R
−1
(y − g(x̂(i) ))

4: Calculate
x̂(i+1) = x̂(i) + ∆x(i+1)
5: Set i ← i + 1
6: until Converged
7: Set x̂WLS = x̂(i)

We then set the next estimate x̂(i+1) to be equal to the minimum, which gives
−1 (i) −1 T (i) −1 (i)
x̂(i+1) = x̂(i) + (GT (i)
x (x̂ ) R Gx (x̂ )) Gx (x̂ ) R e
−1 (i) −1 T (i) −1
(4.9)
= x̂(i) + (GT (i) (i)
x (x̂ ) R Gx (x̂ )) Gx (x̂ ) R (y − g(x̂ )).

The resulting method is given in Algorithm 4.2. The algorithm produces the
least squares estimate of the parameters x̂WLS . Given this estimate, we can also ap-
proximate the covariance of the estimate by using the linearized model analogously
to the linear case in Equation (3.38), which gives
−1 −1
Cov{x̂WLS } ≈ (GT
x (x̂WLS ) R Gx (x̂WLS )) . (4.10)

A major disadvantage of the Gauss–Newton approach is that the linearization


of the measurement model is local. In highly nonlinear problems, this means that
it is dangerous to extrapolate too much and only small steps can be taken at a time
to avoid missing local minima. This problem can be diminished by using a line
search method to find a suitable step size. We discuss this kind of methods in next
section.

Example 4.2. Figure 4.3 shows the result of applying Gauss–Newton algorithm to
the problem that we considered in Example 4.1. It can be seen that Gauss–Newton
takes a large step already in the beginning and only a few additional iterations are
needed to obtain convergence.

Example 4.3. A problem of Gauss–Newton is illustrated in Figure 4.4. In this case


we have started the optimization from a relatively far away point and it can be seen
that Gauss–Newton’s initial step is taken to a wrong direction. As can be seen in
the cost function evolution, the initial step actually increases the cost – because the
initial step is too large.
4.4. GAUSS–NEWTON ALGORITHM WITH LINE SEARCH 43

x2 Gradient descent x2 Gradient descent


Gauss–Newton Gauss–Newton

x1 x1

Figure 4.3. Comparison of gradient descent and Gauss–Newton in weighted least squared
problem. The full evolution of estimates is shown on the left and the end of the estimation
is zoomed on the right.

x2 Gradient descent 104


Gradient Descent
Gauss–Newton
Gauss–Newton
Cost J(x)

101

x1 10−2
0 2 4 6 8 10 12
Iteration i

Figure 4.4. Comparison of gradient descent and Gauss–Newton in weighted least squared
problem when the starting point is further away. The evolution of estimates is shown on
the left and the cost J(x) as function of iteration number is shown on the right.

4.4 Gauss–Newton Algorithm with Line Search


In practice, taking a full step according to (4.9) in Gauss–Newton algorithm might
be too large with respect to the neighborhood for which the Taylor series approx-
imation (4.7) is valid. To avoid this, a scaled Gauss–Newton step can be done
instead, proportional to the direction given by the local approximation. This yields

x̂(i+1) = x̂(i) + γ∆x̂(i+1) , (4.11a)


−1 (i) −1 T (i) −1
∆x̂(i+1) = (GT (i) (i)
x (x̂ )R Gx (x̂ )) Gx (x̂ )R (y − g(x̂ )), (4.11b)

where γ is the scaling factor.


One way to select the scaling parameter is to use a line search as follows. If
we substitute the updated parameter vector to the cost function JWLS (x), we can
44 CHAPTER 4. STATIC NONLINEAR MODELS

Algorithm 4.3 Line Search on Grid


Input: Previous iterate x̂(i) , the update direction ∆x̂(i+1) , the cost function JWLS (x), and
the grid size Ng .
Output: Optimal step size γ ∗ .
1: Set γ ∗ ← 0 and J ∗ ← JWLS x̂(i)

2: for j ∈ {1, 2, . . . , Ng } do
3: Set γ ← j/Ng
if JWLS x̂(i) + γ∆x̂(i+1) < J ∗ then

4:
5: Set γ ∗ ← γ
6: end if
7: end for

consider the cost as function of the step size parameter:



(i)
JWLS (γ) = JWLS x̂(i) + γ∆x̂(i+1) . (4.12)

Then the idea of line search is that on every step we select the step size by locally
optimizing it in the range [0, 1].
(i)
One way to perform line search is simply by evaluating JWLS (γ) on a grid of
values and then selecting the step size that gives the minimum. This is often called
exact line search in optimization literature (Nocedal and Wright, 2006). This grid
search algorithm is shown as Algorithm 4.3. It would also be possible to use more
sophisticated minimum search methods such as bisection algorithm or interpolation
methods to find the minimum.
However, it turns out that the line search does not need to be exact as we can
guarantee to find the minimum. One simple idea is to use backtracking, where we
decrease the parameter γ until it provides a sufficient decrease in the cost. One
way is to simply halve the step size until it causes a cost decrease. This kind of
approach is used, for example, in Gustafsson (2018) and it often works well.
The Armijo backtracking is an inexact line search method where we demand
that the cost is decreased at least with an amount that is predicted by a first order
Taylor series expansions as follows. Let us expand the right hand side of (4.12)
using a first order Taylor series expansion as follows (Nocedal and Wright, 2006):

JWLS x̂(i) + γ∆x̂(i+1) ≈ JWLS x̂(i) − 2γ[∆x̂(i+1) ]T GT (i) (i)
x (x̂ ) (y − g(x̂ )).
(4.13)
Then we demand that the cost decrease should satisfy the Armijo condition

JWLS x̂(i) + γ∆x̂(i+1) − JWLS x̂(i)
≤ −2β γ[∆x̂(i+1) ]T GT (i) (i)
x (x̂ ) (y − g(x̂ )), (4.14)

where β is a parameter that we can choose freely on range [0, 1), typical parameter
value being β = 0.1. In the backtracking we then decrease γ by multiplying it with
4.5. LEVENBERG–MARQUARDT ALGORITHM 45

Algorithm 4.4 Line Search with Armijo Backtracking


Input: Previous iterate x̂(i) , the update direction ∆x̂(i+1) , the cost function JWLS (x), and
parameters β and τ .
Output: Suitable step size γ.
1: Set γ ← 1 and J0 ← JWLS x̂(i) .
2: Set d ← −2β [∆x̂(i+1) ]T GT (i) (i)
) (y − g(x̂ ))
x (x̂
(i) (i+1)
3: while JWLS x̂ + γ∆x̂ > J0 + γ d do
4: Set γ ← τ γ
5: end while

Algorithm 4.5 Gauss–Newton Algorithm with Line Search


Input: Initial parameter guess x̂(0) , data y, function g(x), Jacobian Gx (x)
Output: Parameter estimate x̂WLS
1: Set i ← 0
2: repeat
3: Calculate the update direction
−1
∆x(i+1) = (GT (i)
x (x̂ )R Gx (x̂(i) ))−1 GT (i)
x (x̂ )R
−1
(y − g(x̂(i) ))

4: Compute optimal γ (i+1) with line search (Algorithm 4.3 or 4.4)


5: Calculate
x̂(i+1) = x̂(i) + γ (i+1) ∆x(i+1)
6: Set i ← i + 1
7: until Converged
8: Set x̂WLS = x̂(i)

a parameter τ (e.g., τ = 0.5) on the range (0, 1) until the condition is satisfied. The
resulting method is shown as Algorithm 4.4.
A Gauss-Newton algorithm with a generic line search is shown as Algo-
rithm 4.5. The algorithm produces the least squares estimate x̂WLS and its
covariance can be approximated using (4.10). More advanced line search methods
can be found in optimization literature (e.g. Nocedal and Wright, 2006).
Example 4.4. Figure 4.5 shows the result of applying the Gauss–Newton algorithm
with line search to the far-away initial-point case considered in Example 4.3. It can
be seen that the initial step is shorter than with plain Gauss–Newton and the cost
is decreased at every step. However, although the cost is strictly decreasing, the
intermediate costs during the iterations are larger than with plain Gauss–Newton.
However, the strict decrease in the cost enhances the stability of the algorithm.

4.5 Levenberg–Marquardt Algorithm


The Levenberg–Marquardt algorithm (Levenberg, 1944; Marquardt, 1963; Nocedal
and Wright, 2006) uses a different strategy from line search to enhance the perfor-
mance of Gauss–Newton algorithm. The underlying idea is to restrict the step
46 CHAPTER 4. STATIC NONLINEAR MODELS

x2Gradient descent 104


Gradient Descent
Gauss–Newton (LS)
Gauss–Newton
Gauss–Newton (LS)

Cost J(x)
101

x1 10−2
0 2 4 6 8 10 12
Iteration i

Figure 4.5. Illustration of Gauss–Newton algorithm with (exact) line search.

size by using regularization in the least squares solution following the linearization
made in the Gauss–Newton algorithm.
One way of deriving the Levenberg–Marquardt algorithm is indeed to see it
as a regularized Gauss–Newton algorithm. The Taylor series approximation of
the nonlinear function (4.7) is only valid in the local neighborhood of the current
parameter estimate x̂(i) . Hence, we can use this prior information to regularize the
approximated weighted least squares cost function using a regularization term as
introduced in Section 3.3. This yields the cost function approximation
T
(i) (i) (i)
JReLS (x) ≈ y − g(x̂ ) − Gx (x̂ )(x − x̂ )

× R−1 y − g(x̂(i) ) − Gx (x̂(i) )(x − x̂(i) )
+ λ(x − x̂(i) )T (x − x̂(i) )
T
= e(i) − Gx (x̂(i) )(x − x̂(i) ) R−1 e(i) − Gx (x̂(i) )(x − x̂(i) )
+ λ(x − x̂(i) )T (x − x̂(i) )

where the second term is the regularization around the last iteration’s estimate with
weight or damping factor λ. The gradient of the cost function approximation now
becomes
∂JReLS (x) −1 (i) −1
≈ −2GT (i)
x (x̂ )R e + 2GT (i) (i) (i)
x R Gx (x̂ )(x − x̂ ) + 2λ(x − x̂ )
∂x
−1 (i) −1
= −2GT xR e + 2(GT (i) (i)
x R Gx (x̂ ) + λI)(x − x̂ )
(4.15)
Setting (4.15) to zero and rearranging yields
−1 −1 (i)
(GT (i) (i) (i) T (i)
x (x̂ )R Gx (x̂ ) + λI)(x − x̂ ) = Gx (x̂ )R e
4.5. LEVENBERG–MARQUARDT ALGORITHM 47

and thus,

x̂(i+1) = x̂(i) + ∆x̂(i+1) , (4.16a)


(i+1) −1 −1 −1
∆x̂ = (GT (i) (i)
x (x̂ )R Gx (x̂ ) + λI) GT (i)
x (x̂ )R (y
(i)
− g(x̂ )).
(4.16b)

Equation (4.16b) shows that when λ approaches zero, the algorithm converges
to the Gauss–Newton algorithm. On the other hand, for large values of λ, the term
λI will dominate. In this case, we have that
1 T −1
∆x̂(i+1) ≈ G R (y − g(x̂(i) )),
λ x
which is a scaled gradient descent step. Thus, an important question is how to
choose λ. Ideally, in flat regions of the cost function where the linear approxi-
mation is good, λ should be chosen small, whereas in steep regions, it should be
chosen large. Unfortunately, there is no definite method on how to chose and adapt
λ and several different approaches have been proposed.
A simple strategy which is often used, and which is a simplified version of the
method proposed by Marquardt (1963) and used, for example, in Pujol (2007) is to
start from some damping value, for example, λ(0) (e.g. λ(0) = 10−2 ) and select a
fixed factor ν (e.g. ν = 10). Then at each step we do the following:

• First compute using Equation (4.16) a candidate x̂(i+1) using the previous
parameter value λ(i−1) . Then proceed as follows:

– If JWLS (x̂(i+1) ) < JWLS (x̂(i) ) then accept x̂(i+1) and decrease the
damping parameter by λ(i) = λ(i−1) /ν.
– Otherwise continue with x̂(i+1) = x̂(i) and increase the damping pa-
rameter by λ(i) = ν λ(i−1) .

Furthermore, the regularization procedure in Equation (4.16) has the disadvantage


that it is not scale invariant. To make the regularization scale-invariant, Marquardt
(1963) suggested to normalize the regularized cost function approximation by us-
ing the diagonal values of GT (i) −1 (i)
x (x̂ )R Gx (x̂ ). This procedure is equivalent to
replacing the regularization term λI with λ diag(GT (i) −1 (i)
x (x̂ )R Gx (x̂ )), which
yields to the following scaled parameter update:

x̂(i+1) = x̂(i) + ∆x̂(i+1) , (4.17a)


(i+1) −1 −1 −1
∆x̂ = (GT (i) (i) T (i) (i)
x (x̂ )R Gx (x̂ ) + λ diag(Gx (x̂ )R Gx (x̂ )))
−1
× GT (i) (i)
x (x̂ )R (y − g(x̂ )). (4.17b)

The resulting Levenberg–Marquardt algorithm, including both the non-scaled


and scaled versions, is shown in Algorithm 4.6.
48 CHAPTER 4. STATIC NONLINEAR MODELS

Algorithm 4.6 Levenberg–Marquardt Algorithm with simple adaptation


Input: Initial parameter guess x̂(0) , data y, function g(x), Jacobian Gx , initial damping
λ(0) , and parameter ν.
Output: Parameter estimate x̂WLS
1: Set i ← 0 and λ ← λ(0) .
2: repeat
3: Compute the (candidate) parameter update:
4: if using scaled version then
−1 −1
∆x̂(i+1) = (GT (i)
x (x̂ )R Gx (x̂(i) ) + λ diag(GT (i)
x (x̂ )R Gx (x̂(i) )))−1
−1
× GT (i)
x (x̂ )R (y − g(x̂(i) ))

5: else
−1
∆x̂(i+1) = (GT (i)
x (x̂ )R Gx (x̂(i) ) + λI)−1 GT
xR
−1
(y − g(x̂(i) ))

6: end if
7: if JWLS (x̂(i) + ∆x̂(i+1) ) < JWLS (x̂(i) ) then
8: Accept the candidate and decrease λ:

x̂(i+1) = x̂(i) + ∆x̂(i+1)


λ ← λ/ν

9: Set i ← i + 1
10: else
11: Reject the candidate and increase λ:

λ←νλ

12: end if
13: until Converged
14: Set x̂WLS = x̂(i)

Example 4.5. Figure 4.6 illustrates the operation of the Levenberg–Marquardt


algorithm in comparison to Gauss–Newton method with line search. It can be seen
that the performance of the algorithm is similar to the Gauss–Newton although the
path that the optimization takes is quite much different. Furthermore, Figure 4.7
shows the evolution of errors in all the considered optimization methods. It can
be seen that the performance of the methods is similar in this case although exact
path that the cost function takes differs between the methods.

4.6 Regularized Non-Linear Models


Although the optimization algorithms in this chapter have been formulated to com-
pute solutions to non-regularized linear squares problems, they can also be used to
compute solutions to regularized problems. Similarly to Section 3.3, we can rewrite
4.7. QUASI-NEWTON METHODS 49

x2
Gradient descent
Gauss–Newton (LS)
Levenberg–Marquardt
Levenberg–Marquardt-scaled

x1

Figure 4.6. Illustration of the operation of the Levenberg–Marquardt algorithm.

the regularized cost as follows:

JReLS (x) = (y − g(x))T R−1 (y − g(x)) + (m − x)T P−1 (m − x)


T −1
y g(x) R 0 y g(x)
= − − , (4.18)
m x 0 P−1 m x

which has the form of a non-regularized cost and hence all the algorithms presented
in this chapter are again applicable.

4.7 Quasi-Newton Methods


Although here we have only concentrated on least squares problems and restricted
our consideration to gradient descent, Gauss–Newton, and Levenberg–Marguardt
algorithms, there exist other algorithms that can be used for solving these optimiza-
tion problems. One class of algorithms is based on the classical Newton’s method
as follows.
Let us consider a generic cost function J(x) which we wish to minimize. The
so-called Newton’s method is based on the following iterative procedure. Assume
that our current guess for the minimum is x(i) . We can now Taylor expand the cost
50 CHAPTER 4. STATIC NONLINEAR MODELS

104
Gradient Descent
Gauss–Newton
3 Gauss–Newton (LS)
10
Levenberg–Marquardt
Levenberg–Marquardt-scaled

2
10
Cost J(x)

101

100

10−1

10−2
0 2 4 6 8 10 12
Iteration i

Figure 4.7. Evolution of cost in all the considered optimization methods.

function as follows:
T
∂J(x)
(i)
J(x) ≈ J(x ) + (x − x(i) )
∂x
x=x(i)
2

1 ∂ J(x)
+ (x − x(i) )T (x − x(i) ), (4.19)
2 ∂x2 x=x(i)

where ∂ 2 J(x)/∂x2 denotes the Hessian matrix of J(x). The strategy is now to
minimize the right hand side with respect to x and use the result as the next guess.
The resulting Newton’s method takes the following form:
−1
∂ 2 J(x)

(i+1) (i) ∂J(x)
x =x − . (4.20)
∂x2 ∂x
x=x(i)

Unfortunately, computation of the Hessian is often not desirable and therefore, in


so called quasi-Newton methods the Hessian is approximated. This approximation
can be formed in various ways which leads to multiple classes of quasi-Newton
methods (Nocedal and Wright, 2006) and among them the Broyden–Fletcher–
Goldfarb–Shanno (BFGS) algorithm has turned out to be particularly successful.
In fact, the Gauss–Newton method can also be seen as a quasi–Newton method
where we approximate the Hessian by 2GT −1
x R Gx . The line search procedure is
also typically an essential part of quasi-Newton methods.
4.8. CONVERGENCE CRITERIA 51

4.8 Convergence Criteria


The algorithms introduced in the previous sections are all based on an iterative
scheme, where the parameter estimates are improved iteratively, based on the last
estimate. An important question therefore is, when to terminate the search. Three
simple and easy to check criteria are:

• The absolute or relative change in the parameter estimate falls below a


threshold. We can, for example, stop iterations when the vector norm of the
update direction is below a threshold: k∆x(i) k < p .

• The absolute or relative change in the cost falls below a certain threshold.
For example, a criterion could be (J(xi ) − J(x(i+1) ))/J(xi ) < c .

• A maximum number of iterations is reached.

It is common practice to monitor all or at least a subset of these criteria (plus


possibly others, too) and terminate the search when at least one of the criteria is
fulfilled. In order to avoid infinite loops, the last criterion (a maximum number of
iterations) should always be checked for, possibly issuing a warning when this is
the reason for termination of the algorithm.
52 CHAPTER 4. STATIC NONLINEAR MODELS
Chapter 5

Dynamic Models

Most sensor fusion problems involve dynamically changing variables. For exam-
ple, in autonomous driving, other road users continuously appear, disappear, and
move in the traffic scene. In this case, we are interested in estimating dynamically
varying parameters. We refer to these parameters as states or the state, because
they describe the state a system is in (e.g., the location and velocity of a car, or the
position, attitude, and velocity of a unmanned aerial vehicle, etc.).
In such dynamic scenarios, we have to employ sensors that provide repeated
measurements in time, that is, they sample periodically. In between those samples,
the state of the system evolves according to some process and thus, the states
at different times are different from each other. However, since the evolution of
the state follows some dynamic process, they are related and thus, we can relate
the states at two times to each other. To do that, we need a principled way of
describing how the state evolves between samples. We find a suitable approach
in differential equations (for continuous-time processes) and difference equations
(for discrete-time processes), which can be used to describe dynamic processes.
Indeed, we will make use of differential equations to derive state-space models
that turn an Lth order differential (or difference) equation into a first order vector-
valued differential (or difference) equation.

5.1 Continuous-Time State-Space Models


5.1.1 Deterministic Linear State-Space Models
We start the derivation of the state-space formulation based on an example. Fig-
ure 5.1 shows a spring-damper system, a typical mechanical system. This system is
governed by Newton’s second law of motion, from which we can find the following
inhomogeneous ordinary differential equation (ODE):

ma(t) = −kp(t) − ηv(t) + u(t), (5.1)

53
54 CHAPTER 5. DYNAMIC MODELS

p
η
m
u(t)
k

Figure 5.1. Example of a mechanical dynamic system: A spring-damper system with


forcing function u(t). The positive direction of the motion p is defined to the right.

where p(t), v(t), and a(t) denote the position, velocity, and acceleration of the
mass m, respectively, k is the spring constant, and η is the damping constant.
Furthermore, the forcing function u(t) is a deterministic input to the system.
By introducing a second equation, v(t) = v(t), which always holds, and divid-
ing (5.1) by m, we obtain the equation system

v(t) = v(t), (5.2a)


k η 1
a(t) = − p(t) − v(t) + u(t). (5.2b)
m m m
This can be rewritten in matrix form, such that we obtain

v(t) 0 1 p(t) 0
= k η + 1 u(t). (5.3)
a(t) −m −m v(t) m

Observe that in (5.3), the vector on the left hand side of the equation is the
derivative of the vector on the right hand side. Hence, we can define the vector

p(t)
x(t) = ,
v(t)

and rewrite (5.3) as



0 1 0
ẋ(t) = k η x(t) + 1 u(t), (5.4)
−m −m m

where

dx(t)
ẋ(t) =
dt
denotes the time-derivative.
The form (5.4) is the state-space representation of the dynamic system in Fig-
ure 5.1 and the vector x(t) is called the state vector (or simply state). The state
vector now encodes all the information about the state the system is in (position,
speed, etc.). Therefore, solving this first-order vector valued ODE representation
rather than the original differential equation in (5.1), in some sense, provides richer
information about the system.
5.1. CONTINUOUS-TIME STATE-SPACE MODELS 55

We can now generalize this approach. Consider the Lth order ODE of the form

dL x(t) dx(t) dL−1 x(t)


= a0 x(t) + a 1 + · · · + a L−1 + b1 u(t). (5.5)
dtL dt dtL−1
Then we can introduce L − 1 equations of the form dl x(t)/dtl = dl x(t)/dtl to
obtain the equation system
dx(t) dx(t)
= , (5.6a)
dt dt
d2 x(t) d2 x(t)
= , (5.6b)
dt2 dt2
..
. (5.6c)
dL x(t) dx(t) dL−1 x(t)
= a0 x(t) + a1 + · · · + aL−1 + b1 u(t). (5.6d)
dtL dt dtL−1
Rewriting (5.6) in vector form yields
 dx(t)      
dt
0 1 0 ... 0 x(t) 0
 d2 x(t)   ..   dx(t)   0 
 dt2   0 0 1 .   dt   
  
 . =
 .   .. . .. + .  u(t).
 .  . . 0   L−1.   .. 

.
dL x(t) d x(t) b1
dtL
a0 a1 . . . aL−1 dtL−1

This in turn, can be written compactly as the dynamic model

ẋ(t) = Ax(t) + Bu u(t), (5.7)

where
   
x(t) 0 1 0 ...0
 
0
 dx(t)   ..  0
 dt  0 0 1 . 
x(t) =  , A =  , Bu =  .  , u(t) = u(t)
 
.. . .

 . 
  ..
 .. 0 
  .. 
d L−1 x(t)
a0 a1 . . . aL−1 b1
dtL−1
(5.8)
While quite general, Equation (5.7) is a state-space representation for models
of the type in (5.5). However, we can also derive the same type of representation for
other dynamic models. For example, consider a freshly brewed hot cup of coffee.
Newton’s law of cooling then states that the change in temperature of the coffee is
proportional to the temperature difference between the coffee and its surrounding
(assuming that the surroundings are much larger than the coffee cup). This can be
formulated as a differential equation of the form
dTc (t)
= −k1 (Tc (t) − Tr (t)), (5.9)
dt
where Tc (t) denotes the coffee’s temperature, Tr (t) the room temperature, and k1
is a constant. The change in room temperature on the other hand depends on the
56 CHAPTER 5. DYNAMIC MODELS

heat that is produced inside the room (e.g., due to heating) as well as the losses to
the outside. Again assuming that Newton’s law of cooling applies, we can write
the differential equation as

dTr (t)
= −k2 (Tr (t) − Ta (t)) + h(t), (5.10)
dt
where k2 is a constant, Ta (t) denotes the outside temperature and h(t) is the heating
input. Observe that (5.9)–(5.10) is a coupled system that can be written as the
equation system

dTr (t)
= −k2 (Tr (t) − Ta (t)) + h(t) (5.11a)
dt
dTc (t)
= −k1 (Tc (t) − Tr (t)), (5.11b)
dt
or equivalently on matrix form as
" #
dTr (t)
dt −k2 0 Tr (t) k2 1 Ta (t)
dTc (t) = k1 −k1 Tc (t)
+
0 0 h(t)
. (5.12)
dt

This has again the form of (5.7), where



Tr (t) −k2 0 k2 1 Ta (t)
x(t) = , A= , Bu = , u(t) = . (5.13)
Tc (t) k1 −k1 0 0 h(t)

Hence, the model (5.7) is a quite general dynamic model of the time-varying
behavior for many different processes. Recall that our original aim was to estimate
the dynamically changing state x(t). This requires that we measure x(t) in some
way, that is, we need to combine the dynamic model with a measurement model.
The simplest case is that of a linear measurement model of the form (3.10) where
we replace the static x with time varying state x(t). Since the measurements are
obtained at discrete time instances tn (i.e., at t1 , t2 , . . . ), the measurement model
becomes
yn = Gx(tn ) + rn , (5.14)
where rn denotes the measurement noise with covariance matrix Cov{rn } = Rn .
Defining xn , x(tn ) and combining the dynamic model (5.7) and the mea-
surement model (5.14) then yields the deterministic linear state space model

ẋ(t) = Ax(t) + Bu u(t), (5.15a)


yn = Gxn + rn . (5.15b)

5.1.2 Stochastic Linear State-Space Models


Often, the behavior of dynamic systems is not completely deterministic, or the
input u(t) is not known. For example, when tracking an airplane using radar, we
5.1. CONTINUOUS-TIME STATE-SPACE MODELS 57

4
4
2 3

Sww (ω)
w(t)

0 2
−2
1
−4
0
t ω
(a) (b)

2
Figure 5.2. Example of a white noise process w(t) with σw = 2. (a) One realization of
the process, and (b) its power spectral density Sw (ω) averaged over 100 realizations.

do not know the inputs made by the pilots. Additionally, variables such as wind
or pressure fields are not known either and it is difficult to model them accurately.
Instead, such random effects can be modeled by including a stochastic process as
an input to the differential equation (and hence to its state-space representation),
which turns an ODE into a stochastic differential equation (SDE; Øksendal, 2010;
Särkkä and Solin, 2019).
A simple class of stochastic processes are stationary stochastic processes which
can be characterized by their autocorrelation functions. The autocorrelation func-
tion of a stationary stochastic process w(t) is

Rww (τ ) = E{w(t + τ )w(t)}.

An equivalent characterization is the power spectral density (which is the Fourier


transform of the autocorrelation function; Papoulis (1984)). A suitable assumption
in many models is that the stochastic process is a white noise process. In this case,
the autocorrelation function is
2
Rww (τ ) = σw δ(τ ), (5.16)

where δ(τ ) denotes the Dirac delta function. Hence, the power spectral density is
a constant, that is,
2
Sww = σw . (5.17)
In other words, the white noise process contains equal contributions from each
frequency. One realization of such a process, together with its power spectral
density are shown in Figure 5.2.
The derivation of the state-space representation of the resulting SDE follows
the same steps as for the deterministic case, where the random process takes the
role of the input. Consider the Lth order SDE given by

dL x(t) dx(t) dL−1 x(t)


= a 0 x(t) + a 1 + · · · + aL−1 + b1 w(t), (5.18)
dtL dt dtL−1
58 CHAPTER 5. DYNAMIC MODELS

Fp
Spacecraft

py

px Fg

Earth
p(t)

Figure 5.3. Illustration of a spacecraft orbiting the earth.

which is of the same form as (5.5) but with the stochastic process w(t) as the input.
Rewriting (5.18) into matrix form yields
 dx(t)      
dt
0 1 0 ... 0 x(t) 0
 d2 x(t)   ..   dx(t)   0 
 dt2   0 0 1 .   dt   
  
 . =
 .   .. . .. + .  w(t).
 .  . .. 0   L−1.   .. 

dL x(t) d x(t) b1
dtL
a0 a1 . . . aL−1 dtL−1

Hence, this can also be written compactly in the form of a first order vector
valued SDE system
ẋ(t) = Ax(t) + Bw w(t).
Naturally, also coupled models (like the coffee-cup example) can be written in this
form. Together with a linear measurement equation, the stochastic linear state-
space model becomes

ẋ(t) = Ax(t) + Bw w(t), (5.19a)


yn = Gxn + rn . (5.19b)

Note that some dynamic models may contain both deterministic inputs u(t)
and stochastic inputs w(t). Naturally, both of these can be incorporated in the
model as well.

5.1.3 Nonlinear State-Space Models


So far, we have only discussed the state-space form of linear ODEs and SDEs.
However, many dynamic systems actually behave nonlinearly. Fortunately, we can
handle nonlinear models with a similar formalism.
5.1. CONTINUOUS-TIME STATE-SPACE MODELS 59

We start our discussion again based on an example. Figure 5.3 shows an illus-
tration of a spacecraft orbiting the earth. The forces acting upon the craft are the
gravitational pull of the earth Fg and the propulsion force Fp . The position p(t) of
the craft is defined according to the coordinate system shown in Figure 5.3, with
the origin at the center of the earth. The gravitational acceleration of an object at a
distance |p(t)| from the earth’s center is approximately
2
re
g ≈ g0 ,
|p(t)|

where g0 = 9.81 m/s2 is the gravitational acceleration on the earth’s surface and
re = 6371 km is the mean earth radius. The gravitational pull is in the opposite
direction of p(t) and hence, we can write it on vector form as
2
re p(t)
Fg = −mg0
|p(t)| |p(t)|
p(t)
= −mg0 re2 .
|p(t)|3

The propulsion force is perpendicular to the direction of the position vector p(t).
Hence, we can write y
1 −p (t)
Fp = Fp
|p(t)| px (t)
When tracking the spacecraft from the ground, for example using radar, the magni-
tude of the propulsion force Fp is unknown. But since we know that the engines are
only used to make small flight path corrections to conserve fuel, we can model this
as a stochastic process, that is, we can assume that Fp = w(t) with some spectral
density σw2.

Using Newton’s second law it follows that the motion of the spacecraft is
governed by the vector-valued differential equation
y
2 p(t) 1 −p (t)
ma(t) = −mg0 re + w(t). (5.20)
|p(t)| 3 |p(t)| px (t)

The highest order of the derivative here is the acceleration on the left hand
side. Hence, a suitable state vector includes the position (zeroth derivative) and the
speed (first derivative), that is,
T
x(t) = px (t) py (t) v x (t) v y (t) .

However, when trying to rewrite (5.20) into the form of a state-space representa-
tion (5.19), we notice that this is not possible since the right hand side of (5.20) is
not linear in p(t). Nevertheless, we can still write it as a nonlinear equation system
60 CHAPTER 5. DYNAMIC MODELS

in terms of x(t) as
  v x (t)   0 
v x (t)

v y (t)  v y (t)    0y 
 
= 2 px (t)  +  p (t)  w(t)
a (t) −g0 re |p(t)|3  − m|p(t)| 

x 
ay (t) py (t) px
−g0 re2 |p(t)| 3 m|p(t)|
   0  (5.21)
f1 (x(t))
f2 (x(t))  0 
= + py (t)  w(t),

f3 (x(t)) − m|p(t)| 

f4 (x(t)) px
m|p(t)|

where the fi (x(t)) are nonlinear functions of x(t).


This idea can be generalized to any arbitrary nonlinear dynamic system that is
described by a nonlinear ODE or SDE, including coupled systems. Given the state
T
vector x(t) = x1 (t) x2 (t) . . . xdx (t) of dimension dx , we can write the
ODE/SDE as a vector-valued, nonlinear equation system of the form
  
b11 (x(t)) . . . b1dw (x(t))
  
ẋ1 (t) f1 (x(t))
 ẋ2 (t)   f2 (x(t))   .. 
  b21 (x(t)) . 
 ..  =  +  w(t). (5.22)
  
.. . ..
 .   .    .. .


ẋdx (t) fdx (x(t)) bdx 1 (x(t)) . . . bdx dw (x(t))

More compactly, this can be written as

ẋ(t) = f (x(t)) + Bw (x(t))w(t), (5.23)

where f (x(t)) is a vector-valued nonlinear function of the state, and Bw (x(t)) is a


matrix that depends on the state x(t).
In order to estimated the state x(t) based on observations yn , we again need to
relate the measurements to the state. This is achieved by combining the dynamic
model (5.23) with a measurement model as introduced in Chapters 3–4. The most
general model arises if we chose a nonlinear observation model of the form (4.1)
where the state xn , x(tn ) at time tn takes the place of the unknown x. This
yields the general nonlinear state-space model

ẋ(t) = f (x(t)) + Bw (x(t))w(t), (5.24a)


yn = g(xn ) + rn , (5.24b)

where f (x(t)) is the nonlinear state transition function, w(t) is the driving stochas-
tic process, g(xn ) is the measurement model, and rn is the measurement noise.
5.2. DISCRETE-TIME STATE-SPACE MODELS 61

5.2 Discrete-Time State-Space Models


5.2.1 Deterministic Linear State-Space Models
As discussed in the previous section, ODEs and SDEs are a natural way for mod-
eling the behavior of dynamic systems. Transforming them to state-space form
furthermore provides a way of describing how the state of a system evolves over
time.
However, not all dynamic systems can be modeled by using differential equa-
tions. For some systems, the state is only defined at some discrete points in time
t1 , t2 , . . . , that is, only in discrete-time. In this case, another approach than dif-
ferential equations is needed to model the dynamic behavior. The discrete-time
equivalent to differential equations are difference equations that describe the rela-
tionship between the value of a variable at time tn to its previous values at times
tn−1 , tn−2 , . . . .
The process of obtaining a state-space representation from difference
equations is closely related to the approach used for differential equations.
Consider a dx th order equation system with discrete-time, deterministic inputs
u1,n , u1 (tn ), u2,n , u2 (tn ), . . . , udu ,n , udu (tn ) of the form

x1,n = a11 x1,n−1 + · · · + a1dx xdx ,n−1 + b11 u1,n + · · · + b1du udu ,n
x2,n = a21 x1,n−1 + · · · + a2dx xdx ,n−1 + b21 u1,n + · · · + b2du udu ,n
..
.
xdx ,n = adx 1 x1,n−1 + · · · + adx dx xdx ,n−1 + bdx 1 u1,n + · · · + bdx du udu ,n

where xi,n , xi (tn ) is the ith variable at time tn . This can be rewritten in matrix
form according to
       
x1,n a11 . . . a1dx x1,n−1 b11 . . . b1du u1,n
 ..   .. .. ..   ..  +  .. .. ..   ..  .
 . = . . .  .   . . .  . 
xdx ,n adx 1 . . . adx dx xdx ,n−1 bdx 1 . . . bdx du udu ,n

More compactly, this can also be written as

xn = Fxn−1 + Bu un , (5.25)

with
       
x1,n a11 . . . a1dx b11 . . . b1du u1,n
xn =  ...  , F =  ... .. ..  , B =  .. .. ..  , u =  ..  .
  
. .  u  . . .  n  . 
xdx ,n adx 1 . . . adx dx bdx 1 . . . bdx du udu ,n

Equation (5.25) is the general dynamic model for linear discrete-time systems with
deterministic inputs un .
62 CHAPTER 5. DYNAMIC MODELS

To convert the Lth order difference equation (with single input un )

zn = c1 zn−1 + c2 zn−2 + · · · + cL zn−L + d1 un (5.26)

to the form (5.25), we proceed as follows. First, we have to choose the state
variables xn . This is easiest done at time n − 1 and for the model (5.26), we
can choose

x1,n−1 = zn−1 , x2,n−1 = zn−2 , . . . , xdx ,n−1 = zn−L .

Then we obtain that

x1,n = zn = c1 zn−1 + c2 zn−2 + · · · + cL zn−L + d1 un


= c1 x1,n−1 + c2 x2,n−1 + · · · + cL xdx ,n−1 + d1 un ,
x2,n = zn−1
= x1,n−1 ,
..
.
xdx ,n = zn−L+1
= xdx −1,n−1 ,

where we have expressed the state at time n solely in terms of the state at previous
times as well as the input un . Rewriting this on matrix form finally yields
  
c1 c2 . . . cL
   
x1,n x1,n−1 d1
 x2,n   ..   x  0
 1 0 . 
 2,n−1
 ..  =  ..  +  ..  un ,
    
 .   .. . . 
.  .
. .  
xdx ,n 0 ... 1 0 xdx ,n−1 0

which is of the form (5.25).


Equation (5.25) only models the dynamic behavior of the state. As for the con-
tinuous case, only noisy measurements of the dynamic process are available for
estimation. Combining the dynamic model with a measurement model thus again
yields the complete discrete-time state-space model. Assuming a linear measure-
ment model of the form (3.10) then yields the linear discrete-time model

xn = Fxn−1 + Bu un , (5.27a)
yn = Gxn + rn . (5.27b)

5.2.2 Stochastic Linear Dynamic Models


As for the continuous case, the deterministic discrete-time dynamic model can not
take into account random effects, uncertainty, and unknown inputs. To account
for such effects, we need an approach that is similar to the random process input
5.2. DISCRETE-TIME STATE-SPACE MODELS 63

in the continuous case. For the discrete-time case, we can model this using a
random variable as the input to the dynamic model (compared to a stochastic
process for the continuous case). We denote this random input, commonly called
the process noise, as qn . Then, replacing the deterministic input un with qn yields
the stochastic discrete-time dynamic model
xn = Fxn−1 + Bq qn . (5.28)
The random variable qn follows some probability density function such that
qn ∼ p(qn ).
Here, we assume that qn is zero-mean, that is E{qn } = 0, and that it has co-
variance Cov{qn } = Qn . Furthermore, we also assume that the cross-covariance
between two random inputs qm and qn is Cov{qm , qn } = 0 for m 6= n.
Together with a linear sensor model, we then obtain the stochastic linear
discrete-time state-space model given by
xn = Fxn−1 + Bq qn , (5.29a)
yn = Gxn + rn . (5.29b)

5.2.3 Nonlinear Dynamic Model


Similar to differential equations, difference equations may be nonlinear. In this
case, we start from the nonlinear equation system with the random variables
q1,n , q2,n , . . . , qdq ,n as the inputs given by
x1,n = f1 (x1,n−1 , x2,n−1 , . . . , xdx ,n−1 ) + b11 q1,n + · · · + b1dq qdq ,n ,
x2,n = f2 (x1,n−1 , x2,n−1 , . . . , xdx ,n−1 ) + b21 q1,n + · · · + b2dq qdq ,n ,
..
.
xdx ,n = fdx (x1,n−1 , x2,n−1 , . . . , xdx ,n−1 ) + bdx 1 q1,n + · · · + bdx dq qdq ,n .
This can directly be rewritten into vector form, which yields
      
x1,n f1 (x1,n−1 , x2,n−1 , . . . , xdx ,n−1 ) b11 . . . b1dq q1,n
 ..   .
.. . .. ..   ..  ,
 . =  +  ..
 
. .  . 
xdx ,n fdx (x1,n−1 , x2,n−1 , . . . , xdx ,n−1 ) bdx 1 . . . bdx dq qdq ,n
or more compactly
xn = f (xn−1 ) + Bq qn . (5.30)
The nonlinear dynamic model (5.30) together with the general nonlinear mea-
surement model (4.1) yields the nonlinear discrete-time state-space model
xn = f (xn−1 ) + Bq qn , (5.31a)
yn = g(xn ) + rn , (5.31b)
where qn ∼ p(qn ), E{qn } = 0, Cov{qn } = Qn , and rn ∼ p(rn ), E{rn } = 0,
Cov{rn } = Rn .
64 CHAPTER 5. DYNAMIC MODELS

5.3 Discretization of Linear Dynamic Models


In practice, modern sensor fusion systems are implemented in a digital computer.
Hence, dealing with continuous dynamic models, that are defined on continuous t
is not possible. Instead, we have to discretize the continuous-time model to obtain
an (approximately) equivalent discrete-time dynamic model. In other words, we
have to solve the differential equation on the interval between times tn−1 and tn ,
that is, we have to integrate the differential equation. For linear dynamic models,
this can be achieved exactly and we develop the necessary tools in this section. For
most nonlinear models, exact discretization is not feasible and we have to resort to
approximate methods. This is the subject of Section 5.4.

5.3.1 Deterministic Linear Dynamic Models


Scalar Model. To derive the necessary expressions for discretizing the linear
dynamic model, we first review how to solve a first order scalar case. Consider
the non-homogeneous first order ODE

ẋ(t) = ax(t) + bu(t), (5.32)

which we want to solve on the interval between the two sampling points tn and
tn−1 . To do this, we introduce the integrating factor e−at and note that it holds
that
d −at
e x(t) = e−at ẋ(t) − e−at ax(t). (5.33)
dt
Rearranging (5.32) by moving the term ax(t) to the left hand side and multiplying
by the integrating factor yields

e−at ẋ(t) − e−at ax(t) = e−at bu(t)

and thus, using (5.33), we have that


d −at
e x(t) = e−at bu(t). (5.34)
dt
Equation (5.34) is separable in t. Hence, the ODE can directly be integrated
from tn−1 to tn according to
Z tn Z tn
d e−at x(t) = e−at bu(t)dt,

tn−1 tn−1

which yields
tn
Z tn
e−at x(t) e−at bu(t)dt,

t=tn−1
=
tn−1
Ztn
e−atn x(tn ) − e−atn−1 x(tn−1 ) = e−at bu(t)dt.
tn−1
5.3. DISCRETIZATION OF LINEAR DYNAMIC MODELS 65

Now we can solve for x(tn ) by rearranging and multiplying by the factor eatn ,
which yields
Z tn
atn −atn−1
x(tn ) = e x(tn−1 ) + e atn
e−at bu(t)dt,
tn−1
Z tn
= ea(tn −tn−1 ) x(tn−1 ) + ea(tn −t) bu(t)dt.
tn−1

Finally, letting ∆t , tn − tn−1 and using the notation xn , x(tn ), we obtain


Z tn
xn = e a∆t
xn−1 + ea(tn −t) bu(t)dt. (5.35)
tn−1

Note that the second term on the right hand side of (5.35) is the convolution be-
tween the factor ea(tn −t) and the input u(t) on the interval between tn−1 and tn .

General Case. Based on the steps used for solving the scalar ODE (5.32), the
general case can now be solved. First, recall that the general dynamic model on
state-space form was given by the vector valued first order ODE system (5.15), that
is,
ẋ(t) = Ax(t) + Bu u(t). (5.36)
The key here is to note that (5.36) is a first order vector ODE just like (5.32).
Hence, we should be able to solve it using an integrating factor similar to the one
for the scalar case. Indeed, such an integrating factor can be found through the
matrix exponential. The matrix exponential for a square matrix A can be defined
in the same way as the regular exponential, that is, as an infinite sum of the form

X 1 k
eA = A .
k!
k=0

Similar to the (scalar) exponential, it holds that the derivative with respect to a
scalar x of eAx is
d Ax
e = eAx A. (5.37)
dx
Another useful property of the matrix exponential is that of its transpose, which is
given by
T
(eA )T = eA . (5.38)
With this in mind, we can solve the general case following the same steps as
for the scalar case. First, we multiply (5.36) by the integrating factor e−At and
rearrange it to obtain

e−At ẋ(t) − e−At Ax(t) = e−At Bu u(t)


66 CHAPTER 5. DYNAMIC MODELS

Next, similar to (5.33) (product rule), the left hand side is

d −At
e x(t) = e−At ẋ(t) − e−At Ax(t),
dt
and thus, we have that
d −At
e x(t) = e−At Bu u(t).
dt
This is again separable in t. Direct integration from tn−1 to tn then leads to
Z tn Z tn
d e−At x(t) = e−At Bu u(t)dt.

tn−1 tn−1
−At tn
Z tn
e x(t) t=tn−1 = e−At Bu u(t)dt.
tn−1
Z tn
e−Atn x(tn ) − e−Atn−1 x(tn−1 ) = e−At Bu u(t)dt.
tn−1

Solving for x(tn ) by rearranging and multiplying both sides by eAtn yields
Z tn
A(tn −tn−1 )
x(tn ) = e x(tn−1 ) + eA(tn −t) Bu u(t)dt,
tn−1

or Z tn
xn = eA∆t xn−1 + eA(tn −t) Bu u(t)dt. (5.39)
tn−1

In this case, the factor in front of xn−1 (i.e., the state at time tn−1 ) is a matrix
exponential that again depends on the time difference ∆t between the two discrete
times tn−1 and tn . Note that ∆t can be arbitrary, meaning that there is no need for
uniform sampling between points tn−2 , tn−1 , tn , and so forth. Furthermore, the
second term in (5.39) is again the convolution between the deterministic input u(t)
and the matrix exponential on the same interval.
If the input u(t) is a zeroth-order-hold (ZOH) signal, which is common in
control or robot navigation, the second term can further be simplified. In that case,
u(t) is constant on the interval between tn−1 and tn and we can denote it as un−1 .
Thus, we then have that
Z tn Z tn
A(tn −t)
e Bu u(t)dt, = eA(tn −t) Bu un−1 dt,
tn−1 tn−1
Z tn
= eA(tn −t) Bu dtun−1 ,
tn−1

where Bu may or may not depend on t.


5.3. DISCRETIZATION OF LINEAR DYNAMIC MODELS 67

Defining

Fn , eA(tn −tn−1 ) , (5.40a)


Z tn
Ln , eA(tn −t) Bu dt, (5.40b)
tn−1

we can rewrite (5.39) as

xn = Fn xn−1 + Ln un−1 . (5.41)

Equation (5.41) is an exact solution to (5.36) and thus it is an exact discretiza-


tion and equivalent representation of the continuous-time model. Furthermore, it
also has the same form as the discrete-time state-space model (5.27).

5.3.2 Stochastic Linear Dynamic Models


Discretization of the stochastic linear dynamic model in (5.19) and given by

ẋ(t) = Ax(t) + Bw w(t), (5.42)

with Rww (τ ) = E{w(t + τ )w(t)} = Σw δ(τ ), follows the same steps as the
discretization of the deterministic model. This yields
Z tn
A(tn −tn−1 )
x(tn ) = e x(tn−1 ) + eA(tn −t) Bw w(t)dt.
tn−1

Now note that special attention has to be paid to the second term on the right hand
side. The noise process w(t) does not have the properties required for regular
integration rules to apply (Øksendal, 2010; Särkkä and Solin, 2019).
However, we can define a random variable qn as
Z tn
qn , eA(tn −t) Bw w(t)dt.
tn−1

Next, we can calculate the properties of qn . The mean is given by


(Z )
tn
E{qn } = E eA(tn −t) Bw w(t)dt
tn−1
Z tn n o
= E eA(tn −t) Bw w(t) dt
tn−1
Ztn
= eA(tn −t) Bw E {w(t)} dt
tn−1
= 0.
68 CHAPTER 5. DYNAMIC MODELS

Similarly, we can calculate the covariance matrix

Cov{qn } = E{(qn − E{qn })(qn − E{qn })T }


= E{qn qT }
 n ! Z !T 
 Z tn tn 
=E eA(tn −t) Bw w(t)dt eA(tn −τ ) Bw w(τ )dτ
 tn−1 tn−1 
Z tn Z tn n o
= eA(tn −t) Bw E w(t)w(τ )T BT w (e
A(tn −τ ) T
) dτ dt
tn−1 tn−1
Ztn Z tn
= eA(tn −t) Bw Rww (t − τ )BT
w (e
A(tn −τ ) T
) dτ dt
tn−1 tn−1
Ztn Z tn
= eA(tn −t) Bw Σw δ(t − τ )BT
w (e
A(tn −τ ) T
) dτ dt
tn−1 tn−1
Ztn
T (t −τ )
= eA(tn −τ ) Bw Σw BT
we
A n

tn−1
, Qn .

Hence, the mean and covariance of the random variable qn given that w(t) is a
zero-mean white noise process with autocorrelation function Rww (τ ) = Σw δ(τ )
are given by

E{qn } = 0, (5.43a)
Z tn
AT (tn −τ )
Cov{qn } = eA(tn −τ ) Bw Σw BT
we dτ , Qn . (5.43b)
tn−1

Furthermore, it turns out that in this case, the process noise qn is actually a Gaus-
sian random variable, that is, it holds that

qn ∼ N (0, Qn ). (5.44)

In summary, the discretized stochastic dynamic model (5.42) is

xn = Fn xn−1 + qn (5.45)

where

Fn , eA(tn −tn−1 ) , (5.46a)


qn ∼ N (0, Qn ), (5.46b)

with Qn as in (5.43). Again, the discretized model (5.45)–(5.46) is an exact dis-


cretization of the continuous-time model (5.42) and thus completely equivalent.
5.4. DISCRETIZATION OF NONLINEAR DYNAMIC MODELS 69

5.4 Discretization of Nonlinear Dynamic Models


Unlike the linear models discussed in Section 5.3, discretization of nonlinear dy-
namic models is generally harder. In particular, closed-form solutions to the dis-
cretization problem can only be found in a few special cases. In most other cases,
approximations have to be used. In this section three approaches for approximate
discretization are discussed.

5.4.1 Discretization of Linearized Nonlinear Models


The first approach is based on a Taylor series expansion of the nonlinear function.
Truncating the Taylor series expansion of the nonlinear function f (x(t)) around
the state xn−1 at the linear term yields the following approximation of the (deter-
ministic) nonlinear dynamic model:
ẋ(t) ≈ f (xn−1 ) + Ax (x(t) − xn−1 ) + Bu u(t),
where Ax is the Jacobian of f (x(t)) with respect to x(t), evaluated at x(t) = xn−1 .
Rearranging the terms on the right hand side yields
ẋ(t) ≈ Ax x(t) + f (xn−1 ) − Ax xn−1 + Bu u(t),
which is linear in x(t), and f (xn−1 ) and Ax xn−1 are constants. Thus, we can
use the same approach as for linear models in Section 5.3, that is, we can use the
integrating factor e−Ax t . This yields
Z tn Z tn
xn ≈ eAx ∆t xn−1 + eAx (tn −t) dtf (xn−1 ) − eAx (tn −t) dtAx xn−1
tn−1 tn−1
Z tn
+ eAx (tn −t) Bu u(t)dt.
tn−1

Noting that the third term is the integral of the derivative of the matrix exponential
with respect to t (see (5.37)), we have that
Z tn h itn
xn ≈ e Ax ∆t
xn−1 + eAx (tn −t) dtf (xn−1 ) + eAx (tn −t) xn−1
tn−1 t=tn−1
Z tn
+ eAx (tn −t) Bu u(t)dt
tn−1
Z tn
=e Ax ∆t
xn−1 + eAx (tn −t) dtf (xn−1 ) + (I − eAx (tn −tn−1 ) )xn−1
tn−1
Z tn
+ eAx (tn −t) Bu u(t)dt,
tn−1

and finally,
Z tn Z tn
Ax (tn −t)
xn ≈ xn−1 + e dtf (xn−1 ) + eAx (tn −t) Bu u(t)dt. (5.47)
tn−1 tn−1
70 CHAPTER 5. DYNAMIC MODELS

As usual, the drawback of this Taylor-series-based solution is that the lineariza-


tion is only local and large ∆ts as well as highly nonlinear functions may be prob-
lematic. On the other hand, as it can be seen from (5.47), the approximation only
affects the deterministic term of the model, whereas the input is calculated in the
same way as for the linear model (but with the Jacobian Ax ). Hence, for the
stochastic dynamic model, the result immediately follows from (5.47) to be
Z tn
xn ≈ xn−1 + eAx (tn −t) dtf (xn−1 ) + qn , (5.48)
tn−1

with
qn ∼ N (0, Qn ),
Z tn
AT
Qn ≈ eAx (tn −τ ) Bw Σw BT
we
x (tn −τ ) dτ.

tn−1

5.4.2 Euler Discretization of Deterministic Nonlinear Models


The second approach is based on the so-called Euler approximation. We know that
the solution for xn can be written as
Z tn Z tn
xn = xn−1 + f (x(t))dt + Bu u(t)dt,
tn−1 tn−1

and the problem is to calculate the integrals in the second and third terms on
the right hand side. However, for sufficiently small ∆t = tn − tn−1 , we can
approximate the integral by using the rectangle approximation. This means that
we can approximate the integral as
Z tn
f (x(t))dt ≈ f (xn−1 )(tn − tn−1 )
tn−1
= ∆tf (xn−1 ),
and similar for the second integral with the input u(t).
This then yields the Euler discretization of the deterministic, nonlinear dy-
namic model given by
xn ≈ xn−1 + ∆tf (xn−1 ) + ∆tBu un−1 . (5.49)

5.4.3 Euler–Maruyama Discretization of Stochastic Nonlinear Mod-


els
The Euler discretization is quite simple to implement. However, it does not readily
work for stochastic dynamic models. In this case, we are interested in solving the
integral equation
Z tn Z tn
xn = xn−1 + f (x(t))dt + Bw w(t)dt.
tn−1 tn−1
5.4. DISCRETIZATION OF NONLINEAR DYNAMIC MODELS 71

We can still use the rectangle approximation for the integral involving the deter-
ministic part (second term on the right hand side), but for the stochastic part, we
are again lacking the integration rules. Thus, we can not use the rectangle approx-
imation in this case. However, we can again define the resulting random variable
as Z tn
qn , Bw w(t)dt
tn−1
and investigate its properties instead.
First, the mean is given by
(Z )
tn
E{qn } = E Bw w(t)dt
tn−1
Z tn
= Bw E {w(t)} dt
tn−1
= 0,
where we have made use of the assumption that w(t) is a zero-mean process.
Second, the covariance can be found similar to the linear case as follows.
 ! Z !T 
 Z tn tn 
Cov{qn } = E Bw w(t)dt Bw w(τ )dτ
 tn−1 tn−1 
Z tn Z tn
= Bw E{w(t)w(τ )T }BT w dτ dt
tn−1 tn−1
Z tn Z tn
= Bw Σw δ(t − τ )BT
w dτ dt
tn−1 tn−1
Z tn
= Bw Σw BT
w dτ.
tn−1
At this point, note that the remaining integral does not involve any random process
anymore. Hence, we can again use the rectangle approximation of the integral,
which yields
Cov{qn } ≈ (Bw Σw BT
w )(tn − tn−1 )
= ∆tBw Σw BT
w
, Qn .
This yields the Euler–Maruyama discretization of the stochastic nonlinear dy-
namic model given by
xn = xn−1 + ∆tf (xn−1 ) + qn , (5.50)
with
qn ∼ N (0, Qn ),
Qn = ∆tBw Σw BT
w.
72 CHAPTER 5. DYNAMIC MODELS
Chapter 6

Filtering

In Chapter 5, we developed state-space models that describe the dynamic behavior


of time-varying processes and how the sensor measurements relate to the state. Un-
fortunately, the estimation methods developed earlier do not work for such dynamic
problems: These solve the estimation problem in the static case, but not when the
parameters (i.e., the states) vary between samples. Instead, new methods are re-
quired which are able to combine the prior information from the dynamic model
and the measurements at different points in time. This methodology is called filter-
ing and the general approach is discussed in Section 6.1, which is followed by the
exact solution of the filtering problem for linear state-space models in Section 6.2.
Approximate filtering methods for nonlinear systems are discussed in Section 6.3–
6.6.

6.1 The Filtering Approach


The basic discussion of the filtering approach is based on the general discrete-time
state-space model of the form

xn = f (xn−1 ) + qn , (6.1a)
yn = g(xn ) + rn , (6.1b)

where the process and measurement noises are zero-mean (E{qn } = E{rn } = 0)
with covariances Cov{qn } = Qn and Cov{rn } = Rn , respectively. Note that the
dynamic model (6.1a) may be an inherently discrete-time model (Section 5.2) or
the result of discretizing a continuous-time dynamic model.
An aspect of the state-space model that has been neglected so far are the initial
conditions. Since the dynamic model (6.1a) is based on a differential (difference)
equation, it also requires knowledge of the initial conditions of the system. Natu-
rally, in the sensor fusion and estimation context, the initial conditions are unknown
(e.g., it is not known where exactly a target is located in the beginning). Instead, it
is suitable to specify the initial conditions in a probabilistic way. In other words,

73
74 CHAPTER 6. FILTERING

we assume that the state at t0 , denoted as x0 , x(t0 ), is a random variable which


follows a probability density function p(x0 ), that is,

x0 ∼ p(x0 ).

This allows us to specify the prior knowledge about where the initial state may be
but also take into account that this is an uncertain guess. In practice, the distribution
of the initial state is often assumed to follow a Gaussian distribution, but this does
not need to be the case. Here, we only make the assumptions that the mean and
covariance of the initial state are given by

E{x0 } = m0 , (6.2a)
Cov{x0 } = P0 . (6.2b)

Given the state-space model defined by (6.1) and (6.2), the filtering approach
can now be formulated. In particular, filtering iterates between 1) a prediction
step, which predicts the current state using the dynamic model and the previous
estimate of the state (also called time update), and 2) a measurement update step
that estimates the current state using the prediction and the new measurement. This
prediction–update strategy is actually very general, and as it will be shown, all the
algorithms discussed in this chapter make use of these two steps.

Prediction. The aim of the prediction step is to estimate the current state xn at tn
given the set of all the previous measurement data y1:n−1 = {y1 , y2 , . . . , yn−1 },
and hence, given the estimate of the state xn−1 at tn−1 . Here, the mean of the
prediction is denoted as x̂n|n−1 , which is read as “the estimate of the state at tn
given the data up to tn−1 ”. The hat indicates that it is an estimation, whereas
the subscript is closely related to the notation for conditional expectations and
densities, that is, it is read as “n given n − 1”. Similarly, the covariance of the
predicted xn is denoted as Pn|n−1 .

Measurement Update. In the measurement update, the newly obtained measure-


ment yn is used together with the prediction (and its uncertainty) to estimate the
current value of the state xn . Similar to the prediction, the mean of the updated
state estimate at time tn is denoted as x̂n|n , which is read as “the estimate of the
state at tn given the data up to tn ”. In other words, this estimate is now based
on the set of all measurements including the latest measurement at tn given by
y1:n = {y1 , y2 , . . . , yn }. Again, the covariance of the updated estimate is denoted
as Pn|n .
6.2. KALMAN FILTER 75

6.2 Kalman Filter


In this section, filtering for linear state-space models is considered. Recall that the
linear state-space model with continuous-time dynamic model is (see Section 5.1)

ẋ(t) = Ax(t) + Bw(t), (6.3a)


yn = Gn x(tn ) + rn . (6.3b)

Together with the initial distribution (6.2), the model (6.3) has the discrete-time
equivalent

xn = Fn xn−1 + qn (6.4a)
yn = Gn xn + rn (6.4b)

with

E{x0 } = m0 , Cov{x0 } = P0 , (6.5a)


E{qn } = 0, Cov{qn } = Qn , (6.5b)
E{rn } = 0, Cov{rn } = Rn . (6.5c)

Based on (6.4)–(6.5), the prediction and update steps can now be derived. For ease
of presentation, the measurement update step is presented first, followed by the
prediction.

Measurement Update. For the measurement update at time tn , assume that there
exists a prediction of the mean of the state x̂n|n−1 and its covariance Pn|n−1 . This
can be seen as the prior knowledge of the state at tn . Then, the new measurement
yn provides the actual (noisy) information about the true state. Hence, the ob-
jective is to minimize the error with respect to the measurement, taking the prior
information (prediction) into account.
Recall that we have based our estimators mainly on cost functions in general
and quadratic cost functions (least squares) in particular. A cost function that is
particularly well suited for this kind of problem is the regularized least squares
cost function. Hence, for the filtering problem with the linear measurement model,
the cost function becomes

JReLS (xn ) = (yn − Gn xn )T R−1


n (yn − Gn xn )
(6.6)
+ (xn − x̂n|n−1 )T P−1
n|n−1 (xn − x̂n|n−1 ),

where the first term accounts for the measurement yn and the second term, the
regularization term, accounts for the prior information from the prediction.
To estimate the state, we can now solve the regularized least squares problem

x̂n|n = argmin JReLS (xn ). (6.7)


xn
76 CHAPTER 6. FILTERING

Since the cost function is completely linear in the unknown xn , solving (6.7) fol-
lows the same steps as solving the linear least squares problems in Chapter 3 and
we can find a closed-form solution.
Fortunately, we have already solved this problem for the static case. The solu-
tion is given by (see Section 3.3)

x̂n|n = x̂n|n−1 + Kn (yn − Gn x̂n|n−1 ) (6.8)

where Kn is called the Kalman gain that is given by


−1
Kn = Pn|n−1 GT T
n (Gn Pn|n−1 Gn + Rn ) . (6.9)

Furthermore, we have also shown that the covariance of x̂n|n is

Pn|n = Pn|n−1 − Kn (Gn Pn|n−1 GT T


n + Rn )Kn . (6.10)

An important property of the measurement update step is that the updated


estimate and its covariance actually are the mean and covariance of the state given
the set of all measurements y1:n = {y1 , y2 , . . . , yn } (Särkkä, 2013). In other
words, it holds that

E{xn | y1:n } = x̂n|n , (6.11a)


Cov{xn | y1:n } = Pn|n , (6.11b)

where E{xn | y1:n } denotes the conditional expectation of xn given the data y1:n
(and similar for the covariance).

Prediction. Given the estimated mean E{xn−1 | y1:n−1 } = x̂n−1|n−1 and its
covariance Cov{xn−1 | y1:n−1 } = Pn−1|n−1 from the previous time step tn−1 ,
the predicted mean and covariance can now be calculated. The mean is given by

x̂n|n−1 = E{xn | y1:n−1 }.

Substituting xn in the expectation using the dynamic model given in (6.4), the
mean can be rewritten as

E{xn | y1:n−1 } = E{Fn xn−1 + qn | y1:n−1 }.

Distributing the expectation over the sum and noting that E{qn | y1:n−1 } = 0
yields
E{xn | y1:n−1 } = Fn E{xn−1 | y1:n−1 }
(6.12)
= Fn x̂n−1|n−1 ,
where the last equality makes use of (6.11).
Similarly, the covariance is found from

Pn|n−1 = Cov{xn | y1:n−1 }


= E{(xn − E{xn | y1:n−1 })(xn − E{xn | y1:n−1 })T | y1:n−1 }.
6.2. KALMAN FILTER 77

Algorithm 6.1 Kalman Filter


1: Initialize x̂0|0 = m0 , P0|0 = P0
2: for n = 1, 2, . . . do
3: Prediction (time update):

x̂n|n−1 = Fn x̂n−1|n−1
Pn|n−1 = Fn Pn−1|n−1 FT
n + Qn

4: Measurement update:
−1
Kn = Pn|n−1 GT T
n (Gn Pn|n−1 Gn + Rn )
x̂n|n = x̂n|n−1 + Kn (yn − Gn x̂n|n−1 )
Pn|n = Pn|n−1 − Kn (Gn Pn|n−1 GT T
n + Rn )Kn

5: end for

Using the expression of the dynamic model (6.4) for xn and (6.12) for E{xn |
y1:n−1 } yields

Cov{xn | y1:n−1 }
= E{(Fn xn−1 + qn − Fn x̂n−1|n−1 )(Fn xn−1 + qn − Fn x̂n−1|n−1 )T | y1:n−1 }.

By rewriting and expanding the quadratic term and distributing the expectation we
obtain

Cov{xn | y1:n−1 }
= E{(Fn xn−1 − Fn x̂n−1|n−1 )(Fn xn−1 − Fn x̂n−1|n−1 )T }
+ E{qn (Fn xn−1 − Fn x̂n−1|n−1 )T | y1:n−1 }
+ E{(Fn xn−1 − Fn x̂n−1|n−1 )qT T
n | y1:n−1 } + E{qn qn | y1:n−1 }.

Noting that the middle terms are zero due to the factors being independent and qn
being zero-mean, what remains is

Cov{xn | y1:n−1 }
= E{(Fn xn−1 − Fn x̂n−1|n−1 )(Fn xn−1 − Fn x̂n−1|n−1 )T | y1:n−1 }
+ E{qn qT
n | y1:n−1 },

and finally,
Cov{xn | y1:n−1 } = Fn Pn−1|n−1 FT
n + Qn . (6.13)
78 CHAPTER 6. FILTERING

Kalman Filter. Gathering the results (6.8)–(6.10) and (6.12)–(6.13), the filtering
algorithm for linear state-space models can now be formulated. First, during the
prediction (time update) step the predicted mean and covariance are calculated
according to

x̂n|n−1 = Fn x̂n−1|n−1 , (6.14a)


Pn|n−1 = Fn Pn−1|n−1 FT
n + Qn . (6.14b)

Second, in the measurement update step, the new measurement is incorporated and
the new state is estimated using
−1
Kn = Pn|n−1 GT T
n (Gn Pn|n−1 Gn + Rn ) , (6.15a)
x̂n|n = x̂n|n−1 + Kn (yn − Gn x̂n|n−1 ), (6.15b)
Pn|n = Pn|n−1 − Kn (Gn Pn|n−1 GT
n + Rn )KT
n. (6.15c)

Furthermore, the recursion is initialized by letting x̂0|0 = m0 and P0|0 = P0 . This


leads to the algorithm shown in Algorithm 6.1, which is called the Kalman filter
(KF) (Kalman, 1960).
It is worth pointing out several aspects of the prediction and measurement
update (6.14)–(6.15). First, note that in the prediction, the covariance increases
due to the scaling factor Fn and the added uncertainty by the process noise. On
the other hand, during the measurement update, the covariance is decreased by a
factor that scales quadratically with the gain Kn . This follows the intuition that
a prediction should increase the uncertainty, while incorporating new information
should decrease it.
Second, the measurement update is a sum of the prediction x̂n|n−1 and the
term yn − Gn x̂n|n−1 scaled by the gain Kn . Noting that Gn actually is the matrix
relating the state xn to the measurement yn , the term Gn x̂n|n−1 can be interpreted
as a prediction of the output, which it in fact is. To show this, consider the expected
value of the output yn given all the data up to tn−1 , that is,

E{yn | y1:n−1 } = E{Gn xn + rn | y1:n−1 }


= Gn E{xn | y1:n−1 } + E{rn | y1:n−1 }

and thus
E{yn | y1:n−1 } = Gn x̂n|n−1 . (6.16)
Hence, the difference between the measurement and its prediction gives an indi-
cation about how far the predicted state is from the true state: If the difference is
large, the predicted state is far and should be corrected a lot, if it is small, it is close
and does not need to be corrected much. Hence, this difference is also called the
innovation.
6.3. EXTENDED KALMAN FILTER 79

Third, the covariance of the predicted output yn is given by

Cov{yn | y1:n−1 }
= E{(yn − E{yn | y1:n−1 })(yn − E{yn | y1:n−1 })T | y1:n−1 }
= E{(Gn xn + rn − Gn x̂n|n−1 )(Gn xn + rn − Gn x̂n|n−1 )T | y1:n−1 }
= E{(Gn (xn − x̂n|n−1 )(xn − x̂n|n−1 )T GT T
n + rn rn | y1:n−1 }

and finally,
Cov{yn | y1:n−1 } = Gn Pn|n−1 GT
n + Rn , (6.17)
which is the denominator of the Kalman gain Kn . Similarly, the cross-covariance
between the predicted state xn and predicted output yn is

Cov{xn , yn | y1:n−1 }
= E{(xn − E{xn | y1:n−1 })(yn − E{yn | y1:n−1 })T | y1:n−1 }
= E{(xn − x̂n|n−1 )(Gn xn + rn − Gn x̂n|n−1 )T | y1:n−1 }
= E{(xn − x̂n|n−1 )(xn − x̂n|n−1 )T GT
n | y1:n−1 },

which yields
Cov{xn , yn | y1:n−1 } = Pn|n−1 GT
n. (6.18)
Thus, using (6.16)–(6.18), the measurement update can be written in the alternative
form as

Kn = Cov{xn , yn | y1:n−1 } Cov{yn | y1:n−1 }−1 , (6.19a)


x̂n|n = x̂n|n−1 + Kn (yn − E{yn | y1:n−1 }), (6.19b)
Pn|n = Pn|n−1 − Kn Cov{yn | y1:n−1 }KT
n. (6.19c)

From (6.19), we can see that the denominator of the Kalman gain Kn is the covari-
ance of the predicted measurement. Hence, if the uncertainty of the prediction is
large (either due to a large uncertainty in the predicted state or due to large mea-
surement noise covariance Rn ), the gain becomes small. This in turn means that
the innovation only contributes little information to the updated state. Conversely,
if the uncertainty of the prediction is small, the gain becomes large and the innova-
tion contributes a lot.

6.3 Extended Kalman Filter


Similar to the static, linear case discussed in Chapter 3, the Kalman filter is the
closed-form solution for the state estimation problem in linear state-space models
and thus an exact solution. However, if we consider general nonlinear state-space
models of the form (6.1)–(6.2), closed-form solutions are normally not available.
Similar to the nonlinear static case, we need approximative solutions instead. A
80 CHAPTER 6. FILTERING

first approximation is found in the extended Kalman filter (EKF), which is based
on a linearization using a Taylor series approximation of the nonlinear dynamic
and measurement models. Using this linearization, prediction and measurement
update steps similar to the Kalman filter are found.

Prediction. For the prediction, assume that the state estimate and its covariance
for tn−1 are given by x̂n−1|n−1 and Pn−1|n−1 , respectively. Then, the linear (first
order) Taylor series approximation of the dynamic model around the linearization
point x̂n−1|n−1 is given by

xn = f (xn−1 ) + qn
(6.20)
≈ f (x̂n−1|n−1 ) + Fx (x̂n−1|n−1 )(xn−1 − x̂n−1|n−1 ) + qn ,
where Fx is the Jacobian matrix of the vector-value function f . Then, based on the
linearized dynamic model (6.20), we can predict the mean and covariance of the
state xn at tn given the set of all the measurements y1:n−1 = {y1 , y2 , . . . , yn−1 }.
The mean is given by
x̂n|n−1
= E{xn | y1:n−1 }
≈ E{f (x̂n−1|n−1 ) + Fx (x̂n−1|n−1 )(xn−1 − x̂n−1|n−1 ) + qn | y1:n−1 }
= f (x̂n−1|n−1 ) + Fx (x̂n−1|n−1 ) E{xn−1 | y1:n−1 } − Fx (x̂n−1|n−1 )x̂n−1|n−1
= f (x̂n−1|n−1 ) + Fx (x̂n−1|n−1 )x̂n−1|n−1 − Fx (x̂n−1|n−1 )x̂n−1|n−1
= f (x̂n−1|n−1 ),
and thus
x̂n|n−1 = f (x̂n−1|n−1 ). (6.21)
Similarly, the covariance is
Pn|n−1
= E{(xn − x̂n|n−1 )(xn − x̂n|n−1 )T | y1:n−1 }
≈ E{(f (x̂n−1|n−1 ) + Fx (x̂n−1|n−1 )(xn−1 − x̂n−1|n−1 ) + qn − f (x̂n−1|n−1 ))
× (f (x̂n−1|n−1 ) + Fx (x̂n−1|n−1 )(xn−1 − x̂n−1|n−1 ) + qn − f (x̂n−1|n−1 ))T
| y1:n−1 }
= E{(Fx (x̂n−1|n−1 )(xn−1 − x̂n−1|n−1 ) + qn )
× (Fx (x̂n−1|n−1 )(xn−1 − x̂n−1|n−1 ) + qn )T | y1:n−1 }
= Fx (x̂n−1|n−1 ) E{(xn−1 − x̂n−1|n−1 )(xn−1 − x̂n−1|n−1 )T | y1:n−1 }
× FT T
x (x̂n−1|n−1 ) + E{qn qn | y1:n−1 }

and finally,
Pn|n−1 = Fx (x̂n−1|n−1 )Pn−1|n−1 FT
x (x̂n−1|n−1 ) + Qn . (6.22)
6.3. EXTENDED KALMAN FILTER 81

Measurement Update. The derivation of the measurement update follows very


similar steps as the prediction in the EKF and the measurement update in the KF.
Given the predicted mean xn|n−1 and covariance Pn|n−1 , the nonlinear measure-
ment model is linearized using a first order Taylor series expansion around the
prediction x̂n|n−1 . This yields

yn = g(xn ) + rn
(6.23)
= g(x̂n|n−1 ) + Gx (x̂n|n−1 )(xn − x̂n|n−1 ) + rn ,

where Gx denotes the Jacobian matrix of g.


Next, introducing the regularized least squares criterion for the linearized
model (6.23) yields

JReLS (xn ) = (yn − g(x̂n|n−1 ) − Gx (x̂n|n−1 )(xn − x̂n|n−1 ))T R−1


n
× (yn − g(x̂n|n−1 ) − Gx (x̂n|n−1 )(xn − x̂n|n−1 )) (6.24)
+ (xn − x̂n|n−1 ) T
P−1
n|n−1 (xn − x̂n|n−1 ).

By introducing a change of variables with zn = yn − g(x̂n|n−1 ) +


Gx (x̂n|n−1 )x̂n|n−1 , (6.24) can be rewritten as

JReLS (xn ) = (zn − Gx (x̂n|n−1 )xn )T R−1


n (zn − Gx (x̂n|n−1 )xn )
(6.25)
+ (xn − x̂n|n−1 )T P−1
n|n−1 (xn − x̂n|n−1 ),

which is linear in xn and has the same form as (6.6) but with the Jacobian matrix
Gx (x̂n|n−1 ) rather than the measurement matrix Gn . Hence, solving

x̂n|n = argmin JReLS (xn ) (6.26)


xn

yields

x̂n|n = x̂n|n−1 + Kn (zn − Gx (x̂n|n−1 )x̂n|n−1 ),


−1
Kn = Pn|n−1 GT T
x (x̂n|n−1 )(Gx (x̂n|n−1 )Pn|n−1 Gx (x̂n|n−1 ) + Rn ) ,

with covariance

Pn|n ≈ Pn|n−1 − Kn (Gx (x̂n|n−1 )Pn|n−1 GT T


x (x̂n|n−1 ) + Rn )Kn .

Finally, changing zn back to its definition yields

x̂n|n = x̂n|n−1 + Kn (yn − g(x̂n|n−1 ) + Gx (x̂n|n−1 )x̂n|n−1 − Gx (x̂n|n−1 )x̂n|n−1 )


= x̂n|n−1 + Kn (yn − g(x̂n|n−1 ))

for the measurement update.


82 CHAPTER 6. FILTERING

Algorithm 6.2 Extended Kalman Filter


1: Initialize x̂0|0 = m0 , P0|0 = P0
2: for n = 1, 2, . . . do
3: Prediction (time update):

x̂n|n−1 = f (x̂n−1|n−1 )
Pn|n−1 = Fx (x̂n−1|n−1 )Pn−1|n−1 FT
x (x̂n−1|n−1 ) + Qn

4: Measurement update:
−1
Kn = Pn|n−1 GT T
x (x̂n|n−1 )(Gx (x̂n|n−1 )Pn|n−1 Gx (x̂n|n−1 ) + Rn )
x̂n|n = x̂n|n−1 + Kn (yn − g(x̂n|n−1 ))
Pn|n = Pn|n−1 − Kn (Gx (x̂n|n−1 )Pn|n−1 GT T
x (x̂n|n−1 ) + Rn )Kn

5: end for

Extended Kalman Filter. We can now summarize the prediction and measure-
ment update steps for the EKF. First, during the predictions step, the predicted
mean and covariance are calculated according to

x̂n|n−1 = f (x̂n−1|n−1 ), (6.27a)


Pn|n−1 = Fx (x̂n−1|n−1 )Pn−1|n−1 FT
x (x̂n−1|n−1 ) + Qn , (6.27b)

where Fx is the Jacobian matrix of the dynamic model. Second, the measurement
update is
−1
Kn = Pn|n−1 GT T
x (x̂n|n−1 )(Gx (x̂n|n−1 )Pn|n−1 Gx (x̂n|n−1 ) + Rn ) ,
(6.28a)
x̂n|n = x̂n|n−1 + Kn (yn − g(x̂n|n−1 )), (6.28b)
Pn|n = Pn|n−1 − Kn (Gx (x̂n|n−1 )Pn|n−1 GT
x (x̂n|n−1 ) + Rn )KT
n. (6.28c)

These two steps are then performed iteratively, and the algorithm is initialized
with the initial conditions. This yields the complete EKF algorithm shown in Algo-
rithm 6.2. Note that this is essentially the same algorithm as the original KF where
the nonlinear function takes the place of the prediction (both in the prediction step
and the prediction of the output) and in the covariance updates, the Jacobian matri-
ces take the place of the matrices in the dynamic model and measurement model.
Furthermore, if either of the models is linear, the corresponding step (prediction or
measurement update) may be replaced by an exact update step from the Kalman
filter in Section 6.2. Finally, note that Algorithm 6.2 is an approximation of the
filtering problem since the nonlinear function is approximated around the filtered
and predicted state, meaning that both the state estimate and its covariance may or
may not converge to the true state (similar as for static nonlinear problems).
6.4. UNSCENTED KALMAN FILTERING 83

6.4 Unscented Kalman Filtering


One of the major problems of the EKF is that the linearization is local, which
often leads to problems such as the covariance of the state being underestimated
(which in turn affects the Kalman gain in the next iteration). Another approach to
solve the filtering problem within the Kalman filtering framework is the unscented
Kalman filter (UKF) (Wan and Van Der Merwe, 2000; Julier and Uhlmann, 2004),
which makes use of a so-called unscented transform. Hence, this transform will be
discussed first, and then the prediction and measurement update steps for the UKF
are derived.

Unscented Transform. The unscented transform is based on calculating the mo-


ments of the nonlinear transformation of a finite set of deterministic sampling
points as follows. Given a random variable x with mean m and covariance P,
we can find a set of J points {x0 , x2 , . . . , xJ−1 }, so-called sigma-points, such that
their weighted sum is equal to the mean and covariance of x, that is, such that
J−1
X
j j
m= wm x , (6.29a)
j=0
J−1
wPj (xj − m)(xj − m)T ,
X
P= (6.29b)
j=0

j
where it must hold that J−1
P
j=0 wm = 1 since the expected value of the sum should
be unbiased.
Then, given the nonlinear function z = h(x), we can calculate the transformed
sigma-points according to
zj = h(xj ).
Based on the transformed sigma-points, the mean and covariance of the trans-
formed variable z as well as the cross-covariance between x and z can then be
calculated according to
J
X
j j
E{z} ≈ wm z , (6.30a)
j=1
J
wPj (zj − E{z})(zj − E{z})T ,
X
Cov{z} ≈ (6.30b)
j=1
J
wPj (xj − m)(zj − E{z})T .
X
Cov{x, z} ≈ (6.30c)
j=1

The question then is how to chose the sigma-points xj and their weights for the
j
mean wm and covariance wPj . The unscented transform is one such approach for
84 CHAPTER 6. FILTERING

choosing sigma-points that fulfill the conditions (6.29). Here, J = 2L + 1 (with L


being the dimension of the vector x) sigma-points are chosen according to
x0 = m, (6.31a)
√ √
xj = m + L + λ[ P]j , j = 1, . . . , L, (6.31b)
√ √
xj = m − L + λ[ P](j−L) , j = L + 1, . . . , 2L. (6.31c)
√ √ √ T
In (6.31), P denotes a matrix square root such that P = P √P (in practice,
√ the Cholesky factorization) and [ P]j denotes the
this can be implemented using
jth column of the matrix P. The weights corresponding to the above sigma-
points are

0 λ
wm = , (6.32a)
L+λ
λ
wP0 = + (1 − α2 + β), (6.32b)
L+λ
1
j
wm = wPj = , j = 1, . . . , 2L. (6.32c)
2(L + λ)
In both (6.31) and (6.32), λ is
λ = α2 (L + κ) − L, (6.33)
and α, β, and κ are tuning parameters. The parameters α and κ determine the
scaling of the spread of the sigma-points in the direction of the covariance P
whereas β is related the higher order moments of x and only affects the weight
for the central sigma-point in the covariance calculations.
The choice of the tuning parameters may greatly affect the performance of the
filter and several default choices have been suggested. The value for κ is most
often chosen to be zero (i.e., κ = 0), which leaves the choices for α and β. First,
note that with κ = 0, the weight of the central sigma-point becomes

0 λ Lα2 − L α2 − 1
wm = = = , (6.34)
L+λ L + Lα2 − L α2
and the scaling factor of the root of the covariance matrix becomes
√ p √
L + λ = L + α2 L − L = α L. (6.35)
One suggestion is to choose α = 1 × 10−3 (Wan and Van Der Merwe, 2000).
This gives a quite large and negative weight wm 0 [see (6.34)] and a small scaling

factor [see (6.35)], which makes the sigma-points lie close to the mean. To avoid
this, it is sometimes more intuitive to instead start from the weight of the central
point wm 0 and then determine α based on reformulating (6.34), which yields

s
1
α= 0
. (6.36)
1 − wm
6.4. UNSCENTED KALMAN FILTERING 85

Furthermore, since all the weights for j > 0 are the same [see (6.32)], it must also
hold that
0
1 − wm
j
wm = .
2L
Finally, β only affects the central weight of the covariance and a good starting
point is normally to chose β = 2 (see, for example, Wan and Van Der Merwe
(2000)).

Prediction. The unscented transform as introduced above can directly be used in


the prediction step to calculate the predicted mean and the covariance. In this case,
the nonlinear transformation is the function of the dynamic model, that is, f (xn−1 ).
The sigma-points are calculated using the estimated mean and covariance at tn , that
is,

x0n−1 = x̂n−1|n−1 (6.37a)


√ hq i
xjn−1 = x̂n−1|n−1 + L + λ Pn−1|n−1 , j = 1, . . . , L, (6.37b)
j
√ hq i
xjn−1 = x̂n−1|n−1 − L + λ Pn−1|n−1 , j = L + 1, . . . , 2L.
(j−L)
(6.37c)

The weights of the sigma-points are given by (6.32) and λ is as in (6.33).


Then, the moments of the transformed variable, that is, the moments of the
prediction become

xjn = f (xjn−1 ), j = 0, . . . , 2L, (6.38a)


2L
X
j j
x̂n|n−1 = wm xn , (6.38b)
j=0
2L
X
Pn|n−1 = wcj (xjn − x̂n|n−1 )(xjn − x̂n|n−1 )T + Qn . (6.38c)
j=0

Note that the additional term Qn in the covariance is due to the fact that the dy-
namic model also includes the process noise term qn , which increases the uncer-
tainty. In other words, the unscented transform only calculates the covariance of
the xn−1 transformed by f (xn−1 ) and does not take the effect of the noise into
account in this form.

Measurement Update. For the measurement update, first recall that it can be
written in terms of the predicted output, its covariance, and the covariance between
the predicted output and the state, see (6.19). Hence, we can also use the unscented
transform to calculate these means and covariances.
86 CHAPTER 6. FILTERING

Algorithm 6.3 Unscented Kalman Filter


1: Initialize x̂0|0 = m0 , P0|0 = P0
2: for n = 1, 2, . . . do
3: Calculate xjn−1 , wm j
, and wPj using (6.37) and (6.32)
4: Calculate x̂n|n−1 and Pn|n−1 using (6.38)
5: Calculate xjn , wmj
, and wPj using (6.39) and (6.32)
6: Calculate E{yn | y1:n−1 }, Cov{yn | y1:n−1 }, and Cov{xn , yn | y1:n−1 }
using (6.40)
7: Measurement update:

Kn = Cov{xn , yn | y1:n−1 } Cov{yn | y1:n−1 }−1


x̂n|n = x̂n|n−1 + Kn (yn − E{yn | y1:n−1 })
Pn|n = Pn|n−1 − Kn Cov{yn | y1:n−1 }KT
n

8: end for

In this case, the sigma points are calculated from the predicted mean x̂n|n−1
and the covariance Pn|n−1 according to

x0n = x̂n|n−1 (6.39a)


√ hq i
xjn = x̂n|n−1 + L + λ Pn|n−1 , j = 1, . . . , L, (6.39b)
j
j
√ hq i
xn = x̂n|n−1 − L + λ Pn|n−1 , j = L + 1, . . . , 2L, (6.39c)
(j−L)

with the weights as in (6.32). Then, the necessary moments become

ynj = g(xjn ), j = 0, . . . , 2L, (6.40a)


2L
X
j j
E{yn | y1:n−1 } = wm yn , (6.40b)
j=0
2L
wPj (ynj − E{yn | y1:n−1 })(ynj − E{yn | y1:n−1 })T
X
Cov{y | y1:n−1 } =
j=0

+ Rn , (6.40c)
2L
wPj (xjn − x̂n|n−1 )(ynj − E{yn | y1:n−1 })T ,
X
Cov{xn , yn | y1:n−1 } =
j=0
(6.40d)

and the measurement update given in (6.19) can be performed.


6.5. ITERATED EXTENDED AND UNSCENTED KALMAN FILTERS 87

Unscented Kalman Filter. Based on the prediction and measurement update


discussed above, the unscented Kalman filter then becomes as shown in Algo-
rithm 6.3. Note that the UKF still performs the two steps very similar to the orig-
inal Kalman filter for linear systems, that is, it basically estimates the mean and
covariance of the state at each time step and propagates this information between
time steps, but without analytical linearization. The latter also implies that the
UKF does not require any Jacobians to be calculated and is thus somewhat easier
to implement.

6.5 Iterated Extended and Unscented Kalman Filters


In Section 6.3 we derived the extended Kalman filter (EKF) update by computing
the regularized (weighted) least squares solution to a linearized model. However,
this is exactly the same procedure that was used in Gauss–Newton and Levenberg–
Marquardt algorithms except that this linearization was iteratively used to compute
the solution to the non-linear regularized least squares solution. This same ap-
proach can also be used to improve the EKF — instead of taking just a single step,
we solve the non-linear regularized least squares problem using a least squares
optimization method.
When the non-linear least squares problem is solved using subsequent lin-
earization — which is equivalent to Gauss–Newton optimization — we obtain the
iterated extended Kalman filter (Bell and Cathey, 1993). The iteration helps to
solve highly non-linear problems where the single linearization used in EKF —
which is equivalent to single step of Gauss–Newton algorithm — is not enough.
This concept can also be extended to include line search procedures and we can
also use Levenberg–Marquardt algorithms to solve the non-linear least squares
problem (Skoglund et al., 2015).
Furthermore, the unscented Kalman filter (UKF) introduced in Section 6.4 can
also be improved by using iteration. In this case the optimization criterion is no
longer the non-linear regularized least squares problem, but instead, we optimize
the Kullback–Leibler divergence to the posterior distribution. Nevertheless, the re-
sulting iterated posterior linearization filter (Garcı́a-Fernández et al., 2015) takes a
similar form as the iterated extended Kalman filter, but the Taylor series lineariza-
tions are replaced with sigma-point approximations.

6.6 Bootstrap Particle Filter


Another approach of solving the estimation problem in dynamic systems is particle
filtering, which differs quite a lot from the Kalman filtering approaches discussed in
the previous sections. Rather than being based on predicting and updating the mean
and covariance from time step to time step, particle filtering propagates a set of
weighted random samples (called particles) and hence there is some resemblance
to the unscented Kalman filter. The particles can be seen as qualified but random
88 CHAPTER 6. FILTERING

guesses of the state whereas the weights provide an indication of how good the
guess is. Particle filtering is a very general methodology and comprises a large
class of filtering algorithms. Here, we focus on the bootstrap particle filter, which
has an intuitive explanation but also a rigorous mathematical background (Gordon
et al., 1993; Doucet and Johansen, 2011; Särkkä, 2013).
To derive the particle filter, we first take another look at the general discrete-
time state-space model given by

xn = f (xn−1 ) + qn ,
yn = g(xn ) + rn ,

with qn ∼ p(qn ) and rn ∼ p(rn ). Recall that the dynamic model is a stochastic
process, meaning that every time a system is observed, a new realization of that
stochastic process is observed. For example, consider the scalar (one-dimensional)
random walk dynamic model

xn = xn−1 + qn

with

x0 ∼ N (0, 1),
qn ∼ N (0, 1).

This model essentially states that the initial state is a random variable distributed
around x0 = 0 with variance 1 and as time elapses, random steps from that initial
point are taken. Then, at any time tn , there are infinitely many random steps that
can be taken through qn (i.e., qn can take on any value, but with some values more
likely than others as specified by the distribution of the random variable). This
implies in turn that no two realizations of any random process are the same. As an
example, Figure 6.1a illustrates 20 realization of the random walk process. As it
can be seen, each realization takes a completely different path, and the realizations
diverge as time elapses.
In practice, when estimating a system’s state, all the observations (measure-
ments) come from one single realization out of all the infinitely possible realiza-
tions of the process. The measurements then give us the information about the
particular realization that is observed and the realization and the measurements are
linked through the measurement model. For example, Figure 6.1b illustrates one
particular realization of the random walk process together with the measurements
from the linear measurement model

yn = xn + rn

with rn ∼ N (0, 1).


This leads to a new strategy for filtering where we first simulate a set of trajec-
tories and then evaluate how well each of these trajectories explain the measure-
ments that we are observing. More formally, in the framework of the prediction
6.6. BOOTSTRAP PARTICLE FILTER 89

10
xn
10 yn
5

xn , yn
xn

0 0

−5
−10
−10
0 20 40 60 0 20 40 60
t t

(a) (b)

Figure 6.1. Random walk process example: (a) 20 realizations of the random walk process,
and (b) one realization of the process together with its measurements.

xn
10 yn
xn

−10

0 20 40 60
t

Figure 6.2. Random walk example with measurements yn (blue) from one particular
realization of the random walk process (red) and 20 other realizations.

and measurement update steps introduced in Section 6.1, this approach alternates
between:

1. Prediction: Given a set of simulated states xjn−1 (for j = 1, . . . , J), simulate


from tn−1 to tn to obtain a set of simulated states xjn (j = 1, . . . , J);

2. Measurement update: Evaluate how well the simulated states xjn explain the
observed measurement yn .

For example, Figure 6.2 combines the different realizations of the state trajectories
in Figure 6.1a (gray) with the measurements (blue dots) in Figure 6.1b (together
with the true trajectory that generated the measurements in red). As it can be seen
from the figure, several of the gray realizations could be the trajectories that gener-
ated the measurements in blue, at least at some points in time, whereas other real-
izations are very unlikely to be the trajectories that generated the measurements.
To develop a filtering algorithm based on this reasoning, we need to find a
strategy to a) simulate samples xjn given the samples xjn−1 , and b) evaluate how
well a particular sample xjn of the state explains the current measurement yn .
90 CHAPTER 6. FILTERING

An intuitive way to generate new samples is to use the dynamic model to


simulate one time step. In other words, we can generate a realization of the random
variable qn and use the nonlinear function f (xn−1 ) to pass the sample from tn−1 to
tn . This means performing the following two steps for each sample j = 1, . . . , J:

1. Sample qjn ∼ p(qn ),

2. Calculate xjn = f (xjn−1 ) + qjn .

In practice, the process noise qn is often Gaussian, that is, qn is a Gaussian random
variable (e.g., when the discrete-time model comes from the discretization of a
continuous-time model with a white noise process as the input). In that case,
sampling qn amounts to sampling from the normal distribution N (0, Qn ) where
Qn is the covariance matrix of qn .
To evaluate the importance of each sample with respect to the current measure-
ment yn , we assign a weight wnj (called the importance weight) to each sample xjn .
Intuitively, the weight should represent how well each sample explains the mea-
surement, and thus, the closer the sample is to the true state, the higher the weight
should be. If we ensure that the weights sum to one, that is, if
J
X
wnj = 1,
j=1

we can also loosely interpret the weight wnj as the probability of the jth sample
representing the correct state. The question then is how the weights should be cal-
culated, given the sample xjn and the measurement yn . So far, cost functions have
been used to evaluate whether a state explains the measurements well. However,
in this approach the cost should be low for a state that explains the measurements
well, whereas the importance weight should be high and hence, we can not use cost
functions1 .
Instead, we have another look at the measurement model. Recall that we have
assumed that the measurement model is of the form

yn = g(xn ) + rn ,

where rn ∼ p(rn ) is a random variable with probability density function p(rn ).


Hence, yn is a random variable too. Assuming that xn is given, yn follows the
same distribution as rn but offset by the constant term g(xn ). We can express this
probability density function for yn as p(yn | xn ) which is read as “the probability
density function of yn given that xn is known”. The probability density function
p(yn | xn ) is commonly known as the likelihood because it indicates how likely a
certain state xn is when observing the measurement yn . Hence, the likelihood is a
1
In practice, cost functions are closely related to the way the importance weights are calculated,
but this is beyond the scope of this course.
6.6. BOOTSTRAP PARTICLE FILTER 91

suitable measure for how well any given sample xjn explains the measurement yn ,
and we can define the non-normalized weight w̃nj as

w̃nj = p(yn | xjn ).

w̃nj is non-normalized because the sum over all weights does generally not ful-
fill the requirement that the weights should sum to one. This can be ensured by
simply dividing each non-normalized weight by the sum of all the non-normalized
weights, that is,

w̃nj
wnj = PJ .
i
i=1 w̃n

In practice, the measurement noise is often assumed to be a zero-mean Gaus-


sian random variable with covariance matrix Rn , that is, p(rn ) = N (rn ; 0, Rn ).
In this case, the likelihood is also a Gaussian random variable with mean g(xn )
(due to the offset) and covariance Rn (due to the uncertainty introduced by rn ).
Hence, the likelihood becomes

p(yn | xn ) = N (yn ; g(xn ), Rn ).

Once the importance weights have been calculated, a point estimate of the
current state and its covariance can be obtained. Similar to the unscented Kalman
filter, these are given by the weighted sum of the individual samples xjn , weighed
by the importance weights wnj , that is,
J
X
x̂n|n = wnj xjn , (6.41a)
j=1
J
X
Pn|n = wnj (xjn − x̂n|n )(xjn − x̂n|n )T . (6.41b)
j=1

Observe that the covariances for the process noise (Qn ) and measurement noise
(Rn ) do not enter these equations (contrary to the Kalman-filter-type algorithms).
This is due to the fact that this uncertainty is accounted for when sampling and
calculating the weights.
It would appear that this now yields a working algorithm. However, there is one
important problem, illustrated for the random walk model in Figure 6.3: Assuming
that the red trajectory represents the true realization of our state trajectory, the
other simulated trajectories start to diverge from that trajectory as time elapses.
Consequently, toward the end of the time scale, there are only very few trajectories
in the neighborhood of the actual trajectory. In practice, this means that the weights
for almost all but a few trajectories would become zero and eventually, after some
more time has elapsed, there would be no trajectory close to the true realization
anymore and the filter would break down completely.
92 CHAPTER 6. FILTERING

20

10

xn
0

−10

−20
0 20 40 60 80 100
t

Figure 6.3. Illustration of the divergence problem when simulating trajectories.

To address this problem, one additional step called resampling has to be in-
troduced. The basic idea of resampling is to make sure that samples with low
weights, that is, samples that are unlikely to be close to the true state, are discarded
and replaced with copies of the samples with high weight. Hence, resampling
essentially regenerates the sample xjn such that there are a total of approximately
bwnj Je copies of xjn in the resampled set of particles. Thus, if x̃in (for i = 1, . . . , J)
denotes the resampled particle, that particle is equal to particle xjn with probability
wnj , that is, we have that

Pr{x̃in = xjn } = wnj ,

where Pr{·} denotes the probability. This ensures that trajectories with low weight
are discarded and that the samples remain close to the true state.
The final bootstrap particle filtering algorithm then consists of the following
three steps:

1. Propagate the particles using the dynamic model,

2. calculate the importance weights using the measurement model, and

3. resample the particles according to their weights.

These three steps are then repeated for the complete dataset and the recursion is
initialized by sampling from the initial distribution of the state p(x0 ). Assuming
Gaussian distributions for the initial state, process noise, and measurement noise,
that is,

x0 ∼ N (m0 , P0 ),
qn ∼ N (0, Qn ),
rn ∼ N (0, Rn ),

the resulting algorithm is as summarized in Algorithm 6.4.


6.6. BOOTSTRAP PARTICLE FILTER 93

Algorithm 6.4 Bootstrap Particle Filter (Gaussian Process and Measurement


Noises)
1: Initialization: Sample (j = 1, . . . , J)

xj0 ∼ N (m0 , P0 )

2: for n = 1, 2, . . . do
3: for j = 1, 2, . . . , J do
4: Sample

qjn ∼ N (0, Q)

5: Propagate the state

xjn = f (xjn−1 ) + qjn

6: Calculate the importance weights

w̃nj = N (yn ; g(xjn ), Rn )

7: end for
8: Normalize the importance weights (j = 1, . . . , J)

w̃nj
wnj = PJ
i
i=1 w̃n

9: Calculate the mean and covariance


J
X
x̂n|n = wnj xjn
j=1
J
X
Pn|n = wnj (xjn − x̂n|n )(xjn − x̂n|n )T
j=1

10: Resample such that

Pr{x̃in = xjn } = wnj

11: end for


94 CHAPTER 6. FILTERING
Bibliography

Bell, B. M. and Cathey, F. W. (1993). The iterated Kalman filter update as a


Gauss–Newton method. IEEE Transactions on Automatic Control, 38(2):294–
297. (Cited on page 87.)

Doucet, A. and Johansen, A. M. (2011). A tutorial on particle filtering and smooth-


ing: Fifteen years later. In Crisan, D. and Rozovskii, B., editors, Handbook of
Nonlinear Filtering, volume 12 of Oxford Handbooks, pages 656–704. Oxford
University Press, Oxford, UK. (Cited on page 88.)

Garcı́a-Fernández, Á. F., Svensson, L., Morelande, M. R., and Särkkä, S.


(2015). Posterior linearization filter: principles and implementation using sigma
points. IEEE Transactions on Signal Processing, 63(20):5561–5573. (Cited on
page 87.)

Gordon, N. J., Salmond, D. J., and Smith, A. F. M. (1993). Novel approach to


nonlinear/non-Gaussian Bayesian state estimation. Radar and Signal Process-
ing, IEE Proceedings F, 140(2):107–113. (Cited on page 88.)

Gustafsson, F. (2018). Statistical Sensor Fusion. Studentlitteratur, 3rd edition.


(Cited on pages v and 44.)

Hartley, R. and Zisserman, A. (2003). Multiple view geometry in computer vision.


Cambridge University Press. (Cited on page 4.)

Julier, S. J. and Uhlmann, J. K. (2004). Unscented filtering and nonlinear estima-


tion. Proceedings of the IEEE, 92(3):401–422. (Cited on page 83.)

Kalman, R. E. (1960). A new approach to linear filtering and prediction problems.


Transactions of the ASME, Journal of Basic Engineering, 82(1):35–45. (Cited
on page 78.)

Kay, S. M. (1993). Fundamentals of Statistical Signal Processing: Estimation


Theory. Prentice Hall, Upper Saddle River, NJ, USA. (Cited on page v.)

Levenberg, K. (1944). A method for the solution of certain non-linear problems


in least squares. Quarterly of Applied Mathematics, 2(2):164–168. (Cited on
page 45.)

95
96 BIBLIOGRAPHY

Marquardt, D. (1963). An algorithm for least-squares estimation of nonlinear


parameters. Journal of the Society for Industrial and Applied Mathematics,
11(2):431–441. (Cited on pages 45 and 47.)

Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. Springer, 2nd


edition. (Cited on pages v, 44, 45, and 50.)

Papoulis, A. (1984). Probability, Random Variables, and Stochastic Processes.


McGraw-Hill. (Cited on page 57.)

Petersen, K. B. and Pedersen, M. S. (2012). The matrix cookbook. Technical


report, Technical University of Denmark. (Cited on page 27.)

Pujol, J. (2007). The solution of nonlinear inverse problems and the Levenberg-
Marquardt method. Geophysics, 72(4):W1–W16. (Cited on page 47.)

Richards, M. A., Scheer, J., and Holm, W. A. (2010). Principles of Modern Radar.
SciTech Publishing. (Cited on page 4.)

Skoglund, M., Hendeby, G., and Axehill, D. (2015). Extended Kalman filter modi-
fications based on an optimization view point. In 18th International Conference
on Information Fusion (Fusion), pages 1856–1861. (Cited on page 87.)

Särkkä, S. (2013). Bayesian Filtering and Smoothing. Cambridge University Press.


(Cited on pages v, 76, and 88.)

Särkkä, S. and Solin, A. (2019). Applied Stochastic Differential Equations. Cam-


bridge University Press. (Cited on pages v, 57, and 67.)

Wan, E. A. and Van Der Merwe, R. (2000). The unscented Kalman filter for non-
linear estimation. In Adaptive Systems for Signal Processing, Communications,
and Control Symposium (AS-SPCC), pages 153–158. (Cited on pages 83, 84,
and 85.)

Wilson, J. (2005). Sensor Technology Handbook. Elsevier. (Cited on page 11.)

Øksendal, B. (2010). Stochastic Differential Equations: An Introduction with


Applications. Springer, 6th edition. (Cited on pages 57 and 67.)

You might also like