0% found this document useful (0 votes)
60 views28 pages

Optimal Control Theory

Uploaded by

Aritra Dasgupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
60 views28 pages

Optimal Control Theory

Uploaded by

Aritra Dasgupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 28
To appear in Bayesian Brain, Doya, K. (ed), MIT Press (2006) Optimal Control Theory Emanuel Todorov University of California San Diego Optimal control theory is a mature mathematical discipline with numerous applications in both science and engineering. It is emerging as the computational framework of choice {or studying the neural control of movement, in much the same way that probabilistic infer- ence is emerging as the computational framework of choice for studying sensory information processing. Despite the growing popularity of optimal control models, however, the elab- orate mathematical machinery behind them is rarely exposed and the big picture is hard to grasp without reading a few technical books on the subject. While this chapter cannot replace such books, it aims to provide a self-contained mathematical introduction to opti- mal control theory that is sufficiently broad and yet sufficiently detailed when it comes to key concepts. ‘The text is not tailored to the field of motor control (apart from the last section, and the overall emphasis on systems with continuous state) so it will hopefully be of interest to a wider audience. Of special interest in the context of this book is the aaterial on the duality of optimal control and probabilistic inferenee; such duality suggests that neural information processing in sensory and motor areas may be more similar than currently thought. The chapter is organized in the following sections 1. Dynamic programming, Bellman equations, optimal value functions, value and policy iteration, shortest paths, Markov decision processes. 2. Hamilton-Jacobi-Bellman equations, approximation methods, finite and infinite hori- zon formulations, basies of stochastic calculus. 3. Pontryagin’s maximum principle, ODE and gradient descent methods, relationship to classical mechanies. 4. Linear-quadratic-Gaussian control, Ri to nonlinear problems. ‘ati equations, iterative linear approximations 5. Optimal recursive estimation, Kalman filter, Zakai equation, 6. Duality of optimal control and optimal estimation (including new results), 7. Optimality models in motor control, promising research directions.1 Discrete control: Bellman equations Let x € & denote the state of an agent’s environment, and w € U(c) the action (or control) which the agent chooses while at state «xr, For now both 2 and UW (1) are finite sets, Let nesrt (x,u) € A denote the state which results from applying action w in state sr, and cost (rr,u) > 0 the cost of applying action w in state x. As an example, « may be the city where we are now, u the flight we choose to take, ne:tt (x, ) the city where that flight lands, and cost (1,1) the price of the ticket. We ean now pose a simple yet. practical optimal control problem: find the cheapest way to fly to your destination, This problem can be formalized as follows: find an action sequence (ug, t,*+ty~1) and corresponding state sequence (9, 1,-+q) minimizing the total cost T(esu) = 2" cost (24, ux) where r441 = next (rp, uz) and uy € U (rq). ‘The initial state x» = 0 np must satisfy In the stochastic case the value function equation (2) becomes (2) = i (cont (2,1) + Bo (next (2,09) where E denotes expectation over next (x,u), and is computed as Bl (next (2,0)] = Dey Pula.) »(v) Equations (1, 3, 4) generalize to the stochastic case in the same way as equation (2) does. An optimal control problem with discrete states and actions and probabilistic state transitions is called a Markov decision process (MDP). MDPs are extensively studied in reinforcement learning ~ which is a sub-field of machine learning focusing on optimal control problems with discrete state. In contrast, optimal control theory focuses on problems with continuous state and exploits their rich differential structure 2 Continuous control: Hamilton-Jacobi-Bellman equations We now turn to optimal control problems where the state x € R" and control w € U (x) © RM are real-valued vectors. To simplify notation we will use the shortcut min, instead ofmnin,cu(e), although the latter is implied unless noted otherwise. Consider the stochastic differential equation dx =f (x,u) dt + F (x,u) dw 6) where dw is my-dimensional Brownian motion. ‘This is sometimes called a controlled Ito diffusion, with f (x, m) being the drift and F (x, u) the diffusion coefficient. In the absence of noise, ic. when F (x,u) = 0, we can simply write x = f (x,u). However in the stochastic case this would be meaningless because the sample paths of Brownian motion are not differentiable (the term dw/dt is infinite). What equation (6) really means is that the integral of the left hand side is equal to the integral of the right hand side: ‘ xoxo f Fex(o) mae [ F (x(s),a(s)) dew (s) ‘The last term is an Ito integral, defined for square-integrable functions g (t) as ff 9s) ins) = fi, YP) 0) (5002) — wu) where 0 = 89 0, and a final cost h(x) > 0 which is only evaluated at the final state x(t). Thus the total cost for a given state-control trajectory {x(),u() 0 StS ty} is defined as y I(x(), U6) = h(x (ts) of (x(t), u(t) ,t)dt Keep in mind that we are dealing with a stochastic system. Our objective is to find a control law u = x (x,t) which minimizes the expected total cost for starting at a given (x,t) and acting according thereafter. In discrete time the total cost becomes FM.) = Hn) +A Lest AA) where n = ty/A is the number of time steps (assume that ty/A is integer)2.1 Derivation of the HJB equations We are now ready to apply dynamic programming to the time-discretized stochastic prob- Jem. The development is similar to the MDP case except that the state space is now infinite: it consists of n +1 copies of R"*. The reason we need multiple copies of R" is that we have a finite-horizon problem, and therefore the time when a given x € R" is reached makes a difference. ‘The state transitions are now stochastic: the probability distribution of xey1 given Xi, Ux is the multivariate Gaussian Xia YN (Xe + AL (Xp, UR), AS Oxk, UK) where S(x,u) = F (x,u) F (x,u)™ ‘The Bellman equation for the optimal value function v is similar to (5), except that v is now a funetion of space and time, We have v(x,k) = min {Ae(x, u,kA) + B [v(x + Af (x,u) + k+1))} (7 where €~N (0, AS(x,u)) and v(x,n) = h(x) Consider the second-order Taylor-series expansion of v, with the time index k+1 suppressed for clarity: U(x +5) = v(x) + dT ue (x) + £67 xx (x) 6 + 0 (58) where 6 = Af (x,u) +6 vx = 20, tee = gage Now compute the expectation of the optimal value function at the next state, using the above Taylor-series expansion and only keeping terms up to first-order in A. The result is: E(u] = 0 (x) + AF (x, 0)" vx (x) + Ft (AS (x, U) dx (x)) + 0 (A?) ‘The trace term appears because B [tr (eT) | Note the second-order derivative tye in the first-order approximation to E [p]. This is a recurrent theme in stochastic calculus. It is directly related to Ito’s lemma, which states that if x (t) is an Ito diffusion with coefficient 7, then B | ET vxx€) it (Cov [f] Ux! HT (ASV xx) dg (x (1)) = ge (# (t)) de (8) + fo°ge0 (x(t) dt Coming back to the derivation, we substitute the expression for F[v] in (7), move the term v(x) outside the minimization operator (since it does not depend on u), and divide by A. Suppressing x, u, & on the right hand side, we have (xk) 0 (ek +1) = min {6-4 fox + $61 (Stax) +0(A)}Recall that t= kA, and consider the optimal value function v (x,t) defined in continuous time. The left hand side in the above equation is then v(xt) —vOx,t+d) a In the limit A — 0 the latter expression becomes —v, which we denote ~v,. ‘Thus for O 0 being the discount factor. Intuitively this says that future costs are less costly (whatever that means). Here we do not have a final cost f(x), and the cost rate £ (x, u) no-longer depends on time explicitly. The HJB equation for the optimal value function becomes aw (x) = aaa, (ew) E(x, u)! vy (x) + tr (5 (x, 0) tine 0) } (10) Another alternative is the average-cost-per-stage formulation, with total cost F(x(),w()) = tim 2 [cow weorae aytae Ty In this case the HJB equation for the optimal value function is, min. ox, a) (¢,u)T rx (x) + 5 tr (S (2, 1) te (0) A= amin, {8 660) +f G5 u)T ox (0) +B (5 64,1) Mee ())} ay where A > (/is the average cost per stage, and v now has the meaning of a differential value function. Equations (10) and (11) do not depend on time, which makes them more amenable to numerical approximations in the sense that we do not need to store a copy of the optimal value function at each point in time, Form another point of view, however, (8) may be easierto solve numerically. This is because dynamic programming can be performed in a single backward pass through time: initialize v (x,t7) = h(x) and simply integrate (8) backward in time, computing the spatial derivatives numerically along the way. In contrast, (10) and (11) call for relaxation methods (such as value iteration or policy iteration) which in the continnous-state case may take an arbitrary number of iterations to converge. Relaxation methods are of course guaranteed to converge in a finite number of iterations for any finite state approximation, but that number may increase rapidly as the discretization of the continuous state space is refined. 3 Deterministic control: Pontryagin’s maximum principle Optimal control theory is based on two fundamental ideas. One is dynamic programming and the associated optimality principle, introduced by Bellman in the United States. The other is the mazimum principle, introduced by Pontryagin in the Soviet Union. The max- imum principle applies only to deterministic problems, and yields the same solutions as dynamic programming. Unlike dynamic programming, however, the maximum principle avoids the curse of dimensionality. Here we derive the maximum principle indirectly via the HJB equation, and directly via Lagrange multipliers. We also clarify its relationship to classical mechanics. 3.1 Derivation via the HJB equations For deterministic dynamics % = f (x, u) the finite-horizon HJB equation (8) becomes u(x,t) = min {e6s.u.1) $£ (x, a)" ox (x, »} (12) Suppose a solution to the minimization problem in (12) is given by an optimal control law (Xt) which is differentiable in x. Setting u = x (x,t) we can drop the min operator in (22) and write 0 =u (x, t) + £ (x, m (x, £) ,t) + £ (x, (x, 8)" Ux (x, 6) This equation is valid for all x, and therefore can be differentiated w.r.t. x to obtain (in shortcut notation) O= tg + be tag la + (ET + EEE) ta + tl Regrouping terms, and using the identity bg = tock + vax = Yoel + Une, yields 0 ig + la + Blt +4 (Cu + Boom) We now make a key observation: the term in the brackets is the gradient w.r.t. u of the quantity being minimized w.r.t. u in (12). ‘That gradient is zero (assuming unconstrained minimization), which leaves us with Ati, (0) = Be (X, m (26, 0) 0) +E (x, (x, 0) te (x, 0) (13)This may look like a PDE for w, but if we think of vx as a vector p instead of a gradient of a function which depends on x, then (13) is an ordinary differential equation (ODE) for p. ‘That equation holds along any trajectory generated by x (x,t). The vector p is called the costate vector. We are now ready to formulate the maximum principle. If (x(t) ,u(t):0