The principle of optimality
An optimal policy remains optimal backwards in time
Start from the end and work your way backwards: Words to live by. Yet our daily experience is lived forwards in time and not backwards. According to our experience, one thing happens after another: $$x_{k+1} = f_k(x_k,u_k)$$ and if we are to affect it (by choosing \(u_k\) to modify \(f_k\)), then we are going to affect the future and not the past. Perhaps this is why some people naïvely think that an optimal policy can be determined through forward computation. Although I find this unintuitive, as do many others (more on this below), I have seen such opinions appear from time to time even in the most educated circles.
To formalize the naïve view, let us assume a cost functional, $$F_T(x_k) + \sum_{k=0}^{T−1} F_k(x_k,u_k)$$ and ask if it is possible to determine an optimal policy forwards in time. Define, $$\tilde V_k(x_0,u_0,\dots,u_k) = \sum_{i=0}^k F_i(x_i,u_i)$$ so that, $$\tilde V_{k+1}(x_0,u_0,\dots,u_k,u_{k+1}) = \tilde V_k(x_0,u_0,\dots,u_k) + F_{k+1}(x_{k+1},u_{k+1})$$
Suppose we want to find the minimum1 of the function \(\tilde V_{k+1}\). We would need to optimize over the variables \(u_0\),...,\(u_{k+1}\) without exception. There is in general no way to find the optimum of \(\tilde V_{k}\) and use this information to optimize \(\tilde V_{k+1}\). This is because the term \(x_{k+1}\) in \(F_{k+1}(x_{k+1},u_{k+1})\) depends on all variables \(u_k\) preceding \(u_{k+1}\), so determining the optimum of \(\tilde V_{k+1}\) would necessarily influence the decision variables used in determining the optimum of \(\tilde V_{k}\). Therefore, given a certain state, optimizing forwards in time can give us an optimum with respect to that state, but in no way can it generally be used to determine an optimal policy.
The inverse of this realization is encapsulated in the principle of optimality, which states that:
An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
R.E. Bellman (1957)
Bellman had more to say on the topic of determining an optimal policy:
In place of determining the optimal sequence of decisions from the fixed state of the system, we wish to determine the optimal decision to be made at any state of the system. Only if we know the latter, do we understand the intrinsic structure of the solution.
R.E. Bellman (1957)
The implication is that the principle of optimality leads to closed-loop control laws, as opposed to mere open-loop control. Previous to Bellman, various thinkers had encountered this principle and even dismissed it as trivial. Remarkably, Rufus Isaacs formulated his own, game-theoretic version and called it the tenet of transition:
If the play proceeds from one position to a second and V is thought of as known at the second, then it is determined at the first by demanding that the players optimize (that is, make minimax) the increment of V during the transition.
R.P. Isaacs (1951)
Isaacs, even with an intrinsic understanding that differential games require closed-loop control laws, later came to regret dismissing the importance of his tenet:
Once I felt that here was the heart of the subject and cited it often in the early Rand seminars. Later I felt that it – like other mathematical hearts – was a mere truism. Thus, in Differential Games it is mentioned only by title. This I regret. I had no idea that Pontryagin's principle and Bellman's maximal principle (a special case of the tenet, appearing a little later in the Rand seminars) would enjoy such widespread citation.
R.P. Isaacs (1973)
Let me now state my own understanding of the principle:
An optimal policy remains optimal backwards in time.
Start from the end and work your way backwards: Words to live by.
The rest of this post contains derivations of the Bellman and Hamilton-Jacobi-Bellman (HJB) equations.
Discrete time (Bellman equation)
Consider the optimal control problem, $$\min_u F_T(x_T) + \sum_{i=0}^{T−1} F_i(x_i,u_i)$$ where, $$x_{k+1}=f_k(x_k,u_k)$$ for all \(k \in \mathbb Z_T\), and let, $$V_k(x') = \min_u F_T(x_T)+ \sum_{i=k}^{T−1} F_i(x_i,u_i)$$ where \(x_k = x'\).
The Bellman equation follows easily, \[ \begin{align} V_k(x') &= \min_u F_T(x_T) + \sum_{i=k+1}^{T−1} F_i(x_i,u_i) + F_k(x',u_k) \newline &= \min_u V_{k+1}(x_{k+1})+F_k(x',u_k) \end{align} \] Replacing with the expression for \(x_{k+1}\), we obtain the result: $$V_k(x)= \min_u V_{k+1}(f_k(x,u)) + F_k(x,u)$$ with final condition \(V_T(x) = F_T(x)\).
Discounted time
Oftentimes, the cost functional is discounted in time: $$\min_u \gamma^T \bar F_T(x_T) + \sum_{i=0}^{T−1} \gamma^i \bar F_i(x_i,u_i)$$
A direct application of the Bellman equation gives us the result, $$V_k(x) = \min_u V_{k+1}(f_k(x,u)) + \gamma^k \bar F_k(x,u)$$
To avoid the appearance of small terms \(\gamma^k \bar F_k\) in the Bellman equation, we make a change of variables \(V_k = \gamma^k \bar V_k\), obtaining, $$\bar V_k(x) = \min_u \gamma \bar V_{k+1}(f_k(x,u))+\bar F_k(x,u)$$
Continuous time (HJB equation)
Consider the optimal control problem, $$\min_u F_T(x(T)) + \int_0^T F(x(t),u(t),t)dt$$ where, $$\dot x(t) = f(x(t),u(t),t)$$ for all \(t \in [0,T]\), and let, $$V(x',t) = \min_u F_T(x(T))+ \int_t^T F(x(\tau),u(\tau),\tau)d\tau$$ where \(x(t) = x'\).
We begin as in the derivation of the Bellman equation, optimizing over a time-step \(h\), \[ \begin{align} V(x',t) &= \min_u F_T(x(T)) + \int_{t+h}^T F(x(\tau),u(\tau),\tau)d\tau + \int_t^{t+h} F(x(\tau),u(\tau),\tau)d\tau \newline &= \min_{u|_{[t,t+h]}} V(x(t+h),t+h) + \int_t^{t+h} F(x(\tau),u(\tau),\tau)d\tau \end{align} \]
Taking \(V(x(t),t+h)\) away from both sides, \[ \begin{multline} V(x',t)-V(x(t),t+h) \newline = \min_{u|_{[t,t+h]}} V(x(t+h),t+h)-V(x(t),t+h) + \int_t^{t+h} F(x(\tau),u(\tau),\tau)d\tau \end{multline} \] Dividing by \(h\) and taking the limit \(h \to 0\), $$−\frac{\partial V}{\partial t}(x',t) = \min_u \frac{\partial V}{\partial x}(x',t) \cdot \dot x + F(x',u,t)$$
Replacing with the expression for \(\dot x\), we obtain the result: $$−\frac{\partial V}{\partial t}(x,t) = \min_u \frac{\partial V}{\partial x}(x,t) \cdot f(x,u,t) + F(x,u,t)$$ with final condition \(V(x,T) = F_T(x)\).
Discounted time
When the cost functional is discounted in time, $$\min_u e^{-\rho T}\bar F_T(x(T)) + \int_0^T e^{-\rho t}\bar F(x(t),u(t),t)dt$$ we can take a similar approach as in the case of discrete time. Specifically, we make a change of variables \(V = e^{-\rho t} \bar V\), obtaining, $$−\frac{\partial \bar V}{\partial t}(x,t) = -\rho\bar V(x,t) + \min_u \frac{\partial \bar V}{\partial x}(x,t) \cdot f(x,u,t) + \bar F(x,u,t)$$
Postface
This post started when I sat down to re-derive the HJB equation for the umpteenth time. I figured it was about time to put a stop to that and write a note that I could refer to in the future. The history lesson is a bonus; I hope you enoyed it.
References
H.J. Pesch & R. Bulirsch, "The maximum principle, Bellman's equation, and Carathéodory's work," 1994
D. Liberzon, "Calculus of variations and optimal control theory: A concise introduction," 2011
1 For convenience, I am assuming in this post that functions are well-behaved enough so that the extremum not only exists, but is unique.