2.3.2 Optimal Plans of Unspecified Lengths

The value-iteration method for fixed-length plans can be generalized nicely to the case in which plans of different lengths are allowed. There will be no bound on the maximal length of a plan; therefore, the current case is truly a generalization of Formulation 2.1 because arbitrarily long plans may be attempted in efforts to reach ${X_{G}}$ . The model for the general case does not require the specification of

but instead introduces a special action,

Formulation 2..3 (Discrete Optimal Planning)

All of the components from Formulation 2.1 are inherited directly: , , , ${x_{I}}$ , and ${X_{G}}$ . Also, the notion of stages from Formulation 2.2 is used.
Let denote a stage-additive cost functional, which may be applied to any -step plan, $\pi _K$ , to yield

$\displaystyle L(\pi _K) = \sum_{k=1}^K l(x_k,u_k) + l_F(x_F) .$ (2.17)

In comparison with from Formulation 2.2, the present expression does not consider as a predetermined constant. It will now vary, depending on the length of the plan. Thus, the domain of is much larger.
Each contains the special termination action, . If is applied at , then the action is repeatedly applied forever, the state remains unchanged, and no more cost accumulates. Thus, for all $i \geq k$ , , , and .

The termination action is the key to allowing plans of different lengths. It will appear throughout this book. Suppose that value iterations are performed up to

, and for the problem there exists a two-step solution plan,

, that arrives in ${X_{G}}$ from ${x_{I}}$ . This plan is equivalent to the five-step plan

because the termination action does not change the state, nor does it accumulate cost. The resulting five-step plan reaches ${X_{G}}$ and costs the same as

. With this simple extension, the forward and backward value iteration methods of Section 2.3.1 may be applied for any fixed

to optimize over all plans of length

or less (instead of fixing

The next step is to remove the dependency on

. Consider running backward value iterations indefinitely. At some point,

will be computed, but there is no reason why the process cannot be continued onward to

, $G^*_{-1}$ , and so on. Recall that ${x_{I}}$ is not utilized in the backward value-iteration method; therefore, there is no concern regarding the starting initial state of the plans. Suppose that backward value iteration was applied for

and was executed down to $G^*_{-8}$ . This considers all plans of length

or less. Note that it is harmless to add

to all stage indices to shift all of the cost-to-go functions. Instead of running from $G^*_{-8}$ to $G^*_{16}$ , they can run from

to $G^*_{25}$ without affecting their values. The index shifting is allowed because none of the costs depend on the particular index that is given to the stage. The only important aspect of the value iterations is that they proceed backward and consecutively from stage to stage.

Eventually, enough iterations will have been executed so that an optimal plan is known from every state that can reach ${X_{G}}$ . From that stage, say

, onward, the cost-to-go values from one value iteration to the next will be stationary, meaning that for all $i \leq k$ , $G^*_{i-1}(x) = G^*_i(x)$ for all $x \in X$ . Once the stationary condition is reached, the cost-to-go function no longer depends on a particular stage

. In this case, the stage index may be dropped, and the recurrence becomes

Are there any conditions under which backward value iterations could be executed forever, with each iteration producing a cost-to-go function for which some values are different from the previous iteration? If

is nonnegative for all $x \in X$ and $u \in U(x)$ , then this could never happen. It could certainly be true that, for any fixed

, longer plans will exist, but this cannot be said of optimal plans. From every $x \in X$ , there either exists a plan that reaches $X_{G}$ with finite cost or there is no solution. For each state from which there exists a plan that reaches ${X_{G}}$ , consider the number of stages in the optimal plan. Consider the maximum number of stages taken from all states that can reach ${X_{G}}$ . This serves as an upper bound on the number of value iterations before the cost-to-go becomes stationary. Any further iterations will just consider solutions that are worse than the ones already considered (some may be equivalent due to the termination action and shifting of stages). Some trouble might occur if

contains negative values. If the state transition graph contains a cycle for which total cost is negative, then it is preferable to execute a plan that travels around the cycle forever, thereby reducing the total cost to $-\infty$ . Therefore, we will assume that the cost functional is defined in a sensible way so that negative cycles do not exist. Otherwise, the optimization model itself appears flawed. Some negative values for

, however, are allowed as long as there are no negative cycles. (It is straightforward to detect and report negative cycles before running the value iterations.)

Since the particular stage index is unimportant, let

be the index of the final stage, which is the stage at which the backward value iterations begin. Hence,

is the final stage cost, which is obtained directly from

. Let

denote the stage index at which the cost-to-go values all become stationary. At this stage, the optimal cost-to-go function, $G^*: X \rightarrow {\mathbb{R}} \cup \{\infty\}$ , is expressed by assigning $G^*= G^*_{-K}$ . In other words, the particular stage index no longer matters. The value

gives the optimal cost to go from state $x \in X$ to the specific goal state ${x_{G}}$ .

If the optimal actions are not stored during the value iterations, the optimal cost-to-go,

, can be used to efficiently recover them. Consider starting from some $x \in X$ . What is the optimal next action? This is given by

As in the case of fixed-length plans, the direction of the value iterations can be reversed to obtain a forward value-iteration method for the variable-length planning problem. In this case, the backward state transition equation, ${f^{-1}}$ , is used once again. Also, the initial cost term

is used instead of

, as in (2.14). The forward value-iteration method starts at

, and then iterates until the cost-to-come becomes stationary. Once again, the termination action,

, preserves the cost of plans that arrived at a state in earlier iterations. Note that it is not required to specify ${X_{G}}$ . A counterpart to

may be obtained, from which optimal actions can be recovered. When the cost-to-come values become stationary, an optimal cost-to-come function, $C^*: X \rightarrow {\mathbb{R}}\cup \{\infty\}$ , may be expressed by assigning

, in which

is the final stage reached when the algorithm terminates. The value

gives the cost of an optimal plan that starts from ${x_{I}}$ and reaches

. The optimal action sequence for any specified goal ${x_{G}}\in X$ can be obtained using

Example 2..5 (Value Iteration for Variable-Length Plans) Once again, Example 2.3 is revisited; however, this time the plan length is not fixed due to the termination action. Its effect is depicted in Figure 2.13 by the superposition of new edges that have zero cost. It might appear at first that there is no incentive to choose nontermination actions, but remember that any plan that does not terminate in state $x_{G} = d$ will receive infinite cost.

**Figure 2.13:** Compare this figure to Figure 2.11, for which was fixed at . The effect of the termination action is depicted as dashed-line edges that yield 0 cost when traversed. This enables plans of all finite lengths to be considered. Also, the stages extend indefinitely to the left (for the case of backward value iteration).
$\begin{figure}\centerline{\psfig{figure=figs/fivestate4.eps,width=5.0in} }\end{figure}$

**Figure 2.14:** The optimal cost-to-go functions computed by backward value iteration applied in the case of variable-length plans.
$\begin{figure}\begin{center} \begin{tabular}{\vert c\vert c\vert c\vert c\vert c... ...$ & 4 & 2 & 1 & 0 & $\infty$ \hline \end{tabular}\end{center} \end{figure}$

See Figure 2.14. After a few backward value iterations, the cost-to-go values become stationary. After this point, the termination action is being applied from all reachable states and no further cost accumulates. The final cost-to-go function is defined to be . Since is not reachable from , $G^*(e) = \infty$ .

As an example of using (2.19) to recover optimal actions, consider starting from state . The action that leads to is chosen next because the total cost is better than (the comes from the action cost). From state , the optimal action leads to , which produces total cost . Similarly, the next action leads to $d \in {X_{G}}$ , which terminates the plan.

**Figure 2.15:** The optimal cost-to-come functions computed by forward value iteration applied in the case of variable-length plans.
$\begin{figure}\begin{center} \begin{tabular}{\vert c\vert c\vert c\vert c\vert c... ...ine $C^*$ & 2 & 0 & 1 & 2 & 3 \\ \hline \end{tabular}\end{center} \end{figure}$

Using forward value iteration, suppose that ${x_{I}}= b$ . The following cost-to-come functions shown in Figure 2.15 are obtained. For any finite value that remains constant from one iteration to the next, the termination action was applied. Note that the last value iteration is useless in this example. Once is computed, the optimal cost-to-come to every possible state from ${x_{I}}$ is determined, and future cost-to-come functions are identical. Therefore, the final cost-to-come is renamed . $\blacksquare$