2.3.2 Optimal Plans of Unspecified Lengths

The value-iteration method for fixed-length plans can be generalized nicely to the case in which plans of different lengths are allowed. There will be no bound on the maximal length of a plan; therefore, the current case is truly a generalization of Formulation 2.1 because arbitrarily long plans may be attempted in efforts to reach . The model for the general case does not require the specification of but instead introduces a special action, .

- All of the components from Formulation 2.1 are inherited directly: , , , , and . Also, the notion of stages from Formulation 2.2 is used.
- Let denote a stage-additive cost functional, which may be
applied to any -step plan, , to yield
(2.17)

In comparison with from Formulation 2.2, the present expression does not consider as a predetermined constant. It will now vary, depending on the length of the plan. Thus, the domain of is much larger. - Each contains the special
*termination action*, . If is applied at , then the action is repeatedly applied forever, the state remains unchanged, and no more cost accumulates. Thus, for all , , , and .

The termination action is the key to allowing plans of different lengths. It will appear throughout this book. Suppose that value iterations are performed up to , and for the problem there exists a two-step solution plan, , that arrives in from . This plan is equivalent to the five-step plan because the termination action does not change the state, nor does it accumulate cost. The resulting five-step plan reaches and costs the same as . With this simple extension, the forward and backward value iteration methods of Section 2.3.1 may be applied for any fixed to optimize over all plans of length or less (instead of fixing ).

The next step is to remove the dependency on . Consider running backward value iterations indefinitely. At some point, will be computed, but there is no reason why the process cannot be continued onward to , , and so on. Recall that is not utilized in the backward value-iteration method; therefore, there is no concern regarding the starting initial state of the plans. Suppose that backward value iteration was applied for and was executed down to . This considers all plans of length or less. Note that it is harmless to add to all stage indices to shift all of the cost-to-go functions. Instead of running from to , they can run from to without affecting their values. The index shifting is allowed because none of the costs depend on the particular index that is given to the stage. The only important aspect of the value iterations is that they proceed backward and consecutively from stage to stage.

Eventually, enough iterations will have been executed so that an
optimal plan is known from every state that can reach . From
that stage, say , onward, the cost-to-go values from one value
iteration to the next will be *stationary*, meaning that for all ,
for all . Once the stationary condition is
reached, the cost-to-go function no longer depends on a particular
stage . In this case, the stage index may be dropped, and the
recurrence becomes

Are there any conditions under which backward value iterations could
be executed forever, with each iteration producing a cost-to-go
function for which some values are different from the previous
iteration? If is nonnegative for all and
, then this could never happen. It could certainly be true that,
for any fixed , longer plans will exist, but this cannot be said of
*optimal* plans. From every , there either exists a plan
that reaches with finite cost or there is no solution. For
each state from which there exists a plan that reaches ,
consider the number of stages in the optimal plan. Consider the
maximum number of stages taken from all states that can reach
. This serves as an upper bound on the number of value
iterations before the cost-to-go becomes stationary. Any further
iterations will just consider solutions that are worse than the ones
already considered (some may be equivalent due to the termination
action and shifting of stages). Some trouble might occur if
contains negative values. If the state transition graph contains a
cycle for which total cost is negative, then it is preferable to
execute a plan that travels around the cycle forever, thereby reducing
the total cost to . Therefore, we will assume that the cost
functional is defined in a sensible way so that negative cycles do not
exist. Otherwise, the optimization model itself appears flawed. Some
negative values for , however, are allowed as long as there
are no negative cycles. (It is straightforward to detect and report
negative cycles before running the value iterations.)

Since the particular stage index is unimportant, let be the index of the final stage, which is the stage at which the backward value iterations begin. Hence, is the final stage cost, which is obtained directly from . Let denote the stage index at which the cost-to-go values all become stationary. At this stage, the optimal cost-to-go function, , is expressed by assigning . In other words, the particular stage index no longer matters. The value gives the optimal cost to go from state to the specific goal state .

If the optimal actions are not stored during the value iterations, the optimal cost-to-go, , can be used to efficiently recover them. Consider starting from some . What is the optimal next action? This is given by

in which denotes the argument that achieves the minimum value of the expression. The action minimizes an expression that is very similar to (2.11). The only differences between (2.19) and (2.11) are that the stage indices are dropped in (2.19) because the cost-to-go values no longer depend on them, and is used so that is selected. After applying , the state transition equation is used to obtain , and (2.19) may be applied again on . This process continues until a state in is reached. This procedure is based directly on the dynamic programming recurrence; therefore, it recovers the optimal plan. The function serves as a kind of guide that leads the system from any initial state into the goal set optimally. This can be considered as a special case of a

As in the case of fixed-length plans, the direction of the value iterations can be reversed to obtain a forward value-iteration method for the variable-length planning problem. In this case, the backward state transition equation, , is used once again. Also, the initial cost term is used instead of , as in (2.14). The forward value-iteration method starts at , and then iterates until the cost-to-come becomes stationary. Once again, the termination action, , preserves the cost of plans that arrived at a state in earlier iterations. Note that it is not required to specify . A counterpart to may be obtained, from which optimal actions can be recovered. When the cost-to-come values become stationary, an optimal cost-to-come function, , may be expressed by assigning , in which is the final stage reached when the algorithm terminates. The value gives the cost of an optimal plan that starts from and reaches . The optimal action sequence for any specified goal can be obtained using

which is the forward counterpart of (2.19). The is the action in that yields when the state transition function, , is applied. The iterations proceed backward from and terminate when is reached.

See Figure 2.14. After a few backward value iterations, the cost-to-go values become stationary. After this point, the termination action is being applied from all reachable states and no further cost accumulates. The final cost-to-go function is defined to be . Since is not reachable from , .

As an example of using (2.19) to recover optimal actions, consider starting from state . The action that leads to is chosen next because the total cost is better than (the comes from the action cost). From state , the optimal action leads to , which produces total cost . Similarly, the next action leads to , which terminates the plan.

Using forward value iteration, suppose that
. The
following cost-to-come functions shown in Figure 2.15 are
obtained. For any finite value that remains constant from one
iteration to the next, the termination action was applied. Note that
the last value iteration is useless in this example. Once
is computed, the optimal cost-to-come to every possible state from
is determined, and future cost-to-come functions are
identical. Therefore, the final cost-to-come is renamed .

Steven M LaValle 2012-04-20