10.4.3 Q-Learning: Computing an Optimal Plan

This section moves from evaluating a plan to computing an optimal plan in the simulation-based framework. The most important idea is the computation of Q-factors, $ Q^*(x,u)$. This is an extension of the optimal cost-to-go, $ G^*$, that records optimal costs for each possible combination of a state, $ x \in X$, and action $ u \in U(x)$. The interpretation of $ Q^*(x,u)$ is the expected cost received by starting from state $ x$, applying $ u$, and then following the optimal plan from the resulting next state, $ x' = f(x,u,\theta)$. If $ u$ happens to be the same action as would be selected by the optimal plan, $ \pi ^*(x)$, then $ Q^*(x,u) = G^*(x)$. Thus, the Q-value can be thought of as the cost of making an arbitrary choice in the first stage and then exhibiting optimal decision making afterward.


Steven M LaValle 2012-04-20