Temporal differences

Rather than waiting until the end of each trial to compute $ c_i(x_i)$, it is possible to update each state, $ x_i$, immediately after it is visited and $ l(x_i,u_i)$ is received from the simulator. This leads to a well-known method of estimating the cost-to-go called temporal differences [929]. It is very similar to the method already given but somewhat more complicated. It will be introduced here because the method frequently appears in reinforcement learning literature, and an extension of it leads to a nice simulation-based method for updating the estimated cost-to-go.

Once again, consider the sequence $ x_1$, $ \ldots $, $ x_K$, $ x_F$ generated by a trial. Let $ d_k$ denote a temporal difference, which is defined as

$\displaystyle d_k = l(x_k,u_k) + \hat{G}_\pi (x_{k+1}) - \hat{G}_\pi (x_k) .$ (10.86)

Note that both $ l(x_k,u_k) + \hat{G}_\pi (x_{k+1})$ and $ \hat{G}_\pi (x_k)$ could each serve as an estimate of $ G_\pi (x_k)$. The difference is that the right part of (10.87) utilizes the latest cost obtained from the simulator for the first step and then uses $ \hat{G}_\pi (x_{k+1})$ for an estimate of the remaining cost. In this and subsequent expressions, every action, $ u_k$, is chosen using the plan: $ u_k =
\pi (x_k)$.

Let $ v_k$ denote the number of times that $ x_k$ has been visited so far, for each $ 1 \leq k \leq K$, including previous trials and the current visit. The following update algorithm can be used during the trial. When $ x_2$ is reached, the value at $ x_1$ is updated as

$\displaystyle \hat{G}_\pi (x_1) := \hat{G}_\pi (x_1) + \frac{1}{v_1} d_1 .$ (10.87)

When $ x_3$ is reached, the values at $ x_1$ and $ x_2$ are updated as

\begin{displaymath}\begin{split}\hat{G}_\pi (x_1) := \hat{G}_\pi (x_1) + \frac{1...
...pi (x_2) := \hat{G}_\pi (x_2) + \frac{1}{v_2} d_2 . \end{split}\end{displaymath} (10.88)

Now consider what has been done so far at $ x_1$. The temporal differences partly collapse:

\begin{displaymath}\begin{split}\hat{G}_\pi (x_1) := & \hat{G}_\pi (x_1) + \frac...
...(x_2,u_2) - \hat{G}_\pi (x_1) + \hat{G}_\pi (x_3)). \end{split}\end{displaymath} (10.89)

When $ x_4$ is reached, similar updates are performed. At $ x_k$, the updates are

\begin{displaymath}\begin{split}\hat{G}_\pi (x_1) := & \hat{G}_\pi (x_1) + \frac...
... (x_k) := & \hat{G}_\pi (x_k) + \frac{1}{v_k} d_k . \end{split}\end{displaymath} (10.90)

The updates are performed in this way until $ x_F \in
{X_{G}}$ is reached. Now consider what was actually computed for each $ x_k$. The temporal differences form a telescoping sum that collapses, as shown in (10.90) after two iterations. After all iterations have been completed, the value at $ x_k$ has been updated as

\begin{displaymath}\begin{split}\hat{G}_\pi (x_k) := & \hat{G}_\pi (x_k) + \frac...
...2,u_2) + \cdots + l(x_K,u_K) - \hat{G}_\pi (x_k)) . \end{split}\end{displaymath} (10.91)

The final $ \hat{G}_\pi (x_F)$ was deleted because its value is zero, assuming that the termination action is applied by $ \pi $. The resulting final expression is equivalent to (10.85) if each visited state in the sequence was distinct. This is often not true, which makes the method discussed above differ slightly from the method of (10.85) because the count, $ v_k$, may change during the trial in the temporal difference scheme. This difference, however, is negligible, and the temporal difference method computes estimates that converge to $ \hat{G}_\pi $ [96,97].

The temporal difference method presented so far can be generalized in a way that often leads to faster convergence in practice. Let $ \lambda \in [0,1]$ be a specified parameter. The $ TD(\lambda)$ temporal difference method replaces the equations in (10.91) with

\begin{displaymath}\begin{split}\hat{G}_\pi (x_1) := & \hat{G}_\pi (x_1) + \lamb...
... (x_k) := & \hat{G}_\pi (x_k) + \frac{1}{v_k} d_k . \end{split}\end{displaymath} (10.92)

This has the effect of discounting costs that are received far away from $ x_k$. The method in (10.91) was the special case of $ \lambda = 1$, yielding $ TD(1)$.

Another interesting special case is $ TD(0)$, which becomes

$\displaystyle \hat{G}_\pi (x_k) = \hat{G}_\pi (x_k) + \frac{1}{v_k} \left(l(x_k,u_k) + \hat{G}_\pi (x_{k+1}) - \hat{G}_\pi (x_k)\right) .$ (10.93)

This form appears most often in reinforcement learning literature (although it is applied to the discounted-cost model instead). Experimental evidence indicates that lower values of $ \lambda$ help to improve the convergence rate. Convergence for all values of $ \lambda$ is proved in [97].

One source of intuition about why (10.94) works is that it is a special case of a stochastic iterative algorithm or the Robbins-Monro algorithm [88,97,566]. This is a general statistical estimation technique that is used for solving systems of the form $ h(y) = y$ by using a sequence of samples. Each sample represents a measurement of $ h(y)$ using Monte Carlo simulation. The general form of this iterative approach is to update $ y$ as

$\displaystyle y := (1-\rho) y + \rho h(y) ,$ (10.94)

in which $ \rho \in [0,1]$ is a parameter whose choice affects the convergence rate. Intuitively, (10.95) updates $ y$ by interpolating between its original value and the most recent sample of $ h(y)$. Convergence proofs for this algorithm are not given here; see [97] for details. The typical behavior is that a smaller value of $ \rho$ leads to more reliable estimates when there is substantial noise in the simulation process, but this comes at the cost of slowing the convergence rate. The convergence is asymptotic, which requires that all edges (that have nonzero probability) in the plan-based state transition graph should be visited infinitely often.

A general approach to obtaining $ \hat{G}_\pi $ can be derived within the stochastic iterative framework by generalizing $ TD(0)$:

$\displaystyle \hat{G}_\pi (x) := (1 - \rho) \hat{G}_\pi (x) + \rho\left(l(x,u) + \hat{G}_\pi (x')\right) .$ (10.95)

The formulation of $ TD(0)$ in (10.94) essentially selects the $ \rho$ parameter by the way it was derived, but in (10.96) any $ \rho \in (0,1)$ may be used.

It may appear incorrect that the update equation does not take into account the transition probabilities. It turns out that they are taken into account in the simulation process because transitions that are more likely to occur have a stronger effect on (10.96). The same thing occurs when the mean of a nonuniform probability density function is estimated by using samples from the distribution. The values that occur with higher frequency make stronger contributions to the average, which automatically gives them the appropriate weight.

Steven M LaValle 2012-04-20