Temporal-Difference State values

Introduction

本章要做的事情是在policy evaluation中的计算 $v_{π}$
policy evaluation←>prediction, policy improvement←>control
之前的方法是model-based的方法，通过不断迭代pi的第一个方程来得到可以评估 $v_{π}$ ，现在的方法则是model-free的方法，但与every-visit MC有所不同，我们不需要一个完整的episode，只需要下一步的state和reward即可开始迭代计算 $v_{π}$ ，这是一种非常优雅的incremental manner

Algorithm description

Notations

given policy $π$ , the aim is to estimate the state values ${v_{π} (s)}_{s \in S}$
experience samples: $(s_{0}, r_{1}, s_{1}, \dots, r_{t + 1}, s_{t + 1}, \dots)$ or ${(s_{t}, r_{t + 1}, s_{t + 1})}_{t = 0}^{T}$

Some notations of $v (s)$

$v_{π} (s)$ : the value of state $s$ under policy $π$

$v_{*} (s)$ : the value of state $s$ under the optimal policy

$v (s)$ : the true value of state $s$

$v (s_{t})$ : the estimated value of state $s_{t}$

$v_{t} (s_{t})$ : the estimated value of state $s_{t}$ at time $t$ (or iteration $t$ ) during the learning process

TD learning algorithm

v_{t + 1} (s_{t}) v_{t + 1} (s_{t}) = v_{t} (s_{t}) - α_{t} (s_{t}) [v_{t} (s_{t}) - [r_{t + 1} + γ v_{t} (s_{t + 1})]], = v_{t} (s_{t}), \forall s \neq = s_{t},

where $t = 0, 1, 2, ...$

At time $t$ :

$v_{t}$ is the estimated state value of $v_{π} (s_{t})$
- the value of the visited state $s_{t}$ is updated by the following TD target $r_{t + 1} + γ v_{t} (s_{t + 1})$
- the unvisited states $s \neq = s_{t}$ are not updated
$α_{t} (s_{t})$ is the learning rate for state $s_{t}$

Algorithm properties

new estimate v_{t + 1} (s_{t}) = current estimate v_{t} (s_{t}) - α_{t} (s_{t}) [v_{t} (s_{t}) - [TD target \overset{v}{ˉ}_{t} r_{t + 1} + γ v_{t} (s_{t + 1})] TD error δ_{t}]

上面式子(3)中我们可以看到

TD target: $\overset{v}{ˉ}_{t} ≐ r_{t + 1} + γ v_{t} (s_{t + 1})$
- compared to MC target: $G_{t} ≐ r_{t + 1} + γ G_{t + 1}$
TD error: $δ_{t} ≐ v_{t} (s_{t}) - \overset{v}{ˉ}_{t}$
- compared to MC error: $δ_{t} ≐ v_{t} (s_{t}) - G_{t}$

$≐$ 的意思是“定义为”，参考Sutton书中符号的定义

TD target

实际上我们这里的TD learning的目的就是使得随着迭代次数 $t$ 的增加，TD error $δ_{t}$ 逐渐减小，最终收敛到0，即 $v_{t} (s_{t}) \to \overset{v}{ˉ}_{t}$ ，所以 $\overset{v}{ˉ}_{t}$ 又叫做TD target；

简单证明如下：

v_{t + 1} (s_{t}) ⟹ v_{t + 1} (s_{t}) - \overset{v}{ˉ}_{t} ⟹ v_{t + 1} (s_{t}) - \overset{v}{ˉ}_{t} ⟹ ∣ v_{t + 1} (s_{t}) - \overset{v}{ˉ}_{t} ∣ ⟹ ∣ v_{t + 1} (s_{t}) - \overset{v}{ˉ}_{t} ∣ = v_{t} (s_{t}) - α_{t} (s_{t}) [v_{t} (s_{t}) - \overset{v}{ˉ}_{t}] = v_{t} (s_{t}) - \overset{v}{ˉ}_{t} - α_{t} (s_{t}) [v_{t} (s_{t}) - \overset{v}{ˉ}_{t}] = (1 - α_{t} (s_{t})) [v_{t} (s_{t}) - \overset{v}{ˉ}_{t}] = ∣1 - α_{t} (s_{t}) ∣∣ v_{t} (s_{t}) - \overset{v}{ˉ}_{t} ∣ \leq ∣ v_{t} (s_{t}) - \overset{v}{ˉ}_{t} ∣

最后一个小于等于号成立的原因是 $0 < 1 - α_{t} (s_{t}) < 1$ ，而且从最后一个式子可以看出 $δ_{t}$ 是指数级逐渐减小的，最终收敛到0。

TD error

δ_{t} = current estimate v_{t} (s_{t}) - better estimate: \overset{v}{ˉ}_{t} [r_{t + 1} + γ v_{t} (s_{t + 1})]

it reflects the difference between two time steps $v_{t + 1}$ and $v_{t}$
it reflects the difference between $v_{t}$ and $v_{π}$
$δ_{π, t} ≐ v_{π} (s_{t}) - [r_{t + 1} + γ v_{π} (s_{t + 1})]$
We know from Chapter 02 state value
$v_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s] = E_{π} [R_{t + 1} + γ G_{t + 1} ∣ S_{t} = s] = E_{π} [R_{t + 1} + γ v_{π} (S_{t + 1}) ∣ S_{t} = s]$
So we have
$E_{π} [δ_{π, t} ∣ S_{t} = s_{t}] = v_{π} (s_{t}) - E_{π} [R_{t + 1} + γ v_{π} (S_{t + 1}) ∣ S_{t} = s_{t}] = 0$
- if $v_{t} = v_{π}$ , then $δ_{t} = 0$ (we want to achieve this)
- if $δ_{t} \neq = 0$ , then $v_{t} \neq = v_{π}$
the TD error means information-gap(innovation) obtained from the new experience sample $(s_{t}, r_{t + 1}, s_{t + 1})$

MC error & TD error

这里直接参考了Sutton书里的一个非常直观的数学证明(6.6)：

Note that if the array V does not change during the episode (as it does not in Monte Carlo methods), the Monte Carlo error can be written as a sum of TD errors:

MC error G_{t} - V (S_{t}) = G_{t} R_{t + 1} + γ G_{t + 1} - V (S_{t}) + 0 γV (S_{t + 1}) - γV (S_{t + 1}) = TD error δ_{t} R_{t + 1} + γV (S_{t + 1}) - V (S_{t}) + γ next MC error (G_{t + 1} - V (S_{t + 1})) = δ_{t} + γ (δ_{t + 1} + γ (G_{t + 2} - V (S_{t + 2}))) = δ_{t} + γ δ_{t + 1} + γ^{2} δ_{t + 2}^{2} + \dots + γ^{T - t} final MC error=0 (G_{T} - V (S_{T})) = k = t \sum T γ^{k - t} δ_{k} .

注意这里的 $δ_{t}$ 跟上面的 $δ_{t}$ 的意思不一样

Others

the TD(0) in (3) is also called one-step TD
only estimate the state value of a given policy $π$
- does not estimate the action values like MC methods(TODO LATER⇒Sarsa)
- does not search for optimal policies like DP(PI,VI) methods(TODO LATER⇒Qlearning)

The idea of TD learning

Q: TD algorithm在做什么mathematically

A: Given a policy $π$ , to solve the Bellman equation use model-free method.

回顾之前出现过的几种解贝尔曼方程的算法closed-form solution&iterative solution，我们之前用过迭代的方式来逐渐逼近 $v_{π}$ ，TD也是采取迭代的方式来做，只不过和之前有所不同：

这里的迭代是incremental的，每次只需要一个新的experience sample $(s_{t}, r_{t + 1}, s_{t + 1})$ ，然后就可以更新 $v_{t} (s_{t})$ ，这样就可以逐渐逼近 $v_{π}$

和MC不同，MC需要一个完整的episode，然后才能get mean expectation，从而更新 $v (s)$ , MC is non-incremental

这里的迭代是model-free的，不需要知道环境的具体模型，只需要知道下一步的state和reward即可

和DP不同，DP需要知道环境的具体模型，然后才能通过迭代来逼近 $v_{π}$ ，DP is model-based

Use RM to solve Bellman Equation

By definition of state value of a policy $π$ :

v_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s] = E_{π} [R + γ G_{t + 1} ∣ S_{t} = s], \forall s \in S

Where $G$ is the discounted return of the next state $S^{'}$ :

E_{π} [G_{t + 1} ∣ S_{t} = s] = a \sum π (a ∣ s) s^{'} \sum p (s^{'} ∣ s, a) v_{π} (s^{'}) = E [v_{π} (S_{t + 1}) ∣ S_{t} = s]

for derivation of the first equation, see here

So we have

v_{π} (s) = E [R + γ v_{π} (S_{t + 1}) ∣ S_{t} = s]

sometimes called the Bellman expectation equation

Solve the Bellman equation by RM:

g (v (s)) \tilde{g} (v (s)) = v (s) - E [R + γ v_{π} (S^{'}) ∣ s] = 0 = v (s) - [r + γ v_{π} (s^{'})] = g (v (s)) (v (s) - E [R + γ v_{π} (S^{'}) ∣ s]) + η (E [R + γ v_{π} (S^{'}) ∣ s] - [r + γ v_{π} (s^{'})])

Then we can use the RM to solve $g (v (s)) = 0$ by the following iterative update:

v_{k + 1} (s) = v_{k} (s) - α_{k} \tilde{g} (v_{k} (s)) = v_{k} (s) - α_{k} (v_{k} (s) - [r_{k} + γ v_{π} (s_{k}^{'})]), k = 1, 2, 3, \dots

where

$v_{k} (s)$ is the estimated state value of $v_{π} (s)$ at $k$ th step;
$r_{k}, s_{k}^{'}$ samples from $R$ and $S^{'}$ at $k$ th step;
$α_{k}$ is the learning rate at $k$ th step.

Modify RM and we get TD learning algorithm

从上面RM形式我们可以看出如果希望求解出 $v_{π} (s)$ ，即 $v_{k} (s) \to v_{π} (s)$ ，就需要获得一系列sample, ${(s, r_{k}, s_{k}^{'})}_{k = 1}^{K}$ ，然后通过迭代的方式来逐渐逼近 $v_{π} (s)$ ，当然这里还要加一个条件 $\forall s \in S$ ，即需要对所有的state都要有对应的samples才能迭代求解；
很自然的，我们在轻松地与环境多次只进行一步交互，并采集大量的这种sample；但是如果只是一步，为什么我们不交互到结束为止呢？这样我们就可以利用一个完整的episode所有的visit了(就像MC episode by episode一样来更新)，那么我们获得的sample就可以写成这种形式 ${(s_{t}, r_{t + 1}, s_{t + 1})}_{t = 1}^{T - 1}$
对数据采集做完修改之后，那么我们上面几个变量也可以做几个修改：
- $v_{k} (s) \to v_{k} (s_{t})$
- $r_{k} \to r_{t + 1}$
- $s_{k}^{'} \to s_{t + 1}$
再根据TD error，我们可以用 $v_{k} (s_{t + 1})$ 替换 $v_{π} (s_{t + 1})$
至此我们一步一步用RM来解BE，最后推出了TD learning algorithm $v_{k + 1} (s_{t}) v_{k + 1} (s_{t}) = v_{k} (s_{t}) - α_{k} (s_{t}) [v_{k} (s_{t}) - [r_{t + 1} + γ v_{k} (s_{t + 1})]], = v_{k} (s_{t}), \forall s \neq = s_{t},$ where $k = 0, 1, 2, ...$

Convergence of TD learning

By the TD learning algorithm, $v_{t} (s)$ converges w.p.1 to $v_{π} (s)$ for all $s \in S$ as $t \to \infty$ if:

$\sum_{t}^{\infty} α_{t} (s) = \infty$ for all $s \in S$ ${α_{t} (s) > 0, if s = s_{t}, α_{t} (s) = 0, if s \neq = s_{t} . \Rightarrow t \sum \infty α_{t} (s) = \infty$ requires every state is visited infinitely(or sufficiently many many times) often
$\sum_{t}^{\infty} α_{t}^{2} (s) < \infty$ for all $s \in S$
- In practice, we can use a small constant learning rate $α_{t} (s) = α$ for all $s \in S$ , though it may not satisfy the above conditions.

TD learning vs. MC learning

这一节其实可以放在学完简单的Sarsa之后再来看

TD/Sarsa learning	MC learning
Online: TD learning is online. It can update the state/action values immediately after receiving a reward.	Offline: MC learning is offline. It has to wait until an episode has been completely collected.
Continuing tasks: Since TD learning is online, it can handle both episodic and continuing tasks.	Episodic tasks: Since MC learning is offline, it can only handle episodic tasks that has terminate states.
Bootstrapping: TD bootstraps because the update of a value relies on the previous estimate of this value. Hence, it requires initial guesses.	Non-bootstrapping: MC is not bootstrapping, because it can directly estimate state/action values without any initial guess.
Low estimation variance: TD has lower than MC because there are fewer random variables. For instance, Sarsa requires $R_{t + 1}, S_{t + 1}, A_{t + 1}$ .	High estimation variance: To estimate $q_{π} (s_{t}, a_{t})$ , we need samples of $R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots$ . Suppose the length of each episode is $L$ . There are $∥ A ∥^{L}$ possible episodes.

bootstrapping的意思是“自举”，即通过已有的估计值来估计新的值，有点像递归的意思，而non-bootstrapping则是直接用sample来估计值，不依赖于已有的估计值

Reinforcement Learning Notes

Explorer

TD state values

Temporal-Difference State values

Introduction

Algorithm description

Notations

TD learning algorithm

Algorithm properties

TD target

TD error

MC error & TD error

Others

The idea of TD learning

Use RM to solve Bellman Equation

Modify RM and we get TD learning algorithm

Convergence of TD learning

TD learning vs. MC learning

Graph View

Table of Contents

Backlinks