Temporal-Difference State values

Introduction

  1. 本章要做的事情是在policy evaluation中的计算
  2. policy evaluation>prediction, policy improvement>control
  3. 之前的方法是model-based的方法,通过不断迭代pi的第一个方程来得到可以评估,现在的方法则是model-free的方法,但与every-visit MC有所不同,我们不需要一个完整的episode,只需要下一步的state和reward即可开始迭代计算,这是一种非常优雅的incremental manner

Algorithm description

Notations

  • given policy , the aim is to estimate the state values
  • experience samples: or

Some notations of

  • : the value of state under policy
  • : the value of state under the optimal policy
  • : the true value of state
  • : the estimated value of state
  • : the estimated value of state at time (or iteration ) during the learning process

TD learning algorithm

where

At time :

  • is the estimated state value of
    • the value of the visited state is updated by the following TD target
    • the unvisited states are not updated
  • is the learning rate for state

Algorithm properties

上面式子(3)中我们可以看到

  • TD target:
    • compared to MC target:
  • TD error:
    • compared to MC error:

的意思是“定义为”,参考Sutton书中符号的定义

TD target

实际上我们这里的TD learning的目的就是使得随着迭代次数的增加,TD error 逐渐减小,最终收敛到0,即,所以又叫做TD target;

简单证明如下:

最后一个小于等于号成立的原因是 ,而且从最后一个式子可以看出是指数级逐渐减小的,最终收敛到0。

TD error

  • it reflects the difference between two time steps and

  • it reflects the difference between and

    We know from Chapter 02 state value

    So we have

    • if , then (we want to achieve this)
    • if , then
  • the TD error means information-gap(innovation) obtained from the new experience sample

MC error & TD error

这里直接参考了Sutton书里的一个非常直观的数学证明(6.6):

Note that if the array V does not change during the episode (as it does not in Monte Carlo methods), the Monte Carlo error can be written as a sum of TD errors:

注意这里的跟上面的的意思不一样

Others

  • the TD(0) in (3) is also called one-step TD
  • only estimate the state value of a given policy
    • does not estimate the action values like MC methods(TODO LATERSarsa)
    • does not search for optimal policies like DP(PI,VI) methods(TODO LATERQlearning)

The idea of TD learning

Q: TD algorithm在做什么mathematically

A: Given a policy , to solve the Bellman equation use model-free method.

回顾之前出现过的几种解贝尔曼方程的算法closed-form solution&iterative solution,我们之前用过迭代的方式来逐渐逼近,TD也是采取迭代的方式来做,只不过和之前有所不同:

  1. 这里的迭代是incremental的,每次只需要一个新的experience sample ,然后就可以更新,这样就可以逐渐逼近
    • 和MC不同,MC需要一个完整的episode,然后才能get mean expectation,从而更新, MC is non-incremental
  2. 这里的迭代是model-free的,不需要知道环境的具体模型,只需要知道下一步的state和reward即可
    • 和DP不同,DP需要知道环境的具体模型,然后才能通过迭代来逼近,DP is model-based

Use RM to solve Bellman Equation

By definition of state value of a policy :

Where is the discounted return of the next state :

  • for derivation of the first equation, see here

So we have

  • sometimes called the Bellman expectation equation

Solve the Bellman equation by RM:

Then we can use the RM to solve by the following iterative update:

where

  • is the estimated state value of at th step;
  • samples from and at th step;
  • is the learning rate at th step.

Modify RM and we get TD learning algorithm

  1. 从上面RM形式我们可以看出如果希望求解出,即,就需要获得一系列sample, ,然后通过迭代的方式来逐渐逼近,当然这里还要加一个条件,即需要对所有的state都要有对应的samples才能迭代求解;
  2. 很自然的,我们在轻松地与环境多次只进行一步交互,并采集大量的这种sample;但是如果只是一步,为什么我们不交互到结束为止呢?这样我们就可以利用一个完整的episode所有的visit了(就像MC episode by episode一样来更新),那么我们获得的sample就可以写成这种形式
  3. 对数据采集做完修改之后,那么我们上面几个变量也可以做几个修改:
  4. 再根据TD error,我们可以用替换
  5. 至此我们一步一步用RM来解BE,最后推出了TD learning algorithm where

Convergence of TD learning

By the TD learning algorithm, converges w.p.1 to for all as if:

  1. for all requires every state is visited infinitely(or sufficiently many many times) often
  2. for all
    • In practice, we can use a small constant learning rate for all , though it may not satisfy the above conditions.

TD learning vs. MC learning

这一节其实可以放在学完简单的Sarsa之后再来看

TD/Sarsa learningMC learning
Online: TD learning is online. It can update the state/action values immediately after receiving a reward.Offline: MC learning is offline. It has to wait until an episode has been completely collected.
Continuing tasks: Since TD learning is online, it can handle both episodic and continuing tasks.Episodic tasks: Since MC learning is offline, it can only handle episodic tasks that has terminate states.
Bootstrapping: TD bootstraps because the update of a value relies on the previous estimate of this value. Hence, it requires initial guesses.Non-bootstrapping: MC is not bootstrapping, because it can directly estimate state/action values without any initial guess.
Low estimation variance: TD has lower than MC because there are fewer random variables. For instance, Sarsa requires .High estimation variance: To estimate , we need samples of . Suppose the length of each episode is . There are possible episodes.

bootstrapping的意思是“自举”,即通过已有的估计值来估计新的值,有点像递归的意思,而non-bootstrapping则是直接用sample来估计值,不依赖于已有的估计值