Temporal-Difference State values
Introduction
- 本章要做的事情是在policy evaluation中的计算
- policy evaluation←>prediction, policy improvement←>control
- 之前的方法是model-based的方法,通过不断迭代pi的第一个方程来得到可以评估,现在的方法则是model-free的方法,但与every-visit MC有所不同,我们不需要一个完整的episode,只需要下一步的state和reward即可开始迭代计算,这是一种非常优雅的incremental manner
Algorithm description
Notations
- given policy , the aim is to estimate the state values
- experience samples: or
Some notations of
- : the value of state under policy
- : the value of state under the optimal policy
- : the true value of state
- : the estimated value of state
- : the estimated value of state at time (or iteration ) during the learning process
TD learning algorithm
where
At time :
- is the estimated state value of
- the value of the visited state is updated by the following TD target
- the unvisited states are not updated
- is the learning rate for state
Algorithm properties
上面式子(3)中我们可以看到
- TD target:
- compared to MC target:
- TD error:
- compared to MC error:
的意思是“定义为”,参考Sutton书中符号的定义
TD target
实际上我们这里的TD learning的目的就是使得随着迭代次数的增加,TD error 逐渐减小,最终收敛到0,即,所以又叫做TD target;
简单证明如下:
最后一个小于等于号成立的原因是 ,而且从最后一个式子可以看出是指数级逐渐减小的,最终收敛到0。
TD error
-
it reflects the difference between two time steps and
-
it reflects the difference between and
We know from Chapter 02 state value
So we have
- if , then (we want to achieve this)
- if , then
-
the TD error means information-gap(innovation) obtained from the new experience sample
MC error & TD error
这里直接参考了Sutton书里的一个非常直观的数学证明(6.6):
Note that if the array V does not change during the episode (as it does not in Monte Carlo methods), the Monte Carlo error can be written as a sum of TD errors:
注意这里的跟上面的的意思不一样
Others
- the TD(0) in (3) is also called one-step TD
- only estimate the state value of a given policy
The idea of TD learning
Q: TD algorithm在做什么mathematically
A: Given a policy , to solve the Bellman equation use model-free method.
回顾之前出现过的几种解贝尔曼方程的算法closed-form solution&iterative solution,我们之前用过迭代的方式来逐渐逼近,TD也是采取迭代的方式来做,只不过和之前有所不同:
- 这里的迭代是incremental的,每次只需要一个新的experience sample ,然后就可以更新,这样就可以逐渐逼近
- 和MC不同,MC需要一个完整的episode,然后才能get mean expectation,从而更新, MC is non-incremental
- 这里的迭代是model-free的,不需要知道环境的具体模型,只需要知道下一步的state和reward即可
- 和DP不同,DP需要知道环境的具体模型,然后才能通过迭代来逼近,DP is model-based
Use RM to solve Bellman Equation
By definition of state value of a policy :
Where is the discounted return of the next state :
- for derivation of the first equation, see here
So we have
- sometimes called the Bellman expectation equation
Solve the Bellman equation by RM:
Then we can use the RM to solve by the following iterative update:
where
- is the estimated state value of at th step;
- samples from and at th step;
- is the learning rate at th step.
Modify RM and we get TD learning algorithm
- 从上面RM形式我们可以看出如果希望求解出,即,就需要获得一系列sample, ,然后通过迭代的方式来逐渐逼近,当然这里还要加一个条件,即需要对所有的state都要有对应的samples才能迭代求解;
- 很自然的,我们在轻松地与环境多次只进行一步交互,并采集大量的这种sample;但是如果只是一步,为什么我们不交互到结束为止呢?这样我们就可以利用一个完整的episode所有的visit了(就像MC episode by episode一样来更新),那么我们获得的sample就可以写成这种形式
- 对数据采集做完修改之后,那么我们上面几个变量也可以做几个修改:
- 再根据TD error,我们可以用替换
- 至此我们一步一步用RM来解BE,最后推出了TD learning algorithm where
Convergence of TD learning
By the TD learning algorithm, converges w.p.1 to for all as if:
- for all requires every state is visited infinitely(or sufficiently many many times) often
- for all
- In practice, we can use a small constant learning rate for all , though it may not satisfy the above conditions.
TD learning vs. MC learning
这一节其实可以放在学完简单的Sarsa之后再来看
TD/Sarsa learning | MC learning |
---|---|
Online: TD learning is online. It can update the state/action values immediately after receiving a reward. | Offline: MC learning is offline. It has to wait until an episode has been completely collected. |
Continuing tasks: Since TD learning is online, it can handle both episodic and continuing tasks. | Episodic tasks: Since MC learning is offline, it can only handle episodic tasks that has terminate states. |
Bootstrapping: TD bootstraps because the update of a value relies on the previous estimate of this value. Hence, it requires initial guesses. | Non-bootstrapping: MC is not bootstrapping, because it can directly estimate state/action values without any initial guess. |
Low estimation variance: TD has lower than MC because there are fewer random variables. For instance, Sarsa requires . | High estimation variance: To estimate , we need samples of . Suppose the length of each episode is . There are possible episodes. |
bootstrapping的意思是“自举”,即通过已有的估计值来估计新的值,有点像递归的意思,而non-bootstrapping则是直接用sample来估计值,不依赖于已有的估计值