Temporal-Difference Sarsa
Introduction
- TD algorithm introduced before to estimate state values
- now we intoduce Sarsa to directly estimate action values
- aim is to estimate the action values of a given policy
Algorithm
Suppose we hava some experience
We can use the following Sarsa algorithm to estimate the action values of a given policy :
where
- is an estimate of
- is the learning rate depending on
一些问题
- 为什么这个算法叫做Sarsa? is , 因为算法的每一步都要包含
- Sarsa和之前的TD learning的relationship是什么呢?sarsa用action value来估计,而TD learning用state value来估计,换言之,sarsa 其实是 action-value version of TD learning
- What does Sarsa do mathematically? Sarsa is a stochastic approximation algorithm to solve the following equation(another expression of the Bellman equation in terms of action values):
Convergence
By the Sarsa algorithm, converges with probability 1 to the action value as for all if and for all .
Remarks
Just say that the action values can be found by Sarsa for a given policy if the learning rates satisfy the conditions.
Implementation
可以对比Algorithm的实现来看这里的改进
n-step Sarsa
n-step Sarsa Core Idea
非常自然的,受之前的BGD>MBGD>SGD(batch size的折中)和PI>TPI>VI(policy evaluation iteration的折中),我们这里也可以对MC(一个完整的episode)和Sarsa(一个episode中的一步)进行折中(需要n-step)
The definition of action value is
The discounted return can be rewritten in different form as
注意
这里的,这里只是写成不同的展开形式,可能还有点疑惑为什么是这样:此时应该注意观察这个,这个是真实的action value,并不等同于,所以这里的所有都是一样的,只不过是前瞻的步数不同。
n-step Sarsa Algorithm
- Sarsa aims to solve:
- MC aims to solve:
- n-step Sarsa aims to solve:
- n-step Sarsa algorithm
- becomes Sarsa when
- becomes MC when
Remarks
- n-step Sarsa needs data
- 这里的n-step Sarsa就是在policy evaluation做了一点小小改进
- 再n-th step之前我们都是不能update q value的,必须等他收集完
- if n is large, more like MC(large variance, small bias); if n is small, more like Sarsa(small variance, large bias due to initial guess)
Expected Sarsa
Expected Sarsa Core Idea
回顾当时学习Bellman Equation的三条公式,我们可以写出下面两条公式:
受之启发,我们有了几个算法:
- 由第一条公式(state value的bootstrapping),我们推出了上一节TD state values
- 由第二条公式(action value的bootstrapping),我们推出了Sarsa
- 但Sarsa的缺点是只考虑了next state一个action的q,我们这里其实可以考虑next state所有action的q的期望(也即,恰好就是我的next state value),用q的期望去代替原本的一个q显然更精确(low variance),在编程中也很容易实现,向量相乘就可以了。于是我们推出了Expeceted Sarsa
- Later, 可以看到我们进一步的把取平均的操作继续去掉,改成直接取最大值(相当于令policy为greedy policy),这样就得到了Qlearning
Expected Sarsa Algorithm
与Sarsa对比,Expected Sarsa做出的主要改进在于:
Expected Sarsa algorithm
Note that
Features:
- need more computation than Sarsa
- less variance than Sarsa because it reduces random varibles
Sutton书中是如何评价Expected Sarsa的
“For example, suppose is the greedy policy while behavior is more exploratory; then Expected Sarsa is exactly Q-learning. In this sense Expected Sarsa subsumes and generalizes Q-learning while reliably improving over Sarsa. Except for the small additional computational cost, Expected Sarsa may completely dominate both of the other more-well-known TD control algorithms.”
也就是说,Expected Sarsa是一个很好的折中(关于greedy policy和exploratory policy的折中),可以同时包含Q-learning(Qlearning也有折中,但它是靠off-policy去解决这个事情的,也都是希望behavior policy more exploratory and target policy more greedy)和Sarsa,而且在很多情况下都会优于这两个算法(书中Figure 6.3说明了这个事实)。