Temporal-Difference Sarsa

Introduction

  • TD algorithm introduced before to estimate state values
  • now we intoduce Sarsa to directly estimate action values
  • aim is to estimate the action values of a given policy

Algorithm

Suppose we hava some experience

We can use the following Sarsa algorithm to estimate the action values of a given policy :

where

  • is an estimate of
  • is the learning rate depending on

一些问题

  1. 为什么这个算法叫做Sarsa? is , 因为算法的每一步都要包含
  2. Sarsa和之前的TD learning的relationship是什么呢?sarsa用action value来估计,而TD learning用state value来估计,换言之,sarsa 其实是 action-value version of TD learning
  3. What does Sarsa do mathematically? Sarsa is a stochastic approximation algorithm to solve the following equation(another expression of the Bellman equation in terms of action values):

Convergence

By the Sarsa algorithm, converges with probability 1 to the action value as for all if and for all .

Remarks

Just say that the action values can be found by Sarsa for a given policy if the learning rates satisfy the conditions.

Implementation

可以对比Algorithm的实现来看这里的改进

n-step Sarsa

n-step Sarsa Core Idea

非常自然的,受之前的BGD>MBGD>SGD(batch size的折中)和PI>TPI>VI(policy evaluation iteration的折中),我们这里也可以对MC(一个完整的episode)和Sarsa(一个episode中的一步)进行折中(需要n-step)

The definition of action value is

The discounted return can be rewritten in different form as

注意

这里的,这里只是写成不同的展开形式,可能还有点疑惑为什么是这样:此时应该注意观察这个,这个是真实的action value,并不等同于,所以这里的所有都是一样的,只不过是前瞻的步数不同。

n-step Sarsa Algorithm

  • Sarsa aims to solve:
  • MC aims to solve:
  • n-step Sarsa aims to solve:
  • n-step Sarsa algorithm
    • becomes Sarsa when
    • becomes MC when

Remarks

  1. n-step Sarsa needs data
  2. 这里的n-step Sarsa就是在policy evaluation做了一点小小改进
  3. 再n-th step之前我们都是不能update q value的,必须等他收集完
  4. if n is large, more like MC(large variance, small bias); if n is small, more like Sarsa(small variance, large bias due to initial guess)

Expected Sarsa

Expected Sarsa Core Idea

回顾当时学习Bellman Equation的三条公式,我们可以写出下面两条公式:

受之启发,我们有了几个算法:

  • 由第一条公式(state value的bootstrapping),我们推出了上一节TD state values
  • 由第二条公式(action value的bootstrapping),我们推出了Sarsa
  • 但Sarsa的缺点是只考虑了next state一个action的q,我们这里其实可以考虑next state所有action的q的期望(也即,恰好就是我的next state value),用q的期望去代替原本的一个q显然更精确(low variance),在编程中也很容易实现,向量相乘就可以了。于是我们推出了Expeceted Sarsa
  • Later, 可以看到我们进一步的把取平均的操作继续去掉,改成直接取最大值(相当于令policy为greedy policy),这样就得到了Qlearning

Expected Sarsa Algorithm

Sarsa对比,Expected Sarsa做出的主要改进在于:

Expected Sarsa algorithm

Note that

Features:

  • need more computation than Sarsa
  • less variance than Sarsa because it reduces random varibles

Sutton书中是如何评价Expected Sarsa的

“For example, suppose is the greedy policy while behavior is more exploratory; then Expected Sarsa is exactly Q-learning. In this sense Expected Sarsa subsumes and generalizes Q-learning while reliably improving over Sarsa. Except for the small additional computational cost, Expected Sarsa may completely dominate both of the other more-well-known TD control algorithms.”

也就是说,Expected Sarsa是一个很好的折中(关于greedy policy和exploratory policy的折中),可以同时包含Q-learning(Qlearning也有折中,但它是靠off-policy去解决这个事情的,也都是希望behavior policy more exploratory and target policy more greedy)和Sarsa,而且在很多情况下都会优于这两个算法(书中Figure 6.3说明了这个事实)。