Temporal-Difference Sarsa

Introduction

TD algorithm introduced before to estimate state values
now we intoduce Sarsa to directly estimate action values
aim is to estimate the action values of a given policy $π$

Algorithm

Suppose we hava some experience ${(s_{t}, a_{t}, r_{t + 1}, s_{t + 1}, a_{t + 1})}$

We can use the following Sarsa algorithm to estimate the action values of a given policy $π$ :

q_{k + 1} (s_{t}, a_{t}) q_{k + 1} (s, a) = q_{k} (s_{t}, a_{t}) - α_{k} (s_{t}, a_{t}) [q_{k} (s_{t}, a_{t}) - [r_{t + 1} + γ q_{k} (s_{t + 1}, a_{t + 1})]], = q_{k} (s, a), \forall (s, a) \neq = (s_{t}, a_{t}) .

where $k = 0, 1, 2, \dots$

$q_{k} (s_{t}, a_{t})$ is an estimate of $q_{π} (s_{t}, a_{t})$
$α_{k} (s_{t}, a_{t})$ is the learning rate depending on $s_{t}, a_{t}$

一些问题

为什么这个算法叫做Sarsa？ $Sarsa$ is $S tate- A ction- R eward- S tate- A ction$ , 因为算法的每一步都要包含 $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1}, a_{t + 1})$

Sarsa和之前的TD learning的relationship是什么呢？sarsa用action value来估计，而TD learning用state value来估计，换言之，sarsa 其实是 action-value version of TD learning

What does Sarsa do mathematically? Sarsa is a stochastic approximation algorithm to solve the following equation(another expression of the Bellman equation in terms of action values): $q_{π} (s, a) = E_{π} [R + γ q_{π} (S^{'}, A^{'}) ∣ s, a], \forall s, a .$

Convergence

By the Sarsa algorithm, $q_{t} (s, a)$ converges with probability 1 to the action value $q_{π} (s, a)$ as $t \to \infty$ for all $(s, a)$ if $\sum_{t} α_{t} (s, a) = \infty$ and $\sum_{t} α_{t}^{2} (s, a) < \infty$ for all $(s, a)$ .

Remarks

Just say that the action values can be found by Sarsa for a given policy $π$ if the learning rates satisfy the conditions.

Implementation

可以对比Algorithm的实现来看这里的改进

Algorithm Policy Search by SARSA(on-policy TD) with ε -Greedy Initialization: Initial policy π_{0} (a ∣ s), value function q (s, a) for all (s, a) . Step size α \in (0, 1] . For each episode, do: Initialize state s_{0} . Select action a_{0} using π_{0} (s_{0}) . While s_{t} is not the target state, do: Execute a_{t}, observe r_{t + 1}, s_{t + 1} . Select a_{t + 1} using π_{t} (s_{t + 1}) . Update q (s_{t}, a_{t}) : q (s_{t}, a_{t}) \leftarrow q (s_{t}, a_{t}) - α [q (s_{t}, a_{t}) - [r_{t + 1} + γ q (s_{t + 1}, a_{t + 1})]] Update policy for s_{t} : π_{t + 1} (a ∣ s_{t}) \leftarrow {1 - ε + \frac{ε}{∣ A ( s _{t} ) ∣}, \frac{ε}{∣ A ( s _{t} ) ∣}, if a = ar g max_{a^{'}} q (s_{t}, a^{'}) otherwise s_{t} \leftarrow s_{t + 1}, a_{t} \leftarrow a_{t + 1}

n-step Sarsa

n-step Sarsa Core Idea

非常自然的，受之前的BGD>MBGD>SGD(batch size的折中)和PI>TPI>VI(policy evaluation iteration的折中)，我们这里也可以对MC(一个完整的episode)和Sarsa(一个episode中的一步)进行折中(需要n-step)

The definition of action value is

q_{π} (s, a) = E [G_{t} ∣ S_{t} = s, A_{t} = a]

The discounted return $G_{t}$ can be rewritten in different form as

Sarsa \leftarrow G_{t}^{(1)} G_{t}^{(2)} ⋮ n-step Sarsa \leftarrow G_{t}^{(n)} ⋮ MC \leftarrow G_{t}^{(\infty)} = R_{t + 1} + γ q_{π} (S_{t + 1}, A_{t + 1}) = R_{t + 1} + γ R_{t + 2} + γ^{2} q_{π} (S_{t + 2}, A_{t + 2}) = R_{t + 1} + γ R_{t + 2} + \dots + γ^{n} q_{π} (S_{t + n}, A_{t + n}) = R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 2} + \dots

注意

这里的 $G_{t} = G_{t}^{(1)} = G_{t}^{(2)} = G_{t}^{(n)} = G_{t}^{(\infty)}$ ，这里只是写成不同的展开形式，可能还有点疑惑为什么是这样：此时应该注意观察这个 $q_{π}$ ，这个是真实的action value，并不等同于 $q_{t}$ ，所以这里的所有 $G_{t}$ 都是一样的，只不过是前瞻的步数不同。

n-step Sarsa Algorithm

Sarsa aims to solve: $q_{π} (s, a) = E [G_{t}^{(1)} ∣ s, a] = E [R_{t + 1} + γ q_{π} (S_{t + 1}, A_{t + 1}) ∣ s, a]$
MC aims to solve: $q_{π} (s, a) = E [G_{t}^{(\infty)} ∣ s, a] = E [R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 2} + \dots ∣ s, a]$
n-step Sarsa aims to solve: $q_{π} (s, a) = E [G_{t}^{(n)} ∣ s, a] = E [R_{t + 1} + γ R_{t + 2} + \dots + γ^{n} q_{π} (S_{t + n}, A_{t + n}) ∣ s, a]$
n-step Sarsa algorithm $q_{k + 1} (s_{t}, a_{t}) = q_{k} (s_{t}, a_{t}) - α_{k} (s_{t}, a_{t}) [q_{k} (s_{t}, a_{t}) - G_{t}^{(n)}]$
- becomes Sarsa when $n = 1$
- becomes MC when $n = \infty$

Remarks

n-step Sarsa needs data $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1}, a_{t + 1}, \dots, r_{t + n}, s_{t + n}, a_{t + n})$

这里的n-step Sarsa就是在policy evaluation做了一点小小改进

再n-th step之前我们都是不能update q value的，必须等他收集完

if n is large, more like MC(large variance, small bias); if n is small, more like Sarsa(small variance, large bias due to initial guess)

Expected Sarsa

Expected Sarsa Core Idea

回顾当时学习Bellman Equation的三条公式，我们可以写出下面两条公式：

$v_{π} = π @ q_{π} = π @ (P_{r} R + γ P_{s} v_{π})$
$q_{π} = P_{r} R + γ P_{s} v_{π} = P_{r} R + γ P_{s} π @ q_{π}$

受之启发，我们有了几个算法：

由第一条公式(state value的bootstrapping)，我们推出了上一节TD state values $q_{π} (s) = E_{π} [R_{t + 1} + γ v_{π} (S_{t + 1}) ∣ S_{t} = s]$
由第二条公式(action value的bootstrapping)，我们推出了Sarsa $q_{π} (s, a) = E_{π} [R_{t + 1} + γ q_{π} (S_{t + 1}, A_{t + 1}) ∣ S_{t} = s, A_{t} = a]$
但Sarsa的缺点是只考虑了next state一个action的q，我们这里其实可以考虑next state所有action的q的期望（也即 $π @ q_{π}$ ，恰好就是我的next state value），用q的期望去代替原本的一个q显然更精确（low variance），在编程中也很容易实现，向量相乘 $@$ 就可以了。于是我们推出了Expeceted Sarsa $q_{π} (s, a) = E_{π} [R_{t + 1} + γ E_{π} [q_{π} (S_{t + 1}, A)] ∣ S_{t} = s, A_{t} = a] = E_{π} [R_{t + 1} + γπ (S_{t + 1}) @ q_{π} (S_{t + 1}) ∣ S_{t} = s, A_{t} = a]$
Later, 可以看到我们进一步的把取平均的操作继续去掉，改成直接取最大值（相当于令policy为greedy policy），这样就得到了Qlearning $q_{π} (s, a) = E_{π} [R_{t + 1} + γ a^{'} max q_{π} (S_{t + 1}, a^{'}) ∣ S_{t} = s, A_{t} = a]$

Expected Sarsa Algorithm

与Sarsa对比，Expected Sarsa做出的主要改进在于：

q_{k} (s_{t + 1}, a_{t + 1}) \to E_{π} [q_{k} (s_{t + 1}, A)]

Expected Sarsa algorithm

q_{k + 1} (s_{t}, a_{t}) q_{k + 1} (s, a) = q_{k} (s_{t}, a_{t}) - α_{k} (s_{t}, a_{t}) [q_{k} (s_{t}, a_{t}) - [r_{t + 1} + γ E_{π} [q_{k} (s_{t + 1}, A)]]], = q_{k} (s, a), \forall (s, a) \neq = (s_{t}, a_{t}) .

Note that

E_{π} [q_{k} (s_{t + 1}, A)] = a \sum π_{k} (a ∣ s_{t + 1}) q_{k} (s_{t + 1}, a) ≐ v_{k} (s_{t + 1})

Features:

need more computation than Sarsa
less variance than Sarsa because it reduces random varibles ${s_{t}, a_{t}, r_{t + 1}, s_{t + 1}, a_{t + 1}} \to {s_{t}, a_{t}, r_{t + 1}, s_{t + 1}}$

Sutton书中是如何评价Expected Sarsa的

“For example, suppose $π$ is the greedy policy while behavior is more exploratory; then Expected Sarsa is exactly Q-learning. In this sense Expected Sarsa subsumes and generalizes Q-learning while reliably improving over Sarsa. Except for the small additional computational cost, Expected Sarsa may completely dominate both of the other more-well-known TD control algorithms.”

也就是说，Expected Sarsa是一个很好的折中（关于greedy policy和exploratory policy的折中），可以同时包含Q-learning（Qlearning也有折中，但它是靠off-policy去解决这个事情的，也都是希望behavior policy more exploratory and target policy more greedy）和Sarsa，而且在很多情况下都会优于这两个算法（书中Figure 6.3说明了这个事实）。

Reinforcement Learning Notes

Explorer

TD Sarsa

Temporal-Difference Sarsa

Introduction

Algorithm

Convergence

Implementation

n-step Sarsa

n-step Sarsa Core Idea

n-step Sarsa Algorithm

Expected Sarsa

Expected Sarsa Core Idea

Expected Sarsa Algorithm

Graph View

Table of Contents

Backlinks