Temporal-Difference Q-learning

Introduction

Sarsa estimates the action values of a given policy $π$
Q-learning estimates the optimal action values

Algorithm

$Sarsa$

q_{k + 1} (s_{t}, a_{t}) q_{k + 1} (s, a) = q_{k} (s_{t}, a_{t}) - α_{k} (s_{t}, a_{t}) [q_{k} (s_{t}, a_{t}) - [r_{t + 1} + γ q_{k} (s_{t + 1}, a_{t + 1})]], = q_{k} (s, a), \forall (s, a) \neq = (s_{t}, a_{t}) .

$Q-learning$

q_{k + 1} (s_{t}, a_{t}) q_{k + 1} (s, a) = q_{k} (s_{t}, a_{t}) - α_{k} (s_{t}, a_{t}) [q_{k} (s_{t}, a_{t}) - [r_{t + 1} + γ a \in A max q_{k} (s_{t + 1}, a)]], = q_{k} (s, a), \forall (s, a) \neq = (s_{t}, a_{t}) .

the TD target in Q-learning is $r_{t + 1} + γ max_{a \in A} q_{k} (s_{t + 1}, a)$
the TD target in Sarsa is $r_{t + 1} + γ q_{k} (s_{t + 1}, a_{t + 1})$

What does Q-learning do mathematically?

It aims to solve Bellman optimality equation in terms of action values:
$q (s, a) = E_{π} [R_{t + 1} + γ a^{'} max q (S_{t + 1}, a^{'}) S_{t} = s, A_{t} = a], \forall s, a .$

Off-policy vs. On-policy

在TD learning的task中一般有两个policy：

behavior policy：用来生成experience samples
target policy：用来不断更新towards optimal policy

那么下面就可以定义on-policy和off-policy的概念：

$On-policy$ : behavior policy和target policy是一样的
$Off-policy$ : behavior policy和target policy是不一样的

Advantages of off-policy learning

It can search for the optimal policy on the experience samples generated by any other policy(behavior policy).

举例说明就是：考虑这样一种情况：我们的behavior policy是比较随机的（说好听点叫 exploratory ，比较倾向于去探索的），但是我们的target policy是比较稳定的（说好听点叫 exploitative ，比较倾向于去利用已有的信息）

如果这个时候用on-policy的方法去更新的话很可能会导致我们的policy不能继续优化了（因为用target policy生成的experience是比较稳定单一的结果）
但是如果用off-policy的方法去更新的话（此时是用behavior policy去生成丰富多样的experience），我们就仍可以继续去搜索最优的policy。

Sarsa and MC are on-policy

Sarsa aims to evaluate a given policy $π$ by solving $q_{π} (s, a) = E_{π} [R + γ q_{π} (S^{'}, A^{'}) ∣ s, a], \forall s, a .$
MC aims to evaluate a given policy $π$ by solving $q_{π} (s, a) = E_{π} [G_{t} ∣ S_{t} = s, A_{t} = a], \forall s .$ where $G_{t} = R_{t + 1} + γ R_{t + 2} + \dots + γ^{T - t - 1} R_{T}$ is the return from time $t$ .
$π$ is the behavior policy and the target policy in both cases.

Q-learning is off-policy

Q-learning aims to solve the BOE $q (s, a) = E_{π} [R_{t + 1} + γ a^{'} max q (S_{t + 1}, a^{'}) S_{t} = s, A_{t} = a], \forall s, a .$
the algorithm is $q_{k + 1} (s_{t}, a_{t}) = q_{k} (s_{t}, a_{t}) - α_{k} (s_{t}, a_{t}) [q_{k} (s_{t}, a_{t}) - [r_{t + 1} + γ a \in A max q_{k} (s_{t + 1}, a)]] .$ which requires $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ .
the behavior policy is the one for generating $a_{t}$ in $s_{t}$ , which can be any policy.

Implementation

Version 1

Algorithm Policy Search by Q-learning(on-policy version) with ε -Greedy Initialization: Initial policy π_{0} (a ∣ s), value function q (s, a) for all (s, a) . Step size α \in (0, 1] . For each episode, do: Initialize state s_{0} . Select action a_{0} using π_{0} (s_{0}) . While s_{t} is not the target state, do: Execute a_{t}, observe r_{t + 1}, s_{t + 1} . Select a_{t + 1} using π_{t} (s_{t + 1}) . Update q (s_{t}, a_{t}) : q (s_{t}, a_{t}) \leftarrow q (s_{t}, a_{t}) - α [q (s_{t}, a_{t}) - [r_{t + 1} + γ a max q (s_{t + 1}, a)]] Update policy for s_{t} : π_{t + 1} (a ∣ s_{t}) \leftarrow {1 - ε + \frac{ε}{∣ A ( s _{t} ) ∣}, \frac{ε}{∣ A ( s _{t} ) ∣}, if a = ar g max_{a^{'}} q (s_{t}, a^{'}) otherwise s_{t} \leftarrow s_{t + 1}, a_{t} \leftarrow a_{t + 1}

Version 2

off-policy的优点其实是包括了exploration在里面，我们用 exploratory(random) 的behavior policy $π_{b}$ 去生成experience samples即可，这样我们就可以把上面的算法的 $ε$ -greedy的部分去掉，因为我们已经有了exploration的部分了。

Algorithm Optimal Policy Search by Q-learning(off-policy version) Initialization: q (s, a), α \in (0, 1] . For each episode generated by π_{b}, do: For t = 0, 1, 2, \dots in the episode, do: Observe experience: (s_{t}, a_{t}, r_{t + 1}, s_{t + 1}) Update q (s_{t}, a_{t}) : q (s_{t}, a_{t}) \leftarrow q (s_{t}, a_{t}) - α [q (s_{t}, a_{t}) - (r_{t + 1} + γ a max q (s_{t + 1}, a))] Update target policy π_{T} for s_{t} : π_{T} (a ∣ s_{t}) \leftarrow {1, 0, if a = ar g max_{a^{'}} q (s_{t}, a^{'}) otherwise

Version 3

上面的算法例子非常清晰的包括了policy evaluation和policy improvement两个部分，但实际上我们发现压根就不需要把policy improvement部分写出来，因为在policy evaluation部分我们就直接得用greedy的方法去更新q值了（即只要有Q table的存在我们就可以不断更新下去），而且上面的版本的 $π_{b}$ 并没有写出是如何得到的，实际上在Sutton的书里6.5小结给出了更清晰的算法描述（用上一步得到的Q的greedy作为 $π_{b}$ ，这样再加一个 $ε$ -greedy decay我们就可以实现模型的从一开始的 exploration 慢慢转向 exploitation），因此再将其改动一下：

Algorithm Q-learning(off-policy TD control) Initialization: q (s, a), α \in (0, 1], ε \in (0, 1], small decay rate φ \in (0, 1] . For each episode, do: Initialize state s_{0} . Select action a_{0} using ε -greedy w.r.t. q (s_{0}, a) . While s_{t} is not the target state, do: Execute a_{t}, observe r_{t + 1}, s_{t + 1} . Select a_{t + 1} using ε -greedy w.r.t. q (s_{t + 1}, a) . Update q (s_{t}, a_{t}) : q (s_{t}, a_{t}) \leftarrow q (s_{t}, a_{t}) - α [q (s_{t}, a_{t}) - (r_{t + 1} + γ a max q (s_{t + 1}, a))] ε \leftarrow ε \times φ

解释

w.r.t. = with respect to，这里的意思是说我们在选择action的时候是根据Q值来选择的（或者翻译成关于）。这里如果用数学来表示就是：
$π (a ∣ s) = {1 - ε + \frac{ε}{∣ A ( s ) ∣}, \frac{ε}{∣ A ( s ) ∣}, if a = ar g max_{a^{'}} q (s, a^{'}) otherwise$
这里的 $π$ 就是我们的behavior policy，select action就是指从 $π_{b} (s_{current})$ 中采样即可。

这样做有几个优点：

we have relatively exploration(random) behavior policy: last step’s Q

we also have relatively exploitation(greedy) target Policy: current step’s Q

通过 $ε$ -greedy decay（随着 $φ$ 的decay， $φ \to 0$ ），我们可以逐渐从exploration转向exploitation，实现了一个丝滑的转换

我们现在依然是off-policy的，因为我们的behavior policy和target policy是不一样的，依然保持之前的优点

但其实也可以一直保持一个比较小的 $ε$ （或者迭代到最后不不让其收敛到0，加一个小小的bias就行），这样就可以一直保持exploration的能力（ $π_{b}$ 就是 $ε$ -greedy sample的Q，而 $π_{T}$ 就是 $ar g max$ 的Q）

Reinforcement Learning Notes

Explorer

TD Qlearning

Temporal-Difference Q-learning

Introduction

Algorithm

Off-policy vs. On-policy

Advantages of off-policy learning

Sarsa and MC are on-policy

Q-learning is off-policy

Implementation

Version 1

Version 2

Version 3

Graph View

Table of Contents

Backlinks