07 Temporal-Difference Methods

Classic RL algorithms
Algorithms:
- TD learning of state values
- Sarsa: TD learning of action values
- Q-learning: TD learning of optimal action values
  - on-policy & off-policy
- Unified point of view

on-policy vs. off-policy

在强化学习中有两个策略，一个是behavior policy，用来生成经验数据的；一个是target policy，这是我们的目标策略，用来不断改进的，希望收敛到最优策略的。如果这两个policy是同一个policy，那么就是on-policy；如果两个可以不同，那就是off-policy。

on-policy: 自己下棋，自己看棋思考最优策略；

off-policy: 别人下棋，自己看棋思考最优策略，偶尔交流一下

Outline

A unified point of view

All the algorithms intoduced above can be unified in the following way:

q_{k + 1} (s_{t}, a_{t}) = q_{k} (s_{t}, a_{t}) - α_{k} (s_{t}, a_{t}) [q_{k} (s_{t}, a_{t}) - \overset{q}{ˉ}_{k}]

Expression of TD target

$\overset{q}{ˉ}_{k}$ is the TD target, which can be different for different algorithms:

Algorithm	Expression of $\overset{q}{ˉ}_{k}$
Monte Carlo	$r_{t + 1} + γ r_{t + 2} + γ^{2} r_{t + 2} + \dots$
n-step Sarsa	$r_{t + 1} + γ r_{t + 2} + \dots + γ^{n} q_{k} (s_{t + n}, a_{t + n})$
Sarsa	$r_{t + 1} + γ q_{k} (s_{t + 1}, a_{t + 1})$
Expected Sarsa	$r_{t + 1} + γ E_{π} [q_{k} (s_{t + 1}, A)]$
Q-learning	$r_{t + 1} + γ max_{a} q_{k} (s_{t + 1}, a)$

Equation to solve

Algorithm	Equation to solve
Monte Carlo	BE: $q_{π} (s, a) = E_{π} [R_{t + 1} + γ R_{t + 2} + \dots ∣ S_{t} = s, A_{t} = a]$
n-step Sarsa	BE: $q_{π} (s, a) = E_{π} [R_{t + 1} + γ R_{t + 2} + \dots + γ^{n} q_{π} (s_{t + n}, a_{t + n}) ∣ S_{t} = s, A_{t} = a]$
Sarsa	BE: $q_{π} (s, a) = E_{π} [R_{t + 1} + γ q_{π} (S_{t + 1}, A_{t + 1}) ∣ S_{t} = s, A_{t} = a]$
Expected Sarsa	BE: $q_{π} (s, a) = E_{π} [R_{t + 1} + γ E_{π} [q_{π} (S_{t + 1}, A)] ∣ S_{t} = s, A_{t} = a]$
Q-learning	BOE: $q (s, a) = E [R_{t + 1} + γ max_{a} q (S_{t + 1}, a) ∣ S_{t} = s, A_{t} = a]$

Need data

Algorithm	Data needed
Monte Carlo	$(s_{t}, a_{t}, r_{t + 1}, s_{t + 1}, a_{t + 1}, \dots)$
n-step Sarsa	$(s_{t}, a_{t}, r_{t + 1}, s_{t + 1}, a_{t + 1}, \dots, r_{t + n}, s_{t + n}, a_{t + n})$
Sarsa	$(s_{t}, a_{t}, r_{t + 1}, s_{t + 1}, a_{t + 1})$
Expected Sarsa	$(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$
Q-learning	$(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$

Summary

Various TD learning algorithms
Expressions. math interpretations, implementations, relationships
A unified point of view

Later

tabular representation to function representation ⇒ 08 Value Function Methods

Reinforcement Learning Notes

Explorer

07 Temporal-Difference Methods

07 Temporal-Difference Methods

Outline

A unified point of view

Expression of TD target

Equation to solve

Need data

Summary

Later

Graph View

Table of Contents

Backlinks