07 Temporal-Difference Methods
- Classic RL algorithms
- Algorithms:
- TD learning of state values
- Sarsa: TD learning of action values
- Q-learning: TD learning of optimal action values
- on-policy & off-policy
- Unified point of view
on-policy vs. off-policy
在强化学习中有两个策略,一个是behavior policy,用来生成经验数据的;一个是target policy,这是我们的目标策略,用来不断改进的,希望收敛到最优策略的。如果这两个policy是同一个policy,那么就是on-policy;如果两个可以不同,那就是off-policy。
- on-policy: 自己下棋,自己看棋思考最优策略;
- off-policy: 别人下棋,自己看棋思考最优策略,偶尔交流一下
Outline
- Motivating example
- TD learning of state values
- TD learning of action values
- TD learning of optimal action values
- A unified point of view
- Summary
A unified point of view
All the algorithms intoduced above can be unified in the following way:
Expression of TD target
is the TD target, which can be different for different algorithms:
| Algorithm | Expression of |
|---|---|
| Monte Carlo | |
| n-step Sarsa | |
| Sarsa | |
| Expected Sarsa | |
| Q-learning |
Equation to solve
| Algorithm | Equation to solve |
|---|---|
| Monte Carlo | BE: |
| n-step Sarsa | BE: |
| Sarsa | BE: |
| Expected Sarsa | BE: |
| Q-learning | BOE: |
Need data
| Algorithm | Data needed |
|---|---|
| Monte Carlo | |
| n-step Sarsa | |
| Sarsa | |
| Expected Sarsa | |
| Q-learning |
Summary
- Various TD learning algorithms
- Expressions. math interpretations, implementations, relationships
- A unified point of view
Later
tabular representation to function representation ⇒ 08 Value Function Methods