07 Temporal-Difference Methods
- Classic RL algorithms
- Algorithms:
- TD learning of state values
- Sarsa: TD learning of action values
- Q-learning: TD learning of optimal action values
- on-policy & off-policy
- Unified point of view
on-policy vs. off-policy
在强化学习中有两个策略,一个是behavior policy,用来生成经验数据的;一个是target policy,这是我们的目标策略,用来不断改进的,希望收敛到最优策略的。如果这两个policy是同一个policy,那么就是on-policy;如果两个可以不同,那就是off-policy。
- on-policy: 自己下棋,自己看棋思考最优策略;
- off-policy: 别人下棋,自己看棋思考最优策略,偶尔交流一下
Outline
- Motivating example
- TD learning of state values
- TD learning of action values
- TD learning of optimal action values
- A unified point of view
- Summary
A unified point of view
All the algorithms intoduced above can be unified in the following way:
Expression of TD target
is the TD target, which can be different for different algorithms:
Algorithm | Expression of |
---|---|
Monte Carlo | |
n-step Sarsa | |
Sarsa | |
Expected Sarsa | |
Q-learning |
Equation to solve
Algorithm | Equation to solve |
---|---|
Monte Carlo | BE: |
n-step Sarsa | BE: |
Sarsa | BE: |
Expected Sarsa | BE: |
Q-learning | BOE: |
Need data
Algorithm | Data needed |
---|---|
Monte Carlo | |
n-step Sarsa | |
Sarsa | |
Expected Sarsa | |
Q-learning |
Summary
- Various TD learning algorithms
- Expressions. math interpretations, implementations, relationships
- A unified point of view
Later
tabular representation to function representation ⇒ 08 Value Function Methods