07 Temporal-Difference Methods

  • Classic RL algorithms
  • Algorithms:
    • TD learning of state values
    • Sarsa: TD learning of action values
    • Q-learning: TD learning of optimal action values
      • on-policy & off-policy
    • Unified point of view

on-policy vs. off-policy

在强化学习中有两个策略,一个是behavior policy,用来生成经验数据的;一个是target policy,这是我们的目标策略,用来不断改进的,希望收敛到最优策略的。如果这两个policy是同一个policy,那么就是on-policy;如果两个可以不同,那就是off-policy。

  • on-policy: 自己下棋,自己看棋思考最优策略;
  • off-policy: 别人下棋,自己看棋思考最优策略,偶尔交流一下

Outline

  1. Motivating example
  2. TD learning of state values
  3. TD learning of action values
  4. TD learning of optimal action values
  5. A unified point of view
  6. Summary

A unified point of view

All the algorithms intoduced above can be unified in the following way:

Expression of TD target

is the TD target, which can be different for different algorithms:

AlgorithmExpression of
Monte Carlo
n-step Sarsa
Sarsa
Expected Sarsa
Q-learning

Equation to solve

AlgorithmEquation to solve
Monte CarloBE:
n-step SarsaBE:
SarsaBE:
Expected SarsaBE:
Q-learningBOE:

Need data

AlgorithmData needed
Monte Carlo
n-step Sarsa
Sarsa
Expected Sarsa
Q-learning

Summary

  1. Various TD learning algorithms
  2. Expressions. math interpretations, implementations, relationships
  3. A unified point of view

Later

tabular representation to function representation 08 Value Function Methods