Simplest Q Actor Critic(QAC)
Revisit
Revisit the idea of policy gradient introduced in the last lecture.
- A scalar metric , which can be or .
- The gradient-ascent algorithm maximizing is
- The stochastic gradient-ascent algorithm is
This expression is very important! We can directly see actor and critic from it:
- This expression corresponds to actor!
- The algorithm estimating corresponds to critic!
How to get ?
So far, we have studied two ways to estimate action values:
- Monte Carlo learning: If MC is used, the corresponding algorithm is called REINFORCE or Monte Carlo policy gradient.
- Temporal difference learning: If TD is used, such kind of algorithms are usually called actor-critic.
Core Idea
用one-step difference(TD)代替G
Implementation
- critic: SARSA + value function approximation
- actor: policy gradient
- Q Actor-Critic is the simplest actor-critic, reveals the core idea of AC
Later
这里存在的问题
因为是自己一直在评估自己,会导致q越来越高越来越高(bootstrap,自举),所以这里需要引入一个基准(baseline),使Q变成Q-baseline,也即advantage
Q Actor-Critic(QAC) ⇒ Advantage Actor-Critic(A2C)