Simplest Q Actor Critic(QAC)

Revisit

Revisit the idea of policy gradient introduced in the last lecture.

  1. A scalar metric , which can be or .
  2. The gradient-ascent algorithm maximizing is
  3. The stochastic gradient-ascent algorithm is This expression is very important! We can directly see actor and critic from it:
    • This expression corresponds to actor!
    • The algorithm estimating corresponds to critic!

How to get ?

So far, we have studied two ways to estimate action values:

  1. Monte Carlo learning: If MC is used, the corresponding algorithm is called REINFORCE or Monte Carlo policy gradient.
  2. Temporal difference learning: If TD is used, such kind of algorithms are usually called actor-critic.

Core Idea

用one-step difference(TD)代替G

Implementation

Later

这里存在的问题

因为是自己一直在评估自己,会导致q越来越高越来越高(bootstrap,自举),所以这里需要引入一个基准(baseline),使Q变成Q-baseline,也即advantage

Q Actor-Critic(QAC) Advantage Actor-Critic(A2C)