Simplest Q Actor Critic(QAC)

Revisit

Revisit the idea of policy gradient introduced in the last lecture.

A scalar metric $J (θ)$ , which can be $\overset{v}{ˉ}_{π}$ or $\overset{r}{ˉ}_{π}$ .
The gradient-ascent algorithm maximizing $J (θ)$ is $θ_{t + 1} = θ_{t} + α \nabla_{θ} J (θ_{t}) = θ_{t} + α \nabla_{θ} E_{S \sim μ, A \sim π} [q_{π} (S, A) \nabla_{θ} ln π (A ∣ S, θ)] .$
The stochastic gradient-ascent algorithm is $θ_{t + 1} = θ_{t} + α q_{t} (s_{t}, a_{t}) \nabla_{θ} ln π (a_{t} ∣ s_{t}, θ_{t}) .$ This expression is very important! We can directly see actor and critic from it:
- This expression corresponds to actor!
- The algorithm estimating $q_{t} (s_{t}, a_{t})$ corresponds to critic!

So far, we have studied two ways to estimate action values:

Monte Carlo learning: If MC is used, the corresponding algorithm is called REINFORCE or Monte Carlo policy gradient.
Temporal difference learning: If TD is used, such kind of algorithms are usually called actor-critic.

Core Idea

用one-step difference(TD)代替G

Algorithm Q Actor-Critic (QAC) with On-Policy TD Initialization: Policy parameters θ_{0}, value function parameters w_{0} . α_{θ}, α_{w} > 0. For each episode, do: Initialize state s_{0} . Select action a_{0} \sim π (a ∣ s_{0}, θ_{0}) . While s_{t} is not terminal, do: Execute a_{t}, observe r_{t + 1}, s_{t + 1} . Select a_{t + 1} \sim π (a ∣ s_{t + 1}, θ_{t}) . Actor (Policy Update): θ_{t + 1} \leftarrow θ_{t} + α_{θ} \cdot q (s_{t}, a_{t}, w_{t}) \cdot \nabla_{θ} ln π (a_{t} ∣ s_{t}, θ_{t}) Critic (Value Update): δ_{t} = r_{t + 1} + γ q (s_{t + 1}, a_{t + 1}, w_{t}) - q (s_{t}, a_{t}, w_{t}) w_{t + 1} \leftarrow w_{t} + α_{w} δ_{t} \nabla_{w} q (s_{t}, a_{t}, w_{t}) s_{t} \leftarrow s_{t + 1}, a_{t} \leftarrow a_{t + 1}

这里存在的问题

因为是自己一直在评估自己，会导致q越来越高越来越高(bootstrap，自举)，所以这里需要引入一个基准(baseline)，使Q变成Q-baseline，也即advantage