Advantage Actor Critic(A2C)
Introduction
The policy gradient theorem can be generalized to include a comparison of the action value to an arbitrary basline :
The baseline can be any function as long as it does not vary with ; the equation remains valid because the subtracted quantity is zero:
Baseline invariance
注意到
上面的Intro中对引入baseline的过程进行了简单的介绍(refer from Sutton&Batto 13.4):引入一个和action无关的随机变量或者函数仍然可以使原来的policy gradient保持原有的equation,并不改变任何之前所证明过的东西;下面我们就baseline的不变性来分析一下。
Property: the policy gradient is invarient to an additional baseline:
Here the additional baseline is a scalar function of .
Why is it valid?
Cause:
Details:
Why is it useful?
The gradient is where
- we have is invarient to from above
- we want that our can minimize the , usually, is a suboptimal baseline, see the derivation of the optimal baseline
为什么要引入baseline
- 引入baseline不改变,即不改变原有的policy gradient的equation
- 引入optimal baseline的目的是为了减小,通常选择作为baseline,我希望采样到的每个都尽可能的接近,这样就可以减小
Algorithm
The gradient-ascent algorithm is:
where is called the .
The stochasitic version is:
Refer to TD methods, we can replace by MC error or TD error:
So the algorithm can be written as TD version:
Benifit: only need one network to approximate rather than two networks for and .
Analysis
Same as the analysis in the last lecture, we have can well balance exploration and exploitation, but what matters here is the relative value(advantage value) rather than the absolute value(action value) .
Implementation
The implementation is also simple, compared with Implementation.
- on-policy
- is stochastic, no need to use -greedy
Later
on-policy ⇒ off-policy