Monte-Carlo Policy Gradient(REINFORCE)
Introduction
- The gradient-ascent algorithm maximizing is
- Since the true gradient is unknown, we can replace it by a stochastic one:
- Furthermore, since is unknown, it can be replaced by an estimate:
Implementation
If is estimated by Monte Carlo estimation, then the algorithm is called REINFORCE.
Version 1
Version 2
可以看到上面计算g时很多地方都是重复计算了,我们优化这部分的计算,同时我们按照之前的想法引入常数C,使得最终的更新公式变得更加简洁:
复习:MC是offline的
我们这里是不能做到边收集数据边更新的,一定要等episode结束之后才能更新。后面介绍TD版本的方法就可以做incremental update(online)了
Remarks
How to do sampling?
- How to sample ?
- , where the on-policy distribution is a long-run behavior under .
- In practice, people usually do not care about it.
- How to sample ?
- . Hence, should be sampled following at .
- Therefore, policy gradient methods are on-policy.
How to interpret this algorithm?
Since
the algorithm can be rewritten as
这里对用一下泰勒一次展开
When is sufficientlt small:
Interpretation:
- If , then the probability of choosing is increased.
- If , then the probability of choosing is decreased.
can balance exploration and exploitation.
- -⇒ -⇒ , intends to exploit actions with greater values.
- -⇒ -⇒ , intends to explore actions with lower probabilities if q>0, intends to give up actions with lower probabilities if q<0.