Monte-Carlo Policy Gradient(REINFORCE)

Introduction

  1. The gradient-ascent algorithm maximizing is
  2. Since the true gradient is unknown, we can replace it by a stochastic one:
  3. Furthermore, since is unknown, it can be replaced by an estimate:

Implementation

If is estimated by Monte Carlo estimation, then the algorithm is called REINFORCE.

Version 1

Version 2

可以看到上面计算g时很多地方都是重复计算了,我们优化这部分的计算,同时我们按照之前的想法引入常数C,使得最终的更新公式变得更加简洁:

复习:MC是offline的

我们这里是不能做到边收集数据边更新的,一定要等episode结束之后才能更新。后面介绍TD版本的方法就可以做incremental update(online)了

Remarks

How to do sampling?

  • How to sample ?
    • , where the on-policy distribution is a long-run behavior under .
    • In practice, people usually do not care about it.
  • How to sample ?
    • . Hence, should be sampled following at .
    • Therefore, policy gradient methods are on-policy.

How to interpret this algorithm?

Since

the algorithm can be rewritten as

这里对用一下泰勒一次展开

When is sufficientlt small:

Interpretation:

  • If , then the probability of choosing is increased.
  • If , then the probability of choosing is decreased.

can balance exploration and exploitation.

  • - - , intends to exploit actions with greater values.
  • - - , intends to explore actions with lower probabilities if q>0, intends to give up actions with lower probabilities if q<0.