Monte-Carlo Policy Gradient(REINFORCE)

Introduction

The gradient-ascent algorithm maximizing $J (θ)$ is $θ_{t + 1} = θ_{t} + α \nabla_{θ} J (θ_{t}) = θ_{t} + α E [q_{π} (S_{t}, A_{t}) \nabla_{θ} ln π (A_{t} ∣ S_{t}, θ_{t})]$
Since the true gradient is unknown, we can replace it by a stochastic one: $θ_{t + 1} = θ_{t} + α q_{π} (s_{t}, a_{t}) \nabla_{θ} ln π (a_{t} ∣ s_{t}, θ_{t})$
Furthermore, since $q_{π}$ is unknown, it can be replaced by an estimate: $θ_{t + 1} = θ_{t} + α q_{t} (s_{t}, a_{t}) \nabla_{θ} ln π (a_{t} ∣ s_{t}, θ_{t})$

If $q_{π}$ is estimated by Monte Carlo estimation, then the algorithm is called REINFORCE.

Algorithm Policy Gradient by Monte Carlo (REINFORCE) Initialization: Initial parameter θ, γ \in (0, 1), α > 0 For each episode, do: Generate episode s_{0}, a_{0}, r_{1}, s_{1}, \dots, s_{T - 1}, a_{T - 1}, r_{T} following π (θ) . For t = 0, 1, \dots, T - 1 : Value update: q_{t} (s_{t}, a_{t}) = k = t + 1 \sum T γ^{k - t - 1} r_{k} Policy update: θ \leftarrow θ + α q_{t} (s_{t}, a_{t}) \nabla_{θ} ln π (a_{t} ∣ s_{t}, θ)

可以看到上面计算g时很多地方都是重复计算了，我们优化这部分的计算，同时我们按照之前的想法引入常数C，使得最终的更新公式变得更加简洁：

Algorithm Policy Gradient by Monte Carlo (REINFORCE) Initialization: Initial parameter θ, γ \in (0, 1), α > 0, C = 1/ (1 - γ) For each episode, do: G = 0 Generate episode s_{0}, a_{0}, r_{1}, s_{1}, \dots, s_{T - 1}, a_{T - 1}, r_{T} following π (θ) . For t = T - 1, T - 2, \dots, 0 : Value update: G = γ G + r_{t + 1} Policy update: θ \leftarrow θ + α CG \nabla_{θ} ln π (a_{t} ∣ s_{t}, θ)

复习：MC是offline的

我们这里是不能做到边收集数据边更新的，一定要等episode结束之后才能更新。后面介绍TD版本的方法就可以做incremental update(online)了

E_{S_{t} \sim μ, A_{t} \sim π} [q_{π} (S_{t}, A_{t}) \nabla_{θ} ln π (A_{t} ∣ S_{t}, θ)] ⟶ q_{π} (s_{t}, a_{t}) \nabla_{θ} ln π (a_{t} ∣ s_{t}, θ)

How to sample $S_{t}$ ?
- $S_{t} \sim μ$ , where the on-policy distribution $μ$ is a long-run behavior under $π$ .
- In practice, people usually do not care about it.
How to sample $A_{t}$ ?
- $A_{t} \sim π (A_{t} ∣ S_{t}, θ)$ . Hence, $a_{t}$ should be sampled following $π (θ_{t})$ at $s_{t}$ .
- Therefore, policy gradient methods are on-policy.

Since

\nabla_{θ} ln π (a_{t} ∣ s_{t}, θ) = \frac{\nabla _{θ} π ( a _{t} ∣ s _{t} , θ )}{π ( a _{t} ∣ s _{t} , θ )}

the algorithm can be rewritten as

θ_{t + 1} = θ_{t} + α q_{t} (s_{t}, a_{t}) \nabla_{θ} ln π (a_{t} ∣ s_{t}, θ) = θ_{t} + α q_{t} (s_{t}, a_{t}) \frac{\nabla _{θ} π ( a _{t} ∣ s _{t} , θ )}{π ( a _{t} ∣ s _{t} , θ )} = θ_{t} + α β_{t} (\frac{q _{t} ( s _{t} , a _{t} )}{π ( a _{t} ∣ s _{t} , θ )}) \nabla_{θ} π (a_{t} ∣ s_{t}, θ) = θ_{t} + α β_{t} \nabla_{θ} π (a_{t} ∣ s_{t}, θ)

这里对 $π$ 用一下泰勒一次展开

When $θ_{t + 1} - θ_{t}$ is sufficientlt small:

π (a_{t} ∣ s_{t}, θ_{t + 1}) \approx π (a_{t} ∣ s_{t}, θ_{t}) + (\nabla_{θ} π (a_{t} ∣ s_{t}, θ_{t}))^{T} (θ_{t + 1} - θ_{t}) = π (a_{t} ∣ s_{t}, θ_{t}) + (\nabla_{θ} π (a_{t} ∣ s_{t}, θ_{t}))^{T} α β_{t} \nabla_{θ} π (a_{t} ∣ s_{t}, θ) = π (a_{t} ∣ s_{t}, θ_{t}) + α β_{t} ∥ \nabla_{θ} π (a_{t} ∣ s_{t}, θ_{t}) ∥^{2} (substitute θ_{t + 1} - θ_{t}) (rearrange terms)

Interpretation:

If $q_{t} (s_{t}, a_{t}) > 0 \to β_{t} > 0$ , then the probability of choosing $(s_{t}, a_{t})$ is increased. $π (a_{t} ∣ s_{t}, θ_{t + 1}) > π (a_{t} ∣ s_{t}, θ_{t})$
If $q_{t} (s_{t}, a_{t}) < 0 \to β_{t} < 0$ , then the probability of choosing $(s_{t}, a_{t})$ is decreased. $π (a_{t} ∣ s_{t}, θ_{t + 1}) < π (a_{t} ∣ s_{t}, θ_{t})$

$β_{t}$ can balance exploration and exploitation.

$q_{t} (s_{t}, a_{t}) ↑$ -⇒ $β_{t} ↑$ -⇒ $π (a_{t} ∣ s_{t}, θ_{t + 1}) ↑$ , intends to exploit actions with greater values.
$π (a_{t} ∣ s_{t}, θ) ↓$ -⇒ $∣ β_{t} ∣ ↑$ -⇒ $∣ π (a_{t} ∣ s_{t}, θ_{t + 1}) - π (a_{t} ∣ s_{t}, θ_{t}) ∣ ↑$ , intends to explore actions with lower probabilities if q>0, intends to give up actions with lower probabilities if q<0.