Proof of the Policy Gradient Theorem

一些说在证明前面的话

自己看了Sutton&Barto的证明(Sutton书中的13.2节的一个box的证明过程)，参考了其前四步的推理，觉得是非常清晰且正确的，但是到第五步将下面推导的第四步标红的 $\nabla v_{π}$ 再次展开并写成这种格式时觉得非常难理解
$\nabla v_{π} (s) = x \in S \sum (k = 0 \sum \infty Pr (s \to x, k, π)) a \sum \nabla π (a ∣ s) q_{π} (s, a)$

这里的 $Pr (s \to x, k, π)$ 是一个概率，表示从状态 $s$ 经过 $k$ 步转移到状态 $x$ 的概率，和之前的证明average reward中的 $p_{π}^{(k)} (x ∣ s)$ 其实是一样的

但是从接下来我们所看到的第四步的推导中，我们发现并没有办法直接推导出上述形式，在这里笔者决定第五步之后采取了另一种做法——矩阵形式的推导，这样可以更加清晰地展示出推导的过程，最终会证明上述这个式子是对的（从级数展开的角度去理解）

Step 1: Unroll $\nabla v_{π} (s)$ to iterative form(Refer Sutton&Barto)

We can prove the policy gradient theorem from first principles.

$π$ is a function of $θ$ , means that we use $π (a ∣ s, θ) \to π (a ∣ s)$ for simplicity
all gradients are with respect to $θ$ , means that we use $\nabla_{θ} \to \nabla$ for simplicity

\nabla v_{π} (s) = \nabla [a \sum π (a ∣ s) q_{π} (s, a)], \forall s \in S = a \sum [\nabla π (a ∣ s) q_{π} (s, a) + π (a ∣ s) \nabla q_{π} (s, a)] = a \sum \nabla π (a ∣ s) q_{π} (s, a) + π (a ∣ s) \nabla s^{'}, r \sum p (s^{'}, r ∣ s, a) [r + v_{π} (s^{'})] = a \sum [\nabla π (a ∣ s) q_{π} (s, a) + π (a ∣ s) s^{'} \sum p (s^{'} ∣ s, a) \nabla v_{π} (s^{'})] (product rule of calculus) (Eq. 1) (Eq. 2 and r is a constant)

where

Eq. 1 : Eq. 2 : q_{π} (s, a) p (s^{'} ∣ s, a) = s^{'}, r \sum p (s^{'}, r ∣ s, a) [r + v_{π} (s^{'})] = r \in R \sum p (s^{'}, r ∣ s, a)

Step 2: Turn $\nabla v_{π} (s)$ into matrix form

Refer to our former notes on the probability&matrix, we know that any conditional probability can turn into a matrix easily, just do it:

$\nabla v_{π} (s) \to \nabla v_{π}$ is a (n,1) vector
- the shape is same as $v_{π}$
- where $n$ is the number of states
$\nabla π (a ∣ s) \to \nabla π$ is a (n,m) matrix
- the shape is same as $π$
- where $m$ is the number of actions
$q_{π} (s, a) \to q_{π}$ is a (n,m) matrix
- the shape is same as $π$
$p (s^{'} ∣ s, a) \to P_{s}$ is a (n,m,n) tensor

Now, we can rewrite the above last equation in matrix form.

\nabla v_{π} = \nabla π @ q_{π} + π @ P_{s} \nabla v_{π}

where $@$ denotes batch matrix multiplication(bmm).

We demonstrate the above matrix equation in a more detailed form.

first, we caculate the $\nabla π @ q_{π}$ → (n,1)
- $\nabla π$ is a (n,m) matrix, we unsequeeze it to (n,1,m) tensor
- $q_{π}$ is a (n,m) matrix, we unsequeeze it to (n,m,1) tensor
- then, we do bmm, the result is a (n,1,1) tensor
- finally, we squeeze it to (n,1) vector

Why we do bmm $@$ here?

$a \sum \nabla π (a ∣ s) q_{π} (s, a) \Rightarrow \nabla π @ q_{π}$
We notice $\sum_{a}$ is the dot-product-sum of the two tensors $\nabla π$ (n,m) dim=1 and $q_{π}$ (n,m) dim=1, and we should also return a tensor shape like $v_{π}$ , which is (n,1), so we get (n,1,m)@(n,m,1) = (n,1,1), then we squeeze it to (n,1) vector

second, we caculate the $π @ P_{s} \nabla v_{π}$ → (n,1)
- $π$ is a (n,m) matrix, we unsequeeze it to (n,1,m) tensor
- $P_{s}$ is a (n,m,n) tensor, we just use it directly
- $\nabla v_{π}$ is a (n,1) vector, use it directly(broadcast)
- then, we do bmm, $π @ P_{s}$ is a (n,1,n) tensor, squeeze it to (n,n)
- finally, just do matrix multiplication, the result is a (n,1) vector

Why we do bmm $@$ here?

$a \sum π (a ∣ s) s^{'} \sum p (s^{'} ∣ s, a) \nabla v_{π} (s^{'}) \Rightarrow π @ P_{s} \nabla v_{π}$

$\sum_{s^{'}}$ is the dot-product-sum of $P_{s}$ (n,m,n) dim=2 and $\nabla v_{π}$ (n,1) dim=0, actually, we don’t need to do bmm here, just matmul and we get (n,m,n)x(n,1) = (n,m,1), then we squeeze it to (n,m) tensor

$\sum_{a}$ is the dot-product-sum of the two tensors $π$ (n,m) dim=1 and the returned tensor (n,m) from the last step, so we get (n,1,m)@(n,m,1) = (n,1,1), then we squeeze it to (n,1) vector

Note the order of $\sum_{s^{'}}$ and $\sum_{a}$ is not important, because the result is the same

Step 3: Solve the matrix equation

Directly unroll the matrix equation

\nabla v_{π} = \nabla π @ q_{π} + π @ P_{s} \nabla v_{π} = \nabla π @ q_{π} + π @ P_{s} (\nabla π @ q_{π} + π @ P_{s} \nabla v_{π}) = (I + π @ P_{s}) \nabla π @ q_{π} + (π @ P_{s})^{2} \nabla v_{π} = (I + P_{π}) \nabla π @ q_{π} + P_{π}^{2} \nabla v_{π} = (I + P_{π} + \dots + P_{π}^{k - 1}) \nabla π @ q_{π} + P_{π}^{k} \nabla v_{π} = (I - P_{π}^{k}) (I - P_{π})^{- 1} \nabla π @ q_{π} + P_{π}^{k} \nabla v_{π} (substitute \nabla v_{π}) (P_{π} = π @ P_{s}) (expand to k steps) (use geometric series)

什么是geometric series(几何级数)?

就是指等比数列前 $n$ 项和，这里写成了矩阵形式，其实就是
$S_{k} \Rightarrow S_{k + 1} \Rightarrow S_{k} = I + P_{π} + \dots + P_{π}^{k - 1} = S_{k} + P_{π}^{k} = I + S_{k} P_{π} = (I - P_{π}^{k}) (I - P_{π})^{- 1}$

We notice that from Appendix

d_{π}^{T} k \to \infty lim P_{π}^{k} \Rightarrow k \to \infty lim P_{π}^{k} = d_{π}^{T} P_{π} = d_{π}^{T} = C

Then we have another beautiful iterative form here:

\nabla v_{π} = (I - C) (I - P_{π})^{- 1} \nabla π @ q_{π} + C \nabla v_{π}

And we can also obtain the final matrix form from above:

(I - C) \nabla v_{π} \Rightarrow v_{π} = (I - C) (I - P_{π})^{- 1} \nabla π @ q_{π} = (I - P_{π})^{- 1} \nabla π @ q_{π}

理解最开始说在证明前面的话中的式子

上面这个式子其实就是最开头的式子的matrix form，其中，值得注意的是（也是比较难以一眼看出的就是下面这个概率到matrix的转换）
$k = 0 \sum \infty Pr (s \to x, k, π) \Rightarrow (I - P_{π})^{- 1}$
从级数展开的角度去理解就很清晰了：
$(I - P_{π})^{- 1} = I + P_{π} + P_{π}^{2} + \dots$

Step 4: Obtain the final form(Refer Sutton&Barto)

Finally, we can obtain the final form of the policy gradient theorem.

matrix form $\nabla v_{π} = (I - P_{π})^{- 1} \nabla π @ q_{π}$
element-wise form $\nabla v_{π} (s) = x \in S \sum (k = 0 \sum \infty Pr (s \to x, k, π)) a \sum \nabla π (a ∣ x) q_{π} (x, a)$

It is then immediate that

J (θ) ≐ \nabla v_{π} (s_{0}) = s \sum (k = 0 \sum \infty Pr (s_{0} \to s, k, π)) a \sum \nabla π (a ∣ s) q_{π} (s, a) = s \sum η (s) a \sum \nabla π (a ∣ s) q_{π} (s, a) = s \sum s^{'} \sum \frac{η ( s ^{'} )}{\sum _{s^{'}} η ( s ^{'} )} η (s) a \sum \nabla π (a ∣ s) q_{π} (s, a) = s^{'} \sum η (s^{'}) s \sum \frac{η ( s )}{\sum _{s^{'}} η ( s ^{'} )} a \sum \nabla π (a ∣ s) q_{π} (s, a) = s^{'} \sum η (s^{'}) s \sum μ (s) a \sum \nabla π (a ∣ s) q_{π} (s, a) \propto s \sum μ (s) a \sum \nabla π (a ∣ s) q_{π} (s, a) (time steps spent) (s^{'} \sum \frac{η ( s ^{'} )}{\sum _{s^{'}} η ( s ^{'} )} = 1) (rearrange the sum) (on-policy distribution) (Q.E.D.)

where

$η (s)$ is the time steps spent(or number of visits) in state $s$ , details see Appendix
$μ (s)$ is just the sum-to-one normalized $η (s)$ , we call it on-policy distribution
note that the $s^{'}$ here does not mean the next state, it is just a variable notation to distinguish the sum
$\sum_{s^{'}} η (s^{'}) = C$ , just a constant, so we can ignore it

What is Q.E.D?

有几种叫法：

希腊语：ὅπερ ἔδει δεῖξαι（hoper edei deixai）欧几里得和阿基米德用过

英语：Q.E.D (quod erat demonstrandum) 希腊语的翻译，Sutton&Barto用过

中文：证毕，我用过😏

TL;DR

Using the matrix form, we can quickly summarize the above proof process. (Note that the $\nabla$ here is just the same as the operator $d$ , does not represent the derivative of $θ$ , in short $\nabla \neq = \nabla_{θ}$ )

\nabla v_{π} = \nabla (π @ q_{π}) = \nabla π @ q_{π} + π @\nabla q_{π} = \nabla π @ q_{π} + γ π @ P_{s} \nabla v_{π} = \nabla π @ q_{π} + γ P_{π} \nabla v_{π} = (I - γ P_{π})^{- 1} \nabla π @ q_{π} = (k = 0 \sum \infty (γ P_{π})^{k}) \nabla π @ q_{π} (v_{π} = π @ q_{π}) (product rule) (q_{π} = P_{r} R + γ P_{s} @ v_{π}) (P_{π} = π @ P_{s}) (rearrange) (unroll)

Conclusion

From the Appendix, we notice that

without-discounting $γ = 1$ case: each row of the normalized $(\sum_{k = 0}^{\infty} P_{π}^{k})$ is same, in other word, each row is the stationary distribution $d_{π}$ . Therefore, we can just take one value from the vector $\nabla v_{π}$ to represent the scalar objective function $J (θ)$ .
$\nabla J (θ) = \nabla v_{π} (s) = k d_{π}^{T} \nabla π @ q_{π} \propto d_{π}^{T} \nabla π @ q_{π}$
- $d_{π}$ is the stationary distribution, it is a (n,1) vector
- $\nabla π @ q_{π}$ is a (n,1) vector as we mentioned before
- $k$ is just a constant, it will be very large because $k \to \infty$
discounting $γ < 1$ case: each row of the normalized $(\sum_{k = 0}^{\infty} (γ P_{π})^{k})$ is different, so we can not just take one value from the vector $\nabla v_{π}$ to represent the scalar objective function $J (θ)$ . We can use a $\nabla \overset{v}{ˉ}_{π}$ to represent the scalar objective function $J (θ)$ .
$\nabla J (θ) = \nabla d^{T} v_{π} (s) = d^{T} M C \nabla π @ q_{π} \propto d^{T} M \nabla π @ q_{π}$
- $d$ is the current distribution of state, it is a (n,1) vector
- each row of $M$ is a on-policy distribution, it is a (n,n) matrix
- $C$ is just a constant (n,1) vector, each row is same, we can take it as a scalar $C = 1/ (1 - γ)$
unified form
$\nabla J (θ) \propto μ \nabla π @ q_{π}$
- $μ = d^{T} M$ is the mixed on-policy distribution, it is a (1,n) vector
- $γ = 1$ , each row of $M$ is $d_{π}$ , so $μ = d_{π}^{T}$
- $γ < 1$ , each row of $M$ is different, so $μ = d^{T} M$

一些思考

自己一开始没留意着原来discount和没有discount的区别会导致最后的结果的差异那么大，确实是自己忽略这部分原因思考了，下面Appendix中的内容说明了这个常数C的值其实，在下一节做优化（梯度上升）时自己觉得可以稍微考虑一下C的因素在里面，当 $γ$ 非常接近1时就考虑用比例来做；当 $γ$ 小于1时，可以考虑用原来的式子来做，即把 $C = 1/ (1 - γ)$ 代入到 $\nabla J (θ)$ 中，在梯度上升时可以考虑这个常数，就不用啥都归入 $α$ 中来调参比较困难了。也即我们之后可以考虑的正确梯度形式为（注意这里是等号，而不是成比例）：
$\nabla J (θ) = C μ \nabla π @ q_{π}$
而且这个结果还可以从另外一个地方得到互相验证：

上面通篇的推导我们都是用的metric $\overset{v}{ˉ}_{π}$ , 而在之前的章节中我们证明了另一个metric $\overset{r}{ˉ}_{π} = (1 - γ) \overset{v}{ˉ}_{π}$ ，也就是说如果用这个 $\overset{r}{ˉ}_{π}$ 作为metric，那么我们的policy gradient就变成了 $\nabla J (θ) = μ \nabla π @ q_{π}$ （还是那句话这里是等号）；

如果假设我们很笨，一开始没有推出第二个metric，我们可以仍然只用metric $\overset{v}{ˉ}_{π}$ 进行思考，注意力放在 $C = 1/ (1 - γ)$ 上， $γ = 0$ 所导致的结果就是我们只着眼于眼前的奖励（回想刚刚引入 $γ$ 时的定义discounted factor），注意到 $γ = 0$ 的metric $\overset{v}{ˉ}_{π}$ 和metric $\overset{r}{ˉ}_{π}$ 其实是一个式子的policy gradient

另外自己觉得Sutton书中13.3节的REINFORCE伪代码中在最后的梯度部分引入了 $γ^{t}$ 这个是有问题的，这样的副作用导致Figure 13.1中的曲线中的 $α = 2^{- 13}$ ，这个值过于小了。在Exercise 13.2的答案中也提及了这里unclear，自己的猜测：由于没考虑常数C的影响导致了学习率需要设置得比较大( $γ = 0.99$ 的话，本来的alpha=0.1就得设成alpha=10才会有效果了，但一般大家都默认alpha<1)，所以他们才引入了 $γ^{t}$ 这个因素 ⇒ 正确做法应该是考虑C, 之后实践一下看看效果

Appendix

Notations about long-run behavior

We define some notations about the long-run( $k \to \infty$ ) behavior for the above proof.

H C M ≐ k = 0 \sum \infty (γ P_{π})^{k} ≐ s^{'} \sum H (s^{'}) = d i m = 1 \sum H ≐ H / C (the upper η) (the constant) (the upper μ)

$H$ is a (n,n) matrix, $η$ is a (1,n) vector
- $H (s, s^{'})$ is time steps spent (number of visits) starting from $s$ and ending at $s^{'}$
- $η (s^{'})$ is time steps spent (number of visits) (starting from default $s_{0}$ ) ending at $s^{'}$
$M$ is a (n,n) matrix, $μ$ is a (1,n) vector
- $M (s, s^{'})$ is the probability of visits starting from $s$ and ending at $s^{'}$
- $μ (s^{'})$ is the probability of visits (starting from default $s_{0}$ ) ending at $s^{'}$
$C$ is a (n,1) vector
- $C (s)$ is the total visits starting from $s$
- We find that
  - $γ = 1$ : $C (s_{1}) = C (s_{2}) = \dots = C (s_{n}) = k - 1 = \sum_{i = 1}^{k - 1} 1$
    - $k \to \infty lim C (s) = \infty$
  - $γ < 1$ : $C (s_{1}) = C (s_{2}) = \dots = C (s_{n}) = (1 - γ^{k - 1}) / (1 - γ) = \sum_{i = 1}^{k - 1} γ^{i - 1}$
    - $k \to \infty lim C (s) = 1/ (1 - γ)$
- where $k$ is just the number of iterations or total visits(usaually a large number because $k \to \infty$ )

The on-policy distribution(Refer Sutton&Barto)

Let

$d (s)$ denote the probability that an episode begins in each state $s$
$η (s)$ denote the number of time step spent, on average, in state $s$ in a single episode

Time is spent in a state $s$ ,

if episodes start in $s$
or if transitions are made into $s$ from a preceding(先前的) $\overset{s}{ˉ}$

Then we have the following equations which can be solved for the expected number of visits $η (s)$

η (s) = d (s) + γ s \sum η (\overset{s}{ˉ}) a \sum π (a ∣ \overset{s}{ˉ}) p (s ∣ \overset{s}{ˉ}, a), \forall s \in S

without discounting case: $γ = 1$
discounting case: $γ < 1$

The on-policy distribution is

μ (s) = \frac{η ( s )}{\sum _{s^{'}} η ( s ^{'} )}, \forall s \in S

Matrix form:

Iterative form: $η = d + γ η π @ P_{s} = d + γ η P_{π}$
Solution form: $η = (I - γ P_{π})^{- 1} d$
And $μ = η / \sum η$
where $η, d, μ$ is a (n,1) vector

一些理解

这里的主要思想其实就是expected number of visits, $η (s)$ 表示的是state $s$ 被访问次数的期望，计算他的方法也很简单，是从递归的角度去想的，我们现在把 $η (s)$ 理解成是 probability x 1(time-step) spent in a state $s$ , 那么我们要做的就是简单的将所有概率加起来就行了，也就是两部分组成

当前state的概率d(s)x1

所有可以跳到当前的state的其他state的被访问次数 x 状态转移概率 x 1并求和（因为假设 $\overset{s}{ˉ}$ 被访问k次， $P_{π} (s ∣ \overset{s}{ˉ}) = 0.1$ ，那么当然s也会有这部分0.1k次被访问到才对）

Convergence of the long-run behavior

Just modify the example in Appendix, and we will get the same result d_pi (but matrix form)

The following code demonstrates that the three is same to represent the stationary distribution.

$P_{π}^{k}$ is the transition matrix multiplied $k$ times
$(I - P_{π})^{- 1}$ is the inverse of the transition matrix, we normalize it by the sum of each row
$I + P_{π} + P_{π}^{2} + \dots$ is the geometric series of the transition matrix, we normalize it by the sum of each row

The three results are the same, but

the first one is the most efficient, easy to calculate
the second one is the most accurate, but hard to calculate
the third one is more intuitive, but need a lot of iterations

Talk is cheap, show the code.

# File: P_pi_k.py
import numpy as np
 
def method_multiply(P_pi: np.ndarray, k: int):
    res = np.eye(P_pi.shape[0])
    for i in range(k):
        res = res @ P_pi
    return res
 
def inverse_norm_direct(M: np.ndarray):
    inv = np.linalg.inv(M)
    print(np.sum(inv, axis=1))
    return inv / np.sum(inv, axis=1)
 
def inverse_norm_iterative(M: np.ndarray):
    I = np.eye(M.shape[0])
    inv = np.zeros(M.shape)
    cnt = 0
    # iterate until convergence
    while True:
        cnt += 1
        inv_new = I + inv @ (I - M)
        inv_new = inv_new
        if np.allclose(inv, inv_new):
            break
        inv = inv_new
    print(f"converged after {cnt} iterations")
    print(np.sum(inv, axis=1))
    return inv / np.sum(inv, axis=1)
 
 
P_pi = np.array([
    [0.3, 0.1, 0.6, 0],
    [0.1, 0.3, 0, 0.6],
    [0.1, 0, 0.3, 0.6],
    [0, 0.1, 0.1, 0.8]
])
 
k = 30
d_pi3 = method_multiply(P_pi, 30)
print(f"Multiply {k} times")
print(d_pi3)
 
# obtain the normalized inverse of (I - P_pi)
I = np.eye(P_pi.shape[0])
 
print("Inverse norm direct")
print(inverse_norm_direct(I - P_pi))
 
print("Inverse norm iterative")
print(inverse_norm_iterative(I - P_pi))
 
print("-" * 50)
# discount factor
gamma = 0.9
 
print("Inverse norm direct gamma")
print(inverse_norm_direct(I - gamma * P_pi))
 
print("Inverse norm iterative gamma")
print(inverse_norm_iterative(I - gamma * P_pi))

Multiply 30 times
[[0.03448276 0.10837438 0.13300493 0.72413793] 
 [0.03448276 0.10837438 0.13300493 0.72413793] 
 [0.03448276 0.10837438 0.13300493 0.72413793] 
 [0.03448276 0.10837438 0.13300493 0.72413793]]
Inverse norm direct
[-2.48770265e+16 -2.48770265e+16 -2.48770265e+16 -2.48770265e+16]
[[0.03448276 0.10837438 0.13300493 0.72413793]
 [0.03448276 0.10837438 0.13300493 0.72413793]
 [0.03448276 0.10837438 0.13300493 0.72413793]
 [0.03448276 0.10837438 0.13300493 0.72413793]]
Inverse norm iterative
converged after 100003 iterations
[100002.00000024 100002.00000024 100002.00000024 100002.00000024]
[[0.03449732 0.10837206 0.13301266 0.72411796]
 [0.03448353 0.10838586 0.13300231 0.7241283 ]
 [0.03448353 0.10837157 0.1330166  0.7241283 ]
 [0.03448181 0.10837329 0.13300281 0.72414209]]
--------------------------------------------------
Inverse norm direct gamma
[10. 10. 10. 10.]
[[0.17184995 0.08842402 0.19435892 0.5453671 ]
 [0.04039756 0.21987641 0.10779272 0.63193331]
 [0.04039756 0.08289011 0.24477902 0.63193331]
 [0.02596986 0.09731781 0.11332663 0.7633857 ]]
Inverse norm iterative gamma
converged after 92 iterations
[9.9993144 9.9993144 9.9993144 9.9993144]
[[0.17185937 0.08842265 0.19436313 0.54535485]
 [0.04039797 0.21988405 0.10779099 0.63192699]
 [0.04039797 0.08288836 0.24478668 0.63192699]
 [0.02596928 0.09731705 0.11332528 0.76338839]]
>>> (1 - 0.9**91)/(1-0.9)
9.99931440386759

Reinforcement Learning Notes

Explorer

policy_gradient_proof

Proof of the Policy Gradient Theorem

Step 1: Unroll $\nabla v_{π} (s)$ to iterative form(Refer Sutton&Barto)

Step 2: Turn $\nabla v_{π} (s)$ into matrix form

Step 3: Solve the matrix equation

Step 4: Obtain the final form(Refer Sutton&Barto)

TL;DR

Conclusion

Appendix

Notations about long-run behavior

The on-policy distribution(Refer Sutton&Barto)

Convergence of the long-run behavior

Graph View

Table of Contents

Backlinks

Reinforcement Learning Notes

Explorer

policy_gradient_proof

Proof of the Policy Gradient Theorem

Step 1: Unroll ∇vπ​(s) to iterative form(Refer Sutton&Barto)

Step 2: Turn ∇vπ​(s) into matrix form

Step 3: Solve the matrix equation

Step 4: Obtain the final form(Refer Sutton&Barto)

TL;DR

Conclusion

Appendix

Notations about long-run behavior

The on-policy distribution(Refer Sutton&Barto)

Convergence of the long-run behavior

Graph View

Table of Contents

Backlinks

Step 1: Unroll $\nabla v_{π} (s)$ to iterative form(Refer Sutton&Barto)

Step 2: Turn $\nabla v_{π} (s)$ into matrix form