Proof of the Policy Gradient Theorem

一些说在证明前面的话

自己看了Sutton&Barto的证明(Sutton书中的13.2节的一个box的证明过程),参考了其前四步的推理,觉得是非常清晰且正确的,但是到第五步将下面推导的第四步标红的再次展开并写成这种格式时觉得非常难理解

  • 这里的是一个概率,表示从状态经过步转移到状态的概率,和之前的证明average reward中的其实是一样的
  • 但是从接下来我们所看到的第四步的推导中,我们发现并没有办法直接推导出上述形式,在这里笔者决定第五步之后采取了另一种做法——矩阵形式的推导,这样可以更加清晰地展示出推导的过程,最终会证明上述这个式子是对的(从级数展开的角度去理解)

Step 1: Unroll to iterative form(Refer Sutton&Barto)

We can prove the policy gradient theorem from first principles.

  • is a function of , means that we use for simplicity
  • all gradients are with respect to , means that we use for simplicity

where

Step 2: Turn into matrix form

Refer to our former notes on the probability&matrix, we know that any conditional probability can turn into a matrix easily, just do it:

  1. is a (n,1) vector
    • the shape is same as
    • where is the number of states
  2. is a (n,m) matrix
    • the shape is same as
    • where is the number of actions
  3. is a (n,m) matrix
    • the shape is same as
  4. is a (n,m,n) tensor

Now, we can rewrite the above last equation in matrix form.

where denotes batch matrix multiplication(bmm).

We demonstrate the above matrix equation in a more detailed form.

  • first, we caculate the (n,1)
    • is a (n,m) matrix, we unsequeeze it to (n,1,m) tensor
    • is a (n,m) matrix, we unsequeeze it to (n,m,1) tensor
    • then, we do bmm, the result is a (n,1,1) tensor
    • finally, we squeeze it to (n,1) vector

Why we do bmm here?

We notice is the dot-product-sum of the two tensors (n,m) dim=1 and (n,m) dim=1, and we should also return a tensor shape like , which is (n,1), so we get (n,1,m)@(n,m,1) = (n,1,1), then we squeeze it to (n,1) vector

  • second, we caculate the (n,1)
    • is a (n,m) matrix, we unsequeeze it to (n,1,m) tensor
    • is a (n,m,n) tensor, we just use it directly
    • is a (n,1) vector, use it directly(broadcast)
    • then, we do bmm, is a (n,1,n) tensor, squeeze it to (n,n)
    • finally, just do matrix multiplication, the result is a (n,1) vector

Why we do bmm here?

  • is the dot-product-sum of (n,m,n) dim=2 and (n,1) dim=0, actually, we don’t need to do bmm here, just matmul and we get (n,m,n)x(n,1) = (n,m,1), then we squeeze it to (n,m) tensor
  • is the dot-product-sum of the two tensors (n,m) dim=1 and the returned tensor (n,m) from the last step, so we get (n,1,m)@(n,m,1) = (n,1,1), then we squeeze it to (n,1) vector
  • Note the order of and is not important, because the result is the same

Step 3: Solve the matrix equation

Directly unroll the matrix equation

什么是geometric series(几何级数)?

就是指等比数列前项和,这里写成了矩阵形式,其实就是

We notice that from Appendix

Then we have another beautiful iterative form here:

And we can also obtain the final matrix form from above:

理解最开始 说在证明前面的话中的式子

上面这个式子其实就是最开头的式子的matrix form,其中,值得注意的是(也是比较难以一眼看出的就是下面这个概率到matrix的转换)

从级数展开的角度去理解就很清晰了:

Step 4: Obtain the final form(Refer Sutton&Barto)

Finally, we can obtain the final form of the policy gradient theorem.

  • matrix form
  • element-wise form

It is then immediate that

where

  • is the time steps spent(or number of visits) in state , details see Appendix
  • is just the sum-to-one normalized , we call it on-policy distribution
  • note that the here does not mean the next state, it is just a variable notation to distinguish the sum
  • , just a constant, so we can ignore it

What is Q.E.D?

有几种叫法:

  1. 希腊语:ὅπερ ἔδει δεῖξαι(hoper edei deixai)欧几里得和阿基米德用过
  2. 英语:Q.E.D (quod erat demonstrandum) 希腊语的翻译,Sutton&Barto用过
  3. 中文:证毕,我用过😏

TL;DR

Using the matrix form, we can quickly summarize the above proof process. (Note that the here is just the same as the operator , does not represent the derivative of , in short )

Conclusion

From the Appendix, we notice that

  • without-discounting case: each row of the normalized is same, in other word, each row is the stationary distribution . Therefore, we can just take one value from the vector to represent the scalar objective function .

    • is the stationary distribution, it is a (n,1) vector
    • is a (n,1) vector as we mentioned before
    • is just a constant, it will be very large because
  • discounting case: each row of the normalized is different, so we can not just take one value from the vector to represent the scalar objective function . We can use a to represent the scalar objective function .

    • is the current distribution of state, it is a (n,1) vector
    • each row of is a on-policy distribution, it is a (n,n) matrix
    • is just a constant (n,1) vector, each row is same, we can take it as a scalar
  • unified form

    • is the mixed on-policy distribution, it is a (1,n) vector
    • , each row of is , so
    • , each row of is different, so

一些思考

自己一开始没留意着原来discount和没有discount的区别会导致最后的结果的差异那么大,确实是自己忽略这部分原因思考了,下面Appendix中的内容说明了这个常数C的值其实,在下一节做优化(梯度上升)时自己觉得可以稍微考虑一下C的因素在里面,当非常接近1时就考虑用比例来做;当小于1时,可以考虑用原来的式子来做,即把代入到中,在梯度上升时可以考虑这个常数,就不用啥都归入中来调参比较困难了。也即我们之后可以考虑的正确梯度形式为(注意这里是等号,而不是成比例):

而且这个结果还可以从另外一个地方得到互相验证:

  1. 上面通篇的推导我们都是用的metric , 而在之前的章节中我们证明了另一个metric ,也就是说如果用这个作为metric,那么我们的policy gradient就变成了(还是那句话这里是等号);
  2. 如果假设我们很笨,一开始没有推出第二个metric,我们可以仍然只用metric 进行思考,注意力放在上,所导致的结果就是我们只着眼于眼前的奖励(回想刚刚引入时的定义discounted factor),注意到的metric 和metric 其实是一个式子的policy gradient

另外自己觉得Sutton书中13.3节的REINFORCE伪代码中在最后的梯度部分引入了这个是有问题的,这样的副作用导致Figure 13.1中的曲线中的,这个值过于小了。在Exercise 13.2的答案中也提及了这里unclear,自己的猜测:由于没考虑常数C的影响导致了学习率需要设置得比较大(的话,本来的alpha=0.1就得设成alpha=10才会有效果了,但一般大家都默认alpha<1),所以他们才引入了这个因素 正确做法应该是考虑C, 之后实践一下看看效果

Appendix

Notations about long-run behavior

We define some notations about the long-run() behavior for the above proof.

  • is a (n,n) matrix, is a (1,n) vector
    • is time steps spent (number of visits) starting from and ending at
    • is time steps spent (number of visits) (starting from default ) ending at
  • is a (n,n) matrix, is a (1,n) vector
    • is the probability of visits starting from and ending at
    • is the probability of visits (starting from default ) ending at
  • is a (n,1) vector
    • is the total visits starting from
    • We find that
      • :
      • :
    • where is just the number of iterations or total visits(usaually a large number because )

The on-policy distribution(Refer Sutton&Barto)

Let

  • denote the probability that an episode begins in each state
  • denote the number of time step spent, on average, in state in a single episode

Time is spent in a state ,

  • if episodes start in
  • or if transitions are made into from a preceding(先前的)

Then we have the following equations which can be solved for the expected number of visits

  • without discounting case:
  • discounting case:

The on-policy distribution is

Matrix form:

  • Iterative form:
  • Solution form:
  • And
  • where is a (n,1) vector

一些理解

这里的主要思想其实就是expected number of visits, 表示的是state 被访问次数的期望,计算他的方法也很简单,是从递归的角度去想的,我们现在把理解成是 probability x 1(time-step) spent in a state , 那么我们要做的就是简单的将所有概率加起来就行了,也就是两部分组成

  1. 当前state的概率d(s)x1
  2. 所有可以跳到当前的state的其他state的被访问次数 x 状态转移概率 x 1并求和(因为假设被访问k次,,那么当然s也会有这部分0.1k次被访问到才对)

Convergence of the long-run behavior

Just modify the example in Appendix, and we will get the same result d_pi (but matrix form)

The following code demonstrates that the three is same to represent the stationary distribution.

  • is the transition matrix multiplied times
  • is the inverse of the transition matrix, we normalize it by the sum of each row
  • is the geometric series of the transition matrix, we normalize it by the sum of each row

The three results are the same, but

  • the first one is the most efficient, easy to calculate
  • the second one is the most accurate, but hard to calculate
  • the third one is more intuitive, but need a lot of iterations

Talk is cheap, show the code.

# File: P_pi_k.py
import numpy as np
 
def method_multiply(P_pi: np.ndarray, k: int):
    res = np.eye(P_pi.shape[0])
    for i in range(k):
        res = res @ P_pi
    return res
 
def inverse_norm_direct(M: np.ndarray):
    inv = np.linalg.inv(M)
    print(np.sum(inv, axis=1))
    return inv / np.sum(inv, axis=1)
 
def inverse_norm_iterative(M: np.ndarray):
    I = np.eye(M.shape[0])
    inv = np.zeros(M.shape)
    cnt = 0
    # iterate until convergence
    while True:
        cnt += 1
        inv_new = I + inv @ (I - M)
        inv_new = inv_new
        if np.allclose(inv, inv_new):
            break
        inv = inv_new
    print(f"converged after {cnt} iterations")
    print(np.sum(inv, axis=1))
    return inv / np.sum(inv, axis=1)
 
 
P_pi = np.array([
    [0.3, 0.1, 0.6, 0],
    [0.1, 0.3, 0, 0.6],
    [0.1, 0, 0.3, 0.6],
    [0, 0.1, 0.1, 0.8]
])
 
k = 30
d_pi3 = method_multiply(P_pi, 30)
print(f"Multiply {k} times")
print(d_pi3)
 
# obtain the normalized inverse of (I - P_pi)
I = np.eye(P_pi.shape[0])
 
print("Inverse norm direct")
print(inverse_norm_direct(I - P_pi))
 
print("Inverse norm iterative")
print(inverse_norm_iterative(I - P_pi))
 
print("-" * 50)
# discount factor
gamma = 0.9
 
print("Inverse norm direct gamma")
print(inverse_norm_direct(I - gamma * P_pi))
 
print("Inverse norm iterative gamma")
print(inverse_norm_iterative(I - gamma * P_pi))
Multiply 30 times
[[0.03448276 0.10837438 0.13300493 0.72413793] 
 [0.03448276 0.10837438 0.13300493 0.72413793] 
 [0.03448276 0.10837438 0.13300493 0.72413793] 
 [0.03448276 0.10837438 0.13300493 0.72413793]]
Inverse norm direct
[-2.48770265e+16 -2.48770265e+16 -2.48770265e+16 -2.48770265e+16]
[[0.03448276 0.10837438 0.13300493 0.72413793]
 [0.03448276 0.10837438 0.13300493 0.72413793]
 [0.03448276 0.10837438 0.13300493 0.72413793]
 [0.03448276 0.10837438 0.13300493 0.72413793]]
Inverse norm iterative
converged after 100003 iterations
[100002.00000024 100002.00000024 100002.00000024 100002.00000024]
[[0.03449732 0.10837206 0.13301266 0.72411796]
 [0.03448353 0.10838586 0.13300231 0.7241283 ]
 [0.03448353 0.10837157 0.1330166  0.7241283 ]
 [0.03448181 0.10837329 0.13300281 0.72414209]]
--------------------------------------------------
Inverse norm direct gamma
[10. 10. 10. 10.]
[[0.17184995 0.08842402 0.19435892 0.5453671 ]
 [0.04039756 0.21987641 0.10779272 0.63193331]
 [0.04039756 0.08289011 0.24477902 0.63193331]
 [0.02596986 0.09731781 0.11332663 0.7633857 ]]
Inverse norm iterative gamma
converged after 92 iterations
[9.9993144 9.9993144 9.9993144 9.9993144]
[[0.17185937 0.08842265 0.19436313 0.54535485]
 [0.04039797 0.21988405 0.10779099 0.63192699]
 [0.04039797 0.08288836 0.24478668 0.63192699]
 [0.02596928 0.09731705 0.11332528 0.76338839]]
>>> (1 - 0.9**91)/(1-0.9)
9.99931440386759