Gradient of the PG metric

Introduction

预告

这节将是笔者认为这个笔记最精彩的一节，会有很多证明，放心食用👀

Given a metric, we next derive its gradient and then, apply gradient-based methods to optimize the metric.

Simply put, the unified gradient of the metric is given by

\nabla_{θ} J (θ) \propto s \sum μ (s) a \sum q_{π} (s, a) \nabla_{θ} π (a ∣ s, θ)

重要

赵老师PPT没有阐述的很清楚这个公式的推导，自己花了差不多一天时间来写的证明，用了matrix form的形式来证的，结合了很多Sutton书中的内容⇒ policy_gradient_proof

Turn into Expectation

\nabla_{θ} J (θ) \propto s \sum μ (s) a \sum q_{π} (s, a) \nabla_{θ} π (a ∣ s, θ) = E_{S_{t} \sim μ} [a \sum q_{π} (S_{t}, a) \nabla_{θ} π (a ∣ S_{t}, θ)] = E_{S_{t} \sim μ} [a \sum π (a ∣ S_{t}, θ) q_{π} (S_{t}, a) \frac{\nabla _{θ} π ( a ∣ S _{t} , θ )}{π ( a ∣ S _{t} , θ )}] = E_{S_{t} \sim μ, A_{t} \sim π} [q_{π} (S_{t}, A_{t}) \frac{\nabla _{θ} π ( A _{t} ∣ S _{t} , θ )}{π ( A _{t} ∣ S _{t} , θ )}] = E_{S_{t} \sim μ, A_{t} \sim π} [G_{t} \frac{\nabla _{θ} π ( A _{t} ∣ S _{t} , θ )}{π ( A _{t} ∣ S _{t} , θ )}] = E_{S_{t} \sim μ, A_{t} \sim π} [G_{t} \nabla_{θ} ln π (A_{t} ∣ S_{t}, θ)] (replace s by the sample S_{t} \sim μ) (replace a by the sample A_{t} \sim π) (q_{π} (S_{t}, A_{t}) ≐ E [G_{t} ∣ S_{t}, A_{t}])

注意上面是如何把一个 $\sum_{a}$ 和 $\sum_{s}$ 逐渐提出来变成 $E$ 的

我们首先提出来的是 $\sum_{s} μ (s)$ ，为了清晰表示，这里在外面的 $E$ 加上了下标 $S \sim μ$ 说明是对哪个random variable做期望的，同时将所有小写s替换成大写S，也说明了S此时的分布(on-policy distribution)

第二步提出来之前我们先凑出action probability，所以就又能把 $\sum_{a} π (a ∣ S, θ)$ 提出来，同时在外面的 $E$ 加上下标 $A \sim π$ ，同时将所有小写a替换成大写A

为什么要这样提出来变成一个期望形式？因为我们最终还是希望可以通过MC的方法去轻松采样到大量state-action pair的数据的，提出来之后我们就可以像之前一样update by samples

About the policy gradient

π (a ∣ s, θ) h (s, a, θ) = \frac{e ^{h (s, a, θ)}}{\sum _{a^{'}} e ^{h (s, a^{'}, θ)}} = θ^{T} ϕ (s, a)

where

$ϕ (s, a) \in R^{d}$ is the feature vector of state-action pair $(s, a)$
$θ \in R^{d}$ is the parameter vector of the policy

\nabla_{θ} ln π (a ∣ s, θ) = \nabla_{θ} ln \frac{e ^{h (s, a, θ)}}{\sum _{a^{'}} e ^{h (s, a^{'}, θ)}} = \nabla_{θ} (h (s, a, θ) - ln a^{'} \sum e^{h (s, a^{'}, θ)}) = \nabla_{θ} h (s, a, θ) - \nabla_{θ} ln a^{'} \sum e^{h (s, a^{'}, θ)} = ϕ (s, a) - \frac{1}{\sum _{a^{''}} e ^{h (s, a^{''}, θ)}} \nabla_{θ} a^{'} \sum e^{h (s, a^{'}, θ)} = ϕ (s, a) - \frac{1}{\sum _{a^{''}} e ^{h (s, a^{''}, θ)}} a^{'} \sum e^{h (s, a^{'}, θ)} \nabla_{θ} h (s, a^{'}, θ) = ϕ (s, a) - a^{'} \sum \frac{e ^{h (s, a^{'}, θ)}}{\sum _{a^{''}} e ^{h (s, a^{''}, θ)}} ϕ (s, a^{'}) = ϕ (s, a) - a^{'} \sum π (a^{'} ∣ s, θ) ϕ (s, a^{'})

Remarks:

such a form based on the softmax function can be realized by a neural network
- input is $s$ , (continuous)
- output is $a$ , (discrete, $∣ A ∣$ outputs)
- the activation of the output layer is the softmax function
$π (a ∣ s, θ) \in [0, 1]$ for all $a \in A$ and $\sum_{a} π (a ∣ s, θ) = 1$
- the parameterized policy is stochastic and hence exploratory

一些矩阵想法

按照之前的做法，我们这里看到也是可以写成矩阵形式的，由上面定义，我们知道

$ϕ$ is a (n,m,k) tensor

$θ$ is a (k,1) vector

$π = softmax (ϕ \times θ, dim = 1)$ is a (n,m) matrix

我们可以直接写成下面这样子：
$(n, m, k) \nabla_{θ} ln π = (n, m, k) ϕ - (n, 1, m) π @ (n, m, k) ϕ$
但是我们的 $\nabla_{θ} ln π$ 的shape应该是(k,1)的，容易看出：在前向传播过程中，我们的 $θ$ 是和 $ϕ$ (n,m,k)中的每一个(1,k)都做了向量dot-product-sum的操作（这里的广播操作），所以在反向传播过程中，就应该对我们最后得到的一个梯度(n,m,k)取mean(dim=0,dim=1)，这样就得到了一个(k,)的梯度；或者可以直接将上面的矩阵形式reshape成(k,n*m)的形式，然后再取mean(dim=1)即可;

TL;DR: 前向传播中的广播操作使得 $θ$ 复制了nxm份，所以在反向传播中，我们需要对nxm份的梯度取平均

Reinforcement Learning Notes

Explorer

PG metric gradient

Gradient of the PG metric

Introduction

Turn into Expectation

About the policy gradient

Graph View

Table of Contents

Backlinks