Two metric of policy gradient

Metric 1: Average value

The first metric is the average state value or simply called $average value$ :

\overset{v}{ˉ}_{π} = s \in S \sum d (s) v_{π} (s)

$\overset{v}{ˉ}_{π}$ is a weighted average of the state values.
$d (s)$ is the weight for state $s$ . Since $\sum_{s \in S} d (s) = 1$ , we can interpret $d (s)$ as a probability distribution.
Then, the metric can be written as

\overset{v}{ˉ}_{π} = E_{S \sim d} [v_{π} (S)]

How to select the distribution $d$ ?

Case 1: $d$ is independent of the policy $π$ .

easy to calculate the gradient of the metric. $\nabla_{θ} \overset{v}{ˉ}_{π} = d^{T} \nabla_{θ} v_{π}$
In this case, we specifically denote $d$ as $d_{0}$ and $\overset{v}{ˉ}_{π}$ as $\overset{v}{ˉ}_{π}^{0}$ .
How to select $d_{0}$ ?
- Uniform(All is same): One trivial way is to treat all the states equally important and hence select $d_{0} (s) = \frac{1}{∣ S ∣}$
- Onehot(Only one): Another important case is that we are only interested in a specific state $s_{0}$ . For example, the episodes in some tasks always start from the same state $s_{0}$ . Then, we only care about the long-term return starting from $s_{0}$ . In this case, $d_{0} (s) = {10 if s = s_{0} otherwise$ In this case, $\overset{v}{ˉ}_{π} = v_{π} (s_{0})$ .

Case 2: $d$ depends on the policy $π$ .

A common way is to select $d$ as $d_{π} (s)$ , which is the stationary distribution under $π$ .
The interpretation of selecting $d$ is as follows:
- $d_{π}$ reflects the long-run behavior of the Markov decision process under a given policy $π$ .
- If one state is frequently visited in the long run, it is more important and deserves more weight.
- If a state is hardly visited, then we give it less weight.

An important equivalent expression of average value

You will see the following metric often in the literature:

J (θ) = n \to \infty lim E [t = 0 \sum n γ^{t} R_{t + 1}] = E [t = 0 \sum \infty γ^{t} R_{t + 1}]

Question: What is its relationship to the metric we introduced just now?
Answer: They are the same. That is because $J (θ) = E [t = 0 \sum \infty γ^{t} R_{t + 1}] = s \in S \sum d (s) v_{π} (s) E [t = 0 \sum \infty γ^{t} R_{t + 1} ∣ S_{0} = s] = s \in S \sum d (s) v_{π} (s) = \overset{v}{ˉ}_{π}$

Metric 2: Average reward

The second metric is the average one-step reward or simply called $average reward$ :

\overset{r}{ˉ}_{π} ≐ s \in S \sum d_{π} (s) r_{π} (s) = E_{S \sim d_{π}} [r_{π} (S)]

where

r_{π} (s) r (s, a) = a \in A \sum π (a ∣ s) r (s, a) = E [R ∣ s, a] = r \sum r p (r ∣ s, a)

Remarks:

$\overset{r}{ˉ}_{π}$ is simply a weighted average of immediate rewards.
$r_{π} (s)$ is the average immediate reward that can be obtained from state $s$ .
$d_{π}$ is the stationary distribution under policy $π$ .

An important equivalent expression of average reward

Suppose an agent follows a given policy and generates a trajectory with the rewards as $R_{1}, R_{2}, \dots$ .

The average single-step reward along this trajectory is

= n \to \infty lim \frac{1}{n} E [R_{1} + R_{2} + \dots + R_{n} ∣ S_{0} = s_{0}] n \to \infty lim \frac{1}{n} E [t = 0 \sum n - 1 R_{t + 1} ∣ S_{0} = s_{0}]

where $s_{0}$ is the starting state of the trajectory.

An important fact(more detail, see proof) is that

\overset{r}{ˉ}_{π} = n \to \infty lim \frac{1}{n} E [t = 0 \sum n - 1 R_{t + 1} ∣ S_{0} = s_{0}]

这个事实告诉我们什么？

告诉我们了一个快速计算average reward的好方法，这里的 $E$ 是为我们天然使用Monte Carlo方法而准备出现在这里的，我们最后会一步一步大名鼎鼎的REINFORCE算法其实就是Monte Carlo Policy Gradient。

Summary

\overset{v}{ˉ}_{π} \overset{r}{ˉ}_{π} = s \in S \sum d (s) v_{π} (s) = E_{S \sim d} [v_{π} (S)] = n \to \infty lim E [t = 0 \sum n γ^{t} R_{t + 1}] = s \in S \sum d_{π} (s) r_{π} (s) = E_{S \sim d_{π}} [r_{π} (S)] = n \to \infty lim \frac{1}{n} E [t = 0 \sum n - 1 R_{t + 1}]

Remarks:

About $π$ and $θ$ :

all these metrics are functions of $π$
since $π$ is parameterized by $θ$ , these metrics are functions of $θ$
different values of $θ$ can generate different metric values
we can search for the optimal values of $θ$ to maximize these metrics

Basic idea of policy gradient methods

我们的policy是用参数 $θ$ 通过函数化的方法来表示的，所以查找optimal policy的过程自然变成了查找optimal $θ$ 的过程，对于 $θ$ 的optimality，我们又提出了两个metrics

About discounting:

the metrics can be defined in either the discounted case where $γ < 1$ or the undiscounted case where $γ = 1$
the undiscounted case is nontrivial
we only consider the discounted case so far.

About the relationship between $\overset{r}{ˉ}$ and $\overset{v}{ˉ}$ :

the two metrics are equivalent (not equal) to each other
specifically, in the discounted case where $γ < 1$ , it holds that $\overset{r}{ˉ}_{π} = (1 - γ) \overset{v}{ˉ}_{π}$
they can be maximized simultaneously

关于 $\overset{r}{ˉ}$ 和 $\overset{v}{ˉ}$ 的关系

先明确两个定义：
$\overset{v}{ˉ}_{π} \overset{r}{ˉ}_{π} = d^{T} v_{π} = d_{π}^{T} r_{π}$
然后推导如下：
$\overset{v}{ˉ}_{π} = d^{T} v_{π} = d_{π}^{T} v_{π} = d_{π}^{T} (r_{π} + γ P_{π} v_{π}) = d_{π}^{T} r_{π} + γ d_{π}^{T} P_{π} v_{π} = \overset{r}{ˉ}_{π} + γ d_{π}^{T} P_{π} v_{π} = \overset{r}{ˉ}_{π} + γ d_{π}^{T} v_{π} = \overset{r}{ˉ}_{π} + γ \overset{v}{ˉ}_{π} (Just special case) (Bellman equation) (\overset{r}{ˉ}_{π} = d_{π}^{T} r_{π}) (d_{π}^{T} = d_{π}^{T} P_{π})$
所以我们有 $\overset{r}{ˉ}_{π} = (1 - γ) \overset{v}{ˉ}_{π}$ 。

Reinforcement Learning Notes

Explorer

PG metric

Two metric of policy gradient

Metric 1: Average value

How to select the distribution $d$ ?

An important equivalent expression of average value

Metric 2: Average reward

An important equivalent expression of average reward

Summary

Graph View

Table of Contents

Backlinks

Reinforcement Learning Notes

Explorer

PG metric

Two metric of policy gradient

Metric 1: Average value

How to select the distribution d?

An important equivalent expression of average value

Metric 2: Average reward

An important equivalent expression of average reward

Summary

Graph View

Table of Contents

Backlinks

How to select the distribution $d$ ?