Two metric of policy gradient

Metric 1: Average value

The first metric is the average state value or simply called :

  • is a weighted average of the state values.
  • is the weight for state . Since , we can interpret as a probability distribution.
  • Then, the metric can be written as

How to select the distribution ?

Case 1: is independent of the policy .

  • easy to calculate the gradient of the metric.
  • In this case, we specifically denote as and as .
  • How to select ?
    • Uniform(All is same): One trivial way is to treat all the states equally important and hence select
    • Onehot(Only one): Another important case is that we are only interested in a specific state . For example, the episodes in some tasks always start from the same state . Then, we only care about the long-term return starting from . In this case, In this case, .

Case 2: depends on the policy .

  • A common way is to select as , which is the stationary distribution under .
  • The interpretation of selecting is as follows:
    • reflects the long-run behavior of the Markov decision process under a given policy .
    • If one state is frequently visited in the long run, it is more important and deserves more weight.
    • If a state is hardly visited, then we give it less weight.

An important equivalent expression of average value

You will see the following metric often in the literature:

  • Question: What is its relationship to the metric we introduced just now?
  • Answer: They are the same. That is because

Metric 2: Average reward

The second metric is the average one-step reward or simply called :

where

Remarks:

  • is simply a weighted average of immediate rewards.
  • is the average immediate reward that can be obtained from state .
  • is the stationary distribution under policy .

An important equivalent expression of average reward

Suppose an agent follows a given policy and generates a trajectory with the rewards as .

The average single-step reward along this trajectory is

where is the starting state of the trajectory.

An important fact(more detail, see proof) is that

这个事实告诉我们什么?

告诉我们了一个快速计算average reward的好方法,这里的是为我们天然使用Monte Carlo方法而准备出现在这里的,我们最后会一步一步大名鼎鼎的REINFORCE算法其实就是Monte Carlo Policy Gradient。

Summary

Remarks:

About and :

  1. all these metrics are functions of
  2. since is parameterized by , these metrics are functions of
  3. different values of can generate different metric values
  4. we can search for the optimal values of to maximize these metrics

Basic idea of policy gradient methods

我们的policy是用参数通过函数化的方法来表示的,所以查找optimal policy的过程自然变成了查找optimal 的过程,对于的optimality,我们又提出了两个metrics

About discounting:

  1. the metrics can be defined in either the discounted case where or the undiscounted case where
  2. the undiscounted case is nontrivial
  3. we only consider the discounted case so far.

About the relationship between and :

  1. the two metrics are equivalent (not equal) to each other
  2. specifically, in the discounted case where , it holds that
  3. they can be maximized simultaneously

关于 的关系

先明确两个定义:

然后推导如下:

所以我们有