Two metric of policy gradient
Metric 1: Average value
The first metric is the average state value or simply called :
- is a weighted average of the state values.
- is the weight for state . Since , we can interpret as a probability distribution.
- Then, the metric can be written as
How to select the distribution ?
Case 1: is independent of the policy .
- easy to calculate the gradient of the metric.
- In this case, we specifically denote as and as .
- How to select ?
- Uniform(All is same): One trivial way is to treat all the states equally important and hence select
- Onehot(Only one): Another important case is that we are only interested in a specific state . For example, the episodes in some tasks always start from the same state . Then, we only care about the long-term return starting from . In this case, In this case, .
Case 2: depends on the policy .
- A common way is to select as , which is the stationary distribution under .
- The interpretation of selecting is as follows:
- reflects the long-run behavior of the Markov decision process under a given policy .
- If one state is frequently visited in the long run, it is more important and deserves more weight.
- If a state is hardly visited, then we give it less weight.
An important equivalent expression of average value
You will see the following metric often in the literature:
- Question: What is its relationship to the metric we introduced just now?
- Answer: They are the same. That is because
Metric 2: Average reward
The second metric is the average one-step reward or simply called :
where
Remarks:
- is simply a weighted average of immediate rewards.
- is the average immediate reward that can be obtained from state .
- is the stationary distribution under policy .
An important equivalent expression of average reward
Suppose an agent follows a given policy and generates a trajectory with the rewards as .
The average single-step reward along this trajectory is
where is the starting state of the trajectory.
An important fact(more detail, see proof) is that
这个事实告诉我们什么?
告诉我们了一个快速计算average reward的好方法,这里的是为我们天然使用Monte Carlo方法而准备出现在这里的,我们最后会一步一步大名鼎鼎的REINFORCE算法其实就是Monte Carlo Policy Gradient。
Summary
Remarks:
About and :
- all these metrics are functions of
- since is parameterized by , these metrics are functions of
- different values of can generate different metric values
- we can search for the optimal values of to maximize these metrics
Basic idea of policy gradient methods
我们的policy是用参数通过函数化的方法来表示的,所以查找optimal policy的过程自然变成了查找optimal 的过程,对于的optimality,我们又提出了两个metrics
About discounting:
- the metrics can be defined in either the discounted case where or the undiscounted case where
- the undiscounted case is nontrivial
- we only consider the discounted case so far.
About the relationship between and :
- the two metrics are equivalent (not equal) to each other
- specifically, in the discounted case where , it holds that
- they can be maximized simultaneously
关于 和的关系
先明确两个定义:
然后推导如下:
所以我们有。