Deterministic Policy Gradient(DPG)
Introduction
注意
本节出现的所有都是指的是一个映射,而不是之前章节所提及的on-policy distribution。
Up to now, the policies used in the policy gradient methods are all stochastic since for every .
Can we use deterministic policies in the policy gradient methods? The benefit is that it can handle continuous action spaces.
The ways to represent a policy:
- Up to now, a general policy is denoted as , which can be either stochastic or deterministic.
- Now, the deterministic policy is specifically denoted as
- is a mapping from to .
- can be represented by, for example, a neural network with the input as , the output as , and the parameter as .
- We may write in short as .
一些理解
之前都是可以直接输入一个state,然后输出一个action的概率分布,因为我的action是离散的(比如输出机器人应该是上下左右),而正因为是离散的,所以每个动作都有一定概率会被选中;但是如果当需要输出的action是连续的(比如输出机器人的速度),此时更像一个回归问题了(我们输出的只能是一个速度,而不是几个速度再概率采样这样子,因为速度是连续的,有无数个速度在定义域中),那也就是确定性策略。
确定性策略的情况下,我们如果要获得对应的动作概率的话,可以用高斯分布来实现(Sutton书中的13.7节说了证怎么求得概率)
Theorem of deterministic policy gradient
- The policy gradient theorems introduced before are merely valid for stochastic policies.
- If the policy must be deterministic, we must derive a new policy gradient theorem.
- The ideas and procedures are similar.
Consider the metric of average state value in the discounted case:
where is a probability distribution satisfying .
How to select ?
- same as the last lecture.
- normal case: uniform; special case: one-hot, stationary distribution
In the discounted case where , the gradient of is
Here is the state distribution under policy .
One important difference from the stochastic case:
- The gradient does not involve the distribution of the action .
- As a result, the deterministic policy gradient method is off-policy.
For detail, see Proof.
Algorithm of deterministic policy gradient
Based on the policy gradient, the gradient-ascent algorithm for maximizing is:
The corresponding stochastic gradient-ascent algorithm is
Implementation
Compared with Implementation
- is deterministic target policy.
- is stochastic behavior policy.
- This is an off-policy implementation where the behavior policy may be different from .
- can also be replaced by .
- How to select the function to represent ?
- Linear function: where is the feature vector.
- Neural networks: deep deterministic policy gradient (DDPG) method.