Deterministic Policy Gradient(DPG)

Introduction

注意

本节出现的所有都是指的是一个映射,而不是之前章节所提及的on-policy distribution。

Up to now, the policies used in the policy gradient methods are all stochastic since for every .

Can we use deterministic policies in the policy gradient methods? The benefit is that it can handle continuous action spaces.

The ways to represent a policy:

  • Up to now, a general policy is denoted as , which can be either stochastic or deterministic.
  • Now, the deterministic policy is specifically denoted as
    • is a mapping from to .
    • can be represented by, for example, a neural network with the input as , the output as , and the parameter as .
    • We may write in short as .

一些理解

之前都是可以直接输入一个state,然后输出一个action的概率分布,因为我的action是离散的(比如输出机器人应该是上下左右),而正因为是离散的,所以每个动作都有一定概率会被选中;但是如果当需要输出的action是连续的(比如输出机器人的速度),此时更像一个回归问题了(我们输出的只能是一个速度,而不是几个速度再概率采样这样子,因为速度是连续的,有无数个速度在定义域中),那也就是确定性策略。

确定性策略的情况下,我们如果要获得对应的动作概率的话,可以用高斯分布来实现(Sutton书中的13.7节说了证怎么求得概率)

Theorem of deterministic policy gradient

  • The policy gradient theorems introduced before are merely valid for stochastic policies.
  • If the policy must be deterministic, we must derive a new policy gradient theorem.
  • The ideas and procedures are similar.

Consider the metric of average state value in the discounted case:

where is a probability distribution satisfying .

How to select ?

  • same as the last lecture.
  • normal case: uniform; special case: one-hot, stationary distribution

In the discounted case where , the gradient of is

Here is the state distribution under policy .

One important difference from the stochastic case:

  • The gradient does not involve the distribution of the action .
  • As a result, the deterministic policy gradient method is off-policy.

For detail, see Proof.

Algorithm of deterministic policy gradient

Based on the policy gradient, the gradient-ascent algorithm for maximizing is:

The corresponding stochastic gradient-ascent algorithm is

Implementation

Compared with Implementation

  • is deterministic target policy.
  • is stochastic behavior policy.
  • This is an off-policy implementation where the behavior policy may be different from .
  • can also be replaced by .
  • How to select the function to represent ?
    • Linear function: where is the feature vector.
    • Neural networks: deep deterministic policy gradient (DDPG) method.