Deterministic Policy Gradient(DPG)

Introduction

注意

本节出现的所有 $μ$ 都是指的是一个映射，而不是之前章节所提及的on-policy distribution。

Up to now, the policies used in the policy gradient methods are all stochastic since $π (a ∣ s, θ) > 0$ for every $(s, a)$ .

Can we use deterministic policies in the policy gradient methods? The benefit is that it can handle continuous action spaces.

The ways to represent a policy:

Up to now, a general policy is denoted as $π (a ∣ s, θ) \in [0, 1]$ , which can be either stochastic or deterministic.
Now, the deterministic policy is specifically denoted as $a = μ (s, θ) ≐ μ (s)$
- $μ$ is a mapping from $S$ to $A$ .
- $μ$ can be represented by, for example, a neural network with the input as $s$ , the output as $a$ , and the parameter as $θ$ .
- We may write $μ (s, θ)$ in short as $μ (s)$ .

一些理解

之前都是可以直接输入一个state，然后输出一个action的概率分布，因为我的action是离散的（比如输出机器人应该是上下左右），而正因为是离散的，所以每个动作都有一定概率会被选中；但是如果当需要输出的action是连续的（比如输出机器人的速度），此时更像一个回归问题了（我们输出的只能是一个速度，而不是几个速度再概率采样这样子，因为速度是连续的，有无数个速度在定义域中），那也就是确定性策略。

确定性策略的情况下，我们如果要获得对应的动作概率的话，可以用高斯分布来实现（Sutton书中的13.7节说了证怎么求得概率）

Theorem of deterministic policy gradient

The policy gradient theorems introduced before are merely valid for stochastic policies.
If the policy must be deterministic, we must derive a new policy gradient theorem.
The ideas and procedures are similar.

Consider the metric of average state value in the discounted case:

J (θ) = E [v_{μ} (s)] = s \in S \sum d_{0} (s) v_{μ} (s)

where $d_{0} (s)$ is a probability distribution satisfying $\sum_{s \in S} d_{0} (s) = 1$ .

How to select $d_{0}$ ?

same as the last lecture.

normal case: uniform; special case: one-hot, stationary distribution

In the discounted case where $γ \in (0, 1)$ , the gradient of $J (θ)$ is

\nabla_{θ} J (θ) = s \in S \sum ρ_{μ} (s) \nabla_{θ} μ (s) (\nabla_{a} q_{μ} (s, a)) ∣_{a = μ (s)} = E_{S \sim ρ_{μ}} [\nabla_{θ} μ (S) (\nabla_{a} q_{μ} (S, a)) ∣_{a = μ (S)}]

Here $ρ_{μ}$ is the state distribution under policy $μ$ .

One important difference from the stochastic case:

The gradient does not involve the distribution of the action $A$ .
As a result, the deterministic policy gradient method is off-policy.

For detail, see Proof.

Algorithm of deterministic policy gradient

Based on the policy gradient, the gradient-ascent algorithm for maximizing $J (θ)$ is:

θ_{t + 1} = θ_{t} + α_{θ} \nabla_{θ} J (θ) = θ_{t} + α_{θ} E_{S \sim ρ_{μ}} [\nabla_{θ} μ (S) (\nabla_{a} q_{μ} (S, a)) ∣_{a = μ (S)}]

The corresponding stochastic gradient-ascent algorithm is

θ_{t + 1} = θ_{t} + α_{θ} \nabla_{θ} μ (s_{t}) (\nabla_{a} q_{μ} (s_{t}, a)) ∣_{a = μ (s_{t})}

Implementation

Compared with Implementation

Algorithm Deterministic Actor-Critic Initialization: Policy μ (s, θ) parameters θ_{0}, value function q (s, a, w) parameters w_{0}, given behavior policy β (a ∣ s) . α_{θ}, α_{w} > 0. For each episode, do: Initialize state s_{0} . Select action a_{0} \sim β (a ∣ s_{0}) . While s_{t} is not terminal, do: Execute a_{t}, observe r_{t + 1}, s_{t + 1} . Select a_{t + 1} \sim β (a ∣ s_{t + 1}) . TD Error: δ_{t} = r_{t + 1} + γ q (s_{t + 1}, μ (s_{t + 1}, θ_{t}), w_{t}) - q (s_{t}, a_{t}, w_{t}) Actor (Policy Update): θ_{t + 1} \leftarrow θ_{t} + α_{θ} \nabla_{θ} μ (s_{t}, θ_{t}) (\nabla_{a} q_{μ} (s_{t}, a, w_{t})) ∣_{a = μ (s_{t})} Critic (Value Update): w_{t + 1} \leftarrow w_{t} + α_{w} δ_{t} \nabla_{w} q (s_{t}, w_{t}) s_{t} \leftarrow s_{t + 1}, a_{t} \leftarrow a_{t + 1}

$μ (s, θ)$ is deterministic target policy.
$π (a ∣ s, θ)$ is stochastic behavior policy.
This is an off-policy implementation where the behavior policy $β$ may be different from $μ$ .
$β$ can also be replaced by $μ + noise$ .
How to select the function to represent $q (s, a, w)$ ?
- Linear function: $q (s, a, w) = ϕ^{T} (s, a) w$ where $ϕ (s, a)$ is the feature vector.
- Neural networks: deep deterministic policy gradient (DDPG) method.

Reinforcement Learning Notes

Explorer

AC DPG

Deterministic Policy Gradient(DPG)

Introduction

Theorem of deterministic policy gradient

Algorithm of deterministic policy gradient

Implementation

Graph View

Table of Contents

Backlinks