Proof of Deterministic Policy Gradient

Notations

Since the policy is deterministic,

a = μ (s, θ) ≐ μ (s)

$μ (s)$ is the deterministic policy function.
$θ$ is the parameter of the policy function.
$μ$ is a mapping from $S$ to $A$ .

$μ$ v.s. $π$

So far, we use the notation $π (a ∣ s)$ to represent the stochastic policy,

$π (a ∣ s)$ is the probability of taking action $a$ in state $s$ .

input: one state $s$ , output: a probability distribution over actions.

and now we use $μ (s)$ to represent the deterministic policy.

$μ (s)$ is just the action taken in state $s$ .

input: one state $s$ , output: one action.

we have

v_{μ} (s) = q_{μ} (s, a) = q_{μ} (s, μ (s))

Action value

Remember the stochastic case: $v_{π} (s) = E_{A \sim π} [q_{π} (s, A)]$ , refer to action value, because the action is sampled from the stochastic policy, and now the policy is deterministic (or one-hot), so the action is fixed. (the $\sum_{a}$ operation can be removed)

Theorem of DPG

As we mentioned before, in the discounted case where $γ \in (0, 1)$ , the gradient of $J (θ)$ is

\nabla_{θ} J (θ) = s \in S \sum ρ_{μ} (s) \nabla_{θ} μ (s) (\nabla_{a} q_{μ} (s, a)) ∣_{a = μ (s)} = E_{S \sim ρ_{μ}} [\nabla_{θ} μ (S) (\nabla_{a} q_{μ} (S, a)) ∣_{a = μ (S)}]

Here $ρ_{μ}$ is the state distribution under policy $μ$ .

Derivation of DPG

\nabla_{θ} v_{μ} (s) = \nabla_{θ} q_{μ} (s, a) = \frac{d q _{μ} ( s , a )}{d θ} = \frac{d q _{μ} ( s , a )}{d s} \frac{d s}{d θ} + \frac{d q _{μ} ( s , a )}{d a} \frac{d a}{d θ} = (\nabla_{θ} q_{μ} (s, a)) ∣_{a = μ (s)} + (\nabla_{a} q_{μ} (s, a)) ∣_{a = μ (s)} \nabla_{θ} μ (s) = (\nabla_{θ} (r (s, a) + γ s^{'} \in S \sum p (s^{'} ∣ s, a) v_{μ} (s^{'}))) ∣_{a = μ (s)} + (\nabla_{a} q_{μ} (s, a)) ∣_{a = μ (s)} \nabla_{θ} μ (s) = (0 + γ s^{'} \in S \sum p (s^{'} ∣ s, a) \nabla_{θ} v_{μ} (s^{'})) ∣_{a = μ (s)} + (\nabla_{a} q_{μ} (s, a)) ∣_{a = μ (s)} \nabla_{θ} μ (s) = γ s^{'} \in S \sum p (s^{'} ∣ s, μ (s)) \nabla_{θ} v_{μ} (s^{'}) + u (s) (\nabla_{a} q_{μ} (s, a)) ∣_{a = μ (s)} \nabla_{θ} μ (s)

Turn in to matrix form:

$n = ∣ S ∣$ , the size of state space
$θ \in R^{k}$ , $k$ is the dimension of parameter $θ$
$\nabla_{θ} v_{μ} (s)$ is a (k,1) vector when $s$ is fixed, then $\nabla_{θ} v_{μ}$ is a (n,k) matrix
$p (s^{'} ∣ s, μ (s))$ is a (n,1) vector when $s$ is fixed, then $P_{μ}$ is a (n,n) matrix
$u (s)$ is a (k,1) vector when $s$ is fixed, then $u$ is a (n,k) matrix

Then we have:

\nabla_{θ} v_{μ} \Rightarrow \nabla_{θ} v_{μ} = γ P_{μ} \nabla_{θ} v_{μ} + u = (I_{n} - γ P_{μ})^{- 1} u = (k = 0 \sum \infty (γ P_{μ})^{k}) u

TL;DR

just denote $μ$ as a one-hot (n,m) vector, same as the shape of $π$ before
$P_{μ} = μ @ P_{s} \in R^{n \times n}$ , same as the shape of $P_{π}$ before

Then we can directly use the result in the policy gradient proof:

d v_{π} = (I - γ P_{π})^{- 1} d π @ q_{π} = (k = 0 \sum \infty (γ P_{π})^{k}) d π @ q_{π}

Just replace $π$ by $μ$ , and divide by the factor $d θ$

\nabla_{θ} v_{μ} = \frac{d v _{μ}}{d θ} = (k = 0 \sum \infty (γ P_{μ})^{k}) \frac{d μ @ q _{μ}}{d θ}

And we know that actually, when $μ (s)$ is deterministic and just return one value, our $d μ @ q_{μ} / d θ$ element-wise form is

\frac{d q ( s , μ ( s ))}{d θ} = \frac{d q ( s , a )}{d a} ∣_{a = μ (s)} \frac{d a}{d θ} = (\nabla_{a} q (s, a)) ∣_{a = μ (s)} \nabla_{θ} μ (s)

Then we have the deterministic policy gradient, unified form:

\nabla J (θ) = Cρ \nabla μ @ q_{μ}

Element-wise form:

\nabla_{θ} J (θ) = C s \in S \sum ρ_{μ} (s) \nabla_{θ} μ (s) (\nabla_{a} q_{μ} (s, a)) ∣_{a = μ (s)}

$ρ_{μ}$ is the state distribution under policy $μ$
$C = 1/ (1 - γ)$

Reinforcement Learning Notes

Explorer

deterministic_policy_gradient_proof

Proof of Deterministic Policy Gradient

Notations

Theorem of DPG

Derivation of DPG

TL;DR

Graph View

Table of Contents

Backlinks