Proof of Deterministic Policy Gradient

Notations

Since the policy is deterministic,

  • is the deterministic policy function.
  • is the parameter of the policy function.
  • is a mapping from to .

v.s.

So far, we use the notation to represent the stochastic policy,

  • is the probability of taking action in state .
  • input: one state , output: a probability distribution over actions.

and now we use to represent the deterministic policy.

  • is just the action taken in state .
  • input: one state , output: one action.

we have

Action value

Remember the stochastic case: , refer to action value, because the action is sampled from the stochastic policy, and now the policy is deterministic (or one-hot), so the action is fixed. (the operation can be removed)

Theorem of DPG

As we mentioned before, in the discounted case where , the gradient of is

Here is the state distribution under policy .

Derivation of DPG

Turn in to matrix form:

  • , the size of state space
  • , is the dimension of parameter
  • is a (k,1) vector when is fixed, then is a (n,k) matrix
  • is a (n,1) vector when is fixed, then is a (n,n) matrix
  • is a (k,1) vector when is fixed, then is a (n,k) matrix

Then we have:

TL;DR

  • just denote as a one-hot (n,m) vector, same as the shape of before
  • , same as the shape of before

Then we can directly use the result in the policy gradient proof:

Just replace by , and divide by the factor

And we know that actually, when is deterministic and just return one value, our element-wise form is

Then we have the deterministic policy gradient, unified form:

Element-wise form:

  • is the state distribution under policy