Proof of Deterministic Policy Gradient
Notations
Since the policy is deterministic,
- is the deterministic policy function.
- is the parameter of the policy function.
- is a mapping from to .
v.s.
So far, we use the notation to represent the stochastic policy,
- is the probability of taking action in state .
- input: one state , output: a probability distribution over actions.
and now we use to represent the deterministic policy.
- is just the action taken in state .
- input: one state , output: one action.
we have
Action value
Remember the stochastic case: , refer to action value, because the action is sampled from the stochastic policy, and now the policy is deterministic (or one-hot), so the action is fixed. (the operation can be removed)
Theorem of DPG
As we mentioned before, in the discounted case where , the gradient of is
Here is the state distribution under policy .
Derivation of DPG
Turn in to matrix form:
- , the size of state space
- , is the dimension of parameter
- is a
(k,1)
vector when is fixed, then is a(n,k)
matrix - is a
(n,1)
vector when is fixed, then is a(n,n)
matrix - is a
(k,1)
vector when is fixed, then is a(n,k)
matrix
Then we have:
TL;DR
- just denote as a one-hot
(n,m)
vector, same as the shape of before - , same as the shape of before
Then we can directly use the result in the policy gradient proof:
Just replace by , and divide by the factor
And we know that actually, when is deterministic and just return one value, our element-wise form is
Then we have the deterministic policy gradient, unified form:
Element-wise form:
- is the state distribution under policy