Reinforcement Learning Notes

❯

09 Policy Gradient Methods

09 Policy Gradient Methods

Sep 30, 20251 min read

09 Policy Gradient Methods

Gap: from value-based to policy-based
Contents:
1. Metrics to define optimal policies: $J (θ) = \overset{v}{ˉ}_{π}, \overset{r}{ˉ}_{π}$
2. Policy gradient: $\nabla J (θ) = E [q_{π} (S, A) \nabla_{θ} ln π (A ∣ S, θ)]$
3. Gradient-ascent algorithm(REINFORCE) $θ_{t + 1} = θ_{t} + α q_{π} (s_{t}, a_{t}) \nabla_{θ} ln π (a_{t} ∣ s_{t}, θ_{t})$

小思考

和前面一章08 Value Function Methods做出的改进其实是异曲同工的，我们在之前是专注于如何得出value function，从而给它加了一个参数w，通过优化 $J (w)$ 来逼近真实的价值函数，现在我们也是加了一个参数 $θ$ 希望通过优化 $J (θ)$

Outline

Basic idea of policy gradient
Metrics to define optimal policies
Gradients of the metrics
Gradient ascent algorithm
Summary

Summary

Metrics for optimality
Gradients of the metrics
Gradient ascent algorithm
- REINFORCE

Later

policy-based plus value-based ⇒ 10 Actor-Critic Methods

Graph View

09 Policy Gradient Methods
Outline
Summary
Later

Backlinks

08 Value Function Methods
10 Actor-Critic Methods
AC QAC
Stationary distribution
index

Created with Quartz v4.4.0 © 2025

GitHub
Email
Home