Idea of Policy Gradient

Introduction

启发

受之前tabular的启发，我们下定决心抛弃掉表格化的方法

于是就有了本来使用表格存储的state value(index is state) ⇒ 使用函数映射的state value(input)

那我们的想法是能不能也同时将表格存储的action probabilities(index is state) ⇒ 使用函数映射的action probabilities(input is state)

action probabilities是一个m维的向量，m是action的数量

action probabilities = policy

Previously, the action probabilities of all states are stored in a table ( $π$ ).

State/Action	$a_{1}$	$a_{2}$	$a_{3}$	$\dots$	$a_{m}$
$s_{1}$	$π (s_{1} ∣ a_{1})$	$π (s_{1} ∣ a_{2})$	$π (s_{1} ∣ a_{3})$	$\dots$	$π (s_{1} ∣ a_{m})$
$⋮$	$⋮$	$⋮$	$⋮$	$⋱$	$⋮$
$s_{n}$	$π (s_{n} ∣ a_{1})$	$π (s_{n} ∣ a_{2})$	$π (s_{n} ∣ a_{3})$	$\dots$	$π (s_{n} ∣ a_{m})$

Now, policies can be represented by parameterized functions:

π (a ∣ s, θ) = π (a, s, θ) = π_{θ} (a ∣ s) = π_{θ} (a, s)

where $θ \in R^{m}$ is a parameter vector.

the function can be, e.g., a neural network, whose input is $s$ , output is the probability to take each action, and parameter is $θ$ .
advantage: when the state space is large, the tabular representation will be of low efficiency in terms of storage and generalization.

Difference between tabular and function representations

how to define optimal policy?
- tabular case: a policy $π$ is optimal if it can maximize every state value
- function case: a policy $π$ is optimal if it can maximize certain scalar metrics
how to access the probability of an action?
- tabular case: the probability of taking action $a$ at state $s$ can be directly accessed by looking up the tabular policy
- function case: we need to calculate the value of $π (a ∣ s, θ)$ given the function structure and the parameter
how to update policies?
- tabular case: a policy $π$ can be updated by directly changing the entries in the table
- function case: a policy $π$ cannot be updated in this way anymore. Instead, it can only be updated by changing the parameter $θ$

Basic idea of policy gradient

Define a metric (or objective function) to represent the quality of a policy: $J (θ)$
Use gradient-based optimization algorithms to search for the optimal policy: $θ_{t + 1} = θ_{t} + α \nabla_{θ} J (θ_{t})$

Although the idea is simple, the implementation is not trivial.

What appropriate metrics should be used? See PG metric.
How to calculate the gradient of $J (θ)$ . See PG metric gradient

Reinforcement Learning Notes

Explorer

PG idea

Idea of Policy Gradient

Introduction

Difference between tabular and function representations

Basic idea of policy gradient

Graph View

Table of Contents

Backlinks