Idea of Policy Gradient
Introduction
启发
受之前tabular的启发,我们下定决心抛弃掉表格化的方法
- 于是就有了本来使用表格存储的state value(index is state) ⇒ 使用函数映射的state value(input)
- 那我们的想法是能不能也同时 将表格存储的action probabilities(index is state) ⇒ 使用函数映射的action probabilities(input is state)
- action probabilities是一个m维的向量,m是action的数量
- action probabilities = policy
- Previously, the action probabilities of all states are stored in a table ().
State/Action | |||||
---|---|---|---|---|---|
- Now, policies can be represented by parameterized functions:
where is a parameter vector.
- the function can be, e.g., a neural network, whose input is , output is the probability to take each action, and parameter is .
- advantage: when the state space is large, the tabular representation will be of low efficiency in terms of storage and generalization.
Difference between tabular and function representations
- how to define optimal policy?
- tabular case: a policy is optimal if it can maximize every state value
- function case: a policy is optimal if it can maximize certain scalar metrics
- how to access the probability of an action?
- tabular case: the probability of taking action at state can be directly accessed by looking up the tabular policy
- function case: we need to calculate the value of given the function structure and the parameter
- how to update policies?
- tabular case: a policy can be updated by directly changing the entries in the table
- function case: a policy cannot be updated in this way anymore. Instead, it can only be updated by changing the parameter
Basic idea of policy gradient
- Define a metric (or objective function) to represent the quality of a policy:
- Use gradient-based optimization algorithms to search for the optimal policy:
Although the idea is simple, the implementation is not trivial.
- What appropriate metrics should be used? See PG metric.
- How to calculate the gradient of . See PG metric gradient