Idea of Policy Gradient

Introduction

启发

受之前tabular的启发,我们下定决心抛弃掉表格化的方法

  • 于是就有了本来使用表格存储的state value(index is state) 使用函数映射的state value(input)
  • 那我们的想法是能不能也同时 将表格存储的action probabilities(index is state) 使用函数映射的action probabilities(input is state)
  • action probabilities是一个m维的向量,m是action的数量
  • action probabilities = policy
  • Previously, the action probabilities of all states are stored in a table ().
State/Action
  • Now, policies can be represented by parameterized functions:

where is a parameter vector.

  • the function can be, e.g., a neural network, whose input is , output is the probability to take each action, and parameter is .
  • advantage: when the state space is large, the tabular representation will be of low efficiency in terms of storage and generalization.

Difference between tabular and function representations

  1. how to define optimal policy?
    • tabular case: a policy is optimal if it can maximize every state value
    • function case: a policy is optimal if it can maximize certain scalar metrics
  2. how to access the probability of an action?
    • tabular case: the probability of taking action at state can be directly accessed by looking up the tabular policy
    • function case: we need to calculate the value of given the function structure and the parameter
  3. how to update policies?
    • tabular case: a policy can be updated by directly changing the entries in the table
    • function case: a policy cannot be updated in this way anymore. Instead, it can only be updated by changing the parameter

Basic idea of policy gradient

  1. Define a metric (or objective function) to represent the quality of a policy:
  2. Use gradient-based optimization algorithms to search for the optimal policy:

Although the idea is simple, the implementation is not trivial.

  1. What appropriate metrics should be used? See PG metric.
  2. How to calculate the gradient of . See PG metric gradient