Advantage Actor Critic(A2C)

Introduction

The policy gradient theorem can be generalized to include a comparison of the action value to an arbitrary basline :

The baseline can be any function as long as it does not vary with ; the equation remains valid because the subtracted quantity is zero:

Baseline invariance

注意到

上面的Intro中对引入baseline的过程进行了简单的介绍(refer from Sutton&Batto 13.4):引入一个和action无关的随机变量或者函数仍然可以使原来的policy gradient保持原有的equation,并不改变任何之前所证明过的东西;下面我们就baseline的不变性来分析一下。

Property: the policy gradient is invarient to an additional baseline:

Here the additional baseline is a scalar function of .

Why is it valid?

Cause:

Details:

Why is it useful?

The gradient is where

为什么要引入baseline

  1. 引入baseline不改变,即不改变原有的policy gradient的equation
  2. 引入optimal baseline的目的是为了减小,通常选择作为baseline,我希望采样到的每个都尽可能的接近,这样就可以减小

Algorithm

The gradient-ascent algorithm is:

where is called the .

The stochasitic version is:

Refer to TD methods, we can replace by MC error or TD error:

So the algorithm can be written as TD version:

Benifit: only need one network to approximate rather than two networks for and .

Analysis

Same as the analysis in the last lecture, we have can well balance exploration and exploitation, but what matters here is the relative value(advantage value) rather than the absolute value(action value) .

Implementation

The implementation is also simple, compared with Implementation.

  • on-policy
  • is stochastic, no need to use -greedy

Later

on-policy off-policy