Advantage Actor Critic(A2C)

Introduction

The policy gradient theorem can be generalized to include a comparison of the action value to an arbitrary basline $b (s)$ :

\nabla_{θ} J (θ) \propto s \sum μ (s) a \sum (q_{π} (s, a) - b (s)) \nabla_{θ} π (a ∣ s, θ)

The baseline can be any function as long as it does not vary with $a$ ; the equation remains valid because the subtracted quantity is zero:

a \sum b (s) \nabla π (a ∣ s, θ) = b (s) a \sum \nabla π (a ∣ s, θ) = b (s) \nabla a \sum π (a ∣ s, θ) = b (s) \nabla1 = 0 (⋆) (linearity of differentiation)

Baseline invariance

注意到

上面的Intro中对引入baseline的过程进行了简单的介绍（refer from Sutton&Batto 13.4）：引入一个和action无关的随机变量或者函数仍然可以使原来的policy gradient保持原有的equation，并不改变任何之前所证明过的东西；下面我们就baseline的不变性来分析一下。

Property: the policy gradient is invarient to an additional baseline:

\nabla_{θ} J (θ) = E_{S \sim μ, A \sim π} [q_{π} (S, A) \nabla_{θ} ln π (A ∣ S, θ)] = E_{S \sim μ, A \sim π} [(q_{π} (S, A) - b (S)) \nabla_{θ} ln π (A ∣ S, θ)]

Here the additional baseline $b (S)$ is a scalar function of $S$ .

Why is it valid?

Cause:

= E_{S \sim μ, A \sim π} [(q_{π} (S, A) - b (S)) \nabla_{θ} ln π (A ∣ S, θ)] E_{S \sim μ, A \sim π} [q_{π} (S, A) \nabla_{θ} ln π (A ∣ S, θ)] - 0 E_{S \sim μ, A \sim π} [b (S) \nabla_{θ} l n π (A ∣ S, θ)]

Details:

E_{S \sim μ, A \sim π} [b (S) \nabla_{θ} l n π (A ∣ S, θ)] = s \sum μ (s) b (s) a \sum π (a ∣ s, θ) \nabla_{θ} ln π (a ∣ s, θ) = s \sum μ (s) b (s) a \sum \nabla_{θ} π (a ∣ s, θ) = s \sum μ (s) b (s) \nabla_{θ} a \sum π (a ∣ s, θ) = s \sum μ (s) b (s) \nabla_{θ} 1 = 0

Why is it useful?

The gradient is $\nabla_{θ} J (θ) = E (X)$ where

X ≐ (q_{π} (S, A) - b (S)) \nabla_{θ} ln π (A ∣ S, θ)

we have $E (X)$ is invarient to $b (S)$ from above
we want that our $b (S)$ can minimize the $var (X)$ , usually, $b (S) = v_{π} (S)$ is a suboptimal baseline, see the derivation of the optimal baseline

为什么要引入baseline

引入baseline不改变 $E$ ，即不改变原有的policy gradient的equation

引入optimal baseline的目的是为了减小 $var (X)$ ，通常选择 $v_{π} (S)$ 作为baseline，我希望采样到的每个 $X$ 都尽可能的接近 $E (X)$ ，这样就可以减小 $var (X)$

Algorithm

The gradient-ascent algorithm is:

θ_{t + 1} = θ_{t} + α \nabla_{θ} J (θ_{t}) = θ_{t} + α E_{S \sim μ, A \sim π} [(q_{π} (S, A) - b (S)) \nabla_{θ} ln π (A ∣ S, θ_{t})] = θ_{t} + α E_{S \sim μ, A \sim π} [(q_{π} (S, A) - v_{π} (S)) \nabla_{θ} ln π (A ∣ S, θ_{t})] = θ_{t} + α E_{S \sim μ, A \sim π} [δ_{π} (S, A) \nabla_{θ} ln π (A ∣ S, θ_{t})]

where $δ_{π} (S, A) ≐ q_{π} (S, A) - v_{π} (S)$ is called the $advantage function$ .

The stochasitic version is:

θ_{t + 1} = θ_{t} + α [q_{t} (s_{t}, a_{t}) - v_{t} (s_{t})] \nabla_{θ} ln π (a_{t} ∣ s_{t}, θ_{t}) = θ_{t} + α δ_{t} (s_{t}, a_{t}) \nabla_{θ} ln π (a_{t} ∣ s_{t}, θ_{t})

Refer to TD methods, we can replace $q_{π} (s, a)$ by MC error or TD error:

= E_{S \sim μ, A \sim π} [q_{π} (S, A) - v_{π} (S) ∣ S = s_{t}, A = a_{t}] {E [G - v_{π} (S) ∣ S = s_{t}, A = a_{t}], E [R + γ v_{π} (S^{'}) - v_{π} (S) ∣ S = s_{t}, A = a_{t}], (MC error) (TD error)

So the algorithm can be written as TD version:

θ_{t + 1} = θ_{t} + α (q_{t} (s_{t}, a_{t}) - v_{t} (s_{t})) \nabla_{θ} ln π (a_{t} ∣ s_{t}, θ_{t}) = θ_{t} + α (r_{t + 1} + γ v_{t} (s_{t + 1}) - v_{t} (s_{t})) \nabla_{θ} ln π (a_{t} ∣ s_{t}, θ_{t})

Benifit: only need one network to approximate $v_{π} (s)$ rather than two networks for $q_{π} (s, a)$ and $v_{π} (s)$ .

Analysis

Same as the analysis in the last lecture, we have $β_{t} = δ_{t} (s_{t}, a_{t}) / π (a_{t} ∣ s_{t}, θ_{t})$ can well balance exploration and exploitation, but what matters here is the relative value(advantage value) $δ_{t}$ rather than the absolute value(action value) $q_{t}$ .

Implementation

The implementation is also simple, compared with Implementation.

Algorithm Advantage Actor-Critic (A2C) with On-Policy TD Initialization: Policy parameters θ_{0}, value function parameters w_{0} . α_{θ}, α_{w} > 0. For each episode, do: Initialize state s_{0} . Select action a_{0} \sim π (a ∣ s_{0}, θ_{0}) . While s_{t} is not terminal, do: Execute a_{t}, observe r_{t + 1}, s_{t + 1} . Select a_{t + 1} \sim π (a ∣ s_{t + 1}, θ_{t}) . Advantage (TD Error): δ_{t} = r_{t + 1} + γ v (s_{t + 1}, w_{t}) - v (s_{t}, w_{t}) Actor (Policy Update): θ_{t + 1} \leftarrow θ_{t} + α_{θ} δ_{t} \nabla_{θ} ln π (a_{t} ∣ s_{t}, θ_{t}) Critic (Value Update): w_{t + 1} \leftarrow w_{t} + α_{w} δ_{t} \nabla_{w} v (s_{t}, w_{t}) s_{t} \leftarrow s_{t + 1}, a_{t} \leftarrow a_{t + 1}

on-policy
$π$ is stochastic, no need to use $ϵ$ -greedy

Later

on-policy ⇒ off-policy

Reinforcement Learning Notes

Explorer

AC A2C

Advantage Actor Critic(A2C)

Introduction

Baseline invariance

Why is it valid?

Why is it useful?

Algorithm

Implementation

Later

Graph View

Table of Contents

Backlinks