Off-policy Actor Critic

Introduction

Policy gradient is on-policy, because the gradient $\nabla_{θ} J (θ) = E_{S \sim μ, A \sim π} [*]$
We can convert it to off-policy by using importance sampling.

Illustrative Example

Consider a random variable $X \in X = {+ 1, - 1}$ . If the probability distribution of $X$ is $p_{0}$ :

p_{0} (X = + 1) = 0.5 p_{0} (X = - 1) = 0.5

then the expectation of $X$ is

E_{X \sim p_{0}} [X] = (+ 1) \cdot 0.5 + (- 1) \cdot 0.5 = 0

Question: how to estimate $E [X]$ by using some samples ${x_{i}}$ ?

Case 1 (familiar)

The samples ${x_{i}}$ are generated according to $p_{0}$ :

E [x_{i}] = E [X] var [x_{i}] = var [X]

Then, the average value can converge to the expectation:

\overset{x}{ˉ} = \frac{1}{n} i = 1 \sum n x_{i} n \to \infty E [X]

center

More

See the law of large numbers.

code: Figure Samples Average

Case 2 (new)

The samples ${x_{i}}$ are generated according to another distribution $p_{1}$ :

p_{1} (X = + 1) = 0.8 p_{1} (X = - 1) = 0.2

The expectation is

E_{X \sim p_{1}} [X] = (+ 1) \cdot 0.8 + (- 1) \cdot 0.2 = 0.6

If we use the average of the samples, then without suprising

\overset{x}{ˉ} = \frac{1}{n} i = 1 \sum n x_{i} n \to \infty E_{X \sim p_{1}} [X] = 0.6 \neq = E_{X \sim p_{0}} [X] = 0

Can we use ${x_{i}} \sim p_{1}$ to estimate $E_{X \sim p_{0}} [X]$ ?

Why to do that? We want to estimate $E_{A \sim π}$ where $π$ is the target policy based on the sample of a behavior policy $β$ .(off-policy: get $E_{A \sim β}$ , want $E_{A \sim π}$ )

How to do that? We can’t directly using $\overset{x}{ˉ}$ above, we need to re-weight the samples by the ratio of the target policy and the behavior policy, see Importance sampling.

Importance sampling

Note that

E_{X \sim p_{0}} [X] = x \sum p_{0} (x) x = x \sum p_{1} (x) f (x) \frac{p _{0} ( x )}{p _{1} ( x )} x = E_{X \sim p_{1}} f (X) \frac{p _{0} ( X )}{p _{1} ( X )} X

Then we can estimate $E_{X \sim p_{0}} [X]$ by estimating $E_{X \sim p_{1}} [f (X)]$ .

How to estimate $E_{X \sim p_{1}} [f (X)]$ ? Easy. Let

\overset{ˉ}{f} = \frac{1}{n} i = 1 \sum n f (x_{i}), x_{i} \sim p_{1} n \to \infty E_{X \sim p_{1}} [f (X)]

Therefore, $f$ is a good approximation for $E_{X \sim p_{1}} [f (X)] = E_{X \sim p_{0}} [X]$ .

E_{X \sim p_{0}} [X] \approx \overset{ˉ}{f} = \frac{1}{n} i = 1 \sum n f (x_{i}) \frac{p _{0} ( x _{i} )}{p _{1} ( x _{i} )} x_{i} = \frac{1}{n} i = 1 \sum n importance weight \frac{p _{0} ( x _{i} )}{p _{1} ( x _{i} )} x_{i}

If $p_{1} (x_{i}) = p_{0} (x_{i})$ , the importance weight is one and $\overset{ˉ}{f}$ becomes $\overset{x}{ˉ}$ .
If $p_{0} (x_{i}) \geq p_{1} (x_{i})$ , $x_{i}$ can be more often sampled by $p_{0}$ than $p_{1}$ . The importance weight $> 1$ can emphasize the importance of this sample, vice versa.

你可能会问：如果都知道 $p_{0}$ 和 $p_{1}$ 了，为什么不直接计算 $p_{0}$ 的期望呢（还搁这计算 $\overset{ˉ}{f}$ ）？

当x是离散的时候：需要注意的是，我们有时候是获得不了all x的，只能一个一个采样获得x(或者从replay buffer里面采样一个batch)，我们可以用到重要性采样了来更新我们整体的期望。

当x是连续时：比如下一节AC DPG就开始介绍连续action应该怎么做，此时我们要做整个action space空间上的动作概率( $π, β$ )积分是非常困难的，但是如果只是获取某一个特定动作的概率就简单不少（TD，determinisitic），此时我们就可以用重要性采样来对当前的梯度进行修正（ $π (A ∣ S, θ) / β (A ∣ S)$ ），更多连续的细节可以参考Sutton书中的13.7节。

TL;DR

一句话总结

当我们用一个分布 $p_{1}$ 的样本去估计另一个分布 $p_{0}$ 的期望时，我们需要用 $p_{0}$ 和 $p_{1}$ 的比值作为权重来调整样本的平均值。就element-wise的角度而言，我们就是拿将第i个样本除以 $p_{1} (i)$ 再乘以 $p_{0} (i)$ 就可以了。

If ${x_{i}} \sim p_{1}$ , then

\overset{x}{ˉ} = \frac{1}{n} i = 1 \sum n x_{i} n \to \infty E_{X \sim p_{1}} [X] \overset{ˉ}{f} = \frac{1}{n} i = 1 \sum n \frac{p _{0} ( x _{i} )}{p _{1} ( x _{i} )} x_{i} n \to \infty E_{X \sim p_{0}} [X]

center

code: Figure Importance Sampling

Theorem of off-policy policy gradient

Suppose $β$ is the behavior policy that generates experience samples.
Our goal is to use these samples to update the target policy $π$ that can optimize the metric $J (θ) = s \in S \sum d_{β} (s) v_{π} (s) = E_{S \sim d_{β}} [v_{π} (S)]$ where $d^{β}$ is the stationary distribution under policy $β$ .

In the discounted case where $γ \in (0, 1)$ , the gradient of $J (θ)$ is

\nabla_{θ} J (θ) = E_{S \sim ρ, A \sim π} [q_{π} (S, A) \nabla_{θ} ln π (A ∣ S, θ)] = E_{S \sim ρ, A \sim β} [\frac{π ( A ∣ S , θ )}{β ( A ∣ S )} q_{π} (S, A) \nabla_{θ} ln π (A ∣ S, θ)]

where $β$ is the behavior policy and $ρ$ is the state distribution.

我们这里的 $ρ$ 和之前所提及的on-policy distribution $μ$ 做了区分，这里是off-policy的

Algorithm of off-policy actor-critic

The corresponding stochastic gradient-ascent algorithm is

θ_{t + 1} = θ_{t} + α_{θ} \frac{π ( a _{t} ∣ s _{t} , θ _{t} )}{β ( a _{t} ∣ s _{t} )} (q_{t} (s_{t}, a_{t}) - v_{t} (s_{t})) \nabla_{θ} l n π (a_{t} ∣ s_{t}, θ_{t})

Similar to the on-policy case,

(q_{t} (s_{t}, a_{t}) - v_{t} (s_{t})) \approx r_{t} + γ v_{t} (s_{t + 1}) - v_{t} (s_{t}) ≐ δ_{t} (s_{t}, a_{t})

Then, the algorithm becomes

θ_{t + 1} = θ_{t} + α_{θ} \frac{π ( a _{t} ∣ s _{t} , θ _{t} )}{β ( a _{t} ∣ s _{t} )} δ_{t} (s_{t}, a_{t}) \nabla_{θ} l n π (a_{t} ∣ s_{t}, θ_{t})

Implementation

Compared with Implementation

Algorithm Off-Policy Actor-Critic with importance sampling Initialization: Policy π (a ∣ s, θ) parameters θ_{0}, value function v (s, w) parameters w_{0}, given behavior policy β (a ∣ s) . α_{θ}, α_{w} > 0. For each episode, do: Initialize state s_{0} . Select action a_{0} \sim β (a ∣ s_{0}) . While s_{t} is not terminal, do: Execute a_{t}, observe r_{t + 1}, s_{t + 1} . Select a_{t + 1} \sim β (a ∣ s_{t + 1}) . Advantage (TD Error): δ_{t} = r_{t + 1} + γ v (s_{t + 1}, w_{t}) - v (s_{t}, w_{t}) Importance Sampling: δ_{t} = \frac{π ( a _{t} ∣ s _{t} , θ _{t} )}{β ( a _{t} ∣ s _{t} )} δ_{t} Actor (Policy Update): θ_{t + 1} \leftarrow θ_{t} + α_{θ} δ_{t} \nabla_{θ} ln π (a_{t} ∣ s_{t}, θ_{t}) Critic (Value Update): w_{t + 1} \leftarrow w_{t} + α_{w} δ_{t} \nabla_{w} v (s_{t}, w_{t}) s_{t} \leftarrow s_{t + 1}, a_{t} \leftarrow a_{t + 1}

Later

discrete action space ⇒ continuous action space

Appendix

Figure Samples Average

import numpy as np
import matplotlib.pyplot as plt
 
# Generate random values +1 or -1, with probability 0.5 for each
np.random.seed(1337)  # Set random seed to ensure reproducibility
num_samples = 200
samples = np.random.choice([1, -1], size=num_samples)
 
# Calculate the average
# cumulative_avg = np.cumsum(samples) / np.arange(1, num_samples + 1)
cumulative_avg = np.zeros(num_samples)
mean = 0  # Initial mean
for k in range(1, num_samples + 1):
    mean = mean - (1 / k) * (mean - samples[k - 1])
    cumulative_avg[k - 1] = mean
 
# Create plot
plt.figure(figsize=(6, 4))
plt.scatter(range(num_samples), samples, marker='d', facecolors='none', edgecolors='blue', label="samples")
plt.plot(range(num_samples), cumulative_avg, color='orangered', label="average", linewidth=2)
 
# Turn off grid
plt.grid(False)
 
# Add a gray dashed line at Value = 0
plt.axhline(y=0, color='gray', linestyle=':', alpha=0.6)
 
# Set legend and labels
plt.xlabel("Sample index")
plt.ylabel("Value")
plt.legend()
plt.ylim(-2, 2)
 
plt.title("Samples and Average Value")
# plt.show()
plt.savefig("Figure_Samples_Average.png", dpi=300, bbox_inches='tight')

Figure Importance Sampling

import numpy as np
import matplotlib.pyplot as plt
 
# Generate random values +1 or -1, with probability 0.5 for each
np.random.seed(42)  # Set random seed to ensure reproducibility
num_samples = 200
samples = np.random.choice([1, -1], size=num_samples, p=[0.8, 0.2])
 
# Calculate the average
avg = np.zeros(num_samples)
mean = 0  # Initial mean
for i, sample in enumerate(samples):
    alpha = 1 / (i + 1)
    mean = mean - alpha * (mean - sample)
    avg[i] = mean
 
# Importance sampling
# Calculate the average using importance sampling
avg_imp = np.zeros(num_samples)
mean_imp = 0  # Initial mean
for i, sample in enumerate(samples):
    alpha = 1 / (i + 1)
    # Update the mean using importance sampling
    ratio = 0.5 / (0.8 if sample == 1 else 0.2)
    mean_imp = mean_imp - alpha * (mean_imp - sample) * ratio
    avg_imp[i] = mean_imp
 
# Create plot
plt.figure(figsize=(6, 4))
plt.scatter(range(num_samples), samples, marker='o', facecolors='none', edgecolors='orangered', label="samples")
plt.plot(range(num_samples), avg, label="average", linestyle=':')
plt.plot(range(num_samples), avg_imp, label="importance sampling", color='green')
 
# Turn off grid
plt.grid(False)
 
# Add a gray dashed line at Value = 0
plt.axhline(y=0, color='gray', linestyle=':', alpha=0.6)
 
# Set legend and labels
plt.xlabel("Sample index")
plt.ylabel("Value")
plt.legend()
plt.ylim(-2.5, 2.5)
 
plt.title("Importance Sampling")
# plt.show()
plt.savefig("Figure_Importance_Sampling.png", dpi=300, bbox_inches='tight')

Reinforcement Learning Notes

Explorer

AC off-policy

Off-policy Actor Critic

Introduction

Illustrative Example

Case 1 (familiar)

Case 2 (new)

Importance sampling

TL;DR

Theorem of off-policy policy gradient

Algorithm of off-policy actor-critic

Implementation

Later

Appendix

Figure Samples Average

Figure Importance Sampling

Graph View

Table of Contents

Backlinks