Reinforcement Learning Notes

❯

❯

SA example

Sep 30, 20253 min read

Example of Stochastic Approximation

Review

Remember Law of large numbers, the mean estimation problem:

consider a random variable $X$
suppose that we collected a sequence of i.i.d. samples ${x_{j}}_{j = 1}^{N}$ from $X$
aim is to estimate $E [X]$
The expectation of $X$ can be approximated by $E [X] \approx \overset{x}{ˉ} = \frac{1}{N} j = 1 \sum N x_{j}$
the basic idea of Monte Carlo estimation
$lim_{N \to \infty} \overset{x}{ˉ} = E [X]$

重申：为什么这么关注mean estimation？

之前已经在这里提过一次，当时是从环境模型的两个概率分布的角度出发（因为我们希望用一些别的什么东西替代环境模型的概率形式，这个东西其实就是 $E$ ）。其实除了这个角度，我们大部分在强化学习问题中的值比如action values和gradients都是由expectation的形式所定义的。

New question: how to calculate the mean $\overset{x}{ˉ}$ ?

Non-incremental: trivial(平常的), wait until all samples are collected, then calculate the average
- drawback: cannot get the estimation during the collecting process
Incremental: update the average after each sample

Incremental mean estimation

Suppose $w_{k + 1}$ is the average of the first $k$ samples

w_{k + 1} = \frac{1}{k} i = 1 \sum k x_{i}, k = 1, 2, \dots

hence

w_{k} = \frac{1}{k - 1} i = 1 \sum k - 1 x_{i}, k = 2, 3, \dots

Then we can express $w_{k + 1}$ in terms of $w_{k}$ and $x_{k}$

w_{k + 1} = \frac{1}{k} i = 1 \sum k x_{i} = \frac{1}{k} (x_{k} + i = 1 \sum k - 1 x_{i}) = \frac{1}{k} (x_{k} + (k - 1) w_{k}) = w_{k} - \frac{1}{k} (w_{k} - x_{k})

Verification:

w_{2} w_{3} w_{4} ⋮ w_{k + 1} = x_{1} = w_{2} - \frac{1}{2} (w_{2} - x_{2}) = \frac{1}{2} (x_{1} + x_{2}) = w_{3} - \frac{1}{3} (w_{3} - x_{3}) = \frac{1}{3} (x_{1} + x_{2} + x_{3}) = \frac{1}{k} i = 1 \sum k x_{i}

Remarks:

w_{k + 1} = w_{k} - \frac{1}{k} (w_{k} - x_{k}) = (1 - \frac{1}{k}) w_{k} + \frac{1}{k} x_{k}

mean estimate can be obtained immediately once a sample is received
$lim_{k \to \infty} w_{k} = E [X]$

A more general form(also known as exponential moving average, EMA, used to estimate the mean of a time series, e.g., smoothing stock prices):

w_{k + 1} = w_{k} - α_{k} (w_{k} - x_{k}) = α_{k} x_{k} + (1 - α_{k}) w_{k}

where $\frac{1}{k}$ is replaced by $α_{k}$
if ${α_{k}}$ satisify some mild conditions, then $lim_{k \to \infty} w_{k} = E [X]$
this algorithm is a special SA algorithm, and also a special SGD algorithm
in the next lecture, we will see TD have simliar (but more complex) form

EMA和SGD的相似性

确实非常相似，如果将 $(w_{k - 1} - x_{k})$ 视作梯度的话，如果是batch的形式，和平时我们torch框架中的反向传播就更像了。

Graph View

Example of Stochastic Approximation
Review
Incremental mean estimation

Backlinks

06 Stochastic Approximation
SA Robbins-Monro algorithm
SA Stochastic Gradient Descent

Created with Quartz v4.4.0 © 2025

GitHub
Email
Home