Stochasitc Gradient Descent(SGD)

Algorithm

Suppose we aim to solve the following optimization problem:

w min J (w) = E [f (w, X)]

$w$ is the parameter to be optimized
$X$ is the random variable. The expectation is with respect to $X$
$w$ and $X$ can be either scalars or vectors. the function $f (\cdot)$ is a scalar function

Method 1: gradient descent (GD)

Drawback: requires the distribution of $X$ to calculate the expectation

w_{k + 1} = w_{k} - α_{k} \nabla_{w} E [f (w_{k}, X)] = w_{k} - α_{k} E [\nabla_{w} f (w_{k}, X)]

Method 2: batch gradient descent (BGD)

Drawback: requires many samples in each iteration for each $w_{k}$

E [\nabla_{w} f (w_{k}, X)] \approx \frac{1}{n} i = 1 \sum n \nabla_{w} f (w_{k}, x_{i})

w_{k + 1} = w_{k} - α_{k} \frac{1}{n} i = 1 \sum n \nabla_{w} f (w_{k}, x_{i})

Method 3: stochastic gradient descent (SGD) (batch_size=1)

w_{k + 1} = w_{k} - α_{k} \nabla_{w} f (w_{k}, x_{k})

Examples

We consider an example:

w min J (w) = E [\frac{1}{2} ∥ w - X ∥^{2}],

where

f (w, X) = \frac{1}{2} ∥ w - X ∥^{2} \Rightarrow \nabla_{w} f (w, X) = w - X

Exercise 1: Show that the optimal solution is

w^{*} = E (X)

The optimal solution $w^{*}$ must satisfy:

\nabla_{w} J (w^{*}) \Rightarrow \nabla_{w} E [\frac{1}{2} ∥ w - X ∥^{2}] \Rightarrow E [\nabla_{w} \frac{1}{2} ∥ w - X ∥^{2}] \Rightarrow E [w - X] \Rightarrow w^{*} = 0 = 0 = 0 = 0 = E (X)

Exercise 2: Write out the GD algorithm for solving this problem.

The GD algorithm is:

w_{k + 1} = w_{k} - α_{k} \nabla_{w} J (w_{k}) = w_{k} - α_{k} E [\nabla_{w} f (w_{k}, X)] = w_{k} - α_{k} E [w_{k} - X]

Exercise 3: Write out the SGD algorithm for solving this problem.

The SGD algorithm is:

w_{k + 1} = w_{k} - α_{k} \nabla_{w} f (w_{k}, x_{k}) = w_{k} - α_{k} (w_{k} - x_{k})

It is the same as the mean estimation problem in the Incremental mean estimation
mean estimation is a special SGD

Convergence

Question: Dose $w_{k} \to w^{*}$ as $k \to \infty$ ?

Consider use RM to minimize $J (w)$ :

J (w) = E [f (w, X)]

Convert the optimization problem to the root-finding problem:

g (w) = \nabla_{w} J (w) = E [\nabla_{w} f (w, X)] = 0

What we can measure is:

\tilde{g} (w, η) = \nabla_{w} f (w_{k}, x_{k}) = g (w) E [\nabla_{w} f (w, X)] + η \nabla_{w} f (w_{k}, x_{k}) - E [\nabla_{w} f (w, X)]

Then, the RM algorithm is:

w_{k + 1} = w_{k} - a_{k} \tilde{g} (w_{k}, η_{k}) = w_{k} - a_{k} \nabla_{w} f (w_{k}, x_{k})

It is exactly the same as the SGD algorithm
SGD is a special RM
SGD’s convergence is guaranteed by RM
SGD convergence pattern see here

Convergence of SGD

In the SGD, if

$0 < c_{1} \leq \nabla_{w}^{2} f (w, x) \leq c_{2}$ ;

$\sum_{k = 1}^{\infty} a_{k} = \infty$ and $\sum_{k = 1}^{\infty} a_{k}^{2} < \infty$ ;

${x_{k}}_{k = 1}^{\infty}$ is i.i.d.;

Then, $w_{k} \to w^{*}$ as $k \to \infty$ with probability 1, $w^{*}$ is the root of $\nabla_{w} E [f (w, X)] = 0$ .

BGD, MBGD and SGD

Suppose we would like to minimize $J (w) = E [f (w, X)]$ given a set of random samples ${x_{i}}_{i = 1}^{n}$ of $X$ .

BGD: all the samples are used in every iteration. When $n$ is large, $\frac{1}{n} \sum_{i = 1}^{n} \nabla_{w} f (w_{k}, x_{i})$ is close to the true gradient $E [\nabla_{w} f (w_{k}, X)]$ $w_{k + 1} = w_{k} - α_{k} \frac{1}{n} i = 1 \sum n \nabla_{w} f (w_{k}, x_{i})$
MBGD: $I_{k}$ is a subset of ${1, 2, \dots, n}$ with the size as $∣ I_{k} ∣ = m$ . The set $I_{k}$ is obtained by $m$ times i.i.d. samplings. $w_{k + 1} = w_{k} - α_{k} \frac{1}{m} j \in I_{k} \sum \nabla_{w} f (w_{k}, x_{j})$
SGD: $x_{k}$ is randomly sampled from ${x_{i}}_{i = 1}^{n}$ at time $k$ . $w_{k + 1} = w_{k} - α_{k} \nabla_{w} f (w_{k}, x_{k})$

Compare MBGD with BGD and SGD

和之前的PI>TPI>VI的比较非常像，我们的MBGD其实也是BGD和SGD的一种折中(BGD > MBGD > SGD)。BGD的计算量太大，SGD的计算量太小但随机性太大，MBGD则是在这两者之间，可以通过调整batch_size（也就是上面的m）来控制计算量。

和SGD(相当于batch_size=1)比较，MBGD拥有更少的随机性（更鲁棒），因为它是从一个batch中取平均，而不是单个样本

和BGD(相当于batch_size=n)比较，MBGD的计算量更小，因为它只用了一个batch的样本，而不是全部样本

小结：对比一下之前的操作就是在policy-evaluation中通过减少迭代求解次数得到了TPI，而在这里通过减少batch_size得到了MBGD

另外：对于调整batch_size从而得到的不同效果可以查看here

Exercise 4

Given some numbers ${x_{i}}_{i = 1}^{n}$ , our aim is to calculate the mean $x = \frac{1}{n} \sum_{i = 1}^{n} x_{i}$ . This problem can be equivalently stated as the following optimization problem:

w min J (w) = \frac{1}{2 n} i = 1 \sum n ∥ w - x_{i} ∥^{2}

Use the BGD, MBGD and SGD algorithms to solve this problem.

The three algorithms for solving this problem are, respectively:

BGD: $w_{k + 1} = w_{k} - α_{k} \frac{1}{n} i = 1 \sum n (w_{k} - x_{i}) = w_{k} - α_{k} (w_{k} - \overset{x}{ˉ})$
MBGD: $w_{k + 1} = w_{k} - α_{k} \frac{1}{m} j \in I_{k} \sum (w_{k} - x_{j}) = w_{k} - α_{k} (w_{k} - \overset{x}{ˉ}_{k}^{(m)})$
SGD: $w_{k + 1} = w_{k} - α_{k} (w_{k} - x_{k})$

where $\overset{x}{ˉ} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}$ and $\overset{x}{ˉ}_{k}^{(m)} = \frac{1}{m} \sum_{j \in I_{k}} x_{j}$ .

Reinforcement Learning Notes

Explorer

SA Stochastic Gradient Descent

Stochasitc Gradient Descent(SGD)

Algorithm

Examples

Convergence

BGD, MBGD and SGD

Graph View

Table of Contents

Backlinks