Stochasitc Gradient Descent(SGD)

Algorithm

Suppose we aim to solve the following optimization problem:

  • is the parameter to be optimized
  • is the random variable. The expectation is with respect to
  • and can be either scalars or vectors. the function is a scalar function

Method 1: gradient descent (GD)

  • Drawback: requires the distribution of to calculate the expectation

Method 2: batch gradient descent (BGD)

  • Drawback: requires many samples in each iteration for each

Method 3: stochastic gradient descent (SGD) (batch_size=1)

Examples

We consider an example:

where

Exercise 1: Show that the optimal solution is

The optimal solution must satisfy:

Exercise 2: Write out the GD algorithm for solving this problem.

The GD algorithm is:

Exercise 3: Write out the SGD algorithm for solving this problem.

The SGD algorithm is:

Convergence

Question: Dose as ?

Consider use RM to minimize :

Convert the optimization problem to the root-finding problem:

What we can measure is:

Then, the RM algorithm is:

  • It is exactly the same as the SGD algorithm
  • SGD is a special RM
  • SGD’s convergence is guaranteed by RM
  • SGD convergence pattern see here

Convergence of SGD

In the SGD, if

  1. ;
  2. and ;
  3. is i.i.d.;

Then, as with probability 1, is the root of .

BGD, MBGD and SGD

Suppose we would like to minimize given a set of random samples of .

  • BGD: all the samples are used in every iteration. When is large, is close to the true gradient
  • MBGD: is a subset of with the size as . The set is obtained by times i.i.d. samplings.
  • SGD: is randomly sampled from at time .

Compare MBGD with BGD and SGD

和之前的PI>TPI>VI的比较非常像,我们的MBGD其实也是BGD和SGD的一种折中(BGD > MBGD > SGD)。BGD的计算量太大,SGD的计算量太小但随机性太大,MBGD则是在这两者之间,可以通过调整batch_size(也就是上面的m)来控制计算量。

  • 和SGD(相当于batch_size=1)比较,MBGD拥有更少的随机性(更鲁棒),因为它是从一个batch中取平均,而不是单个样本
  • 和BGD(相当于batch_size=n)比较,MBGD的计算量更小,因为它只用了一个batch的样本,而不是全部样本

小结:对比一下之前的操作就是在policy-evaluation中通过减少迭代求解次数得到了TPI,而在这里通过减少batch_size得到了MBGD

  • 另外:对于调整batch_size从而得到的不同效果可以查看here
Exercise 4

Given some numbers , our aim is to calculate the mean . This problem can be equivalently stated as the following optimization problem:

Use the BGD, MBGD and SGD algorithms to solve this problem.

The three algorithms for solving this problem are, respectively:

  • BGD:
  • MBGD:
  • SGD:

where and .