Stochasitc Gradient Descent(SGD)
Algorithm
Suppose we aim to solve the following optimization problem:
- is the parameter to be optimized
- is the random variable. The expectation is with respect to
- and can be either scalars or vectors. the function is a scalar function
Method 1: gradient descent (GD)
- Drawback: requires the distribution of to calculate the expectation
Method 2: batch gradient descent (BGD)
- Drawback: requires many samples in each iteration for each
Method 3: stochastic gradient descent (SGD) (batch_size=1)
Examples
We consider an example:
where
Exercise 1: Show that the optimal solution is
The optimal solution must satisfy:
Exercise 2: Write out the GD algorithm for solving this problem.
The GD algorithm is:
Exercise 3: Write out the SGD algorithm for solving this problem.
The SGD algorithm is:
- It is the same as the mean estimation problem in the Incremental mean estimation
- mean estimation is a special SGD
Convergence
Question: Dose as ?
Consider use RM to minimize :
Convert the optimization problem to the root-finding problem:
What we can measure is:
Then, the RM algorithm is:
- It is exactly the same as the SGD algorithm
- SGD is a special RM
- SGD’s convergence is guaranteed by RM
- SGD convergence pattern see here
Convergence of SGD
In the SGD, if
- ;
- and ;
- is i.i.d.;
Then, as with probability 1, is the root of .
BGD, MBGD and SGD
Suppose we would like to minimize given a set of random samples of .
- BGD: all the samples are used in every iteration. When is large, is close to the true gradient
- MBGD: is a subset of with the size as . The set is obtained by times i.i.d. samplings.
- SGD: is randomly sampled from at time .
Compare MBGD with BGD and SGD
和之前的PI>TPI>VI的比较非常像,我们的MBGD其实也是BGD和SGD的一种折中(BGD > MBGD > SGD)。BGD的计算量太大,SGD的计算量太小但随机性太大,MBGD则是在这两者之间,可以通过调整batch_size(也就是上面的m)来控制计算量。
- 和SGD(相当于batch_size=1)比较,MBGD拥有更少的随机性(更鲁棒),因为它是从一个batch中取平均,而不是单个样本
- 和BGD(相当于batch_size=n)比较,MBGD的计算量更小,因为它只用了一个batch的样本,而不是全部样本
小结:对比一下之前的操作就是在policy-evaluation中通过减少迭代求解次数得到了TPI,而在这里通过减少batch_size得到了MBGD
- 另外:对于调整batch_size从而得到的不同效果可以查看here
Exercise 4
Given some numbers , our aim is to calculate the mean . This problem can be equivalently stated as the following optimization problem:
Use the BGD, MBGD and SGD algorithms to solve this problem.
The three algorithms for solving this problem are, respectively:
- BGD:
- MBGD:
- SGD:
where and .