Monte-Carlo with epsilon-greedy

Soft Policy

deterministic policy: greedy policy(only choose the max q action, other unconsidered action probability is just 0)
stochastic policy: soft policy(the probability to take any action is positive, sample action from probability)
with soft policy, sufficiently long episodes can guarantee that all state-action pairs are visited
then we do not need to have a large number of episodes starting from every state-action pair(remove the exploring starts condition)

$ε$ -greedy Policy

Greedy policy:

π (a ∣ s) \leftarrow {1, 0, if a = ar g max_{a} q (s, a) otherwise

$ε$ -greedy policy:

π (a ∣ s) \leftarrow {1 - ε + \frac{ε}{∣ A ( s ) ∣}, \frac{ε}{∣ A ( s ) ∣}, if a = ar g max_{a} q (s, a) otherwise

where:

$ε \in [0, 1]$ is a small positive number
$A (s)$ is the set of all possible actions in state $s$
$∣ A (s) ∣$ is the number of actions in $A (s)$ .

Core Idea

hard to soft这里的idea很简单，就是从greedy action的概率1中挖了一块 $ε$ 出来分给所有的action（包括greedy action，q最大的那个action），从而让所有action都有概率被选上。

$ε \to 0$ 很显然就会变得更greedy，更倾向于选择q最大的action

p_greedy → 1, p_others → 0

more exploitation less exploration

$ε \to 1$ 就会变得更random，因为被挖出来分配到所有action的概率越来越平均

p_greedy → 1/|A|, p_others → 1/|A|

more exploration less exploitation

利用和探索其实自己的理解又加深了一点

$greedy exploitation$ 就是倾向在已知的最好的action中选择

$random exploration$ 就是倾向在所有的action中随机选择

上一章所提及到的确保每个 $(s, a)$ 都能被访问到的exploring-starts现在也不需要用了，因为我们自带了exploring的功能，不是100%greedy，而是带了 $ε$ 的random exploration在里面，所以就确保了只要次数够多（一个step足够多的episode），每个 $(s, a)$ 都能被访问到。

Algorithm

Algorithm MC ε -Greedy (a variant of MC Exploring Starts) Initialization: Initial policy π_{0} (a ∣ s), value q (s, a) for all (s, a) . Returns (s, a) \leftarrow 0, Num (s, a) \leftarrow 0. For each episode, do: Select starting (s_{0}, a_{0}) Generate episode s_{0}, a_{0}, r_{1}, s_{1}, \dots, s_{T - 1}, a_{T - 1}, r_{T} following current policy π . g \leftarrow 0 For t = T - 1, T - 2, \dots, 0 : g \leftarrow γ g + r_{t + 1} Returns (s_{t}, a_{t}) \leftarrow Returns (s_{t}, a_{t}) + g Num (s_{t}, a_{t}) \leftarrow Num (s_{t}, a_{t}) + 1 q (s_{t}, a_{t}) \leftarrow \frac{Returns ( s _{t} , a _{t} )}{Num ( s _{t} , a _{t} )} (Policy evaluation) π (a ∣ s_{t}) \leftarrow {1 - ε + \frac{ε}{∣ A ( s _{t} ) ∣}, \frac{ε}{∣ A ( s _{t} ) ∣}, if a = ar g max_{a} q (s_{t}, a) otherwise (Policy improvement)

simple modify of MC Exploring Starts
1. just remove the line of selecting starting $(s_{0}, a_{0})$ under exploring-starts condition
2. change the policy improvement formula to $ε$ -greedy policy

Optimality vs exploration

牺牲了 $ε$ 的最优性来保证探索率，实际上只要 $ε$ 足够小还是可以趋近于最优的
也可以用decay的方法，一开始设置一个较大的 $ε$ ，然后随着episode的增加逐渐减小 $ε$ ，这样可以在开始的时候更多的探索，然后在后面逐渐趋近于最优
可以唯一确定的是只要 $ε$ 不为零，那么最终的state value肯定不是最优的，因为总是有一部分action的概率是随机的(总可能进入forbidden的区域获得惩罚)，如果太大state value甚至会变成负数（比如环境中陷阱要多于安全的区域，总是乱走那么就会变成负数）
在 $ε$ -greedy学习完之后我们希望直接取greedy的最大值进行决策（也叫optimal $ε$ -greedy），相当于就是不希望再带着随机性去探索，而是希望转成对应的greedy policy，并且这个策略和最优策略是相同的(consistent)，赵老师举了一个例子就是当 $ε \geq 0.2$ 时转换后的策略和最优策略怎么都等价不了了，这个告诉我们如果要用 $ε$ -greedy的话 $ε$ 要尽量小，但是又不能为0，所以这个是一个trade-off的问题

Reinforcement Learning Notes

Explorer

MC epsilon-greedy

Monte-Carlo with epsilon-greedy

Soft Policy

$ε$ -greedy Policy

Algorithm

Optimality vs exploration

Graph View

Table of Contents

Backlinks

Reinforcement Learning Notes

Explorer

MC epsilon-greedy

Monte-Carlo with epsilon-greedy

Soft Policy

ε-greedy Policy

Algorithm

Optimality vs exploration

Graph View

Table of Contents

Backlinks

$ε$ -greedy Policy