Monte-Carlo with Exploring Starts

To make MC Basic efficient

$visit$ : every time a state-action pair $(s, a)$ appears in an episode, we say that $(s, a)$ is a visit
method to use the data: first-visit method, every-visit method

first visit & every visit?

first-visit: 只考虑第一个visit $(s, a)$ ，然后collect all the episodes从该visit出发，然后用average return来approximate $q (s, a)$

这个就是MC Basic的方法

问题是agent在采取下一次visit的时候需要等待所有 episode搜集完，所以效率很低下

every-visit: uses the return of a single episode to approximate $q (s, a)$

用这个方法就可以improve policy episode by episode

本节算法就是用了every-visit

总结一句话

first-visit的意思就是如果第一次出现了(s,a)，就用当前这个visit的return来更新q(s,a)，而every-visit就是每次visit都更新q(s,a)

Generalized policy iteration

实际上许多RL的算法都是要进行policy-evaluation和policy-improvement这两步的(general idea)

Algorithm

Core idea

为了充分利用所有episode中的每个visit的return（而不希望每次重复计算），我们用类似动态规划中的填数组的方法来不断加入新的值并求平均

Algorithm MC Exploring Starts (Sample-Efficient MC Basic) Initialization: Initial policy π_{0} (a ∣ s), value q (s, a) for all (s, a) . Returns (s, a) \leftarrow 0, Num (s, a) \leftarrow 0. For each episode, do: Select starting (s_{0}, a_{0}) under exploring-starts condition. Generate episode s_{0}, a_{0}, r_{1}, s_{1}, \dots, s_{T - 1}, a_{T - 1}, r_{T} following current policy π . g \leftarrow 0 For t = T - 1, T - 2, \dots, 0 : g \leftarrow γ g + r_{t + 1} Returns (s_{t}, a_{t}) \leftarrow Returns (s_{t}, a_{t}) + g Num (s_{t}, a_{t}) \leftarrow Num (s_{t}, a_{t}) + 1 q (s_{t}, a_{t}) \leftarrow \frac{Returns ( s _{t} , a _{t} )}{Num ( s _{t} , a _{t} )} (Policy evaluation) π (a ∣ s_{t}) \leftarrow {10 if a = ar g max_{a} q (s_{t}, a) otherwise (Policy improvement)

这里是倒着episode的顺序来逐步更新的，这样做我们也是为了提高效率，可以直接利用后面的return来更新前面的值

What is exploring-starts?

means we can start from any state-action pair $(s, a)$

generate sufficiently many episodes $starts starting$ from $exploring every$ state-action pairs

ensure that all pairs can be possibly selected

如果说没有episode是从某个state-action pair开始的，那么这个pair就不会被更新

MC Basic and MC Exploring Starts need this assumption

总结一句话

我现在希望获得某个visit的return，有两种方法，第一个是确保在一开始就能访问到（starts的意思），第二个是希望在episode的过程中可以访问到这个visit，但是由于这里的方法还是deterministic的，不能保证每个visit都能被访问到，所以我们需要exploring starts condition

Later

$MC Exploring Starts$ is a method to make $MC Basic$ more efficient
however, in this theory, we assume that only if every action value for every state is well explored, can we select the best action
in practice, for many applications, we can’t explore all the state-action pairs (especially for those involing continuous state space or action space/physical iteration with environment)

next⇒Without exploring starts

Reinforcement Learning Notes

Explorer

MC Exploring Starts

Monte-Carlo with Exploring Starts

Algorithm

Later

Graph View

Table of Contents

Backlinks