Monte-Carlo Basic

Convert policy iteration to be model-free

{Policy evaluation: Policy improvement: v_{π_{k}} = r_{π_{k}} + γ P_{π_{k}} v_{π_{k}} π_{k + 1} = ar g max_{π} (r_{π} + γ P_{π} v_{π_{k}})

See the elementwise form of the policy improvement step:

π_{k + 1} (s) = ar g π max a \sum π (a ∣ s) [r \sum p (r ∣ s, a) r + γ s^{'} \sum p (s^{'} ∣ s, a) v_{π_{k}} (s^{'})] = ar g π max a \sum π (a ∣ s) q_{π_{k}} (s, a), s \in S

为什么我们这里不太关注policy evaluation而是policy improvement

我们策略迭代都是围绕策略展开的，希望得到下一个策略之前我们最关心的也就是q的值，而对于policy evaluation，我们在policy iteration中是通过iterative form去解第一条式子得到 $v_{π_{k}}$ 的值从而进行下一步的policy improvement的计算的，本质而言我们最后还是要进行q值的计算，所以不妨在policy evaluation阶段我们就直接进行Q值的采样、平均估计（MC method），这也正是可以避免用到环境模型，而是直接进行蒙特卡洛采样估计。

expression 1: requires the model $q_{π_{k}} (s, a) = r \sum p (r ∣ s, a) r + γ s^{'} \sum p (s^{'} ∣ s, a) v_{π_{k}} (s^{'})$
expression 2: model-free $q_{π_{k}} (s, a) = E [G_{t} ∣ S_{t} = s, A_{t} = a]$ obtain $q$ based on data (samples of experiences)

MC Basic替代环境模型的core idea

when model is unavailable, we can use data(samples, experience)

starting from $(s, a)$ , following policy $π_{k}$ , generate an episode
the return of this episode is $g (s, a)$ , which is a sample of $G_{t}$ (suggest that $G_{t}$ is a random variable)
suppose we have a set of episodes ${g^{(i)} (s, a)}$ , then we can estimate $q_{π_{k}} (s, a)$ by averaging the returns $q_{π_{k}} (s, a) \approx \frac{1}{N} i = 1 \sum N g^{(i)} (s, a)$ where $N$ is the number of episodes

Algorithm

Algorithm MC Basic (Model-Free Policy Iteration) Input: Number of episodes M per state-action pair (s, a) Initialization: Initial policy π_{0} . For k = 0, 1, 2, \dots : Policy Evaluation: For each state s \in S : For each action a \in A (s) : Collect M episodes starting from (s, a), following π_{k} q_{k} (s, a) \leftarrow \frac{1}{M} m = 1 \sum M G^{(m)} (s, a) (Average return) Policy Improvement: For each state s \in S : a_{k}^{*} (s) \leftarrow ar g a \in A (s) max q_{k} (s, a) (Greedy action) π_{k + 1} (a ∣ s) \leftarrow {10 if a = a_{k}^{*} (s) otherwise (Policy update)

or we can combine the policy evaluation and policy improvement steps into one:

Algorithm MC Basic (Model-Free Policy Iteration) Input: Number of episodes M per state-action pair (s, a) Initialization: Initial policy π_{0} . For k = 0, 1, 2, \dots : For each state s \in S : For each action a \in A (s) : Collect M episodes starting from (s, a), following π_{k} q_{k} (s, a) \leftarrow \frac{1}{M} m = 1 \sum M G^{(m)} (s, a) (Average return) a_{k}^{*} (s) \leftarrow ar g a \in A (s) max q_{k} (s, a) (Greedy action) π_{k + 1} (a ∣ s) \leftarrow {10 if a = a_{k}^{*} (s) otherwise (Policy update)

Notes

MC Basic is a varient of policy iteration
model-free build on model-based policy iteration, neccessary to understand policy iteration
why MC Basic estimate action valuses instead of state values?
- state values cannot improve policy directly
- when models are unavailable, we should directly estimate action values
policy iteration is convergent ⇒ MC Basic is convergent
episode should be sufficiently long, not have to be infinitely long

赵老师对大家学习的一个想法

MC Basic在其它地方看不到，原因就是这是赵老师起的名字，在学习的时候应该把最最核心的想法和其它让它看起来更加复杂的这样一些东西给剥离开来。

Later

MC Basic is useful to reveal the core idea, but not practical due to its low efficiency

next⇒Use data more efficiently

Reinforcement Learning Notes

Explorer

MC Basic

Monte-Carlo Basic

Convert policy iteration to be model-free

Algorithm

Notes

Later

Graph View

Table of Contents

Backlinks