Monte-Carlo Basic
Convert policy iteration to be model-free
Policy iteration has two steps:
See the elementwise form of the policy improvement step:
为什么我们这里不太关注policy evaluation而是policy improvement
我们策略迭代都是围绕策略展开的,希望得到下一个策略之前我们最关心的也就是q的值,而对于policy evaluation,我们在policy iteration中是通过iterative form去解第一条式子得到的值从而进行下一步的policy improvement的计算的,本质而言我们最后还是要进行q值的计算,所以不妨在policy evaluation阶段我们就直接进行Q值的采样、平均估计(MC method),这也正是可以避免用到环境模型,而是直接进行蒙特卡洛采样估计。
- expression 1: requires the model
- expression 2: model-free obtain based on data (samples of experiences)
MC Basic替代环境模型的core idea
when model is unavailable, we can use data(samples, experience)
- starting from , following policy , generate an episode
- the return of this episode is , which is a sample of (suggest that is a random variable)
- suppose we have a set of episodes , then we can estimate by averaging the returns where is the number of episodes
Algorithm
or we can combine the policy evaluation
and policy improvement
steps into one:
Notes
- MC Basic is a varient of policy iteration
- model-free build on model-based policy iteration, neccessary to understand policy iteration
- why MC Basic estimate action valuses instead of state values?
- state values cannot improve policy directly
- when models are unavailable, we should directly estimate action values
- policy iteration is convergent ⇒ MC Basic is convergent
- episode should be sufficiently long, not have to be infinitely long
赵老师对大家学习的一个想法
MC Basic在其它地方看不到,原因就是这是赵老师起的名字,在学习的时候应该把最最核心的想法和其它让它看起来更加复杂的这样一些东西给剥离开来。
Later
- MC Basic is useful to reveal the core idea, but not practical due to its low efficiency