Monte-Carlo with Exploring Starts
To make MC Basic efficient
- : every time a state-action pair appears in an episode, we say that is a visit
- method to use the data: first-visit method, every-visit method
first visit & every visit?
- first-visit: 只考虑第一个visit ,然后collect all the episodes从该visit出发,然后用average return来approximate
- 这个就是MC Basic的方法
- 问题是agent在采取下一次visit的时候需要等待所有 episode搜集完,所以效率很低下
- every-visit: uses the return of a single episode to approximate
用这个方法就可以improve policy episode by episode
本节算法就是用了every-visit
总结一句话
first-visit的意思就是如果第一次出现了(s,a),就用当前这个visit的return来更新q(s,a),而every-visit就是每次visit都更新q(s,a)
Generalized policy iteration
实际上许多RL的算法都是要进行policy-evaluation和policy-improvement这两步的(general idea)
Algorithm
Core idea
为了充分利用所有episode中的每个visit的return(而不希望每次重复计算),我们用类似动态规划中的填数组的方法来不断加入新的值并求平均
- 这里是倒着episode的顺序来逐步更新的,这样做我们也是为了提高效率,可以直接利用后面的return来更新前面的值
What is exploring-starts?
means we can start from any state-action pair
generate sufficiently many episodes from state-action pairs
ensure that all pairs can be possibly selected
如果说没有episode是从某个state-action pair开始的,那么这个pair就不会被更新
MC Basic
andMC Exploring Starts
need this assumption总结一句话
我现在希望获得某个visit的return,有两种方法,第一个是确保在一开始就能访问到(starts的意思),第二个是希望在episode的过程中可以访问到这个visit,但是由于这里的方法还是deterministic的,不能保证每个visit都能被访问到,所以我们需要exploring starts condition
Later
- is a method to make more efficient
- however, in this theory, we assume that only if every action value for every state is well explored, can we select the best action
- in practice, for many applications, we can’t explore all the state-action pairs (especially for those involing continuous state space or action space/physical iteration with environment)