State value
Notations
Consider the following single-step process:
- : discrete time instances
- : state at time t
- : the action taken at state
- : the reward obtained after taking
- : the state transited to after taking
为什么这里的SAR都是大写
代表他们都是我们概率论中所学的随机变量(random variables),也就意味着我们可以去求expectation。
Step | Governed Probability Distribution |
---|---|
上述例子我们之所以可以拿得到p都是以为还没有脱离model,我们假设自己是知道enviroment model的(model ⇐> probability distributions)
Consider the following multi-step process:
the discounted return:
- .
- is also random variable since Rs are random variables.
: the expectation (or expected value or mean value) of is defined as the state-value function or simply state value:
remarks:
- It is a function of . It is a conditional expectation with the condition that ==the state starts from ==.
- It is based on the policy . For a different policy, the state value may be different.
- is the same as
return & state value
我们在实际中通常都是不确定的,所以从一个state出发往往可以有多条trajectory,我们对all possible trajectory的return求平均就得到了state value;当然,如果从一个state出发怎么走都只有一条trajectory的话那么state value=return,这就需要上面提及的三个分布都是determinisitic才行。
Example
小小说明
上面三幅图中都是determinisitc的,policy1和policy2也都是确定性的,只有policy3是stochastic的,我们用颜色区分出每条trajectory以及对应的discounted return求解过程,最后可以看到,这样同样可以说明policy1是最优的,policy3次之、policy2最次,这与我们从图中的直观感觉是一致的。