Return
A is state-action-reward chain:
The of this trajectory is the sum of all the rewards collected along the trajectory:
A different policy gives a different policy:
return有什么用?
从上面这个例子我们可以看到第一个例子的return>第二个例子的return,那么我们可以初步认为policy1是由于policy2的,所以我们用return来衡量一个policy的好坏。
Discounted return
A trajectory maybe infinite:
return diverge! so we need a discount rate to calculate the discounted return:
补充证明: 数列求和求极限(高等数学知识)
引入gamma的作用
- 由于reward只是一个常数,我们轻而易举的将return(sum of trajectory rewards)收敛起来了
- 我们可以balance the far and near future rewards,直接从的式子可以看出:
- close to 0, return更加关注于near future
- 越接近,说明更多future的reward被考虑进来,far future