Return

center

A is state-action-reward chain:

The of this trajectory is the sum of all the rewards collected along the trajectory:

A different policy gives a different policy:

center

return有什么用?

从上面这个例子我们可以看到第一个例子的return>第二个例子的return,那么我们可以初步认为policy1是由于policy2的,所以我们用return来衡量一个policy的好坏。

Discounted return

A trajectory maybe infinite:

return diverge! so we need a discount rate to calculate the discounted return:

补充证明: 数列求和求极限(高等数学知识)

引入gamma的作用

  1. 由于reward只是一个常数,我们轻而易举的将return(sum of trajectory rewards)收敛起来了
  2. 我们可以balance the far and near future rewards,直接从的式子可以看出:
    1. close to 0, return更加关注于near future
    2. 越接近,说明更多future的reward被考虑进来,far future