Return

center

A $trajectory$ is state-action-reward chain:

s_{1} a_{2} r = 0 s_{2} a_{3} r = 0 s_{5} a_{3} r = 0 s_{8} a_{2} r = 1 s_{9}

The $return$ of this trajectory is the sum of all the rewards collected along the trajectory:

return = 0 + 0 + 0 + 1 = 1

A different policy gives a different policy:

s_{1} a_{3} r = 0 s_{4} a_{3} r = - 1 s_{7} a_{2} r = 0 s_{8} a_{2} r = 1 s_{9}

return = 0 - 1 + 0 + 1 = 0

center

return有什么用?

从上面这个例子我们可以看到第一个例子的return>第二个例子的return，那么我们可以初步认为policy1是由于policy2的，所以我们用return来衡量一个policy的好坏。

Discounted return

A trajectory maybe infinite:

s_{1} a_{2} r = 0 s_{2} a_{3} r = 0 s_{5} a_{3} r = 0 s_{8} a_{2} r = 1 s_{9} a_{5} r = 1 s_{9} a_{5} r = 1 s_{9} \dots

return = 0 + 0 + 0 + 1 + 1 + 1 + \dots = \infty

return diverge! so we need a discount rate $γ$ to calculate the discounted return:

discounted return = 0 + γ 0 + γ^{2} 0 + γ^{3} 1 + γ^{4} 1 + γ^{5} 1 + \dots = γ^{3} * (1 + γ + γ^{2} + \dots) = γ^{3} / (1 - γ)

补充证明： $\frac{1}{1 - γ}$ 数列求和求极限（高等数学知识）

引入gamma的作用

由于reward只是一个常数，我们轻而易举的将return(sum of trajectory rewards)收敛起来了

我们可以balance the far and near future rewards，直接从 $discounted return = r_{0} + γ r_{1} + γ^{2} r_{2} + \dots$ 的式子可以看出:

$γ$ close to 0, return更加关注于near future

$γ$ 越接近，说明更多future的reward被考虑进来，far future

Reinforcement Learning Notes

Explorer

Concept return

Return

Discounted return

Graph View

Table of Contents

Backlinks