Prove the equation for the average reward of a policy π:
n→∞limn1E[t=0∑n−1Rt+1]=s∈S∑dπ(s)rπ(s)=rˉπ
Proof
Step 1: Equation is valid for any starting state s0
First prove that the equation(mentioned in PG metric: average reward) is valid for any starting state s0
rˉπ=n→∞limn1E[t=0∑n−1Rt+1∣S0=s0]
Proof:
=======n→∞limn1E[t=0∑n−1Rt+1∣S0=s0]n→∞limn1t=0∑n−1E[Rt+1∣S0=s0]t→∞limE[Rt+1∣S0=s0]t→∞lims∈S∑E[Rt+1∣St=s,S0=s0]p(t)(s∣s0)t→∞lims∈S∑E[Rt+1∣St=s]p(t)(s∣s0)t→∞lims∈S∑rπ(s)p(t)(s∣s0)s∈S∑rπ(s)dπ(s)rˉπ(Linearity of expectation)(Cesaro mean)(Markov memoryless property)(Definition of rπ(s))(Stationary distribution)
where p(t)(s∣s0) denotes the probability of transitioning from s0 to s using exactly t steps, and note that:
t→∞limp(t)(s∣s0)=dπ(s)
Step 2: Equation is valid for any state distribution d
Next, consider an arbitrary state distribution d. By the law of total expectation, we have