Step 1: policy evaluation
To calculate the state value of πk
vπk=rπk+γPπkvπk
Note that vπk is state value function here.
Step 2: policy improvement
πk+1=argπmax(rπ+γPπvπk)
The maximation is component-wise.
Q1: how to calculate the state value of πk? embed value iteration to calculate
Q2: why is πk+1 better than πk? because we get the max vπk+1≥vπk
Q3: why such iterative algorithm reach optimal policy? because final the vπk converges to v∗
Q4: what is the relationship between policy iteration and value iteration? see Compare vi and pi
Algorithm
AlgorithmPolicy IterationInput: Probability models p(r∣s,a) and p(s′∣s,a) for all (s,a).Initialization: Initial guess π0.While ∥vπk−vπk−1∥>threshold:Policy Evaluation:Initialization: Arbitrary initial guess vπk(0).While ∥vπk(j)−vπk(j−1)∥>threshold:For each state s∈S:vπk(j+1)(s)←a∑πk(a∣s)[r∑p(r∣s,a)r+γs′∑p(s′∣s,a)vπk(j)(s′)]Policy Improvement:For each state s∈S:For each action a∈A:qk(s,a)←r∑p(r∣s,a)r+γs′∑p(s′∣s,a)vπk(s′)(Q-value)ak∗(s)←argamaxqk(s,a)(Greedy action)πk+1(a∣s)←{10if a=ak∗(s)otherwise(Policy update)