Policy Iteration

Given a random initial policy $π_{0}$

Step 1: policy evaluation To calculate the state value of $π_{k}$
$v_{π_{k}} = r_{π_{k}} + γ P_{π_{k}} v_{π_{k}}$
Note that $v_{π_{k}}$ is state value function here.
Step 2: policy improvement
$π_{k + 1} = ar g π max (r_{π} + γ P_{π} v_{π_{k}})$
The maximation is component-wise.
Q1: how to calculate the state value of $π_{k}$ ? embed value iteration to calculate
Q2: why is $π_{k + 1}$ better than $π_{k}$ ? because we get the max $v_{π_{k + 1}} \geq v_{π_{k}}$
Q3: why such iterative algorithm reach optimal policy? because final the $v_{π_{k}}$ converges to $v_{*}$
Q4: what is the relationship between policy iteration and value iteration? see Compare vi and pi

Algorithm

Algorithm Policy Iteration Input: Probability models p (r ∣ s, a) and p (s^{'} ∣ s, a) for all (s, a) . Initialization: Initial guess π_{0} . While ∥ v_{π_{k}} - v_{π_{k - 1}} ∥ > threshold: Policy Evaluation: Initialization: Arbitrary initial guess v_{π_{k}}^{(0)} . While ∥ v_{π_{k}}^{(j)} - v_{π_{k}}^{(j - 1)} ∥ > threshold: For each state s \in S : v_{π_{k}}^{(j + 1)} (s) \leftarrow a \sum π_{k} (a ∣ s) [r \sum p (r ∣ s, a) r + γ s^{'} \sum p (s^{'} ∣ s, a) v_{π_{k}}^{(j)} (s^{'})] Policy Improvement: For each state s \in S : For each action a \in A : q_{k} (s, a) \leftarrow r \sum p (r ∣ s, a) r + γ s^{'} \sum p (s^{'} ∣ s, a) v_{π_{k}} (s^{'}) (Q-value) a_{k}^{*} (s) \leftarrow ar g a max q_{k} (s, a) (Greedy action) π_{k + 1} (a ∣ s) \leftarrow {10 if a = a_{k}^{*} (s) otherwise (Policy update)