Policy Iteration

Given a random initial policy

  • Step 1: policy evaluation To calculate the state value of

    Note that is state value function here.

  • Step 2: policy improvement

    The maximation is component-wise.

  • Q1: how to calculate the state value of ? embed value iteration to calculate

  • Q2: why is better than ? because we get the max

  • Q3: why such iterative algorithm reach optimal policy? because final the converges to

  • Q4: what is the relationship between policy iteration and value iteration? see Compare vi and pi

Algorithm