Reinforcement Learning Notes

❯

❯

VI&PI truncated policy iteration

VI&PI truncated policy iteration

Apr 01, 20253 min read

Truncated Policy Iteration

Compare value iteration and policy iteration

Policy iteration: start from $π_{0}$

Policy evaluation (PE) $v_{π_{k}} = r_{π_{k}} + γ P_{π_{k}} v_{π_{k}}$
Policy improvement (PI) $π_{k + 1} = ar g π max (r_{π_{k}} + γ P_{π_{k}} v_{π_{k}})$

Value iteration: start from $v_{0}$

Policy update (PU) $π_{k + 1} = ar g π max (r_{π} + γ P_{π} v_{k})$
Value update (VU) $v_{k + 1} = r_{π_{k}} + γ P_{π_{k}} v_{k}$

PE=policy evaluation, PI=policy improvement, PU=policy update, VU=value update

Policy iteration: π_{0} PE v_{π_{0}} P I π_{1} PE v_{π_{1}} P I π_{2} PE v_{π_{2}} P I \dots Value iteration: u_{0} P U π_{1}^{'} V U u_{1} P U π_{2}^{'} V U u_{2} P U \dots

Algorithm

Algorithm Truncated Policy Iteration Input: Probability models p (r ∣ s, a) and p (s^{'} ∣ s, a), max iterations j_{truncate} . Initialization: Initial guess π_{0} . While v_{k} not converged: Policy Evaluation: Initialize: v_{k}^{(0)} arbitrarily . For j = 0 to j_{truncate} - 1 : For each s \in S : v_{k}^{(j + 1)} (s) \leftarrow a \sum π_{k} (a ∣ s) [r \sum p (r ∣ s, a) r + γ s^{'} \sum p (s^{'} ∣ s, a) v_{k}^{(j)} (s^{'})] v_{k} \leftarrow v_{k}^{(j_{truncate})} (Update value function) Policy Improvement: For each s \in S : For each a \in A (s) : q_{k} (s, a) \leftarrow r \sum p (r ∣ s, a) r + γ s^{'} \sum p (s^{'} ∣ s, a) v_{k} (s^{'}) (Q-computation) a_{k}^{*} (s) \leftarrow ar g a max q_{k} (s, a) (Greedy selection) π_{k + 1} (a ∣ s) \leftarrow {10 if a = a_{k}^{*} (s) otherwise (Policy update)

Policy Iteration(PI) vs. Value Iteration(VI) vs. Trucated Policy Iteration(TPI)

v_{π_{1}}^{(0)} value iteration \leftarrow v_{π_{1}}^{(1)} v_{π_{1}}^{(2)} ⋮ truncated policy iteration \leftarrow v_{π_{1}}^{(j)} ⋮ policy iteration \leftarrow v_{π_{1}}^{(\infty)} = v_{0} = r_{π_{1}} + γ P_{π_{1}} v_{π_{1}}^{(0)} = r_{π_{1}} + γ P_{π_{1}} v_{π_{1}}^{(1)} = r_{π_{1}} + γ P_{π_{1}} v_{π_{1}}^{(j - 1)} = r_{π_{1}} + γ P_{π_{1}} v_{π_{1}}^{(\infty)}

center

实际上三个算法的效果为PI>TPI>VI，而且通过TPI，可以很清晰的将价值迭代和策略迭代之间的关系紧密联系起来：

PI的Policy evaluation部分和VI的Value update部分的公式是一样的，都是通过Bellman equation来更新状态价值函数，但是有所区别的是PI中的Policy evaluation部分需要迭代至收敛，而VI中的Policy update部分只需要一次迭代。
观察TPI中的 $j_{truncate}$ ，当这个参数为1时（再交换一下Policy evaluation和Policy improvement的顺序，同时保持pe中的 $v_{0}$ 继承最新那个v），TPI就退化为VI，当这个参数为 $\infty$ 时，TPI就变成了PI，所以TPI是PI和VI的一个折中方案。

Graph View

Truncated Policy Iteration
Compare value iteration and policy iteration
Algorithm
Policy Iteration(PI) vs. Value Iteration(VI) vs. Trucated Policy Iteration(TPI)

Backlinks

04 Value Iteration & Policy Iteration
SA Stochastic Gradient Descent
TD Sarsa
VI&PI policy iteration

Created with Quartz v4.4.0 © 2025

GitHub