Stationary distribution
Explanation
- distribution: the distribution of the state
- stationary: the long-run behavior
- summary: after the agent runs a long time following a policy, the probability that the agent is at any state can be described by this distribution.
Remarks:
- Stationary distribution is also called
- steady-state distribution
- limiting distribution
- It is critical to understand the value function method
- It is also important for the policy gradient method
Let denote the number of times that has been visited in a very long episode generated by . Then, can be approximated by
Example
还记得我们之前的Bellman Equation Matrix Form:
其中就是状态转移的概率矩阵,又由于我们假设可以用stationary distribution来描述关于各个state的分布,那么显然是下面方程的一个解(分布不随状态转移而改变):
根据线性代数的知识我们显然知道此时是()特征值时对应的特征向量。
复习线代:Eigenvalues(特征值) and Eigenvectors(特征向量)
假设A是square matrix,, 如果存在,那么就是A的特征值,就是A的特征向量;简单理解就是A经过特征分解之后我们就可以得到n个特征向量,矩阵对于该特征向量而言就是伸缩作用,伸缩的倍数就是特征值;要解方程也比较简单,解即可;(当然也可以通过程序的方法来迭代解)
比如为
可以解得(过程见Appendix的代码)
Appendix
上述例子代码
# File: eigenvalue.py
import numpy as np
def method_eigen(P_pi: np.ndarray):
# to solve the problem, we need to find the eigenvalue of P_pi
eigenvalues, eigenvectors = np.linalg.eig(P_pi.T)
# find the eigenvector corresponding to the eigenvalue 1 (\lambda = 1)
for i in range(len(eigenvalues)):
if np.isclose(eigenvalues[i], 1):
d_pi = eigenvectors[:, i].real
break
# norm to sum 1
d_pi = d_pi / d_pi.sum()
return d_pi
def method_iter(P_pi: np.ndarray):
d_pi = np.array([0.25, 0.25, 0.25, 0.25])
for i in range(30):
d_pi_new = d_pi @ P_pi
d_pi = d_pi_new
d_pi = d_pi / d_pi.sum()
return d_pi
if __name__ == "__main__":
P_pi = np.array([
[0.3, 0.1, 0.6, 0],
[0.1, 0.3, 0, 0.6],
[0.1, 0, 0.3, 0.6],
[0, 0.1, 0.1, 0.8]
])
# to solve d_pi.T = d_pi.T @ P_pi
d_pi1 = method_eigen(P_pi)
print(d_pi1)
d_pi2 = method_iter(P_pi)
print(d_pi2)
if np.allclose(d_pi1, d_pi2):
print('Both methods give the same result')
else:
print('The results are different')
python eigenvalue.py
Terminal output:
[0.03448276 0.10837438 0.13300493 0.72413793]
[0.03448276 0.10837438 0.13300493 0.72413793]
Both methods give the same result