Stationary distribution

Explanation

  • distribution: the distribution of the state
  • stationary: the long-run behavior
  • summary: after the agent runs a long time following a policy, the probability that the agent is at any state can be described by this distribution.

Remarks:

Let denote the number of times that has been visited in a very long episode generated by . Then, can be approximated by

Example

还记得我们之前的Bellman Equation Matrix Form:

其中就是状态转移的概率矩阵,又由于我们假设可以用stationary distribution来描述关于各个state的分布,那么显然是下面方程的一个解(分布不随状态转移而改变):

根据线性代数的知识我们显然知道此时是(特征值时对应的特征向量。

复习线代:Eigenvalues(特征值) and Eigenvectors(特征向量)

假设A是square matrix,, 如果存在,那么就是A的特征值,就是A的特征向量;简单理解就是A经过特征分解之后我们就可以得到n个特征向量,矩阵对于该特征向量而言就是伸缩作用,伸缩的倍数就是特征值;要解方程也比较简单,解即可;(当然也可以通过程序的方法来迭代解)

比如

可以解得(过程见Appendix的代码)

Appendix

上述例子代码

# File: eigenvalue.py
import numpy as np
 
def method_eigen(P_pi: np.ndarray):
    # to solve the problem, we need to find the eigenvalue of P_pi
    eigenvalues, eigenvectors = np.linalg.eig(P_pi.T)
 
    # find the eigenvector corresponding to the eigenvalue 1 (\lambda = 1)
    for i in range(len(eigenvalues)):
        if np.isclose(eigenvalues[i], 1):
            d_pi = eigenvectors[:, i].real
            break
 
    # norm to sum 1
    d_pi = d_pi / d_pi.sum()
    return d_pi
 
def method_iter(P_pi: np.ndarray):
    d_pi = np.array([0.25, 0.25, 0.25, 0.25])
    for i in range(30):
        d_pi_new = d_pi @ P_pi
        d_pi = d_pi_new
 
    d_pi = d_pi / d_pi.sum()
    return d_pi
 
if __name__ == "__main__":
    P_pi = np.array([
        [0.3, 0.1, 0.6, 0],
        [0.1, 0.3, 0, 0.6],
        [0.1, 0, 0.3, 0.6],
        [0, 0.1, 0.1, 0.8]
    ])
 
    # to solve d_pi.T = d_pi.T @ P_pi
 
    d_pi1 = method_eigen(P_pi)
    print(d_pi1)
 
    d_pi2 = method_iter(P_pi)
    print(d_pi2)
 
    if np.allclose(d_pi1, d_pi2):
        print('Both methods give the same result')
    else:
        print('The results are different')
python eigenvalue.py

Terminal output:

[0.03448276 0.10837438 0.13300493 0.72413793]
[0.03448276 0.10837438 0.13300493 0.72413793]
Both methods give the same result