Stationary distribution

Explanation

distribution: the distribution of the state
stationary: the long-run behavior
summary: after the agent runs a long time following a policy, the probability that the agent is at any state can be described by this distribution.

Remarks:

Stationary distribution is also called
- steady-state distribution
- limiting distribution
It is critical to understand the value function method
It is also important for the policy gradient method

Let $n_{π} (s)$ denote the number of times that $s$ has been visited in a very long episode generated by $π$ . Then, $d_{π} (s)$ can be approximated by

d_{π} (s) \approx \frac{n _{π} ( s )}{\sum _{s^{'} \in S} n _{π} ( s ^{'} )}

Example

还记得我们之前的Bellman Equation Matrix Form:

v_{π} = r_{π} + γ P_{π} v_{π}

其中 $P_{π} = π @ P_{s}$ 就是状态转移的概率矩阵，又由于我们假设可以用stationary distribution来描述关于各个state的分布，那么显然 $d_{π}$ 是下面方程的一个解（分布不随状态转移而改变）：

d_{π}^{T} = d_{π}^{T} P_{π}

根据线性代数的知识我们显然知道此时 $d_{π}$ 是（ $P_{π}^{T} d_{π} = λ d_{π}$ ） $P_{π}^{T}$ 特征值 $λ = 1$ 时对应的特征向量。

复习线代：Eigenvalues(特征值) and Eigenvectors(特征向量)

假设A是square matrix， $A \in R^{n \times n}$ ，如果存在 $A x = λ x$ ，那么 $λ$ 就是A的特征值， $x$ 就是A的特征向量；简单理解就是A经过特征分解之后我们就可以得到n个特征向量，矩阵对于该特征向量而言就是伸缩作用，伸缩的倍数就是特征值 $λ$ ；要解方程也比较简单，解 $det (A - λ I) = 0$ 即可；（当然也可以通过程序的方法来迭代解）

比如 $P_{π}$ 为

P_{π} = 0.3 0.1 0.1 0 0.1 0.3 0 0.1 0.6 0 0.3 0.1 0 0.6 0.6 0.8

可以解得（过程见Appendix的代码）

d_{π}^{T} = [0.0345, 0.1084, 0.1330, 0.7241]

Appendix

上述例子代码

# File: eigenvalue.py
import numpy as np
 
def method_eigen(P_pi: np.ndarray):
    # to solve the problem, we need to find the eigenvalue of P_pi
    eigenvalues, eigenvectors = np.linalg.eig(P_pi.T)
 
    # find the eigenvector corresponding to the eigenvalue 1 (\lambda = 1)
    for i in range(len(eigenvalues)):
        if np.isclose(eigenvalues[i], 1):
            d_pi = eigenvectors[:, i].real
            break
 
    # norm to sum 1
    d_pi = d_pi / d_pi.sum()
    return d_pi
 
def method_iter(P_pi: np.ndarray):
    d_pi = np.array([0.25, 0.25, 0.25, 0.25])
    for i in range(30):
        d_pi_new = d_pi @ P_pi
        d_pi = d_pi_new
 
    d_pi = d_pi / d_pi.sum()
    return d_pi
 
if __name__ == "__main__":
    P_pi = np.array([
        [0.3, 0.1, 0.6, 0],
        [0.1, 0.3, 0, 0.6],
        [0.1, 0, 0.3, 0.6],
        [0, 0.1, 0.1, 0.8]
    ])
 
    # to solve d_pi.T = d_pi.T @ P_pi
 
    d_pi1 = method_eigen(P_pi)
    print(d_pi1)
 
    d_pi2 = method_iter(P_pi)
    print(d_pi2)
 
    if np.allclose(d_pi1, d_pi2):
        print('Both methods give the same result')
    else:
        print('The results are different')

python eigenvalue.py

Terminal output:

[0.03448276 0.10837438 0.13300493 0.72413793]
[0.03448276 0.10837438 0.13300493 0.72413793]
Both methods give the same result

Reinforcement Learning Notes

Explorer

Stationary distribution

Stationary distribution

Explanation

Example

Appendix

Graph View

Table of Contents

Backlinks