Bellman Equation

Derivation

Consider a random trajectory:

S_{t} A_{t} R_{t + 1}, S_{t + 1} A_{t + 1} R_{t + 2}, S_{t + 2} A_{t + 2} R_{t + 3}, \dots

The return $G_{t}$ can be written as:

G_{t} = R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots, = R_{t + 1} + γ (R_{t + 2} + γ R_{t + 3} + \dots), = R_{t + 1} + γ G_{t + 1}

Then state value will be:

v_{π} (s) = E [G_{t} ∣ S_{t} = s], = E [R_{t + 1} + γ G_{t + 1} ∣ S_{t} = s], = E [R_{t + 1} ∣ S_{t} = s] + γ E [G_{t + 1} ∣ S_{t} = s]

the $E$ is the mean of immediate rewards
the $E$ is the mean of future rewards

First we calculate the first term $E$ :

E [R_{t + 1} ∣ S_{t} = s] = a \sum π (a ∣ s) E [R_{t + 1} ∣ S_{t} = s, A_{t} = a], = a \sum π (a ∣ s) r \sum p (r ∣ s, a) r

the $E$ consider all the reward $r$ it will get when take such action $a$ at state $s$

Then we calculate the second term $E$ :

E [G_{t + 1} ∣ S_{t} = s] = s^{'} \sum E [G_{t + 1} ∣ S_{t} = s, S_{t + 1} = s^{'}] p (s^{'} ∣ s), = s^{'} \sum E [G_{t + 1} ∣ S_{t + 1} = s^{'}] p (s^{'} ∣ s), = s^{'} \sum v_{π} (s^{'}) p (s^{'} ∣ s), = s^{'} \sum v_{π} (s^{'}) a \sum p (s^{'} ∣ s, a) π (a ∣ s), = s^{'} \sum a \sum π (a ∣ s) p (s^{'} ∣ s, a) v_{π} (s^{'}), = s^{'} \sum a \sum π (a ∣ s) p (s^{'} ∣ s, a) v_{π} (s^{'}), = a \sum π (a ∣ s) s^{'} \sum p (s^{'} ∣ s, a) v_{π} (s^{'}) .

the $E$ consider the markov memoryless property here
the $p (s^{'} ∣ s)$ consider all the case of action here, so it can expand like this
the $\sum_{a} π (a ∣ s)$ is inrelevant with $\sum_{s^{'}}$ so we exchange their order here

Therefore we have:

v_{π} (s) = E [R_{t + 1} ∣ S_{t} = s] + γ E [G_{t + 1} ∣ S_{t} = s], = a \sum π (a ∣ s) r \sum p (r ∣ s, a) r + γ a \sum π (a ∣ s) s^{'} \sum p (s^{'} ∣ s, a) v_{π} (s^{'}), = a \sum π (a ∣ s) [r \sum p (r ∣ s, a) r + γ s^{'} \sum p (s^{'} ∣ s, a) v_{π} (s^{'})], \forall s \in S .

$v_{π} (s)$ and $v_{π} (s^{'})$ are state values to be caculated. Bootstrapping!
$π (a ∣ s)$ are given policy. Solving this equation is called policy evaluation.
$p (r ∣ s, a) r$ and $p (s^{'} ∣ s, a)$ represent the dynamic model.

假设我们不知道这个dynamic model我们也依然可以求出state value，这就是model-free的算法

Exercise

center

write out the Bellman equations for each state
solve the state values

v_{π} (s_{1}) v_{π} (s_{2}) v_{π} (s_{3}) v_{π} (s_{4}) = 0.5 [- 1 + γ v_{π} (s_{2})] + 0.5 [0 + γ v_{π} (s_{3})] = 1 + γ v_{π} (s_{4}) = 1 + γ v_{π} (s_{4}) = 1 + γ v_{π} (s_{4})

Solve the above quations from the last to the first:

v_{π} (s_{4}) v_{π} (s_{3}) v_{π} (s_{2}) v_{π} (s_{1}) = \frac{1}{1 - γ} = \frac{1}{1 - γ} = \frac{1}{1 - γ} = - 0.5 + \frac{γ}{1 - γ}

Matrix-vector form

Recall that:

v_{π} (s) = a \sum π (a ∣ s) [r \sum p (r ∣ s, a) r + γ s^{'} \sum p (s^{'} ∣ s, a) v_{π} (s^{'})], = a \sum π (a ∣ s) r \sum p (r ∣ s, a) r + γ a \sum π (a ∣ s) s^{'} \sum p (s^{'} ∣ s, a) v_{π} (s^{'}), = r_{π} (s) + γ s^{'} \sum p (s^{'} ∣ s) v_{π} (s^{'}) .

$r_{π} (s) = E [R_{t + 1} ∣ S_{t} = s]$
the second term of this formula recalls the third line of this derivation

Bellman Equation拆解成三条式子来记忆

$v_{π} (s) r_{π} (s) p_{π} (s^{'} ∣ s) = r_{π} (s) + γ s^{'} \sum p (s^{'} ∣ s) v_{π} (s^{'}) ≜ a \sum π (a ∣ s) r \sum p (r ∣ s, a) r ≜ a \sum π (a ∣ s) p (s^{'} ∣ s, a)$

Matrix form:

v_{π} = r_{π} + γ P_{π} v_{π}

where

$v_{π} = [v_{π} (s_{1}), v_{π} (s_{2}), \dots, v_{π} (s_{n})]^{T} \in R^{n}$
$r_{π} = [r_{π} (s_{1}), r_{π} (s_{2}), \dots, r_{π} (s_{n})]^{T} \in R^{n}$
$P_{π} \in R^{n \times n}$ , where $[P_{π}]_{ij} = p_{π} (s_{j} ∣ s_{i}) = \sum_{a} π (a ∣ s_{i}) p (s_{j} ∣ s_{i}, a)$
- also called state transition matrix

$r_{π} (s)$ 和 $P_{π}$ 的表达式可以写成矩阵形式吗？

See Probability&Matrix.

Solve the state values

Why to solve state value

Given a policy, finding out the corresponding state value is called $policy evaluation$ . It is fundermental problem in RL. It is the foundation to find better policies.

It is important to understand how to solve the Bellman equation.

The closed-form solution:

v_{π} = (I - γ P_{π})^{- 1} r_{π}

The iterative solution to approx the fix point

v_{k + 1} = r_{π} + γ P_{π} v_{k}

For detail, see Proof.

Reinforcement Learning Notes

Explorer

BE Bellman Equation

Bellman Equation

Derivation

Exercise

Matrix-vector form

Solve the state values

Graph View

Table of Contents

Backlinks