Bellman Equation
Derivation
Consider a random trajectory:
The return can be written as:
Then state value will be:
- the is the mean of immediate rewards
- the is the mean of future rewards
First we calculate the first term :
- the consider all the reward it will get when take such action at state
Then we calculate the second term :
- the consider the markov memoryless property here
- the consider all the case of action here, so it can expand like this
- the is inrelevant with so we exchange their order here
Therefore we have:
- and are state values to be caculated. Bootstrapping!
- are given policy. Solving this equation is called policy evaluation.
- and represent the dynamic model.
假设我们不知道这个dynamic model我们也依然可以求出state value,这就是model-free的算法
Exercise
- write out the Bellman equations for each state
- solve the state values
Solve the above quations from the last to the first:
Matrix-vector form
Recall that:
- the second term of this formula recalls the third line of this derivation
Bellman Equation拆解成三条式子来记忆
Matrix form:
where
- , where
- also called state transition matrix
和 的表达式可以写成矩阵形式吗?
See Probability&Matrix.
Solve the state values
Why to solve state value
- Given a policy, finding out the corresponding state value is called . It is fundermental problem in RL. It is the foundation to find better policies.
- It is important to understand how to solve the Bellman equation.
- The closed-form solution:
- The iterative solution to approx the fix point
For detail, see Proof.