Bellman Equation

Derivation

Consider a random trajectory:

The return can be written as:

Then state value will be:

  • the is the mean of immediate rewards
  • the is the mean of future rewards

First we calculate the first term :

  • the consider all the reward it will get when take such action at state

Then we calculate the second term :

  • the consider the markov memoryless property here
  • the consider all the case of action here, so it can expand like this
  • the is inrelevant with so we exchange their order here

Therefore we have:

  • and are state values to be caculated. Bootstrapping!
  • are given policy. Solving this equation is called policy evaluation.
  • and represent the dynamic model.

假设我们不知道这个dynamic model我们也依然可以求出state value,这就是model-free的算法

Exercise

center

  • write out the Bellman equations for each state
  • solve the state values

Solve the above quations from the last to the first:

Matrix-vector form

Recall that:

  • the second term of this formula recalls the third line of this derivation

Bellman Equation拆解成三条式子来记忆

Matrix form:

where

  • , where
    • also called state transition matrix

的表达式可以写成矩阵形式吗?

Solve the state values

Why to solve state value

  • Given a policy, finding out the corresponding state value is called . It is fundermental problem in RL. It is the foundation to find better policies.
  • It is important to understand how to solve the Bellman equation.
  • The closed-form solution:
  • The iterative solution to approx the fix point

For detail, see Proof.