Action value

  • State value: the average return the agent can get starting from a state
  • Action value: the average return the agent can get starting from a state and taking an action

Definition:

  • is a function of state-action pair
  • the function depends on

Recalls that:

Hence,

Also we can expand it, refer to the derivation of state value

Matrix form

Less is more, then we will have six matrix formulla here

All in one:

Two sides of the coin:

Three basic formula we got during the derivation of state value:

一些有意思的地方

最后其实可以发现一个有意思的点就是,也就是说action value和policy的shape是一样的,这也就说明了我们其实说白了是在从action value中通过policy概率去统合每个动作的价值,才得到当前state的价值,这样显得非常直观,但是又非常巧妙的将action probability(policy)和action value一一对应起来了; 这也基本解决当时初学强化学习时一直没弄弄懂Q表和policy的关系,当然也可以直接根据action value的大小选取价值最大的action,从而得到当前最正确的policy,这样policy和Q相乘当然也可以得到最大的state value,刚好两全其美。