Action value

State value: the average return the agent can get starting from a state
Action value: the average return the agent can get starting from a state and taking an action

Definition:

q_{π} (s, a) = E [G_{t} ∣ S_{t} = s, A_{t} = a]

$q_{π}$ is a function of state-action pair $(s, a)$
the function depends on $π$

Recalls that:

v_{π} (s) E [G_{t} ∣ S_{t} = s] = a \sum q_{π} (s, a) E [G_{t} ∣ S_{t} = s, A_{t} = a] π (a ∣ s)

Hence,

v_{π} (s) = a \sum π (a ∣ s) q_{π} (s, a)

Also we can expand it, refer to the derivation of state value

q_{π} (s, a) = r \sum p (r ∣ s, a) r + γ s^{'} \sum p (s^{'} ∣ s, a) v_{π} (s^{'})

Matrix form

Less is more, then we will have six matrix formulla here

All in one:

v_{π} = π @ (P_{r} R + γ P_{s} v_{π})

Two sides of the coin:

q_{π} v_{π} = P_{r} R + γ P_{s} v_{π} = π @ q_{π}

Three basic formula we got during the derivation of state value:

v_{π} r_{π} P_{π} = r_{π} + γ P_{π} v_{π} = π @ P_{r} R = π @ P_{s}

一些有意思的地方

最后其实可以发现一个有意思的点就是 $q_{π} \in R^{n \times m}$ ，也就是说action value和policy的shape是一样的，这也就说明了我们其实说白了是在从action value中通过policy概率去统合每个动作的价值，才得到当前state的价值，这样显得非常直观，但是又非常巧妙的将action probability(policy)和action value一一对应起来了；这也基本解决当时初学强化学习时一直没弄弄懂Q表和policy的关系，当然也可以直接根据action value的大小选取价值最大的action，从而得到当前最正确的policy，这样policy和Q相乘当然也可以得到最大的state value，刚好两全其美。

Reinforcement Learning Notes

Explorer

BE action value

Action value

Matrix form

Graph View

Table of Contents

Backlinks