State value

Notations

Consider the following single-step process:

S_{t} A_{t} R_{t + 1}, S_{t + 1}

$t, t + 1$ : discrete time instances
$S_{t}$ : state at time t
$A_{t}$ : the action taken at state $S_{t}$
$R_{t + 1}$ : the reward obtained after taking $A_{t}$
$S_{t + 1}$ : the state transited to after taking $A_{t}$

为什么这里的SAR都是大写

代表他们都是我们概率论中所学的随机变量（random variables），也就意味着我们可以去求expectation。

Step	Governed Probability Distribution
$S_{t} \to A_{t}$	$π (A_{t} = a ∣ S_{t} = s)$
$S_{t}, A_{t} \to R_{t + 1}$	$p (R_{t + 1} = r ∣ S_{t} = s, A_{t} = a)$
$S_{t}, A_{t} \to S_{t + 1}$	$p (S_{t + 1} = s^{'} ∣ S_{t} = s, A_{t} = a)$

上述例子我们之所以可以拿得到p都是以为还没有脱离model，我们假设自己是知道enviroment model的（model ⇐> probability distributions）

Consider the following multi-step process:

S_{t} A_{t} R_{t + 1}, S_{t + 1} A_{t + 1} R_{t + 2}, S_{t + 2} A_{t + 2} R_{t + 3}, S_{t + 3} \dots

the discounted return:

G_{t} = R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots

$γ \in [0, 1)$ .
$G_{t}$ is also random variable since Rs are random variables.

$State value$ : the expectation (or expected value or mean value) of $G_{t}$ is defined as the state-value function or simply state value:

v_{π} (s) = E [G_{t} ∣ S_{t} = s]

remarks:

It is a function of $s$ . It is a conditional expectation with the condition that ==the state starts from $s$ ==.
It is based on the policy $π$ . For a different policy, the state value may be different.
- $v (s, π)$ is the same as $v_{π} (s)$

return & state value

我们在实际中通常 $π (a ∣ s), p (r ∣ s, a), p (s^{'} ∣ s, a)$ 都是不确定的，所以从一个state出发往往可以有多条trajectory，我们对all possible trajectory的return求平均就得到了state value；当然，如果从一个state出发怎么走都只有一条trajectory的话那么state value=return，这就需要上面提及的三个分布都是determinisitic才行。

Example

center

v_{π_{1}} (s_{1}) = 0 + γ 1 + γ^{2} 1 + \dots, = 0 + γ (1 + γ + γ^{2} + \dots), = \frac{γ}{1 - γ} .

v_{π_{2}} (s_{1}) = - 1 + γ 1 + γ^{2} 1 + \dots, = - 1 + γ (1 + γ + γ^{2} + \dots), = - 1 + \frac{γ}{1 - γ} .

v_{π_{3}} (s_{1}) = 0.5 (0 + γ 1 + γ^{2} 1 + \dots) + 0.5 (- 1 + γ 1 + γ^{2} 1 + \dots), = 0.5 (\frac{γ}{1 - γ}) + 0.5 (- 1 + \frac{γ}{1 - γ}), = - 0.5 + \frac{γ}{1 - γ} .

小小说明

上面三幅图中 $p (r ∣ s, a), p (s^{'} ∣ s, a)$ 都是determinisitc的，policy1和policy2也都是确定性的，只有policy3是stochastic的，我们用颜色区分出每条trajectory以及对应的discounted return求解过程，最后可以看到 $v_{π_{1}} (s_{1}) > v_{π_{3}} (s_{1}) > v_{π_{2}} (s_{1})$ ，这样同样可以说明policy1是最优的，policy3次之、policy2最次，这与我们从图中的直观感觉是一致的。

Reinforcement Learning Notes

Explorer

BE state value

State value

Notations

Example

Graph View

Table of Contents

Backlinks