Example of Value Function

From table to function

之前我们所有的state和action的值都是用表格来表示的。例如，state value:

State	$s_{1}$	$s_{2}$	$\dots$	$s_{n}$
Value	$v_{π} (s_{1})$	$v_{π} (s_{2})$	$\dots$	$v_{π} (s_{n})$

又例如，action value:

State/Action	$a_{1}$	$a_{2}$	$a_{3}$	$\dots$	$a_{m}$
$s_{1}$	$q_{π} (s_{1}, a_{1})$	$q_{π} (s_{1}, a_{2})$	$q_{π} (s_{1}, a_{3})$	$\dots$	$q_{π} (s_{1}, a_{m})$
$⋮$	$⋮$	$⋮$	$⋮$	$⋱$	$⋮$
$s_{n}$	$q_{π} (s_{n}, a_{1})$	$q_{π} (s_{n}, a_{2})$	$q_{π} (s_{n}, a_{3})$	$\dots$	$q_{π} (s_{n}, a_{m})$

优缺点

优点：直观且容易分析缺点：难以处理 large or continuous state or action spaces

而且tabular method还面临着两个问题：

storage: 当 state space 和 action space连续时，意味着我们需要存储无限多的值（无穷行和列）。

generalization: 由于我们的tabular是离散的，优化一个state-action pair时无法泛化到未见过的state或action。

Sutton书作为Part II Approximate Solution Methods的开篇，举了一个例子：the number of possible camera images is much larger than the number of atoms in the universe. 这就是为什么我们需要approximate solution methods。

An example

考虑一个例子：

有 $n$ 个状态： $s_{1}, s_{2}, \dots, s_{n}$
the state values are $v_{π} (s_{1}), v_{π} (s_{2}), \dots, v_{π} (s_{n})$ , where $π$ is a given policy
n is very large! (or even infinite)
use a simple curve to approximate these values(linear?polynomial-basis or fourier-basis?)

center

Use a simple straight line to fit the dots:

\overset{v}{^} (s, w) = a s + b = ϕ^{T} (s) [s, 1] w [a b] = ϕ^{T} (s) w

$w$ is the parameter vector
$ϕ (s)$ is the feature vector of $s$
$v (s, w)$ is linear in $w$

Difference between the tabular and function methods

How to retrieve the value of a state?

这个问题主要是在讨论 $ϕ (s)$ 的 representation
represented by a table, 那么我们可以直接通过state index $s$ 来获取值
represented by a function, 那么我们需要通过 $ϕ (s)$ 来input state index $s$ and calculate the value

center

For example, $s \to ϕ (s) \to ϕ^{T} (s) w = \overset{v}{^} (s, w)$

解决了storage的问题，不用存储整个 $S$ 无穷多的state value，而是存储一个lower-dimensional的映射 $w$ 即可

How to update the value of a state?

tabular: directly, rewrite the value in the table
function: indirectly, update the parameter vector $w$

center

解决了generalization的问题，我们可以泛化到未见过的state或action，update $\overset{v}{^} (s, w)$ by changing $w$ ⇒ the values of neighboring states will also change

Every coin has two sides

使用 function 也会带来一些问题
我们无法精确表示所有是state values
所有又叫 function approximation
解决方法是使用high-order (e.g. polynomial) or non-linear functions (e.g. neural networks)

High-order curves:

\overset{v}{^} (s, w) = a s^{2} + b s + c = ϕ^{T} (s) [s^{2}, s, 1] w a b c = ϕ^{T} (s) w

dimensions of $ϕ (s)$ and $w$ are increased
linear in $w$ , nonliearity in $ϕ (s)$

有意思的地方

在神经网络中，我们有

$Y = σ (W X + b)$

linear in $W$ , nonlinearity in $σ$

刚好和这里的顺序是倒过来的，对比一下：

基函数方法：先非线性每个特征 ⇒ 再线性叠加（一次，其实也可以多次但看起来很怪）

神经网络方法：先线性叠加特征 ⇒ 再非线性激活 ⇒ 线性叠加 …（多次）

Summary

Idea: 使用parameterized functions来近似state和action values
Key difference: 如何获取和改变 $v (s)$ 的值
Advantages：
1. Storage: $w$ 的维度可能远小于 $S$
2. Generalization: 当访问一个state $s$ 时，更新参数 $w$ ，使得其他未访问的state的值也会被更新

Reinforcement Learning Notes

Explorer

VF example

Example of Value Function

From table to function

An example

Difference between the tabular and function methods

How to retrieve the value of a state?

How to update the value of a state?

Every coin has two sides

Summary

Graph View

Table of Contents

Backlinks