State value of Value Function

Objective function

$v (s)$ : true state value
$\overset{v}{^} (s, w)$ : estimated state value
goal: find an optimal $w$ so that $\overset{v}{^} (s, w)$ can best approximate $v (s)$ for every $s$
policy evaluation problem, later we will extend to policy improvement

To find the optimal $w$ , we need two steps:

define an objective function
derive algorithms for optimizing the objective function

The objective function

J (w) = E [(v (S) - \overset{v}{^} (S, w))^{2}]

goal: find the best $w$ that can minimize $J (w)$
the expectation is with respect to the random variable $S \in S$

如何求上面这个期望呢？

上面这个期望其实也可以写成 $E_{S}$ ，如果想求出来期望就必须先知道random varible S的分布；我们之前是怎么求期望的呢？之前都是可以通过Monte Carlo大量模拟的方法去求得期望的，但这里用模拟的方法求期望变得不太现实，退而求其次，我们可以用模拟的方法求得S的分布，得到分布求期望就可以写成概率形式的平均了。

What is the probability distribution of $S$ ?

This is new. In the tabular method, we do not need to consider this.
There are several ways to define the probability distribution of $S$ .
- uniform distribution: $P (S = s) = \frac{1}{∣ S ∣}$
- stationary distribution: $P (S = s) = d_{π} (s)$ , where $π$ is a given policy

Uniform distribution

treat all the states to be equally important
the objective function becomes $J (w) = E [(v (S) - \overset{v}{^} (S, w))^{2}] = \frac{1}{∣ S ∣} s \in S \sum (v_{π} (s) - \overset{v}{^} (s, w))^{2}$
drawback: the states may not be equally important
- some states may be rarely visited by a policy
- does not consider the real dynamics of the Markov process under the given policy

Stationary distribution

describes the long-run behavior(长期行为) of a Markov process
Stationary distribution(稳态分布)
let ${d_{π} (s)}_{s \in S}$ denote the stationary distribution of the Markov process under policy $π$
the objective function becomes weighted squared error $J (w) = E [(v (S) - \overset{v}{^} (S, w))^{2}] = s \in S \sum d_{π} (s) (v_{π} (s) - \overset{v}{^} (s, w))^{2}$
more frequently visited states have higher values of $d_{π} (s)$
- their weights in the objective function are also higher than those rarely visited states
More detail, see stationary distribution

Optimization algorithms

use gradient-descent algorithm to minimize the objective function $J (w)$ $w_{k + 1} = w_{k} - α_{k} \nabla_{w} J (w_{k})$
the true gradient is $\nabla_{w} J (w) = \nabla_{w} E [(v_{π} (S) - \overset{v}{^} (S, w))^{2}] = E [\nabla_{w} (v_{π} (S) - \overset{v}{^} (S, w))^{2}] = 2 E [(v_{π} (S) - \overset{v}{^} (S, w)) (- \nabla_{w} \overset{v}{^} (S, w))] = - 2 E [(v_{π} (S) - \overset{v}{^} (S, w)) \nabla_{w} \overset{v}{^} (S, w)]$ the true gradient involves the calculation of an expectation

Then we have

w_{k + 1} = w_{k} - α_{k} \nabla_{w} J (w_{k}) = w_{k} + 2 α_{k} E [(v_{π} (S) - \overset{v}{^} (S, w_{k})) \nabla_{w} \overset{v}{^} (S, w_{k})]

Use stochastic gradient to replace the true gradient( $2 α_{k}$ merges to $α_{k}$ )

w_{k + 1} = w_{k} + α_{k} (v_{π} (s_{t}) - \overset{v}{^} (s_{t}, w_{k})) \nabla_{w} \overset{v}{^} (s_{t}, w_{k})

we don’t know the true state value $v_{π}$ here, solution:
- use the methods in TD, replace $v_{π}$ by another value(approximation)
- use MC, replaced by $g_{t}$
- use TD, replaced by $r_{t + 1} + γ \overset{v}{^} (s_{t + 1}, w_{k})$

Implementation

Algorithm TD Learning of State Values with Function Approximation Initialization: Differentiable function v (s, w) with initial parameter w_{0} . Goal: Learn the true state values of a given policy π . For each episode {(s_{t}, r_{t + 1}, s_{t + 1})}_{t} generated by π, do: For each sample (s_{t}, r_{t + 1}, s_{t + 1}), do: General case: w_{t + 1} \leftarrow w_{t} + α_{t} [r_{t + 1} + v (s_{t + 1}, w_{t}) - v (s_{t}, w_{t})] \nabla_{w} v (s_{t}, w_{t}) Linear case (with features ϕ (s)): w_{t + 1} \leftarrow w_{t} + α_{t} (r_{t + 1} + ϕ (s_{t + 1})^{⊤} w_{t} - ϕ (s_{t})^{⊤} w_{t}) ϕ (s_{t})

这里仅仅是做了estimate the state values of a given policy (policy evaluation)

Selection of function approximators

How to select the function $\overset{v}{^} (s, w)$ ?

use a linear function approximator
- $\overset{v}{^} (s, w) = ϕ^{⊤} (s) w$ ( $\overset{v}{^} (s, w) = w^{⊤} ϕ (s)$ )
- $ϕ (s)$ : feature vector(polynomial basis, fourier basis) of state $s$
- widely used before
use a neural network as a nonlinear function approximator
- input: state $s$
- output: estimated state value $\overset{v}{^} (s, w)$
- parameters: weights and biases of the neural network

Linear function approximator

Linear function approximator: $\overset{v}{^} (s, w) = ϕ^{⊤} (s) w$ , then we have

\nabla_{w} \overset{v}{^} (s, w) = ϕ (s)

Substituting the gradient into the above TD algorithm

w_{t + 1} = w_{t} + α_{t} [r_{t + 1} + γ v (s_{t + 1}, w_{t}) - v (s_{t}, w_{t})] ϕ (s_{t})

yields

w_{t + 1} = w_{t} + α_{t} [r_{t + 1} + γ ϕ^{⊤} (s_{t + 1}) w_{t} - ϕ^{⊤} (s_{t}) w_{t}] ϕ (s_{t})

which is the algorithm of TD learning with linear function approximation(TD-Linear)

优缺点

缺点：难以选择合适的特征向量(feature function $ϕ$ 不好选)

优点：理论性质更好，线性函数逼近的理论性质比非线性函数逼近更好(可解释可理解)

tabular representation是linear fuction representation的特例

Tabular representation(special case)

Show that tabular representation is a special case of linear function representation

Consider a special feature vector for state $s$ : (one-hot encoding) $ϕ (s) = e_{s} \in R^{∣ S ∣}$ where $e_{s}$ is a vector with the $s$ -th entry as 1 and the others as 0
In this case: $\overset{v}{^} (s, w) = ϕ^{⊤} (s) w = e_{s}^{⊤} w = w (s)$ where $w (s)$ is the $s$ -th entry of $w$
Recall that the TD-Linear algorithm is $w_{t + 1} = w_{t} + α_{t} (r_{t + 1} + ϕ^{⊤} (s_{t + 1}) w_{t} - ϕ^{⊤} (s_{t}) w_{t}) ϕ (s_{t})$ When $ϕ (s) = e_{s}$ , the above algorithm becomes $w_{t + 1} = w_{t} + α_{t} (r_{t + 1} + w_{t} (s_{t + 1}) - w_{t} (s_{t})) e_{s}$ This is a vector equation that merely updates the $s_{t}$ -th entry of $w_{t}$
Multiplying $e_{s}$ on both sides of the equation gives $w_{t + 1} (s_{t}) = w_{t} (s_{t}) + α_{t} (r_{t + 1} + w_{t} (s_{t + 1}) - w_{t} (s_{t}))$ which is exactly the tabular TD algorithm (which is called TD-Table here)

Summary: TD-Linear becomes TD-Table if we select a special feature vector(one-hot encoding)

Illustrative examples

Consider a 5x5 grid-world example:

center

Given a policy $π (a ∣ s) = 0.2$ for any $s \in S, a \in A$
The rewards are $r_{forbidden} = r_{boundary} = - 1$ , $r_{target} = 1$ , and $γ = 0.9$
The goal is to estimate the state values of this policy
- There are 25 state values in total
- TD-Table: 25 parameters
- TD-Linear: less than 25 parameters

通过之前的policy evaluation的方法，我们可以得到ground truth的state values

True State Values 2D	True State Values 3D

下面我们分别使用TD-Table和TD-Linear来估计这些state values，其中

500 episodes were generated following the uniform policy $π$
each episode has 500 steps and starts from a randomly selected state-action pair(following a uniform distribution)
TD-Table: $α_{t} = 0.005$
TD-Linear: $α_{t} = 0.0005$ , $ϕ (s) = (1, x, y)$

自己实现的结果如下：

TD-Table	TD-Linear( $ϕ (s) \in R^{3}$ )

RMSE结果对比：

center

More Details

the code above is open-source at donglinkang2021/grid_world

more about the basis function result, see

polynomial basis

fourier basis

more about how to visualize a matrix see visualize_matrix

Summary of the story

objective function: $J (w) = E [(v_{π} (S) - \overset{v}{^} (S, w))^{2}]$
gradient-descent algorithm: $w_{t + 1} = w_{t} + α_{t} [v_{π} (s_{t}) - v (s_{t}, w_{t})] \nabla_{w} \overset{v}{^} (s_{t}, w_{t})$
the true value function $v_{π}$ is replaced by an approximation $\overset{v}{^}$ : $w_{t + 1} = w_{t} + α_{t} [r_{t + 1} + γ \overset{v}{^} (s_{t + 1}, w_{t}) - \overset{v}{^} (s_{t}, w_{t})] \nabla_{w} \overset{v}{^} (s_{t}, w_{t})$

Theoretical analysis (optional)

The algorithm

w_{t + 1} = w_{t} + α_{t} [r_{t + 1} + γ \overset{v}{^} (s_{t + 1}, w_{t}) - \overset{v}{^} (s_{t}, w_{t})] \nabla_{w} \overset{v}{^} (s_{t}, w_{t})

does not minimize the following objective function:

J (w) = E [(v_{π} (S) - v (S, w))^{2}]

Objective function 1: True value error $J_{E} (w) = E [(v_{π} (S) - v (S, w))^{2}] = ∥ \overset{v}{^} (w) - v_{π} ∥_{D}^{2}$ where $∥ x ∥_{D} = x^{⊤} D x$ and $D$ is a diagonal matrix(each element is the stationary distribution $d_{π} (s)$ of the corresponding state)
Objective function 2: Bellman error $J_{BE} (w) = ∥ \overset{v}{^} (w) - replace v_{π} (r_{π} + γ P_{π} \overset{v}{^} (w)) ∥_{D}^{2} = ∥ \overset{v}{^} (w) - T_{π} (\overset{v}{^} (w)) ∥_{D}^{2}$ where $T_{π} (x) ≐ r_{π} + γ P_{π} x$
Objective function 3: Projected Bellman error $J_{PBE} (w) = E [(\overset{v}{^} (w) - M T_{π} (\overset{v}{^} (w)))^{2}] = ∥ \overset{v}{^} (w) - M T_{π} (\overset{v}{^} (w)) ∥_{D}^{2}$ where $M$ is a projection matrix
The TD-Linear algorithm minimizes the projected Bellman error

Reinforcement Learning Notes

Explorer

VF state value

State value of Value Function

Objective function

What is the probability distribution of $S$ ?

Uniform distribution

Stationary distribution

Optimization algorithms

Implementation

Selection of function approximators

Linear function approximator

Tabular representation(special case)

Illustrative examples

Summary of the story

Theoretical analysis (optional)

Graph View

Table of Contents

Backlinks

Reinforcement Learning Notes

Explorer

VF state value

State value of Value Function

Objective function

What is the probability distribution of S?

Uniform distribution

Stationary distribution

Optimization algorithms

Implementation

Selection of function approximators

Linear function approximator

Tabular representation(special case)

Illustrative examples

Summary of the story

Theoretical analysis (optional)

Graph View

Table of Contents

Backlinks

What is the probability distribution of $S$ ?