Bellman Optimality Equation

Introduction

Bellman optimality equation (elementwise form):

v (s) = π m a x a \sum π (a ∣ s) [r \sum p (r ∣ s, a) r + γ s^{'} \sum p (s^{'} ∣ s, a) v (s^{'})] \forall s \in S = π m a x a \sum π (a ∣ s) q (s, a) \forall s \in S

$p (r ∣ s, a), p (s^{'} ∣ s, a)$ are known
$v (s), v (s^{'})$ to be calculated
$π (a ∣ s)$ unknown to be optimized

两个细节

下标去哪了？ 会想起我们的Bellman equantion形式实际是
$v_{π} (s) = a \sum π (a ∣ s) q_{π} (s, a)$
这里的最优公式为什么会下标 $π$ 不见了呢，就是因为策略已经是确定的一个值（ $max_{π} v_{π} (s)$ 之后找到 $π^{*}$ ），所以就不再依赖它了。

$max_{π}$ 是啥意思？ 这里的max是指相对与其它所有的 $π$ 的当前的状态值都要大就可以了，实际上如果把 $π$ 当成一个参数来看会更容易理解。假设有m个动作，我们就相当于每个状态有m个概率，通过分配概率的分布去调整状态值的大小，我们希望对于每个状态而言，都可以调整得出一个最大的状态值。

矩阵形式

$v = π max (r_{π} + γ P_{π} v) = π max π @ (P_{r} R + γ P_{s} v) = π max π @ q$

questions:

have solutions? Existence
how to solve? Algorithm
solution unique? Uniqueness
optimal? Optimality

Maximization on the right-hand side

Fix all $v (s^{'})$ ，and then $q (s, a)$ if fixed, solve $π$

v (s) = π m a x a \sum π (a ∣ s) q (s, a) \forall s \in S

Consider $\sum_{a} π (a ∣ s) = 1$ , we have

π m a x a \sum π (a ∣ s) q (s, a) = a \in A (s) max q (s, a) \forall s \in S

where the optimality is achieved when

π (a ∣ s) = {10 a = a^{*} a \neq = a^{*}

where $a^{*} = argmax_{a} q (s, a)$ .

简单中文描述

对于当前状态 $s$ ⇒ 下一个状态 $s^{'}$

首先固定下一个状态值，这样我们在求解的时候q值也是固定的了

对于当前状态，假设我们有m个动作，要选择的m个动作的概率之和为1

对于当前状态，我们又知道q值了，有m个q值，我们要求的当前状态值其实就是加权平均q值

显然我们把权重都放在最大q值上即可有最大当前状态值

Rewrite as v=f(v)

The BOE is $v = max_{π} (r_{π} + γ P_{π} v)$ , Let

f (v) := π max (r_{π} + γ P_{π} v)

Then the BOE becomes

v = f (v)

where

[f (v)]_{s} = v (s) = π m a x a \sum π (a ∣ s) q (s, a) \forall s \in S

Imagine the v is a vector, then $v (s)$ or $[f (v)]_{s}$ just selects the sth element

Preliminiary: Contraction mapping therom

Solution

v = f (v) = π max (r_{π} + γ P_{π} v)

$f (v)$ is a contraction mapping satisfying

∥ f (v_{1}) - f (v_{2}) ∥ \leq γ ∥ v_{1} - v_{2} ∥

where $γ$ is the discount rate
exists a solution $v^{*}$
solution is unique
algo: $v_{k + 1} = f (v_{k}) = max_{π} (r_{π} + γ P_{π} v_{k})$
${v_{k}}$ converges to $v^{*}$ exponentially fast

Optimality

Suppose $v^{*}$ is the solution to BOE. It satisfies

v^{*} = π max (r_{π} + γ P_{π} v^{*})

Suppose

π^{*} = ar g π max (r_{π} + γ P_{π} v^{*})

Then

v^{*} = r_{π^{*}} + γ P_{π^{*}} v^{*}

policy optimality

v^{*} = v_{π^{*}} \geq v_{π}, \forall π

Greedy Optimal Policy

π^{*} (a ∣ s) = {10 a = a^{*} (s) a \neq = a^{*} (s) \forall s \in S

where

$a^{*} = ar g max_{a} q^{*}$
$q^{*} = P_{r} R + γ P_{s} v^{*}$

π^{*} (s) = ar g π max a \sum π (a ∣ s) q^{*} (s, a) (r \sum p (r ∣ s, a) r + γ s^{'} \sum p (s^{'} ∣ s, a) v^{*} (s^{'}))

policy_star = action_star.one_hot()

Reinforcement Learning Notes

Explorer

BOE Bellman Optimality Equation

Bellman Optimality Equation

Introduction

Maximization on the right-hand side

Rewrite as v=f(v)

Solution

Optimality

Graph View

Table of Contents

Backlinks