Bellman Optimality Equation
Introduction
Bellman optimality equation (elementwise form):
- are known
- to be calculated
- unknown to be optimized
两个细节
下标去哪了? 会想起我们的Bellman equantion形式实际是
这里的最优公式为什么会下标不见了呢,就是因为策略已经是确定的一个值(之后找到),所以就不再依赖它了。
是啥意思? 这里的max是指相对与其它所有的的当前的状态值都要大就可以了,实际上如果把当成一个参数来看会更容易理解。假设有m个动作,我们就相当于每个状态有m个概率,通过分配概率的分布去调整状态值的大小,我们希望对于每个状态而言,都可以调整得出一个最大的状态值。
矩阵形式
questions:
- have solutions? Existence
- how to solve? Algorithm
- solution unique? Uniqueness
- optimal? Optimality
Maximization on the right-hand side
Fix all ,and then if fixed, solve
Consider , we have
where the optimality is achieved when
where .
简单中文描述
对于当前状态 ⇒ 下一个状态
- 首先固定下一个状态值,这样我们在求解的时候q值也是固定的了
- 对于当前状态,假设我们有m个动作,要选择的m个动作的概率之和为1
- 对于当前状态,我们又知道q值了,有m个q值,我们要求的当前状态值其实就是加权平均q值
- 显然我们把权重都放在最大q值上即可有最大当前状态值
Rewrite as v=f(v)
The BOE is , Let
Then the BOE becomes
where
Imagine the v is a vector, then or just selects the sth element
Preliminiary: Contraction mapping therom
Solution
- is a contraction mapping satisfying
- where is the discount rate
- exists a solution
- solution is unique
- algo:
- converges to exponentially fast
Optimality
Suppose is the solution to BOE. It satisfies
Suppose
Then
policy optimality
Greedy Optimal Policy
where
policy_star = action_star.one_hot()