Deep Q-network(DQN)

Introduction

Deep Q-learning or deep Q-network (DQN) is one of the earliest and most successful algorithms that introduce deep neural networks into reinforcement learning.

The role of neural networks is to be a nonlinear function approximator (mentioned before). Different from the following Q-learning algorithm:

w_{t + 1} = w_{t} + α_{t} [r_{t + 1} + γ a^{'} \in A (s_{t + 1}) max \overset{q}{^} (s_{t + 1}, a^{'}, w_{t}) - \overset{q}{^} (s_{t}, a_{t}, w_{t})] \nabla_{w} \overset{q}{^} (s_{t}, a_{t}, w_{t})

Loss Function

DQN aims to minimize the objective/loss function:

J (w) = E [(R + γ a \in A (S^{'}) max \overset{q}{^} (S^{'}, a, w) - \overset{q}{^} (S, A, w))^{2}]

where $(S, A, R, S^{'})$ are random variables.

对 $E$ 的符号的理解

如果是 $E_{p}$ ，表示对于分布 $p$ 的期望

如果是 $E$ ，没有写出下标，表示对于当前的分布的期望，当前分布一般都指的是括号中大写的字母，如 $S, A, R, S^{'}$

上面形式其实也可以写成： $J (w) = E_{s, a, r, s^{'}} [(r + γ a \in A (s^{'}) max \overset{q}{^} (s^{'}, a, w) - \overset{q}{^} (s, a, w))^{2}]$

Minimize loss function

Core Idea

先暂时固定之前网络的参数输出target, 再和现在网络参数的输出output，两个做bellman mse loss，然后再更新网络参数

To calculate the gradient of the objective function $J (w) = E [(R + γ a \in A (S^{'}) max \overset{q}{^} (S^{'}, a, w) - \overset{q}{^} (S, A, w))^{2}]$ the parameter $w$ not only appears in $\overset{q}{^} (S, A, w)$ but also in target $y = R + γ a \in A (S^{'}) max \overset{q}{^} (S^{'}, a, w)$
Since the optimal $a$ depends on $w$ , $\nabla_{w} y \neq = γ a \in A (S^{'}) max \nabla_{w} \overset{q}{^} (S^{'}, a, w)$
need to fix $w$ in $y$ when calculating the gradient (make the dependence above of $a$ on $w$ disappear)
solution ⇒ two networks:
- main network: $\overset{q}{^} (s, a, w)$
- target network: $\overset{q}{^} (s, a, w_{T})$ , where $w_{T}$ is the fixed target network parameter
objective function in this case: $J = E [(R + γ a \in A (S^{'}) max \overset{q}{^} (S^{'}, a, w_{T}) - \overset{q}{^} (S, A, w))^{2}]$
gradient of $J$ : $\nabla_{w} J = E [2 (R + γ a \in A (S^{'}) max \overset{q}{^} (S^{'}, a, w_{T}) - \overset{q}{^} (S, A, w)) \nabla_{w} \overset{q}{^} (S, A, w)]$
basic idea of DQN: use gradient-descent algorithm to minimize the objective function
optimization process involves some important techniques(Replay Buffer, Target Network, etc.)

Two Networks

Two networks, a main network and a target network
Let $w$ and $w_{T}$ denote the parameters of the main and target networks, respectively
- set to be the same initially
In every iteration, draw a mini-batch of samples ${(s, a, r, s^{'})}$ from the replay buffer
For every $(s, a, r, s^{'})$ , calculate the desired output as $y_{T} ≐ r + γ a \in A (s^{'}) max \overset{q}{^} (s^{'}, a, w_{T})$ Therefore, obtain a mini-batch of data: ${(s, a, y_{T})}$
Use ${(s, a, y_{T})}$ to train the network so as to minimize $(y_{T} - \overset{q}{^} (s, a, w))^{2}$ (mseloss)

Experience replay

After we have collected some experience samples, we do NOT use these samples in the order they were collected(please remember the why we use TD learning)
Instead, we store them in a set, called replay buffer $B = {(s, a, r, s^{'})}$
Every time we train the neural network, we can draw a mini-batch of random samples from the replay buffer
The draw of samples, or called experience replay, should follow a uniform distribution

Why do we use experience replay? Why must follow a uniform distribution?

Answer:

用这个方法可以天然地、数学地解决 $J$ 中出现的 $E$ 的问题：
$J = E [(R + γ a \in A (S^{'}) max \overset{q}{^} (S^{'}, a, w) - \overset{q}{^} (S, A, w))^{2}]$

$R \sim p (R ∣ S, A)$ , $S^{'} \sim p (S^{'} ∣ S, A)$ : $R$ and $S$ are determined by the system model

$(S, A) \sim d$ : $(S, A)$ is an index and treated as a single random variable

The distribution of the state-action pair $(S, A)$ is assumed to be uniform

Why uniform distribution? Because no prior knowledge

Can we use stationary distribution like before? No, since no policy is given

但是我们并不是uniformly 搜集到state-action sample的，因为他们都是由特定policy consequently 产生的

所以我们需要用replay buffer来存储这些sample，然后再uniformly sample出来(break the correlation between consequent samples)

Revisit the tabular case

Why does not tabular Q-learning require experience replay?
- Because it does not require any distribution of $S$ or $A$
- 之前的tabular case中我们总是用RM的方法来避免直接计算 $E$ ，而是用某一个sample的 $(s, a, r, s^{'})$ 就可以了，所以我们一直没用到distribution
Why does Deep Q-learning involve distributions?
- 由我们的loss funcion所决定的
- define a scalar objective function $J (w) = E [*]$ , where $E$ is for all $(S, A)$
- tabular case aims to solve a set of equations(e.g. BE, BOE) for all $(s, a)$ (update weight one sample $(s, a)$ at a time)
- deep case aims to optimize a scalar objective function (update weight for a mini-batch $(S, A)$ at a time)
Can we use experience replay in tabular Q-learning?
- Yes, we can. And more sample efficient

Implementation

和 NIPS2013workshop DQN原论文算法不太一样

Algorithm Deep Q-learning with Experience Replay Initialization: Replay memory B with capacity N . Main network q (s, a; w) with random weights w . Target network q (s, a; w_{T}) (initially w_{T} = w). Discount factor γ, exploration probability ϵ, update interval C . For episode = 1 to M, do: Initialize state s_{1} . For t = 1 to T, do: Select action a_{t} : With probability ϵ, choose random a_{t}; else a_{t} = ar g a max q (s_{t}, a; w) . Execute a_{t} in environment, observe reward r_{t} and next state s_{t + 1} . Store transition (s_{t}, a_{t}, r_{t}, s_{t + 1}) in B . // if we do not have enough batch-size samples in B, continue to next step Sample random minibatch (s_{j}, a_{j}, r_{j}, s_{j + 1}) from B . For each transition in minibatch, do: y_{j} = {r_{j} r_{j} + γ max_{a^{'}} q (s_{j + 1}, a^{'}; w_{T}) If s_{j + 1} is terminal Otherwise Update main network: w \leftarrow w - α \nabla_{w} [\frac{1}{∣ batch ∣} \sum (y_{j} - q (s_{j}, a_{j}; w))^{2}] Update target network every C steps: If t mod C = 0, set w_{T} \leftarrow w . t \leftarrow t + 1

一些观点

上面的版本就是off-policy的版本，因为我们的behavior policy和target policy明显是不一样的（或者从这个角度理解：我们永远都是从replay buffer中采样来学习的，符合off-policy这个名字）

上面为什么没有写出policy-update的部分？和之前的理由是一样的Version 3，我们从大量的state-action pair中学到了optimal q-value，所以不需要再去更新policy了(直接对q-value进行argmax就可以得到policy)

Reinforcement Learning Notes

Explorer

VF DQN

Deep Q-network(DQN)

Introduction

Loss Function

Minimize loss function

Two Networks

Experience replay

Revisit the tabular case

Implementation

Graph View

Table of Contents

Backlinks