Deep Q-network(DQN)
Introduction
Deep Q-learning or deep Q-network (DQN) is one of the earliest and most successful algorithms that introduce deep neural networks into reinforcement learning.
The role of neural networks is to be a nonlinear function approximator (mentioned before). Different from the following Q-learning algorithm:
Loss Function
DQN aims to minimize the objective/loss function:
where are random variables.
对 的符号的理解
- 如果是,表示对于分布的期望
- 如果是,没有写出下标,表示对于当前的分布的期望,当前分布一般都指的是括号中大写的字母,如
- 上面形式其实也可以写成:
Minimize loss function
Core Idea
先暂时固定之前网络的参数输出target, 再和现在网络参数的输出output,两个做bellman mse loss,然后再更新网络参数
- To calculate the gradient of the objective function the parameter not only appears in but also in target
- Since the optimal depends on ,
- need to fix in when calculating the gradient (make the dependence above of on disappear)
- solution ⇒ two networks:
- main network:
- target network: , where is the fixed target network parameter
- objective function in this case:
- gradient of :
- basic idea of DQN: use gradient-descent algorithm to minimize the objective function
- optimization process involves some important techniques(Replay Buffer, Target Network, etc.)
Two Networks
- Two networks, a main network and a target network
- Let and denote the parameters of the main and target networks, respectively
- set to be the same initially
- In every iteration, draw a mini-batch of samples from the replay buffer
- For every , calculate the desired output as Therefore, obtain a mini-batch of data:
- Use to train the network so as to minimize (mseloss)
Experience replay
- After we have collected some experience samples, we do NOT use these samples in the order they were collected(please remember the why we use TD learning)
- Instead, we store them in a set, called replay buffer
- Every time we train the neural network, we can draw a mini-batch of random samples from the replay buffer
- The draw of samples, or called experience replay, should follow a uniform distribution
Why do we use experience replay? Why must follow a uniform distribution?
Answer:
用这个方法可以天然地、数学地解决中出现的的问题:
- , : and are determined by the system model
- : is an index and treated as a single random variable
- The distribution of the state-action pair is assumed to be uniform
- Why uniform distribution? Because no prior knowledge
- Can we use stationary distribution like before? No, since no policy is given
- 但是我们并不是uniformly 搜集到state-action sample的,因为他们都是由特定policy consequently 产生的
- 所以我们需要用replay buffer来存储这些sample,然后再uniformly sample出来(break the correlation between consequent samples)
Revisit the tabular case
- Why does not tabular Q-learning require experience replay?
- Because it does not require any distribution of or
- 之前的tabular case中我们总是用RM的方法来避免直接计算,而是用某一个sample的就可以了,所以我们一直没用到distribution
- Why does Deep Q-learning involve distributions?
- 由我们的loss funcion所决定的
- define a scalar objective function , where is for all
- tabular case aims to solve a set of equations(e.g. BE, BOE) for all (update weight one sample at a time)
- deep case aims to optimize a scalar objective function (update weight for a mini-batch at a time)
- Can we use experience replay in tabular Q-learning?
- Yes, we can. And more sample efficient
Implementation
和 NIPS2013workshop DQN原论文算法不太一样
一些观点
- 上面的版本就是off-policy的版本,因为我们的behavior policy和target policy明显是不一样的(或者从这个角度理解:我们永远都是从replay buffer中采样来学习的,符合off-policy这个名字)
- 上面为什么没有写出policy-update的部分?和之前的理由是一样的Version 3,我们从大量的state-action pair中学到了optimal q-value,所以不需要再去更新policy了(直接对q-value进行argmax就可以得到policy)