Deep Q-network(DQN)

Introduction

Deep Q-learning or deep Q-network (DQN) is one of the earliest and most successful algorithms that introduce deep neural networks into reinforcement learning.

The role of neural networks is to be a nonlinear function approximator (mentioned before). Different from the following Q-learning algorithm:

Loss Function

DQN aims to minimize the objective/loss function:

where are random variables.

的符号的理解

  • 如果是,表示对于分布的期望
  • 如果是,没有写出下标,表示对于当前的分布的期望,当前分布一般都指的是括号中大写的字母,如
  • 上面形式其实也可以写成:

Minimize loss function

Core Idea

先暂时固定之前网络的参数输出target, 再和现在网络参数的输出output,两个做bellman mse loss,然后再更新网络参数

  • To calculate the gradient of the objective function the parameter not only appears in but also in target
  • Since the optimal depends on ,
  • need to fix in when calculating the gradient (make the dependence above of on disappear)
  • solution two networks:
    • main network:
    • target network: , where is the fixed target network parameter
  • objective function in this case:
  • gradient of :
  • basic idea of DQN: use gradient-descent algorithm to minimize the objective function
  • optimization process involves some important techniques(Replay Buffer, Target Network, etc.)

Two Networks

  • Two networks, a main network and a target network
  • Let and denote the parameters of the main and target networks, respectively
    • set to be the same initially
  • In every iteration, draw a mini-batch of samples from the replay buffer
  • For every , calculate the desired output as Therefore, obtain a mini-batch of data:
  • Use to train the network so as to minimize (mseloss)

Experience replay

  • After we have collected some experience samples, we do NOT use these samples in the order they were collected(please remember the why we use TD learning)
  • Instead, we store them in a set, called replay buffer
  • Every time we train the neural network, we can draw a mini-batch of random samples from the replay buffer
  • The draw of samples, or called experience replay, should follow a uniform distribution

Why do we use experience replay? Why must follow a uniform distribution?

Answer:

用这个方法可以天然地、数学地解决中出现的的问题:

  • , : and are determined by the system model
  • : is an index and treated as a single random variable
  • The distribution of the state-action pair is assumed to be uniform
    • Why uniform distribution? Because no prior knowledge
    • Can we use stationary distribution like before? No, since no policy is given
  • 但是我们并不是uniformly 搜集到state-action sample的,因为他们都是由特定policy consequently 产生的
  • 所以我们需要用replay buffer来存储这些sample,然后再uniformly sample出来(break the correlation between consequent samples)

Revisit the tabular case

  • Why does not tabular Q-learning require experience replay?
    • Because it does not require any distribution of or
    • 之前的tabular case中我们总是用RM的方法来避免直接计算,而是用某一个sample的就可以了,所以我们一直没用到distribution
  • Why does Deep Q-learning involve distributions?
    • 由我们的loss funcion所决定的
    • define a scalar objective function , where is for all
    • tabular case aims to solve a set of equations(e.g. BE, BOE) for all (update weight one sample at a time)
    • deep case aims to optimize a scalar objective function (update weight for a mini-batch at a time)
  • Can we use experience replay in tabular Q-learning?
    • Yes, we can. And more sample efficient

Implementation

NIPS2013workshop DQN原论文算法不太一样

一些观点

  • 上面的版本就是off-policy的版本,因为我们的behavior policy和target policy明显是不一样的(或者从这个角度理解:我们永远都是从replay buffer中采样来学习的,符合off-policy这个名字)
  • 上面为什么没有写出policy-update的部分?和之前的理由是一样的Version 3,我们从大量的state-action pair中学到了optimal q-value,所以不需要再去更新policy了(直接对q-value进行argmax就可以得到policy)