Q-learning with Value Function

Introduction

The Q-learning algorithm with value function approximation is

Core Idea

Replace with in VF Sarsa.

Implementation

Compared with Version 1 and Version 3

一些观点

赵老师的PPT中说上面这个算法是on-policy的,但自己觉得不对,我觉得是off-policy的,理由如下:

  • behavior policy: -greedy
  • target policy:

两个policy不一样,所以是off-policy的