Reinforcement Learning Notes

❯

❯

VF Qlearning

Sep 30, 20252 min read

Q-learning with Value Function

Introduction

The Q-learning algorithm with value function approximation is

w_{t + 1} = w_{t} + α_{t} [r_{t + 1} + γ a^{'} \in A (s_{t + 1}) max \overset{q}{^} (s_{t + 1}, a^{'}, w_{t}) - \overset{q}{^} (s_{t}, a_{t}, w_{t})] \nabla_{w} \overset{q}{^} (s_{t}, a_{t}, w_{t})

Core Idea

Replace $\overset{q}{^} (s_{t + 1}, a_{t + 1}, w_{t})$ with $max_{a^{'} \in A (s_{t + 1})} \overset{q}{^} (s_{t + 1}, a^{'}, w_{t})$ in VF Sarsa.

Implementation

Compared with Version 1 and Version 3

Algorithm Q-learning with Function Approximation and ε -Greedy Initialization: Initial parameters w_{0}, policy π_{0} (a ∣ s), step size α \in (0, 1], discount factor γ . For each episode, do: Initialize state s_{0} . Select action a_{0} using π_{0} (s_{0}) . While s_{t} is not the target state, do: Execute a_{t}, observe r_{t + 1}, s_{t + 1} . Select a_{t + 1} using π_{t} (s_{t + 1}) . Update parameters (update q-value): w_{t + 1} \leftarrow w_{t} + α [r_{t + 1} + γ a^{'} \in A (s_{t + 1}) max \overset{q}{^} (s_{t + 1}, a^{'}, w_{t}) - \overset{q}{^} (s_{t}, a_{t}, w_{t})] \nabla_{w} \overset{q}{^} (s_{t}, a_{t}, w_{t}) Update policy for s_{t} : π_{t + 1} (a ∣ s_{t}) \leftarrow {1 - ε + \frac{ε}{∣ A ( s _{t} ) ∣}, \frac{ε}{∣ A ( s _{t} ) ∣}, if a = ar g max_{a^{'}} \overset{q}{^} (s_{t}, a^{'}, w_{t + 1}) otherwise s_{t} \leftarrow s_{t + 1}, a_{t} \leftarrow a_{t + 1}

一些观点

赵老师的PPT中说上面这个算法是on-policy的，但自己觉得不对，我觉得是off-policy的，理由如下：

behavior policy: $ε$ -greedy $ar g max \overset{q}{^}$

target policy: $ar g max \overset{q}{^}$

两个policy不一样，所以是off-policy的

Graph View

Q-learning with Value Function
Introduction
Implementation

Backlinks

08 Value Function Methods

Created with Quartz v4.4.0 © 2025

GitHub
Email
Home