Reinforcement Learning Notes

❯

❯

VF Sarsa

Apr 01, 20252 min read

Sarsa with Value Function

Introduction

So far, we merely considered state value estimation. That is

\overset{v}{^} (s) \approx v_{π} (s), s \in S

To search for optimal policies, we need to estimate action values.

The Sarsa algorithm with value function approximation is

w_{t + 1} = w_{t} + α_{t} [r_{t + 1} + γ \overset{q}{^} (s_{t + 1}, a_{t + 1}, w_{t}) - \overset{q}{^} (s_{t}, a_{t}, w_{t})] \nabla_{w} \overset{q}{^} (s_{t}, a_{t}, w_{t})

This is the same as the algorithm we introduced previously in this lecture except that $v$ is replaced by $q$ .

Core Idea

Replace $\overset{v}{^} (s, w)$ with $\overset{q}{^} (s, a, w)$ in VF state value.

Implementation

可以对比Implementation的实现来看这里的改进

Algorithm SARSA with Function Approximation and ε -Greedy Initialization: Initial parameters w_{0}, policy π_{0} (a ∣ s), step size α \in (0, 1], discount factor γ . For each episode, do: Initialize state s_{0} . Select action a_{0} using π_{0} (s_{0}) . While s_{t} is not the target state, do: Execute a_{t}, observe r_{t + 1}, s_{t + 1} . Select a_{t + 1} using π_{t} (s_{t + 1}) . Update parameters (update q-value): w_{t + 1} \leftarrow w_{t} + α [r_{t + 1} + γ \overset{q}{^} (s_{t + 1}, a_{t + 1}, w_{t}) - \overset{q}{^} (s_{t}, a_{t}, w_{t})] \nabla_{w} \overset{q}{^} (s_{t}, a_{t}, w_{t}) Update policy for s_{t} : π_{t + 1} (a ∣ s_{t}) \leftarrow {1 - ε + \frac{ε}{∣ A ( s _{t} ) ∣}, \frac{ε}{∣ A ( s _{t} ) ∣}, if a = ar g max_{a^{'}} \overset{q}{^} (s_{t}, a^{'}, w_{t + 1}) otherwise s_{t} \leftarrow s_{t + 1}, a_{t} \leftarrow a_{t + 1}

Graph View

Sarsa with Value Function
Introduction
Implementation

Backlinks

08 Value Function Methods
AC QAC
VF Qlearning

Created with Quartz v4.4.0 © 2025

GitHub