Reward

is one of the most unique concepts of RL

  • a real number after taking an action
  • a positive reward represents encouragement
  • a negative reward represents punishment

A zero reward? & Positive reward mean punishment?

a1. Zero reward(No punishment) means encouragement in some degree. a2. 没有明确的界定说奖励就必须为正才表示鼓励,只是数学上的trick,如果正数代表惩罚,agent就要去minimize,如果正数代表鼓励,agent就要maximize.

center

Reward, interpreted as a human-machine interface

奖励对于现在的大模型更加重要了,deepseek-r1的grpo就是强化学习对齐llm和人类希望llm能达到的能力的一个很好的例子。

  • Intuition: At state , take action , the reward is -1
  • Math:
  • remarks:
    • here is a deterministic case, reward transition could be stochastic
    • study hard get rewards, the quantity of reward is uncertain
    • reward depends on state and action, not the next state
      • , get -1 reward
      • , get 0 reward