Reward
is one of the most unique concepts of RL
- a real number after taking an action
- a positive reward represents encouragement
- a negative reward represents punishment
A zero reward? & Positive reward mean punishment?
a1. Zero reward(No punishment) means encouragement in some degree. a2. 没有明确的界定说奖励就必须为正才表示鼓励,只是数学上的trick,如果正数代表惩罚,agent就要去minimize,如果正数代表鼓励,agent就要maximize.
Reward, interpreted as a human-machine interface
奖励对于现在的大模型更加重要了,deepseek-r1的grpo就是强化学习对齐llm和人类希望llm能达到的能力的一个很好的例子。
- Intuition: At state , take action , the reward is -1
- Math:
- remarks:
- here is a deterministic case, reward transition could be stochastic
- study hard ⇒ get rewards, the quantity of reward is uncertain
- reward depends on state and action, not the next state
- , get -1 reward
- , get 0 reward