Last updated on August 25, 2025 am
A simple memo for learning Reinforcement Learning.(强化学习备忘录)
Terminologies
- State 状态 s, Action 动作 a, Reward 奖励 r
- Policy 策略 π:(s,a)→[0,1]
π(s∣a)=P(A=a∣S=s)
- Trajectory 轨迹: s1,a1,r1,s2,a2,r2,…,sT,aT,rT
- Discounted return 折扣回报
Ut=Rt+γRt+1+γ2Rt+2+γ3Rt+3+⋯
- Action-value function 动作价值函数
Qπ(st,at)=E[Ut∣S=st,A=at]
- Optimal action-value function 最优动作价值函数
Q∗(st,at)=πmaxQπ(st,at)
- State-value function 状态价值函数
Vπ(st)=EA[Qπ(st,A)]=a∈A∑π(a∣st)⋅Qπ(st,a)
- ES[Vπ(S)] evaluates how good the policy π is.
Value-based Learning
- Goal: Approximate the Q function to maximize the total reward.
Temporal Difference (TD) Learning
Q(st,at;w)=rt+γ⋅Q(st+1,at+1;w)
- Prediction: Q(st,at;wt)
- TD target:
yt=rt+γ⋅Q(st+1,at+1;wt)
Lt=21[Q(st,at;w)−yt]2
wt+1=wt−α⋅∂w∂Lt∣w=wt
SARSA
- Goal: Learn the action-value function Qπ.
Tabular Version
- Goal: Directly learn Qπ(s,a).
- Algorithm:
- Observe a transition (st,at,rt,st+1)
- Sample at+1∼π(⋅∣st+1)
- TD target: yt=rt+γ⋅Qπ(st+1,at+1)
- TD error: δt=Qπ(st,at)−yt
- Update: Qπ(st,at)←Qπ(st,at)−α⋅δt
Neural Network Version
- Goal: Approximate Qπ(s,a) by the value network q(s,a;w).
- Algorithm:
- Observe a transition (st,at,rt,st+1)
- TD target: yt=rt+γ⋅q(st+1,at+1;w)
- TD error: δt=q(st,at;w)−yt
- Update: w←w−α⋅δt⋅∂w∂q(st,at;w)
Q-Learning
- Goal: Learn the optimal action-value function Q∗.
Tabular Version
- Goal: Directly learn Q∗(s,a).
- Algorithm:
- Observe a transition (st,at,rt,st+1)
- TD target: yt=rt+γ⋅maxaQ∗(st+1,a)
- TD error: δt=Q∗(st,at)−yt
- Update: Q∗(st,at)←Q∗(st,at)−α⋅δt
DQN Version
- Goal: Approximate Q∗(s,a) by the DQN, Q(s,a;w)
- Policy: Choose at=argmaxaQ(st,a;w)
- Algorithm:
- Observe a transition (st,at,rt,st+1)
- TD target: yt=rt+γ⋅maxaQ(st+1,a;w)
- TD error: δt=Q(st,at;w)−yt
- Update: w←w−α⋅δt⋅∂w∂Q(st,at;w)
Policy-based Learning
- Goal: Learn θ that maximize J(θ)=ES[V(S;θ)].
- Idea: Approximate policy function π(a∣s) by policy network π(a∣s;θ).
- Learn the policy network by policy gradient.
Policy Gradient
∂θ∂V(st;θ)=EAt∼π[∂θ∂lnπ(At∣st;θ)⋅Qπ(st,At)]
- Policy gradient with baseline: Suppose b is independent of At, then
∂θ∂V(st;θ)==EAt∼π[∂θ∂lnπ(At∣st;θ)⋅(Qπ(st,At)−b)]EAt∼π[∂θ∂lnπ(At∣st;θ)⋅(Qπ(st,At)−Vπ(st))]
REINFORCE
- Goal: Approximate Qπ(st,At) by ut and Vπ(st) by value network v(st;w)
- Algorithm:
- Play a game to the end and observe the trajectory:
s1,a1,r1,s2,a2,r2,…,sn,an,rn
- Compute ut=∑i=tnγr−t⋅ri and δt=v(st;w)−ut
- Update the policy network by:
θ←θ−β⋅δt⋅∂θ∂lnπ(at∣st;θ)
- Update the value network by:
w←w−α⋅δt⋅∂w∂v(st;w)
- Repeat procedure 2 to 4 for t=1,…,n
Actor-Critic
- Goal: Approximate policy function π(a∣s) by policy network π(a∣s;θ) and state-value function Vπ(s) by value network v(s;w).
- Actor: Update policy network π(a∣s;θ) using policy gradient to increase V(s;θ,w)
- Critic: Update value network v(s;w) using TD learning to better estimate the return
- Algorithm:
- Observe a transition (st,at,rt,st+1)
- TD target: yt=rt+γ⋅v(st+1;w)
- TD error: δt=v(st;w)−yt
- Update the policy network (actor) by:
θ←θ−β⋅δt⋅∂θ∂lnπ(at∣st;θ)
- Update the value network (critic) by:
w←w−α⋅δt⋅∂w∂v(st;w)
TRPO (Trust Region Policy Optimization)
J(θ)=ES[Vπ(S)]=ES[EA∼π(⋅∣s;θold)[π(A∣S;θold)π(A∣S;θ)⋅Qπ(S,A)]]
- Algorithm:
- Controlled by the policy π(⋅∣s;θold), the agent plays a game to the end and observes a trajectory:
s1,a1,r1,s2,a2,r2,…,sn,an,rn
- For i=1,2,…,n, compute discounted returns: ui=∑k=inγk−i⋅rk
- Approximation:
L~(θ∣θold)=n1i=1∑nπ(ai∣si;θold)π(ai∣si;θ)⋅ui
- Maximization:
θnew←θargmaxL~(θ∣θold);s.t.∥θ−θold∥<Δ
- Goal:
s.t.θ′argmaxEs∼vθ,a∼πθ(⋅∣s)[πθ(a,s)πθ′(a,s)⋅Aπθ(s,a)]DKL(πθ(⋅∣s)∥πθ′(⋅∣s))<Δ
where Aπθ(s,a) is the advantage function.
PPO (Proximal Policy Optimization)
- PPO-penalty:
θ′argmaxEs∼vθ,a∼πθ(⋅∣s)[πθ(a,s)πθ′(a,s)⋅AπθGAE(s,a)−β⋅DKL(πθ(⋅∣s)∥πθ′(⋅∣s))]
- β←β/2if DKL<δ/1.5
- β←β×2if DKL>δ×1.5
- PPO-clip:
θ′argmaxEs∼vθ,a∼πθ(⋅∣s)[min(πθ(a,s)πθ′(a,s)AπθGAE(s,a),clip(πθ(a,s)πθ′(a,s),1−ϵ,1+ϵ)AπθGAE(s,a))]
GAE (Generalized Advantage Estimation)
At(k)==rt+γrt+1+⋯+γk−1rt+k−1+γkV(sk)−V(st)δt+γδt+1+⋯γk−1δt+k−1
AtGAE=(1−λ)(At(1)+λAt(2)+λ2At(3)+⋯)=l=0∑∞(γλ)lδt+l
GRPO (Group Relative Policy Optimization)
References
- 王树森,黎或君,张志华.深度强化学习.人民邮电出版社,2022.https://github.com/wangshusen/DRL
- 张伟楠,沈键,俞勇.动手学强化学习.人民邮电出版社,2022.https://github.com/boyu-ai/Hands-on-RL
- https://www.bilibili.com/video/BV1rooaYVEk8
- https://www.bilibili.com/video/BV15cZYYvEhz