Reinforcement Learning Memo

Last updated on August 25, 2025 am

A simple memo for learning Reinforcement Learning.(强化学习备忘录)

Terminologies

  1. State 状态 ss, Action 动作 aa, Reward 奖励 rr
  2. Policy 策略 π:(s,a)[0,1]\pi: (s,a) \to [0, 1]

    π(sa)=P(A=aS=s)\pi(s | a) = \mathbb{P}(A = a | S = s)

  3. Trajectory 轨迹: s1,a1,r1,s2,a2,r2,,sT,aT,rTs_1, a_1, r_1, s_2, a_2, r_2, \dots, s_T, a_T, r_T
  4. Discounted return 折扣回报

    Ut=Rt+γRt+1+γ2Rt+2+γ3Rt+3+U_t = R_t + \gamma R_{t + 1} + \gamma^2 R_{t + 2} + \gamma^3 R_{t + 3} + \cdots

  5. Action-value function 动作价值函数

    Qπ(st,at)=E[UtS=st,A=at]Q_\pi(s_t,a_t) = \mathbb{E}[U_t | S = s_t, A = a_t]

  6. Optimal action-value function 最优动作价值函数

    Q(st,at)=maxπQπ(st,at)Q^*(s_t,a_t) = \max_\pi Q_\pi(s_t,a_t)

  7. State-value function 状态价值函数

    Vπ(st)=EA[Qπ(st,A)]=aAπ(ast)Qπ(st,a)V_\pi(s_t) = \mathbb{E}_A[Q_\pi(s_t, A)] = \sum_{a \in A} \pi(a|s_t) \cdot Q_\pi(s_t, a)

    • ES[Vπ(S)]\mathbb{E}_S[V_\pi(S)] evaluates how good the policy π\pi is.

Value-based Learning

  • Goal: Approximate the Q function to maximize the total reward.

Temporal Difference (TD) Learning

Q(st,at;w)=rt+γQ(st+1,at+1;w)Q(s_t,a_t;\mathbf{w}) = r_t + \gamma \cdot Q(s_{t+1},a_{t+1};\mathbf{w})

  • Prediction: Q(st,at;wt)Q(s_t,a_t;\mathbf{w}_t)
  • TD target:

yt=rt+γQ(st+1,at+1;wt)y_t = r_t + \gamma \cdot Q(s_{t+1},a_{t+1};\mathbf{w}_t)

  • Loss:

Lt=12[Q(st,at;w)yt]2L_t = \frac{1}{2}[Q(s_t,a_t;\mathbf{w}) - y_t]^2

  • Gradient descent:

wt+1=wtαLtww=wt\mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \cdot \frac{\partial L_t}{\partial \mathbf{w}} \vert_{\mathbf{w} = \mathbf{w}_t}

SARSA

  • Goal: Learn the action-value function QπQ_\pi.

Tabular Version

  • Goal: Directly learn Qπ(s,a)Q_\pi(s,a).
  • Algorithm:
    1. Observe a transition (st,at,rt,st+1)(s_t, a_t, r_t, s_{t+1})
    2. Sample at+1π(st+1)a_{t+1} \sim \pi(\cdot|s_{t+1})
    3. TD target: yt=rt+γQπ(st+1,at+1)y_t = r_t + \gamma \cdot Q_\pi(s_{t+1}, a_{t+1})
    4. TD error: δt=Qπ(st,at)yt\delta_t = Q_\pi(s_t,a_t) - y_t
    5. Update: Qπ(st,at)Qπ(st,at)αδtQ_\pi(s_t,a_t) \gets Q_\pi(s_t,a_t) - \alpha \cdot \delta_t

Neural Network Version

  • Goal: Approximate Qπ(s,a)Q_\pi(s,a) by the value network q(s,a;w)q(s,a;\mathbf{w}).
  • Algorithm:
    1. Observe a transition (st,at,rt,st+1)(s_t, a_t, r_t, s_{t+1})
    2. TD target: yt=rt+γq(st+1,at+1;w)y_t = r_t + \gamma \cdot q(s_{t+1},a_{t+1};\mathbf{w})
    3. TD error: δt=q(st,at;w)yt\delta_t = q(s_t,a_t;\mathbf{w}) - y_t
    4. Update: wwαδtq(st,at;w)w\mathbf{w} \gets \mathbf{w} - \alpha \cdot \delta_t \cdot \frac{\partial q(s_t,a_t;\mathbf{w})}{\partial \mathbf{w}}

Q-Learning

  • Goal: Learn the optimal action-value function QQ^*.

Tabular Version

  • Goal: Directly learn Q(s,a)Q^*(s,a).
  • Algorithm:
    1. Observe a transition (st,at,rt,st+1)(s_t, a_t, r_t, s_{t+1})
    2. TD target: yt=rt+γmaxaQ(st+1,a)y_t = r_t + \gamma \cdot \max_a Q^*(s_{t+1}, a)
    3. TD error: δt=Q(st,at)yt\delta_t = Q^*(s_t,a_t) - y_t
    4. Update: Q(st,at)Q(st,at)αδtQ^*(s_t,a_t) \gets Q^*(s_t,a_t) - \alpha \cdot \delta_t

DQN Version

  • Goal: Approximate Q(s,a)Q^*(s,a) by the DQN, Q(s,a;w)Q(s,a;\mathbf{w})
  • Policy: Choose at=arg maxaQ(st,a;w)a_t = \argmax_{a} Q(s_t,a;\mathbf{w})
  • Algorithm:
    1. Observe a transition (st,at,rt,st+1)(s_t, a_t, r_t, s_{t+1})
    2. TD target: yt=rt+γmaxaQ(st+1,a;w)y_t = r_t + \gamma \cdot \max_a Q(s_{t+1}, a; \mathbf{w})
    3. TD error: δt=Q(st,at;w)yt\delta_t = Q(s_t,a_t;\mathbf{w}) - y_t
    4. Update: wwαδtQ(st,at;w)w\mathbf{w} \gets \mathbf{w} - \alpha \cdot \delta_t \cdot \frac{\partial Q(s_t,a_t;\mathbf{w})}{\partial \mathbf{w}}

Policy-based Learning

  • Goal: Learn θ\mathbf{\theta} that maximize J(θ)=ES[V(S;θ)]J(\mathbf{\theta}) = \mathbb{E}_S[V(S;\mathbf{\theta})].
  • Idea: Approximate policy function π(as)\pi(a|s) by policy network π(as;θ)\pi(a|s;\mathbf{\theta}).
    • Learn the policy network by policy gradient.

Policy Gradient

  • Policy gradient:

V(st;θ)θ=EAtπ[lnπ(Atst;θ)θQπ(st,At)]\frac{\partial V(s_t;\mathbf{\theta})}{\partial \mathbf{\theta}} = \mathbb{E}_{A_t \sim \pi}\left[\frac{\partial \ln \pi(A_t|s_t;\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot Q_{\pi}(s_t,A_t)\right]

  • Policy gradient with baseline: Suppose bb is independent of AtA_t, then

V(st;θ)θ=EAtπ[lnπ(Atst;θ)θ(Qπ(st,At)b)]=EAtπ[lnπ(Atst;θ)θ(Qπ(st,At)Vπ(st))]\begin{aligned} \frac{\partial V(s_t;\mathbf{\theta})}{\partial \mathbf{\theta}} =& \, \mathbb{E}_{A_t \sim \pi}\left[\frac{\partial \ln \pi(A_t|s_t;\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot (Q_{\pi}(s_t,A_t) - b)\right] \\ =& \, \mathbb{E}_{A_t \sim \pi}\left[\frac{\partial \ln \pi(A_t|s_t;\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot \left(Q_{\pi}(s_t,A_t) - V_\pi(s_t)\right)\right] \end{aligned}

REINFORCE

  • Goal: Approximate Qπ(st,At)Q_{\pi}(s_t,A_t) by utu_t and Vπ(st)V_\pi(s_t) by value network v(st;w)v(s_t;\mathbf{w})
  • Algorithm:
    1. Play a game to the end and observe the trajectory:

    s1,a1,r1,s2,a2,r2,,sn,an,rns_1, a_1, r_1, s_2, a_2, r_2, \ldots, s_n, a_n, r_n

    1. Compute ut=i=tnγrtriu_t = \sum_{i=t}^n \gamma^{r-t}\cdot r_i and δt=v(st;w)ut\delta_t = v(s_t;\mathbf{w}) - u_t
    2. Update the policy network by:

    θθβδtlnπ(atst;θ)θ\mathbf{\theta} \gets \mathbf{\theta} - \beta \cdot \delta_t \cdot \frac{\partial \ln \pi(a_t|s_t;\mathbf{\theta})}{\partial \mathbf{\theta}}

    1. Update the value network by:

    wwαδtv(st;w)w\mathbf{w} \gets \mathbf{w} - \alpha \cdot \delta_t \cdot \frac{\partial v(s_t;\mathbf{w})}{\partial \mathbf{w}}

    1. Repeat procedure 2 to 4 for t=1,,nt = 1, \dots, n

Actor-Critic

  • Goal: Approximate policy function π(as)\pi(a|s) by policy network π(as;θ)\pi(a|s;\mathbf{\theta}) and state-value function Vπ(s)V_\pi(s) by value network v(s;w)v(s;\mathbf{w}).
    • Actor: Update policy network π(as;θ)\pi(a|s;\mathbf{\theta}) using policy gradient to increase V(s;θ,w)V(s;\mathbf{\theta}, \mathbf{w})
    • Critic: Update value network v(s;w)v(s;\mathbf{w}) using TD learning to better estimate the return
  • Algorithm:
    1. Observe a transition (st,at,rt,st+1)(s_t,a_t,r_t,s_{t+1})
    2. TD target: yt=rt+γv(st+1;w)y_t = r_t + \gamma \cdot v(s_{t+1};\mathbf{w})
    3. TD error: δt=v(st;w)yt\delta_t = v(s_t;\mathbf{w}) - y_t
    4. Update the policy network (actor) by:

    θθβδtlnπ(atst;θ)θ\mathbf{\theta} \gets \mathbf{\theta} - \beta \cdot \delta_t \cdot \frac{\partial \ln \pi(a_t|s_t;\mathbf{\theta})}{\partial \mathbf{\theta}}

    1. Update the value network (critic) by:

    wwαδtv(st;w)w\mathbf{w} \gets \mathbf{w} - \alpha \cdot \delta_t \cdot \frac{\partial v(s_t;\mathbf{w})}{\partial \mathbf{w}}

TRPO (Trust Region Policy Optimization)

  • Trust region algorithm: Find θ=arg maxθJ(θ)\mathbf{\theta}^* = \argmax_{\mathbf{\theta}} J(\mathbf{\theta}).
    1. Approximation: Given θold\mathbf{\theta}_\text{old}, construct L(θθold)L(\mathbf{\theta}|\mathbf{\theta}_\text{old}), which is an approximation to J(θ)J(\mathbf{\theta}) in N(θold)\mathcal{N}(\mathbf{\theta}_\text{old})
    2. Maximization: In the trust region, find θnew\mathbf{\theta}_\text{new} by:

    θnewarg maxθN(θold)L(θθold)\mathbf{\theta}_\text{new} \gets \argmax_{\mathbf{\theta} \in \mathcal{N}(\mathbf{\theta}_\text{old})} L(\mathbf{\theta}|\mathbf{\theta}_\text{old})

  • Object function:

J(θ)=ES[Vπ(S)]=ES[EAπ(s;θold)[π(AS;θ)π(AS;θold)Qπ(S,A)]]J(\mathbf{\theta}) = \mathbb{E}_S[V_\pi(S)] = \mathbb{E}_S\left[\mathbb{E}_{A \sim \pi(\cdot|s;\mathbf{\theta}_\text{old})}\left[\frac{\pi(A|S;\mathbf{\theta})}{\pi(A|S;\mathbf{\theta}_\text{old})} \cdot Q_\pi(S,A)\right]\right]

  • Algorithm:
    1. Controlled by the policy π(s;θold)\pi(\cdot|s;\mathbf{\theta}_\text{old}), the agent plays a game to the end and observes a trajectory:

    s1,a1,r1,s2,a2,r2,,sn,an,rns_1, a_1, r_1, s_2, a_2, r_2, \ldots, s_n, a_n, r_n

    1. For i=1,2,,ni = 1, 2, \ldots, n, compute discounted returns: ui=k=inγkirku_i = \sum_{k=i}^n \gamma^{k-i} \cdot r_k
    2. Approximation:

    L~(θθold)=1ni=1nπ(aisi;θ)π(aisi;θold)ui\tilde{L}(\mathbf{\theta}|\mathbf{\theta}_\text{old}) = \frac{1}{n} \sum_{i=1}^n \frac{\pi(a_i|s_i;\mathbf{\theta})}{\pi(a_i|s_i;\mathbf{\theta}_\text{old})} \cdot u_i

    1. Maximization:

    θnewarg maxθL~(θθold);s.t.θθold<Δ\mathbf{\theta}_\text{new} \gets \argmax_{\mathbf{\theta}} \tilde{L}(\mathbf{\theta}|\mathbf{\theta}_\text{old}); \quad s.t. \, \Vert \mathbf{\theta} - \mathbf{\theta}_\text{old} \Vert < \Delta

  • Goal:

    arg maxθEsvθ,aπθ(s)[πθ(a,s)πθ(a,s)Aπθ(s,a)]s.t.DKL(πθ(s)πθ(s))<Δ\begin{aligned} & \argmax_{\mathbf{\theta'}} \mathbb{E}_{s \sim v_{\mathbf{\theta}}, a \sim \pi_{\mathbf{\theta}(\cdot|s)}} \left[\frac{\pi_{\mathbf{\theta'}}(a,s)}{\pi_{\mathbf{\theta}}(a,s)} \cdot A_{\pi_{\mathbf{\theta}}}(s,a)\right] \\ s.t. & \quad D_{KL}(\pi_{\mathbf{\theta}}(\cdot|s) \Vert \pi_{\mathbf{\theta'}}(\cdot|s)) < \Delta \end{aligned}

    where Aπθ(s,a)A_{\pi_{\mathbf{\theta}}}(s,a) is the advantage function.

PPO (Proximal Policy Optimization)

  • PPO-penalty:

    arg maxθEsvθ,aπθ(s)[πθ(a,s)πθ(a,s)AπθGAE(s,a)βDKL(πθ(s)πθ(s))]\argmax_{\mathbf{\theta'}} \mathbb{E}_{s \sim v_{\mathbf{\theta}}, a \sim \pi_{\mathbf{\theta}(\cdot|s)}} \left[\frac{\pi_{\mathbf{\theta'}}(a,s)}{\pi_{\mathbf{\theta}}(a,s)} \cdot A^{GAE}_{\pi_{\mathbf{\theta}}}(s,a) - \beta \cdot D_{KL}(\pi_{\mathbf{\theta}}(\cdot|s) \Vert \pi_{\mathbf{\theta'}}(\cdot|s))\right]

    • ββ/2if DKL<δ/1.5\beta \gets \beta / 2 \quad \text{if } D_{KL} < \delta / 1.5
    • ββ×2if DKL>δ×1.5\beta \gets \beta \times 2 \quad \text{if } D_{KL} > \delta \times 1.5
  • PPO-clip:

    arg maxθEsvθ,aπθ(s)[min(πθ(a,s)πθ(a,s)AπθGAE(s,a),clip(πθ(a,s)πθ(a,s),1ϵ,1+ϵ)AπθGAE(s,a))]\argmax_{\mathbf{\theta'}} \mathbb{E}_{s \sim v_{\mathbf{\theta}}, a \sim \pi_{\mathbf{\theta}(\cdot|s)}} \left[\min\left(\frac{\pi_{\mathbf{\theta'}}(a,s)}{\pi_{\mathbf{\theta}}(a,s)}A^{GAE}_{\pi_{\mathbf{\theta}}}(s,a), clip\left(\frac{\pi_{\mathbf{\theta'}}(a,s)}{\pi_{\mathbf{\theta}}(a,s)}, 1-\epsilon, 1+\epsilon\right)A^{GAE}_{\pi_{\mathbf{\theta}}}(s,a)\right)\right]

GAE (Generalized Advantage Estimation)

At(k)=rt+γrt+1++γk1rt+k1+γkV(sk)V(st)=δt+γδt+1+γk1δt+k1\begin{aligned} A_t^{(k)} =& \, r_t + \gamma r_{t+1} + \cdots + \gamma^{k-1} r_{t+k-1} + \gamma^k V(s_k) - V(s_t) \\ =& \, \delta_t + \gamma \delta_{t+1} + \cdots \gamma^{k-1} \delta_{t+k-1} \end{aligned}

AtGAE=(1λ)(At(1)+λAt(2)+λ2At(3)+)=l=0(γλ)lδt+lA^{GAE}_t = (1-\lambda)(A_t^{(1)} + \lambda A_t^{(2)} + \lambda^2 A_t^{(3)} + \cdots) = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}

GRPO (Group Relative Policy Optimization)

References

  1. 王树森,黎或君,张志华.深度强化学习.人民邮电出版社,2022.https://github.com/wangshusen/DRL
  2. 张伟楠,沈键,俞勇.动手学强化学习.人民邮电出版社,2022.https://github.com/boyu-ai/Hands-on-RL
  3. https://www.bilibili.com/video/BV1rooaYVEk8
  4. https://www.bilibili.com/video/BV15cZYYvEhz

Reinforcement Learning Memo
https://cny123222.github.io/2025/07/11/Reinforcement-Learning-Memo/
Author
Nuoyan Chen
Posted on
July 11, 2025
Licensed under