Reinforcement Learning Memo

Last updated on July 11, 2025 pm

A simple memo for learning Reinforcement Learning.(强化学习备忘录)

Terminologies

  1. State 状态 ss, Action 动作 aa, Reward 奖励 rr
  2. Policy 策略 π:(s,a)[0,1]\pi: (s,a) \to [0, 1]

    π(sa)=P(A=aS=s)\pi(s | a) = \mathbb{P}(A = a | S = s)

  3. Trajectory 轨迹: s1,a1,r1,s2,a2,r2,,sT,aT,rTs_1, a_1, r_1, s_2, a_2, r_2, \dots, s_T, a_T, r_T
  4. Discounted return 折扣回报

    Ut=Rt+γRt+1+γ2Rt+2+γ3Rt+3+U_t = R_t + \gamma R_{t + 1} + \gamma^2 R_{t + 2} + \gamma^3 R_{t + 3} + \cdots

  5. Action-value function 动作价值函数

    Qπ(st,at)=E[UtS=st,A=at]Q_\pi(s_t,a_t) = \mathbb{E}[U_t | S = s_t, A = a_t]

  6. Optimal action-value function 最优动作价值函数

    Q(st,at)=maxπQπ(st,at)Q^*(s_t,a_t) = \max_\pi Q_\pi(s_t,a_t)

  7. State-value function 状态价值函数

    Vπ(st)=EA[Qπ(st,A)]=aAπ(ast)Qπ(st,a)V_\pi(s_t) = \mathbb{E}_A[Q_\pi(s_t, A)] = \sum_{a \in A} \pi(a|s_t) \cdot Q_\pi(s_t, a)

    • ES[Vπ(S)]\mathbb{E}_S[V_\pi(S)] evaluates how good the policy π\pi is.

Overview

References

  1. 王树森,黎或君,张志华.深度强化学习.人民邮电出版社,2022.https://github.com/wangshusen/DRL
  2. 张伟楠,沈键,俞勇.动手学强化学习.人民邮电出版社,2022.https://github.com/boyu-ai/Hands-on-RL
  3. https://www.bilibili.com/video/BV1rooaYVEk8
  4. https://www.bilibili.com/video/BV15cZYYvEhz

Reinforcement Learning Memo
https://cny123222.github.io/2025/07/11/Reinforcement-Learning-Memo/
Author
Nuoyan Chen
Posted on
July 11, 2025
Licensed under