Last updated on July 11, 2025 pm
A simple memo for learning Reinforcement Learning.(强化学习备忘录)
Terminologies
- State 状态 s, Action 动作 a, Reward 奖励 r
- Policy 策略 π:(s,a)→[0,1]
π(s∣a)=P(A=a∣S=s)
- Trajectory 轨迹: s1,a1,r1,s2,a2,r2,…,sT,aT,rT
- Discounted return 折扣回报
Ut=Rt+γRt+1+γ2Rt+2+γ3Rt+3+⋯
- Action-value function 动作价值函数
Qπ(st,at)=E[Ut∣S=st,A=at]
- Optimal action-value function 最优动作价值函数
Q∗(st,at)=πmaxQπ(st,at)
- State-value function 状态价值函数
Vπ(st)=EA[Qπ(st,A)]=a∈A∑π(a∣st)⋅Qπ(st,a)
- ES[Vπ(S)] evaluates how good the policy π is.
Overview
References
- 王树森,黎或君,张志华.深度强化学习.人民邮电出版社,2022.https://github.com/wangshusen/DRL
- 张伟楠,沈键,俞勇.动手学强化学习.人民邮电出版社,2022.https://github.com/boyu-ai/Hands-on-RL
- https://www.bilibili.com/video/BV1rooaYVEk8
- https://www.bilibili.com/video/BV15cZYYvEhz
Reinforcement Learning Memo
https://cny123222.github.io/2025/07/11/Reinforcement-Learning-Memo/