机器学习:笔记整理

Last updated on April 21, 2026 am

本文为 SJTU-CS3612 机器学习课程的笔记整理,主要聚焦于公式及其推导过程。

Lecture 1: Linear Model

1.1 Linear regression

假设有 nn 个样本的数据集 {xi1,xi2,,xip,yi}i=1n\{x_{i1}, x_{i2}, \dots, x_{ip}, y_i\}_{i=1}^n,则线性模型可以表示为:

yi=β01+β1xi1+β2xi2++βpxip+εi=j=0pxijβj+εiy_i = \beta_0 1+\beta_1 x_{i 1}+\beta_2 x_{i 2} +\cdots+\beta_p x_{i p}+\varepsilon_i =\sum_{j=0}^p x_{i j} \beta_j+\varepsilon_i

或者表示为矩阵形式:

y=Xβ+ε\mathbf{y}=\mathbf{X}^{\top} \boldsymbol{\beta}+\boldsymbol{\varepsilon}

其中

X=[x1x2xn]=[1x11x1p1x21x2p1xn1xnp],y=[y1y2yn],β=[β1β2βp],ε=[ε1ε2εn]\mathbf{X}^\top =\begin{bmatrix} \mathbf{x}_1^{\top} \\ \mathbf{x}_2^{\top} \\ \vdots \\ \mathbf{x}_n^{\top} \end{bmatrix}=\begin{bmatrix} 1 & x_{11} & \cdots & x_{1 p} \\ 1 & x_{21} & \cdots & x_{2 p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n 1} & \cdots & x_{n p} \end{bmatrix}, \quad \mathbf{y}=\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \quad \boldsymbol{\beta}=\begin{bmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end{bmatrix}, \quad \boldsymbol{\varepsilon}=\begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix}

1.2 Logistic regression

假设有 nn 个样本的数据集 {xi1,xi2,,xip,yi}i=1n\{x_{i1}, x_{i2}, \dots, x_{ip}, y_i\}_{i=1}^n,其中标签 yi{0,1}y_i \in \{0, 1\}。设

Pr(yi=1Xi)=pi,Pr(yi=0Xi)=1pi\operatorname{Pr}(y_i = 1|X_i) = p_i, \quad \operatorname{Pr}(y_i = 0|X_i) = 1 - p_i

logit(pi)=logpi1pi=Xiβ\operatorname{logit}\left(p_i\right)=\log \frac{p_i}{1-p_i}=X_i^{\top} \beta

那么

pi=sigmoid(Xiβ)=eXiβ1+eXiβ=11+eXiβp_i=\operatorname{sigmoid}\left(X_i^{\top} \beta\right)=\frac{e^{X_i^{\top} \beta}}{1+e^{X_i^{\top} \beta}}=\frac{1}{1+e^{-X_i^{\top} \beta}}

设每个样本的真实标签为 yiy_i^*,那么预测正确的概率为

Pr(yi=yiXi)=piyi(1pi)1yi=(eXiβ1+eXiβ)yi(1eXiβ1+eXiβ)1yi=eyiXiβ1+eXiβ\operatorname{Pr}\left(y_i=y_i^* | X_i\right)=p_i^{y_i^*}\left(1-p_i\right)^{1-y_i^*}=\left(\frac{e^{X_i^{\top} \beta}}{1+e^{X_i^{\top} \beta}}\right)^{y_i^*}\left(1-\frac{e^{X_i^{\top} \beta}}{1+e^{X_i^{\top} \beta}}\right)^{1-y_i^*}=\frac{e^{y_i^* X_i^{\top} \beta}}{1+e^{X_i^{\top} \beta}}

采用最大似然估计,目标是所有样本全部预测正确的概率最大,即

Pr(β)=i=1npiyi(1pi)1yi=i=1neyiXiβ1+eXiβ\operatorname{Pr}(\beta)=\prod_{i=1}^n p_i^{y_i^*}\left(1-p_i\right)^{1-y_i^*}=\prod_{i=1}^n \frac{e^{y_i^* X_i^{\top} \beta}}{1+e^{X_i^{\top} \beta}}

取对数,得

logPr(β)=i=1n[yiXiβlog(1+expXiβ)]\log \operatorname{Pr}(\beta)=\sum_{i=1}^n\left[y_i^* X_i^{\top} \beta-\log \left(1+\exp X_i^{\top} \beta\right)\right]

优化目标变为

arg maxβPr(β)arg maxβlogPr(β)\argmax_\beta \operatorname{Pr}(\beta) \equiv \argmax_\beta \log \operatorname{Pr}(\beta)

1.3 Classification

在分类问题中,标签一般为 yi{+1,1}y_i \in \{+1, -1\},那么可以认为

Pr(yi=+1Xi)=11+eXiβ,Pr(yi=1Xi)=11+eXiβ\operatorname{Pr}(y_i = +1|X_i) = \frac{1}{1+e^{-X_i^{\top} \beta}}, \quad \operatorname{Pr}(y_i = -1|X_i) = \frac{1}{1+e^{X_i^{\top} \beta}}

合并起来,得到

p(yi)=11+eyiXiβp\left(y_i\right)=\frac{1}{1+e^{-y_iX_i^{\top} \beta}}

1.4 Perceptron

一个基础的感知机,可以表示为:

yi=sign(Xiβ)y_i=\operatorname{sign}\left(X_i^{\top} \beta\right)

其中 sign()\operatorname{sign}(\cdot) 是符号函数。

1.5 Three models

Generative models

给定观测变量 XX 和目标变量 YY,生成式模型建模联合概率 pθ(X,Y)p_\theta(X, Y)。推理时,使用条件概率公式:

pθ(YX)=pθ(X,Y)pθ(X)=pθ(X,Y)Ypθ(X,Y)p_\theta(Y | X)=\frac{p_\theta(X, Y)}{p_\theta(X)}=\frac{p_\theta(X, Y)}{\sum_{Y^{\prime}} p_\theta\left(X, Y^{\prime}\right)}

假设 gθg_\theta 是一个生成式网络,给定输入

hN(0,I)h \sim \mathrm{N}(0, I)

期望的输出为:

X=gθ(h)+εX = g_\theta(h) + \varepsilon

输出 XX 的概率为:

pθ(X)=p(h)pθ(Xh)dhp_\theta(X)=\int p(h) p_\theta(X | h) \mathrm{d} h

要增大 pθ(X)p_\theta(X),就是要增大 p(h)p(h)pθ(Xh)p_\theta(X | h)

  1. 先考虑增大 p(h)p(h):由于 hN(0,I)h \sim N(0, I),即

    p(h)=12πexp(12h2)p(h) = \frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}\| h \|^2\right)

    logp(h)=12h2+constant\log p(h)=-\frac{1}{2}\|h\|^2+\mathrm{constant}

  2. 再考虑增大 pθ(Xh)p_\theta(X | h):假设

    Xgθ(h)=εN(0,σ2I)X-g_\theta(h)=\varepsilon \sim \mathrm{N}\left(0, \sigma^2 \mathbf{I}\right)

    那么

    XhN(gθ(h),σ2I)X | h \sim \mathrm{N}\left(g_\theta(h), \sigma^2 \mathbf{I}\right)

    从而

    logpθ(Xh)=Xgθ(h)22σ2+constant\log p_\theta(X | h)=-\frac{\left\|X-g_\theta(h)\right\|^2}{2 \sigma^2}+\mathrm{constant}

Discriminative models

给定观测变量 XX 和目标变量 YY,判别式模型建模条件概率 P(YX)P(Y|X)。设输入为 XX,网络的输出为 KK 维向量 fθ(X)f_\theta(X)。经过 Softmax 层,有

pθ(y=kX)=pk=1Z(θ)exp(fθ(k)(X))p_\theta(y=k | X)= p_k = \frac{1}{Z(\theta)} \exp (f_\theta^{(k)}(X))

其中

Z(θ)=kexp(fθ(k)(X))Z(\theta) = \sum_k \exp (f_\theta^{(k)}(X))

Descriptive models

描述式模型建模输入数据分布 pθ(X)p_\theta(X)。可以计算为:

pθ(X)=1Z(θ)exp(fθ(X))p_\theta(X)=\frac{1}{Z(\theta)} \exp(f_\theta(X))

其中

Z(θ)=exp(fθ(x))dxZ(\theta) = \int \exp (f_\theta(x)) \mathrm{d} x

1.7 Loss functions

为了从数据集 {(Xi,yi)}i=1n\left\{\left(X_i, y_i\right)\right\}_{i=1}^n 中学习 β\beta,我们最小化损失函数

L(β)=i=1nL(yi,Xiβ)\mathscr{L}(\beta)=\sum_{i=1}^n L\left(y_i, X_i^{\top} \beta\right)

其中 L(yi,Xiβ)L\left(y_i, X_i^{\top} \beta\right) 是对每个训练样本的损失。

Loss function for least squares regression

在线性回归中,我们常使用最小平方损失:

L(yi,Xiβ)=(yiXiβ)2L(y_i, X_i^{\top} \beta)=\left(y_i-X_i^{\top} \beta\right)^2

这一损失函数同样可以从最大似然估计中得到。先假设误差服从正态分布,即

yiXiβ=εiN(0,σ2I)y_i-X_i^{\top} \beta=\varepsilon_i \sim \mathrm{N}(0,\sigma^2I)

那么有

yiXiN(Xiβ,σ2I)y_i|X_i \sim \mathrm{N}(X_i^\top \beta, \sigma^2I)

Pr(yiXi,β)=12πσexp[(yiXiβ)22σ2]\mathrm{Pr}(y_i|X_i, \beta) = \frac{1}{\sqrt{2\pi} \sigma} \exp\left[-\frac{(y_i - X_i^\top \beta)^2}{2\sigma^2}\right]

取负对数似然为损失函数,有:

L(yi,Xiβ)=logPr(yiXi,β)=(yiXiβ)22σ2+constantL(y_i, X_i^{\top} \beta)= -\log \operatorname{Pr}(y_i |X_i, \beta) = -\frac{(y_i-X_i^{\top} \beta)^2}{2\sigma^2}+\mathrm{constant}

从而

L(β)=i=1nL(yi,Xiβ)=2σ2i=1n[logPr(yiXi,β)constant ]=i=1n(yiXiβ)2\mathscr{L}(\beta)=\sum_{i=1}^n L(y_i, X_i^{\top} \beta)=-2\sigma^2 \sum_{i=1}^n\left[\log \operatorname{Pr}(y_i | X_i, \beta)-\text {constant }\right]=\sum_{i=1}^n(y_i-X_i^{\top} \beta)^2

因此,最小二乘法的本质是最大似然估计。

Loss function for robust linear regression

考虑到最小平方损失对 outliers 较为敏感,可以将损失函数中的平方换为绝对值,即:

L(yi,Xiβ)=yiXiβL(y_i, X_i^{\top} \beta)=| y_i-X_i^{\top} \beta |

将两种损失结合,得到 Huber loss:

L(yi,Xiβ)={12(yiXiβ)2, if yiXiβδδyiXiβδ22, otherwise L(y_i, X_i^{\top} \beta)= \begin{cases}\frac{1}{2}(y_i-X_i^{\top} \beta)^2, & \text { if }|y_i-X_i^{\top} \beta| \leq \delta \\ \delta|y_i-X_i^{\top} \beta|-\frac{\delta^2}{2}, & \text { otherwise }\end{cases}

其中 δ\delta 是选取的截断值,在 δ\delta 之外采用绝对值损失。

Loss function for logistic regression with 0/1 responses

前面推导过,当标签 yi{0,1}y_i \in \{0, 1\} 时,

Pr(yiXi,β)=exp(yiXiβ)1+exp(Xiβ)\operatorname{Pr}\left(y_i | X_i, \beta\right) =\frac{\exp (y_i X_i^\top \beta)}{1+\exp (X_i^{\top} \beta)}

取负对数似然为损失函数,有:

L(yi,Xiβ)=logPr(yiXi,β)=[yiXiβlog(1+exp(Xiβ))]L(y_i, X_i^{\top} \beta)= -\log \operatorname{Pr}(y_i |X_i, \beta) = -\left[y_i X_i^{\top} \beta-\log (1+\exp (X_i^{\top} \beta))\right]

Loss function for logistic regression with ±\pm responses

前面推导过,当标签 yi{+1,1}y_i \in \{+1, -1\} 时,

Pr(yiXi,β)=11+exp(yiXiβ)\operatorname{Pr}\left(y_i | X_i, \beta\right) =\frac{1}{1+\exp (-y_i X_i^{\top} \beta)}

同样取负对数似然为损失函数,有:

L(yi,Xiβ)=logPr(yiXi,β)=log[1+exp(yiXiβ)]L(y_i, X_i^{\top} \beta)= -\log \operatorname{Pr}(y_i |X_i, \beta) = \log \left[1+\exp \left(-y_i X_i^{\top} \beta\right)\right]

该损失被称为 logistic loss。

Loss functions for classification

对于 yi{+1,1}y_i \in \{+1, -1\} 的分类问题,除了 logistic loss,还有其他的损失函数可以选择:

 Logistic loss =log(1+exp(yiXiβ)), Exponential loss =exp(yiXiβ), Hinge loss =max(0,1yiXiβ), Zero-one loss =1(yiXiβ<0)\begin{aligned} & \text { Logistic loss }=\log \left(1+\exp \left(-y_i X_i^{\top} \beta\right)\right), \\ & \text { Exponential loss }=\exp \left(-y_i X_i^{\top} \beta\right), \\ & \text { Hinge loss }=\max \left(0,1-y_i X_i^{\top} \beta\right), \\ & \text { Zero-one loss }=1\left(y_i X_i^{\top} \beta<0\right) \end{aligned}

图中横轴表示 mi=yiXiβm_i = y_iX_i^\top\beta(称为 margin),纵轴表示 L(yi,Xiβ)L(y_i, X_i^\top\beta)。当 yiy_iXiβX_i^\top\beta 同号时,表明分类正确,所以当 yi=+1y_i = +1 时,我们希望 XiβX_i^\top\beta 是越正越好;当 yi=1y_i = -1 时,我们希望 XiβX_i^\top\beta 是越负越好。也就是说,我们希望 mi=yiXiβm_i = y_iX_i^\top\beta 越大越好,即 mim_i 越小,损失函数越大。

1.8 Least Squares

考虑线性模型 Y=Xβ+εY = X\beta + \varepsilon,优化目标是:

β^=arg minβYXβ2\hat{\beta}=\argmin_\beta\Vert Y-X \beta\Vert^2

考虑到损失函数

L(β)=YXβ2L(\beta) = \Vert Y-X \beta\Vert^2

为凸函数,在 β=β^\beta = \hat{\beta} 处有一阶条件:

βL(β)=0\frac{\partial}{\partial \beta} L(\beta) = 0

下面对 L(β)L(\beta) 求导,展开得:

L(β)=(YXβ)(YXβ)=YYYXββXY+βXXβ\begin{aligned} L(\beta) & = (Y - X\beta)^\top(Y - X\beta) \\ & = Y^{\top} Y-Y^{\top} X \beta-\beta^{\top} X^{\top} Y+\beta^{\top} X^{\top} X \beta \\ \end{aligned}

考虑到 βXY\beta^{\top} X^{\top} Y 是标量,有 βXY=YXβ\beta^{\top} X^{\top} Y=Y^{\top} X \beta,从而

L(β)=YY2YXβ+βXXβL(\beta) = Y^{\top} Y-2 Y^{\top} X \beta+\beta^{\top} X^{\top} X \beta

假设 AA 是对称矩阵,那么有:

β(βAβ)=2Aβ\frac{\partial}{\partial \beta}\left(\beta^{\top} A \beta\right)=2 A \beta

β(bβ)=b\frac{\partial}{\partial \beta}\left(b^{\top} \beta\right)=b

根据矩阵求导法则,有:

Lβ=2XY+2XXβ\frac{\partial L}{\partial \beta}=-2 X^{\top} Y+2 X^{\top} X \beta

Lβ=0\dfrac{\partial L}{\partial \beta} = 0,得:

2X(YXβ)=0\begin{equation} 2 X^\top(Y-X \beta)=0 \end{equation}

如果 XX 列满秩,那么 XXX^\top X 可逆,解得:

β^=(XX)1XY\hat{\beta}=\left(X^\top X\right)^{-1} X^\top Y

因此,

Y^=Xβ^=X(XX)1XY\hat{Y}=X \hat{\beta}=X\left(X^\top X\right)^{-1} X^\top Y

可以看出,最小二乘法的实质是将 YY 投影到 XX 的列空间上,其投影为 Xβ^X\hat{\beta}。因此,残差 ε=YXβ\varepsilon = Y - X\beta 和列空间正交,这可以从 (1) 式中看出。

Distribution of β^\hat{\beta}

下面讨论训练得到的参数 β^\hat{\beta} 的稳定性。设用无穷个样本得到的参数为 βtrue\beta_\text{true},那么我们有:

β^=(XX)1XY=(XX)1X(Xβtrue+ε)=βtrue+(XX)1Xε\begin{aligned} \hat{\beta} & =(X^\top X)^{-1} X^\top Y \\ & =(X^\top X)^{-1} X^\top(X \beta_{\text {true}}+\varepsilon) \\ & =\beta_{\text {true}}+(X^\top X)^{-1} X^\top \varepsilon \end{aligned}

可以看出,β^\hat{\beta} 服从正态分布,其均值为:

E[β^]=βtrue\mathbb{E}[\hat{\beta}]=\beta_{\text {true}}

协方差矩阵为:

Var(β^)=E[(β^E[β^])(β^E[β^])]=E[((XX)1Xε)((XX)1Xε)]=(XX)1XE[εε]X(XX)1,=(XX)1Xσ2IX(XX)1=σ2(XX)1\begin{aligned} \mathrm{Var}(\hat{\beta}) & = \mathbb{E}\left[(\hat{\beta} - \mathbb{E}[\hat{\beta}])(\hat{\beta} - \mathbb{E}[\hat{\beta}])^\top\right] \\ & = \mathbb{E}\left[((X^\top X)^{-1} X^\top \varepsilon )((X^\top X)^{-1} X^\top \varepsilon)^\top\right] \\ & = (X^\top X)^{-1} X^\top \mathbb{E}[\varepsilon \varepsilon^\top] X (X^\top X)^{-1, \top} \\ & = (X^\top X)^{-1} X^\top \cdot \sigma^2 I \cdot X (X^\top X)^{-1} \\ & = \sigma^2 (X^\top X)^{-1} \end{aligned}

因此,

β^N(βtrue,σ2(XX)1)\hat{\beta} \sim N\left(\beta_\text{true}, \sigma^2 (X^\top X)^{-1}\right)

1.9 Kullback-Leibler divergence and cross entropy

Coding and entropy

一个分布 p(x)p(x) 的不确定性用熵来衡量:

H(p)=Ep[logp(X)]=xp(x)[logp(x)]H(p)=\mathbb{E}_p[-\log p(X)]=\sum_x p(x)[-\log p(x)]

熵等于用 logp(x)-\log p(x) 长度来编码达到的最短编码长度。

Kullback-Leibler divergence and cross entropy

交叉熵,是对于分布 p(x)p(x),使用另一个分布 q(x)q(x)logq(x)-\log q(x) 长度进行编码,所得到的编码长度,即:

CE(pq)=Ep[logq(X)]=xp(x)logq(x)\mathrm{CE}(p \Vert q) = \mathbb{E}_p[-\log q(X)]=-\sum_x p(x) \log q(x)

不难得到,交叉熵大于熵,即:

CE(pq)H(p)\mathrm{CE}(p \Vert q) \ge H(p)

当且仅当 p(x)=q(x)p(x) = q(x),即两个分布相等时,等号成立。

定义 KL 散度为交叉熵和熵的差,即:

KL(pq)=CE(pq)H(p)=Ep[logp(X)q(X)]=xp(x)logp(x)q(x)0\mathrm{KL}(p\Vert q) = \mathrm{CE}(p \Vert q) - H(p) = \mathbb{E}_p\left[\log \frac{p(X)}{q(X)}\right] = \sum_x p(x) \log \frac{p(x)}{q(x)} \ge 0

当且仅当 p(x)=q(x)p(x) = q(x)时,等号成立。一般来说,p(x)p(x) 表示真实分布,q(x)q(x) 表示模型输出的预测分布。KL 散度衡量了两个分布之间的距离,但是 KL 散度不对称,即:

KL(pq)KL(qp)\operatorname{KL}(p | q) \neq \operatorname{KL}(q | p)

对于条件概率分布,KL 散度的定义是:

KL(p(yx)q(yx))= def Ep(x,y)[logp(YX)q(YX)]=Ep(x)Ep(yx)[logp(YX)q(YX)]\operatorname{KL}(p(y | x) | q(y | x)) \stackrel{\text { def }}{=} \mathbb{E}_{p(x, y)}\left[\log \frac{p(Y | X)}{q(Y | X)}\right] = \mathbb{E}_{p(x)} \mathbb{E}_{p(y | x)}\left[\log \frac{p(Y | X)}{q(Y | X)}\right]

对于多变量分布,KL 散度为:

KL(p(x,y)q(x,y))=KL(p(x)q(x))+KL(p(yx)q(yx))\begin{equation} \operatorname{KL}(p(x, y) | q(x, y))=\operatorname{KL}(p(x) | q(x))+\operatorname{KL}(p(y | x) | q(y | x)) \end{equation}

1.10 Maximum likelihood

下面证明最大似然估计的正确性。优化目标是:

L(θ)=1ni=1nlogpθ(yiXi)\mathscr{L}(\theta)=\frac{1}{n} \sum_{i=1}^n \log p_\theta(y_i | X_i)

Pdata(X,y)P_\text{data}(X, y) 是样本分布,则有:

maxθL(θ)=maxθ1ni=1nlogpθ(yiXi)=maxθEPdata[logpθ(yX)]=minθ{EPdata[logpθ(yX)]}\begin{aligned} \max_\theta \mathscr{L}(\theta) & = \max _\theta \frac{1}{n} \sum_{i=1}^n \log p_\theta(y_i | X_i) \\ & = \max _\theta \mathbb{E}_{P_{\text {data}}}\left[\log p_\theta(y | X)\right] \\ & = \min _\theta\left\{-\mathbb{E}_{P_{\text {data}}}\left[\log p_\theta(y | X)\right]\right\} \\ \end{aligned}

由于 EPdata[logPdata(yX)]\mathrm{E}_{P_{\text {data}}}\left[\log P_{\text {data}}(y | X)\right] 是与 θ\theta 无关的常数,

maxθL(θ)=minθEPdata [logPdata(yX)]EPdata[logpθ(yX)]=minθKL(Pdata(yX)pθ(yX))\begin{aligned} \max_\theta \mathscr{L}(\theta) & = \min _\theta \mathbb{E}_{P_{\text {data }}}\left[\log P_{\text {data}}(y | X)\right]-\mathbb{E}_{P_{\text {data}}}\left[\log p_\theta(y | X)\right] \\ & = \min _\theta \mathrm{KL}\left(P_{\text {data}}(y | X) \Vert p_\theta(y | X)\right) \end{aligned}

因此,最大似然估计的实质是:最小化预测分布和真实数据分布的 KL 散度,即让预测分布尽可能接近真实数据分布。

1.11 Kullback-Leibler of conditionals

下面证明多变量分布的 KL 散度公式,即式 (2)。

KL(p(x,y)q(x,y))=Ep[logp(x,y)q(x,y)]=Ep[logp(x)p(yx)q(x)q(yx)]=Ep[logp(x)q(x)]+Ep[logp(yx)q(yx)]=KL(p(x)q(x))+KL(p(yx)q(yx))\begin{aligned} \mathrm{KL}(p(x, y) | q(x, y)) & =\mathbb{E}_p\left[\log \frac{p(x, y)}{q(x, y)}\right] \\ & =\mathbb{E}_p\left[\log \frac{p(x) p(y | x)}{q(x) q(y | x)}\right] \\ & =\mathbb{E}_p\left[\log \frac{p(x)}{q(x)}\right]+\mathbb{E}_p\left[\log \frac{p(y | x)}{q(y | x)}\right] \\ & =\mathrm{KL}(p(x) | q(x))+\mathrm{KL}(p(y | x) | q(y | x)) \end{aligned}

1.13 Gradient of log-likelihood

Discriminative model

前面提到过,判别式模型的输出为:

pθ(y=kX)=pk=1Z(θ)exp(fθ(k)(X))p_\theta(y=k | X)=p_k=\frac{1}{Z(\theta)} \exp (f_\theta^{(k)}(X))

其中

Z(θ)=kexp(fθ(k)(X))Z(\theta)=\sum_k \exp (f_\theta^{(k)}(X))

对对数概率求梯度,有:

θlogpθ(yX)=θfθ(k)(X)θlogZ(θ)=θfθ(k)(X)1Z(θ)θZ(θ)=θfθ(k)(X)1Z(θ)θ[kexp(fθ(k)(X))]=θfθ(k)(X)[kexp(fθ(k)(X))Z(θ)θfθ(k)(X)]=θfθ(k)(X)[kpkθfθ(k)(X)]=k(1(y=k)pk)θfθ(k)(X)=θfθ(X)(Yp)=θfθ(X)(YEθ(YX))\begin{aligned} \frac{\partial}{\partial \theta} \log p_\theta(y | X) & =\frac{\partial}{\partial \theta} f_\theta^{(k)}(X)-\frac{\partial}{\partial \theta} \log Z(\theta) \\ & =\frac{\partial}{\partial \theta} f_\theta^{(k)}(X)-\frac{1}{Z(\theta)} \frac{\partial}{\partial \theta} Z(\theta) \\ & =\frac{\partial}{\partial \theta} f_\theta^{(k)}(X)-\frac{1}{Z(\theta)} \frac{\partial}{\partial \theta}\left[\sum_{k^{\prime}} \exp (f_\theta^{(k^{\prime})}(X))\right] \\ & =\frac{\partial}{\partial \theta} f_\theta^{(k)}(X)-\left[\sum_{k^{\prime}} \frac{\exp (f_\theta^{(k^{\prime})}(X))}{Z(\theta)} \frac{\partial}{\partial \theta} f_\theta^{(k^{\prime})}(X)\right] \\ & =\frac{\partial}{\partial \theta} f_\theta^{(k)}(X)-\left[\sum_{k^{\prime}} p_{k^{\prime}} \frac{\partial}{\partial \theta} f_\theta^{(k^{\prime})}(X)\right] \\ & =\sum_{k^{\prime}}\left(1(y=k^{\prime})-p_{k^{\prime}}\right) \frac{\partial}{\partial \theta} f_\theta^{(k^{\prime})}(X) \\ & =\frac{\partial}{\partial \theta} f_\theta(X)^{\top}(Y-p) \\ & =\frac{\partial}{\partial \theta} f_\theta(X)^{\top}\left(Y-\mathbb{E}_\theta(Y | X)\right) \end{aligned}

其中 YYyy 的独热形式,即:

Y=[Y1,Y2,,Yn],Yk={1,k=k0,kkY=\left[Y_1, Y_2, \ldots, Y_n\right], \quad Y_{k^{\prime}}=\left\{\begin{array}{cc} 1, & k^{\prime}=k \\ 0, & k^{\prime} \neq k \end{array}\right.

Descriptive model

前面提到,描述式模型建模数据分布,即:

pθ(X)=1Z(θ)exp(fθ(X))p_\theta(X)=\frac{1}{Z(\theta)} \exp (f_\theta(X))

其中

Z(θ)=exp(fθ(x))dxZ(\theta)=\int \exp (f_\theta(x)) \mathrm{d} x

对对数概率求导,有:

θlogpθ(X)=θfθ(X)θlogZ(θ)=θfθ(X)1Z(θ)θZ(θ)=θfθ(X)1Z(θ)exp(fθ(X))θfθ(X)dx=θfθ(X)1Z(θ)exp(fθ(X))θfθ(X)dx=θfθ(X)Xpθ(X)θfθ(X)=θfθ(X)Eθ[θfθ(X)]\begin{aligned} \frac{\partial}{\partial \theta} \log p_\theta(X) & =\frac{\partial}{\partial \theta} f_\theta(X)-\frac{\partial}{\partial \theta} \log Z(\theta) \\ & =\frac{\partial}{\partial \theta} f_\theta(X)-\frac{1}{Z(\theta)} \frac{\partial}{\partial \theta} Z(\theta) \\ & =\frac{\partial}{\partial \theta} f_\theta(X)-\frac{1}{Z(\theta)} \int \exp (f_\theta(X)) \frac{\partial}{\partial \theta} f_\theta(X) \mathrm{d} x \\ & =\frac{\partial}{\partial \theta} f_\theta(X)-\int \frac{1}{Z(\theta)} \exp \left(f_\theta(X)\right) \frac{\partial}{\partial \theta} f_\theta(X) \mathrm{d} x \\ & =\frac{\partial}{\partial \theta} f_\theta(X)-\sum_X p_\theta(X) \frac{\partial}{\partial \theta} f_\theta(X) \\ & =\frac{\partial}{\partial \theta} f_\theta(X)-\mathbb{E}_\theta\left[\frac{\partial}{\partial \theta} f_\theta(X)\right] \end{aligned}

Generative model

对于生成式模型 pθ(h,X)p_\theta(h, X),有:

θlogpθ(X)=1pθ(X)θpθ(h,X)dh=1pθ(X)[θpθ(h,X)]dh=1pθ(X)[θlogpθ(h,X)]pθ(h,X)dh=[θlogpθ(h,X)]pθ(h,X)pθ(X)dh=[θlogpθ(h,X)]pθ(hX)dh=Epθ(hX)[θlogpθ(h,X)]\begin{aligned} \frac{\partial}{\partial \theta} \log p_\theta(X) & =\frac{1}{p_\theta(X)} \frac{\partial}{\partial \theta} \int p_\theta(h, X) \mathrm{d} h \\ & =\frac{1}{p_\theta(X)} \int\left[\frac{\partial}{\partial \theta} p_\theta(h, X)\right] \mathrm{d} h \\ & =\frac{1}{p_\theta(X)} \int\left[\frac{\partial}{\partial \theta} \log p_\theta(h, X)\right] p_\theta(h, X) \mathrm{d} h \\ & =\int\left[\frac{\partial}{\partial \theta} \log p_\theta(h, X)\right] \frac{p_\theta(h, X)}{p_\theta(X)} \mathrm{d} h \\ & =\int\left[\frac{\partial}{\partial \theta} \log p_\theta(h, X)\right] p_\theta(h | X) \mathrm{d} h \\ & =\mathbb{E}_{p_\theta(h | X)}\left[\frac{\partial}{\partial \theta} \log p_\theta(h, X)\right] \end{aligned}

其中

pθ(hi,Xi)=p(hi)pθ(Xihi)=N(hi0,I)N(Xigθ(hi),σ2I)p_\theta(h_i, X_i) = p(h_i) \cdot p_\theta(X_i | h_i) = \mathrm{N}(h_i | \mathbf{0}, \mathbf{I}) \cdot \mathrm{N}\left(X_i | g_\theta(h_i), \sigma^2 \mathbf{I}\right)

logpθ(hi,Xi)=12σ2Xigθ(hi)212hi2+constant\log p_\theta\left(h_i, X_i\right)=-\frac{1}{2 \sigma^2}\left\|X_i-g_\theta\left(h_i\right)\right\|^2-\frac{1}{2}\left\|h_i\right\|^2+\mathrm{constant}

Optimizing logistic regression via gradient ascent

在 logistic regression 中,对数似然函数是:

l(β)=logL(β)=i=1n[yiXiβlog(1+expXiβ)]l(\beta)=\log L(\beta)=\sum_{i=1}^n\left[y_i X_i^{\top} \beta-\log \left(1+\exp X_i^{\top} \beta\right)\right]

为了最大化 l(β)l(\beta),计算梯度:

l(β)=i=1n[yiXieXiβ1+eXiβXi]=i=1n(yipi)Xil^{\prime}(\beta)=\sum_{i=1}^n\left[y_i X_i-\frac{e^{X_i^{\top} \beta}}{1+e^{X_i^{\top} \beta}} X_i\right]=\sum_{i=1}^n\left(y_i-p_i\right) X_i

其中

pi=eXiβ1+eXiβ=11+eXiβp_i=\frac{e^{X_i^{\top} \beta}}{1+e^{X_i^{\top} \beta}}=\frac{1}{1+e^{-X_i^{\top} \beta}}

我们用这个梯度更新 β\beta 以最大化 l(β)l(\beta)

β(t+1)=β(t)+γti=1n(yipi)Xi\beta^{(t+1)}=\beta^{(t)}+\gamma_t \sum_{i=1}^n\left(y_i-p_i\right) X_i

可以看出,该算法从错误 yipiy_i - p_i 中学习。

1.14 Langevin

在梯度下降算法中,加入随机噪声扰动,有利于跳出局部最优解,这就是 Langevin 动力学。

Brownian motion, Δt\sqrt{\Delta t} notation, second order Taylor expansion

XtX_ttt 时刻粒子的位置,对应机器学习中的参数 β\beta。在布朗运动中,我们有:

Xt+Δt=Xt+σεtΔtX_{t+\Delta t}=X_t+\sigma \varepsilon_t \sqrt{\Delta t}

其中 εtN(0,1)\varepsilon_t \sim N(0, 1)。设初始位置 X0=xX_0 = x,并将时间 [0,t][0, t] 平均分成 nn 段,即 Δt=t/n\Delta t = t / n,那么有:

Xt=x+i=1nσεitn=x+σt1ni=1nεi N(x,σ2tI)X_t=x+\sum_{i=1}^n \sigma \varepsilon_i \sqrt{\frac{t}{n}}=x+\sigma \sqrt{t} \frac{1}{\sqrt{n}} \sum_{i=1}^n \varepsilon_i \sim \mathrm{~N}\left(x, \sigma^2 t \mathbf{I}\right)

Langevin: energy and entropy

对于描述式模型,我们采样 pθ(X)p_\theta(X) 可以用:

X(t+1)=X(t)+ηXfθ(X)+λεX^{(t+1)} = X^{(t)} + \eta \cdot \frac{\partial}{\partial X} f_\theta(X) + \lambda \varepsilon

对于生成式模型,我们采样从 pθ(hiXi)p_\theta(h_i | X_i) 中采样 hih_i 可以用:

h(t+1)=h(t)+ηhpθ(h,Xi)+λεh^{(t+1)} = h^{(t)} + \eta \cdot \frac{\partial}{\partial h} p_\theta(h, X_i) + \lambda \varepsilon

1.17 Linear Discriminant Analysis (LDA)

LDA 的目标是学习一个线性分类器,将 XiX_i 投影到 z=Xiβz = X_i^\top \beta,使得类间方差最大化、类内方差最小化。

设所有样本为 Ω\Omega,其中正类样本为 Ω+\Omega^+,负类样本为 Ω\Omega^-,且:

  • XiΩ+,p(Xiy=+1)N(μ+,Σ+)\forall X_i \in \Omega^{+}, p\left(X_i \mid y=+1\right) \sim \mathrm{N}\left(\mu^{+}, \Sigma^{+}\right)
  • XiΩ,p(Xiy=1)N(μ,Σ)\forall X_i \in \Omega^{-}, p\left(X_i \mid y=-1\right) \sim \mathrm{N}\left(\mu^{-}, \Sigma^{-}\right)

那么类间方差为:

σbetween 2=(EiΩ+[Xiβ]EiΩ[Xiβ])2=(EiΩ+[Xi]βEiΩ[Xi]β)2=((μ+)β(μ)β)2=[(μ+μ)β]2\begin{aligned} \sigma_{\text {between }}^2 & =\left(E_{i \in \Omega^{+}}[X_i^{\top} \beta]-E_{i \in \Omega^{-}}[X_i^{\top} \beta]\right)^2 \\ & =\left(E_{i \in \Omega^{+}}[X_i^{\top}] \beta-E_{i \in \Omega^{-}}[X_i^{\top}] \beta\right)^2 \\ & =((\mu^{+})^{\top} \beta-(\mu^{-})^{\top} \beta)^2 \\ & =\left[(\mu^{+}-\mu^{-})^{\top} \beta\right]^2 \end{aligned}

类内方差为:

σwithin 2=npos σpos 2+nneg σneg 2,npos =Ω+,nneg =Ωσpos 2=EiΩ+[(XiβEiΩ+[Xiβ])2]=EiΩ+[(βXiEiΩ+[βXi])(XiβEiΩ+[Xiβ])]=EiΩ+[β(XiEiΩ+[Xi])(XiEiΩ+[Xi])β]=βEiΩ+[(XiEiΩ+[Xi])(XiEiΩ+[Xi])]β=βΣ+βσneg 2=βΣβ\begin{aligned} \sigma_{\text {within }}^2 & =n_{\text {pos }} \sigma_{\text {pos }}^2+n_{\text {neg }} \sigma_{\text {neg }}^2, \quad n_{\text {pos }}=\left|\Omega^{+}\right|, n_{\text {neg }}=\left|\Omega^{-}\right| \\ \sigma_{\text {pos }}^2 & =E_{i \in \Omega^{+}}\left[\left(X_i^{\top} \beta-E_{i^{\prime} \in \Omega^{+}}[X_{i^{\prime}}^{\top} \beta]\right)^2\right] \\ & =E_{i \in \Omega^{+}}\left[\left(\beta^{\top} X_i-E_{i^{\prime} \in \Omega^{+}}[\beta^{\top} X_{i^{\prime}}]\right)\left(X_i^{\top} \beta-E_{i^{\prime} \in \Omega^{+}}[X_{i^{\prime}}^{\top} \beta]\right)\right] \\ & =E_{i \in \Omega^{+}}\left[\beta^{\top}\left(X_i-E_{i^{\prime} \in \Omega^{+}}[X_{i^{\prime}}]\right)\left(X_i^{\top}-E_{i^{\prime} \in \Omega^{+}}[X_{i^{\prime}}^{\top}]\right) \beta\right] \\ & =\beta^{\top} E_{i \in \Omega^{+}}\left[\left(X_i-E_{i^{\prime} \in \Omega^{+}}[X_{i^{\prime}}]\right)\left(X_i^{\top}-E_{i^{\prime} \in \Omega^{+}}[X_{i^{\prime}}^{\top}\right]\right)] \beta \\ & =\beta^{\top} \Sigma^{+} \beta \\ \sigma_{\text {neg }}^2 & =\beta^{\top} \Sigma^{-} \beta \end{aligned}

从而优化目标是最大化

S=σbetween 2σwithin 2=[(μ+μ)β]2nposβΣ+β+nnegβΣβ=(β(μ+μ))((μ+μ)β)β(nposΣ++nnegΣ)β=βSBββSWβ\begin{aligned} S & =\frac{\sigma_{\text {between }}^2}{\sigma_{\text {within }}^2} \\ & =\frac{[(\mu^{+}-\mu^{-})^{\top} \beta]^2}{n_{\mathrm{pos}} \beta^{\top} \Sigma^{+} \beta+n_{\mathrm{neg}} \beta^{\top} \Sigma^{-} \beta} \\ & =\frac{(\beta^{\top}(\mu^{+}-\mu^{-}))((\mu^{+}-\mu^{-})^{\top} \beta)}{\beta^{\top}(n_{\mathrm{pos}} \Sigma^{+}+n_{\mathrm{neg}} \Sigma^{-}) \beta} \\ & =\frac{\beta^{\top} S_B \beta}{\beta^{\top} S_W \beta} \end{aligned}

其中

SB=(μ+μ)(μ+μ)SW=nposΣ++nnegΣ\begin{aligned} S_B & =(\mu^{+}-\mu^{-})(\mu^{+}-\mu^{-})^{\top} \\ S_W & =n_{\mathrm{pos}} \Sigma^{+}+n_{\mathrm{neg}} \Sigma^{-} \end{aligned}

由于我们只关心 β\beta 的方向,不妨设 βSWβ=1\beta^\top S_W \beta = 1,从而目标变为:

maxββSBβ, s.t. βSWβ=1\max _\beta \beta^{\top} S_B \beta, \quad \text { s.t. } \quad \beta^{\top} S_W \beta=1

使用拉格朗日乘子法:

L=βSBβλ(βSWβ1)Lβ=2SBβ2λSWβ=0SBβ=λSWβSW1SBβ=λβ\begin{aligned} & L=\beta^{\top} S_B \beta-\lambda(\beta^{\top} S_W \beta-1) \\ \Rightarrow \quad & \frac{\partial L}{\partial \beta}=2 S_B \beta-2 \lambda S_W \beta=0 \\ \Rightarrow \quad & S_B \beta=\lambda S_W \beta \\ \Rightarrow \quad & S_W^{-1} S_B \beta=\lambda \beta \end{aligned}

代入 SBS_B,可以解出:

βSW1(μ+μ)\beta \propto S_W^{-1}(\mu^{+}-\mu^{-})

Lecture 2: Support Vector Machines

2.1 Margin and support vectors

考虑分类问题 yi{1,+1}y_i \in\{-1,+1\},使用分类器

y^=sign(wx+b)={+1,wx+b01,wx+b<0\hat{y}=\operatorname{sign}(w^{\top} x+b)= \begin{cases}+1, & w^{\top} x+b \geq 0 \\ -1, & w^{\top} x+b<0\end{cases}

定义 margin:

γi=yi(wxi+b)\gamma_i=y_i(w^{\top} x_i+b)

γi\gamma_i 表示了分类质量,当 γi>0\gamma_i > 0 时分类正确,且越大意味着越自信。

但是,考虑到 γi\gamma_iwwbb 的增大而增大,我们约束 w=1\Vert w \Vert = 1,即 margin 的定义变为:

γi=yi(wXi+b)w=yi[(ww)Xi+bw]\gamma_i=\frac{y_i(w^{\top} X_i+b)}{\|w\|}= y_i\left[\left(\frac{w}{\|w\|}\right)^{\top} X_i+\frac{b}{\|w\|}\right]

因此,目标函数是:

maxw,bmini=1nγi\max_{w, b}\min_{i=1}^n \gamma_i

事实上,在分类正确的情况下,γi\gamma_i 等于 XiX_iwx+b=0w^\top x+b = 0 的距离。可以验证,将 XiX_i 向分类面推动 γi\gamma_i 距离得到的点 XiγiwwyiX_i-\gamma_i \dfrac{w}{\|w\|} y_i 恰好在分类面上,即:

w(Xiγiwwyi)+b=(wXi+b)yi2w(wXi+b)ww2=(wXi+b)(wXi+b)=0w^{\top}\left(X_i-\gamma_i \frac{w}{\|w\|} y_i\right)+b=\left(w^{\top} X_i+b\right)-y_i^2 \frac{w^{\top}\left(w^{\top} X_i+b\right) w}{\|w\|^2}=\left(w^{\top} X_i+b\right)-\left(w^{\top} X_i+b\right)=0

2.2 Margin classifier

假设所有样本都能分类正确。为了最大化 margin,优化目标是:

maxτ,w,bτ s.t. i,yi(wXi+b)τw=1\begin{aligned} & \max _{\tau, w, b} \, \tau \\ \text { s.t. } \quad & \forall i, y_i(w^{\top} X_i+b) \geq \tau \\ & \|w\|=1 \end{aligned}

为了简化计算,将目标重写为:

maxτ,w,bτw s.t. i,yi(wXi+b)τ\begin{aligned} & \max _{\tau, w, b} \frac{\tau}{\|w\|} \\ \text { s.t. } \quad & \forall i, y_i\left(w^{\top} X_i+b\right) \geq \tau \end{aligned}

由于我们可以任意缩放 wwbb 而不改变超平面,不妨固定 τ=1\tau = 1,得到的优化目标等价为:

minw,b12w2 s.t. i,yi(wXi+b)1\begin{aligned} & \min _{w, b} \frac{1}{2}\|w\|^2 \\ \text { s.t. } \quad & \forall i, y_i\left(w^{\top} X_i+b\right) \geq 1 \end{aligned}

Lagrange multipliers

考虑如下的优化问题:

minwf(w) s.t. hk(w)=0,k=1,2,,K\begin{aligned} & \min _w f(w) \\ \text { s.t. } \quad & h_k(w)=0, k=1,2, \ldots, K \end{aligned}

我们可以最小化

L(w,λ)=f(w)+k=1Kλkhk(w)L(w, \lambda)=f(w)+\sum_{k=1}^K \lambda_k h_k(w)

从而要求

Lw=0,k,Lλk=0\frac{\partial L}{\partial w}=0, \quad \forall k, \frac{\partial L}{\partial \lambda_k}=0

考虑更一般的优化问题:

minwf(w) s.t. gk(w)0,k=1,2,,Khl(w)=0,l=1,2,,L\begin{aligned} & \min _w f(w) \\ \text { s.t. } \quad & g_k(w) \leq 0, k=1,2, \ldots, K \\ & h_l(w)=0, l=1,2, \ldots, L \end{aligned}

L(w,α,β)=f(w)+k=1Kαkgk(w)+l=1Lβlhl(w)L(w, \alpha, \beta)=f(w)+\sum_{k=1}^K \alpha_k g_k(w)+\sum_{l=1}^L \beta_l h_l(w)

当满足

minwmaxα,β:αk0L(w,α,β)=maxα,β:αk0minwL(w,α,β)\min _w \max _{\alpha, \beta: \alpha_k \geq 0} L(w, \alpha, \beta)=\max _{\alpha, \beta: \alpha_k \geq 0} \min _w L(w, \alpha, \beta)

时,KKT 条件为:

p,wpL(w,α,β)=0l,βlL(w,α,β)=0k,αkgk(w)=0k,gk(w)0k,αk0\begin{aligned} \forall p, \quad \frac{\partial}{\partial w_p} L(w, \alpha, \beta) & =0 \\ \forall l, \quad \frac{\partial}{\partial \beta_l} L(w, \alpha, \beta) & =0 \\ \forall k, \quad \alpha_k g_k(w) & =0 \\ \forall k, \quad g_k(w) & \leq 0 \\ \forall k, \quad \alpha_k & \geq 0 \end{aligned}

Learning and support vectors

之前已经推出,SVM 的优化目标是:

minw,b12w2 s.t. i,yi(wXi+b)+10\begin{aligned} & \min _{w, b} \frac{1}{2}\|w\|^2 \\ \text { s.t. } \quad & \forall i,-y_i\left(w^{\top} X_i+b\right)+1 \leq 0 \end{aligned}

根据 KKT 条件,对于

yi(wXi+b)+1=0-y_i\left(w^{\top} X_i+b\right)+1=0

的样本,有

αi>0\alpha_i > 0

这些样本被称为 support vectors(支持向量)。

Optimization

SVM 的 Lagrangian 是:

L(w,b,α)=12w2i=1nαi[yi(wXi+b)1]L(w, b, \alpha)=\frac{1}{2}\|w\|^2-\sum_{i=1}^n \alpha_i\left[y_i\left(w^{\top} X_i+b\right)-1\right]

分别对 wwbb 求导,有:

L(w,b,α)w=wi=1nαiyiXiL(w,b,α)b=i=1nαiyi\begin{aligned} \frac{\partial L(w, b, \alpha)}{\partial w}&=w-\sum_{i=1}^n \alpha_i y_i X_i \\ \frac{\partial L(w, b, \alpha)}{\partial b}&=-\sum_{i=1}^n \alpha_i y_i \end{aligned}

由 KKT 条件,有:

L(w,b,α)w=0L(w,b,α)b=0\begin{aligned} &\frac{\partial L(w, b, \alpha)}{\partial w}=0\\ &\frac{\partial L(w, b, \alpha)}{\partial b}=0 \end{aligned}

从而有

{w=i=1nαiyiXii=1nαiyi=0\left\{ \begin{aligned} & \quad w=\sum_{i=1}^n \alpha_i y_i X_i \\ & \quad \sum_{i=1}^n \alpha_i y_i=0 \end{aligned}\right.

因此,

L(w,b,α)=12w2i=1nαi[yi(wXi+b)1]=12i=1nj=1nyiyjαiαjXiXji=1nj=1nyiyjαiαjXiXjbi=1nαiyi+i=1nαi=i=1nαi12i=1nj=1nyiyjαiαjXiXjbi=1nαiyi=i=1nαi12i=1nj=1nyiyjαiαjXiXj\begin{aligned} L(w, b, \alpha) & =\frac{1}{2}\|w\|^2-\sum_{i=1}^n \alpha_i\left[y_i\left(w^{\top} X_i+b\right)-1\right] \\ & =\frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i y_j \alpha_i \alpha_j X_i^{\top} X_j-\sum_{i=1}^n \sum_{j=1}^n y_i y_j \alpha_i \alpha_j X_i^{\top} X_j-b \sum_{i=1}^n \alpha_i y_i+\sum_{i=1}^n \alpha_i \\ & =\sum_{i=1}^n \alpha_i-\frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i y_j \alpha_i \alpha_j X_i^{\top} X_j-b \sum_{i=1}^n \alpha_i y_i \\ & =\sum_{i=1}^n \alpha_i-\frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i y_j \alpha_i \alpha_j X_i^{\top} X_j \end{aligned}

这样,我们可以用下式优化出 α\alpha

maxαW(α),W(α)=i=1nαi12i=1nj=1nyiyjαiαjXi,Xj s.t. i,αi0i=1nαiyi=0\begin{aligned} \max _\alpha W(\alpha), \quad W(\alpha)= & \sum_{i=1}^n \alpha_i-\frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i y_j \alpha_i \alpha_j\left\langle X_i, X_j\right\rangle \\ \text { s.t. } \quad & \forall i, \alpha_i \geq 0 \\ & \sum_{i=1}^n \alpha_i y_i=0 \end{aligned}

接着可以求出 ww,进而求出 bb

b=12(mini:yi=+1wXi+maxi:yi=1wXi)b=-\frac{1}{2}\left(\min _{i: y_i=+1} w^{\top} X_i+\max _{i: y_i=-1} w^{\top} X_i\right)

因此,推理过程可以写为:

wXi+b=(j=1nαjyjXj)Xi+b=j=1nαjyjXj,Xi+b=j:αj>0αjyjXj,Xi+b=j:αj>0αjyjXj,Xi+b\begin{aligned} w^{\top} X_i+b= & \left(\sum_{j=1}^n \alpha_j y_j X_j\right)^\top X_i+b \\ = & \sum_{j=1}^n \alpha_j y_j\left\langle X_j, X_i\right\rangle+b \\ = & \sum_{j: \alpha_j>0} \alpha_j y_j\left\langle X_j, X_i\right\rangle+b \\ = & \left\langle \sum_{j: \alpha_j>0} \alpha_j y_j X_j, X_i\right\rangle+b \end{aligned}

也就是说,只有支持向量对 ww 有贡献。

2.3 Kernel-based SVM

考虑非线性分类问题,以拟合立方函数为例。设特征向量是:

ϕ(x)=[x1,x2,,xp,x12,x22,,xp2,x13,x23,,xp3]\phi(x)=\left[x_1, x_2, \ldots, x_p, x_1^2, x_2^2, \ldots, x_p^2, x_1^3, x_2^3, \ldots, x_p^3\right]^{\top}

此时线性 SVM 可以表示为:

wϕ(x)+b=i=1p(w3i2xi+w3i1xi2+w3ixi3)+bw^{\top} \phi(x)+b=\sum_{i=1}^p\left(w_{3 i-2} x_i+w_{3 i-1} x_i^2+w_{3 i} x_i^3\right)+b

对于一般的非线性问题,ϕ(x)\phi(x)xx 的高维投影。对于推理过程,有:

wϕ(Xi)+b=(j=1nαjyjϕ(Xj))ϕ(Xi)+b=j=1nαjyjϕ(Xj),ϕ(Xi)+b=j:αj>0αjyjϕ(Xj),ϕ(Xi)+b=j:αj>0αjyjK(Xj,Xi)+b\begin{aligned} w^{\top} \phi\left(X_i\right)+b & =\left(\sum_{j=1}^n \alpha_j y_j \phi\left(X_j\right)\right)^{\top} \phi\left(X_i\right)+b \\ & =\sum_{j=1}^n \alpha_j y_j\left\langle\phi\left(X_j\right), \phi\left(X_i\right)\right\rangle+b \\ & =\sum_{j: \alpha_j>0} \alpha_j y_j\left\langle\phi\left(X_j\right), \phi\left(X_i\right)\right\rangle+b \\ & =\sum_{j: \alpha_j>0} \alpha_j y_j K\left(X_j, X_i\right)+b \end{aligned}

其中

K(Xi,Xj)= def ϕ(Xi)ϕ(Xj)K\left(X_i, X_j\right) \stackrel{\text { def }}{=} \phi\left(X_i\right)^{\top} \phi\left(X_j\right)

被称为核(Kernel)。对之前求解的式子,只要将 Xi,Xj\left\langle X_i, X_j\right\rangle 替换为 K(Xi,Xj)K(X_i, X_j) 即可。

有时,K(Xi,Xj)K\left(X_i, X_j\right) 相对 ϕ(Xi)\phi\left(X_i\right) 更好求。以

K(Xi,Xj)= def (XiXj)2K\left(X_i, X_j\right) \stackrel{\text { def }}{=}\left(X_i^{\top} X_j\right)^2

为例,我们有:

K(Xi,Xj)=(k=1pXikXjk)(k=1pXikXjk)=k=1pl=1pXikXjkXilXjl=k,l(XikXil)(XjkXjl)\begin{aligned} K\left(X_i, X_j\right) & =\left(\sum_{k=1}^p X_{i k} X_{j k}\right)\left(\sum_{k=1}^p X_{i k} X_{j k}\right) \\ & =\sum_{k=1}^p \sum_{l=1}^p X_{i k} X_{j k} X_{i l} X_{j l} \\ & =\sum_{k, l}\left(X_{i k} X_{i l}\right)\left(X_{j k} X_{j l}\right) \end{aligned}

ϕ(Xi)=[Xi1Xi1Xi1Xi2Xi1XipXi2Xi1Xi2Xi2Xi2XipXipXi1XipXi2XipXip]\phi\left(X_i\right)=\begin{bmatrix} X_{i 1} X_{i 1} \\ X_{i 1} X_{i 2} \\ \vdots \\ X_{i 1} X_{i p} \\ X_{i 2} X_{i 1} \\ X_{i 2} X_{i 2} \\ \vdots \\ X_{i 2} X_{i p} \\ \vdots \\ \vdots \\ X_{i p} X_{i 1} \\ X_{i p} X_{i 2} \\ \vdots \\ X_{i p} X_{i p} \end{bmatrix}

可以看出,计算 ϕ(Xi)\phi(X_i) 的复杂度为 O(p2)O(p^2),而计算 K(Xi,Xj)K(X_i, X_j) 的复杂度是 O(p)O(p)。因此,我们可以只计算 K(Xi,Xj)K(X_i, X_j) 而不用关心 ϕ(Xi)\phi(X_i) 的计算。

K(Xi,Xj)K(X_i, X_j) 是核函数的充分必要条件是:

  • 对称性

    K(Xi,Xj)=K(Xj,Xi)K\left(X_i, X_j\right)=K\left(X_j, X_i\right)

  • 半正定:对于核矩阵 KK,其中 Kij=K(Xi,Xj)K_{ij} = K(X_i, X_j),对于任意向量 zz,有

    zKz=ijziKijzj0z^\top K z =\sum_i \sum_j z_i K_{i j} z_j \ge 0

2.4 Common kernels

  • RBF kernel:即 Gaussian kernel

    K(Xi,Xj)=exp(XiXj22σ2)K\left(X_i, X_j\right)=\exp \left(-\frac{\left\|X_i-X_j\right\|^2}{2 \sigma^2}\right)

    其中 σ\sigma 是超参数

    • 衡量了 XiX_iXjX_j 的相似度
  • Simple polynomial kernel

    K(Xi,Xj)=(XiXj)dK\left(X_i, X_j\right)=\left(X_i^{\top} X_j\right)^d

    其中 dd 是超参数

  • Cosine similarity kernel

    K(Xi,Xj)=XiXjXiXjK\left(X_i, X_j\right)=\frac{X_i^{\top} X_j}{\left\|X_i\right\|\left\|X_j\right\|}

  • Sigmoid kernel

    K(Xi,Xj)=tanh(αXiXj+c)K\left(X_i, X_j\right)=\tanh \left(\alpha X_i^{\top} X_j+c\right)

    其中 α\alphacc 是超参数

2.5 With outliers

在现实生活中,我们不能保证全部正确分类。此时的优化目标是:

minξ,w,b12w2+Ci=1nξi s.t. yi(wXi+b)1ξi,i=1,2,,nξi0,i=1,2,,n\begin{aligned} \min _{\xi, w, b} \quad& \frac{1}{2}\|w\|^2+C \sum_{i=1}^n \xi_i \\ \text { s.t. } \quad & y_i\left(w^{\top} X_i+b\right) \geq 1-\xi_i, \quad i=1,2, \ldots, n \\ & \xi_i \geq 0, \quad i=1,2, \ldots, n \end{aligned}

等价于

minξ,w,b12w2+Ci=1nmax(0,1yi(wXi+b))\min _{\xi, w, b} \frac{1}{2}\|w\|^2+C \sum_{i=1}^n \max \left(0,1-y_i\left(w^{\top} X_i+b\right)\right)

其中 max(0,1yi(wXi+b))\max \left(0,1-y_i\left(w^{\top} X_i+b\right)\right) 称为 hinge loss。

此时的 Lagrangian 是:

L(w,b,ξ,α,β)=12w2+Ci=1nξii=1nαi[yi(wXi+b)1+ξi]i=1nβiξiminw,b,ξ:ξi0maxα,β:αi0,βi0L(w,b,ξ,α,β)\begin{gathered} L(w, b, \xi, \alpha, \beta)=\frac{1}{2}\|w\|^2+C \sum_{i=1}^n \xi_i-\sum_{i=1}^n \alpha_i\left[y_i\left(w^{\top} X_i+b\right)-1+\xi_i\right]-\sum_{i=1}^n \beta_i \xi_i \\ \min _{w, b, \xi: \xi_i \geq 0} \max _{\alpha, \beta: \alpha_i \geq 0, \beta_i \geq 0} L(w, b, \xi, \alpha, \beta) \end{gathered}

L(w,b,ξ,α,β)w=0,L(w,b,ξ,α,β)b=0\frac{\partial L(w, b, \xi, \alpha, \beta)}{\partial w}=0, \quad\frac{\partial L(w, b, \xi, \alpha, \beta)}{\partial b}=0

可以得到优化 α\alpha 的目标:

maxαW(α),W(α)=i=1nαi12i=1nj=1nyiyjαiαjXi,Xj s.t. i,0αiCi=1nαiyi=0\begin{aligned} \max _\alpha W(\alpha), \quad W(\alpha)= & \sum_{i=1}^n \alpha_i-\frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i y_j \alpha_i \alpha_j\left\langle X_i, X_j\right\rangle \\ \text { s.t. } \quad & \forall i, 0 \leq \alpha_i \leq C \\ & \sum_{i=1}^n \alpha_i y_i=0 \end{aligned}

其 KKT 条件是:

αi=0yi(wXi+b)>1αi=Cyi(wXi+b)<10<αi<Cyi(wXi+b)=1\begin{aligned} \alpha_i=0 & \Rightarrow y_i\left(w^{\top} X_i+b\right)>1 \\ \alpha_i=C & \Rightarrow y_i\left(w^{\top} X_i+b\right)<1 \\ 0<\alpha_i<C & \Rightarrow y_i\left(w^{\top} X_i+b\right)=1 \end{aligned}

Lecture 3: Kernels and Regularized Learning

3.1 Over-fitting & under-fitting

training set validation set Performance
error too large irrelevant Underfitting
error small error too large Overfitting
error small error small Ideal
(Good generalization)

3.2 Ridge Regression

为了防止过拟合,我们引入 Ridge regression,其损失函数是:

(β)=YXβ2+λβ2\ell(\beta)=\|Y-X \beta\|^2+\lambda\|\beta\|^2

其中 λ>0\lambda > 0λβ2\lambda\|\beta\|^2 是惩罚项。对损失求导,有:

(β)=YXβ2+λβ2=(YXβ)(YXβ)+λββ\ell(\beta)=\|Y-X \beta\|^2+\lambda\|\beta\|^2=(Y-X \beta)^{\top}(Y-X \beta)+\lambda \beta^{\top} \beta

0=(β)ββ=β^λ=2X(YXβ^λ)+2λβ^λ0=\left.\frac{\partial \ell(\beta)}{\partial \beta}\right|_{\beta=\hat{\beta}_\lambda}=-2 X^{\top}\left(Y-X \hat{\beta}_\lambda\right)+2 \lambda \hat{\beta}_\lambda

从而得到 β\beta 的解析解为:

β^λ=(XX+λIp)1XY\hat{\beta}_\lambda=\left(X^{\top} X+\lambda I_p\right)^{-1} X^{\top} Y

3.3 Kernel Regression

在推理时,有:

f(x)=ϕ(x)βf(x) = \phi(x)^\top \beta

可以证明,

β=iciϕ(xi)\beta=\sum_i c_i \phi\left(x_i\right)

从而

f(x)=i=1nciK(x,xi)f(x)=\sum_{i=1}^n c_i K\left(x, x_i\right)

在其中加入正则项,就是要最小化:

i=1nyij=1ncjK(xi,xj)2+λi,jcicjK(xi,xj)\sum_{i=1}^n\|y_i-\sum_{j=1}^n c_j K\left(x_i, x_j\right)\|^2+\lambda \sum_{i, j} c_i c_j K\left(x_i, x_j\right)

Kij=K(xi,xj)K_{ij} = K(x_i, x_j),那么损失函数可以写作:

(c)=i=1nyij=1ncjKij2+λi,jcicjKij=YKc2+λcKc\ell(c)=\sum_{i=1}^n\|y_i-\sum_{j=1}^n c_j K_{i j}\|^2+\lambda \sum_{i, j} c_i c_j K_{i j}=\|Y-K c\|^2+\lambda c^{\top} K c

其中惩罚项:

cKc=i,jcicjK(xi,xj)=i,jcicjϕ(xi)ϕ(xj)=(iciϕ(xi))(iciϕ(xi))=β2\begin{aligned} c^{\top} K c & =\sum_{i, j} c_i c_j K\left(x_i, x_j\right) \\ & =\sum_{i, j} c_i c_j \phi\left(x_i\right)^{\top} \phi\left(x_j\right) \\ & =\left(\sum_i c_i \phi\left(x_i\right)\right)^{\top}\left(\sum_i c_i \phi\left(x_i\right)\right) \\ & =\|\beta\|^2 \end{aligned}

可以求解出 cc 的估计值为:

c^λ=(K+λIn)1Y\hat{c}_\lambda=\left(K+\lambda I_n\right)^{-1} Y

3.4 Spline Regression

xRx \in \mathbb{R},线性样条函数的形式是:

f(x)=α0+j=1pαjmax(0,xkj)f(x)=\alpha_0+\sum_{j=1}^p \alpha_j \max \left(0, x-k_j\right)

从而最小化的目标是:

i=1nyiα0j=1pαjmax(0,xikj)2+λj=1pαj2\sum_{i=1}^n\|y_i-\alpha_0-\sum_{j=1}^p \alpha_j \max \left(0, x_i-k_j\right)\|^2+\lambda \sum_{j=1}^p \alpha_j^2

其中

yiα0j=1pαjmax(0,xikj)=yi[1,max(0,xik1),max(0,xik2),,max(0,xikp)][α0α1α2αp]y_i-\alpha_0-\sum_{j=1}^p \alpha_j \max \left(0, x_i-k_j\right)=y_i-\left[1, \max \left(0, x_i-k_1\right), \max \left(0, x_i-k_2\right), \ldots, \max \left(0, x_i-k_p\right)\right] \begin{bmatrix}\alpha_0 \\ \alpha_1 \\ \alpha_2 \\ \vdots \\ \alpha_p \end{bmatrix}

Relations to the Ridge regression

X~ij=max(0,xikj)Z=[1n X~]D=diag(0,1,,1)\begin{aligned} \widetilde{X}_{i j} & =\max \left(0, x_i-k_j\right) \\ Z & =\left[1_n ~ \widetilde{X}\right] \\ D & =\operatorname{diag}(0,1, \ldots, 1) \end{aligned}

从而目标函数可以写作:

(α)=YZα2+λDα2\ell(\alpha)=\|Y-Z \alpha\|^2+\lambda\|D \alpha\|^2

解得 α\alpha 的估计值为:

α^λ=(ZZ+λD)1ZY\hat{\alpha}_\lambda=\left(Z^{\top} Z+\lambda D\right)^{-1} Z^{\top} Y

Relations to the Kernel regression

kjk_j 替换为 xjx_j,令 K(xi,xj)=max(0,xix1)K\left(x_i, x_j\right)=\max \left(0, x_i-x_1\right),那么有:

f^(x)=j=1nα^jx,xj=j=1nα^jK(x,xj)\hat{f}(x)=\sum_{j=1}^n \hat{\alpha}_j\left\langle x, x_j\right\rangle=\sum_{j=1}^n \hat{\alpha}_j K\left(x, x_j\right)

3.5 Lasso regression

Lasso 回归的目标是优化

β^λ=argminβ[12YXβ22+λβ1]\hat{\beta}_\lambda=\arg \min _\beta\left[\frac{1}{2}\|\mathbf{Y}-\mathbf{X} \beta\|_{\ell_2}^2+\lambda\|\beta\|_{\ell_1}\right]

其中

β1=j=1pβj\|\beta\|_{\ell_1}=\sum_{j=1}^p | \beta_j |

对于一般的 pp,没有解析解,但对于 p=1p = 1,存在解析解:

β^λ={(Y,Xλ)/X22, if Y,X>λ(Y,X+λ)/X22, if Y,X<λ0 if Y,Xλ\hat{\beta}_\lambda= \begin{cases}(\langle\mathbf{Y}, \mathbf{X}\rangle-\lambda) /\|\mathbf{X}\|_{\ell_2}^2, & \text { if }\langle\mathbf{Y}, \mathbf{X}\rangle>\lambda \\ (\langle\mathbf{Y}, \mathbf{X}\rangle+\lambda) /\|\mathbf{X}\|_{\ell_2}^2, & \text { if }\langle\mathbf{Y}, \mathbf{X}\rangle<-\lambda \\ 0 & \text { if }|\langle\mathbf{Y}, \mathbf{X}\rangle| \leq \lambda\end{cases}

β^λ=sign(γ^)max(0,γ^λ/X22)\hat{\beta}_\lambda=\operatorname{sign}(\hat{\gamma}) \max \left(0,|\hat{\gamma}|-\lambda /\|\mathbf{X}\|_{\ell_2}^2\right)

其中

γ^=Y,X/X22\hat{\gamma}=\langle\mathbf{Y}, \mathbf{X}\rangle /\|\mathbf{X}\|_{\ell_2}^2

Ridge regression 希望没有主导特征,而 Lasso regression 希望特征稀疏。

3.6 Primal form of Lasso

Lasso 回归有两种等价形式:

  • Primal form of Lasso

min YXβ22/2 subject to β1t\min ~ \|\mathbf{Y}-\mathbf{X} \beta\|_{\ell_2}^2 / 2 \quad \text{ subject to } \quad \|\beta\|_{\ell_1} \leq t

  • Dual form of Lasso

min YXβ22/2+λβ1\min ~ \|\mathbf{Y}-\mathbf{X} \beta\|_{\ell_2}^2 / 2+\lambda\|\beta\|_{\ell_1}

接下来证明两种形式等价。设两个式子有不相等的解

β^λ=argminYXβ22/2+λβ1\hat{\beta}_\lambda=\operatorname{argmin}\|\mathbf{Y}-\mathbf{X} \beta\|_{\ell_2}^2 / 2+\lambda\|\beta\|_{\ell_1}

β^=argminβYXβ22/2 s.t. β1t\hat{\beta}=\underset{\beta}{\operatorname{argmin}}\|\mathbf{Y}-\mathbf{X} \beta\|_{\ell_2}^2 / 2 \quad \text { s.t. } \quad\|\beta\|_{\ell_1} \leq t

t=β^λ1t=\|\hat{\beta}_\lambda\|_{\ell_1}

从而有

YXβ^22/2<YXβ^λ22/2,β^1β^λ1\|\mathbf{Y}-\mathbf{X} \hat{\beta}\|_{\ell_2}^2 / 2<\|\mathbf{Y}-\mathbf{X} \hat{\beta}_\lambda\|_{\ell_2}^2 / 2, \quad\|\hat{\beta}\|_{\ell_1} \leq\|\hat{\beta}_\lambda\|_{\ell_1}

进而

YXβ^22/2+λβ^1<YXβ^λ22/2+λβ^λ1\|\mathbf{Y}-\mathbf{X} \hat{\beta}\|_{\ell_2}^2 / 2+\lambda\|\hat{\beta}\|_{\ell_1}<\|\mathbf{Y}-\mathbf{X} \hat{\beta}_\lambda\|_{\ell_2}^2 / 2+\lambda\|\hat{\beta}_\lambda\|_{\ell_1}

这与第一个式子矛盾,因此 β^λ=β^\hat{\beta}_\lambda = \hat{\beta}

3.7 Coordinate descent for Lasso solution path

多变量 Lasso 回归的求解思路是:将每一个特征维度看作一维的 Lasso 回归,从而用以下算法求解。

 for  λ=10a,10aΔ,10a2Δ,10a3Δ,,10b  do  for  Feature dimension j=1,2,,p  do  Compute the residual, Rj=YkjXkβk Update the parameter of the j-th dimension, βj=sign(γ^j)max(0,γ^jλ/X22), where γ^j=Rj,Xj/Xj22 end  end \begin{aligned} &\textbf { for } ~\lambda=10^a, 10^{a-\Delta}, 10^{a-2 \Delta}, 10^{a-3 \Delta}, \ldots, 10^b \textbf{ ~do }\\ &\quad \textbf { for } \text{~Feature dimension } j=1,2, \ldots, p \textbf{ ~do }\\ &\quad \quad \text { Compute the residual, } \textstyle \mathbf{R}_j=\mathbf{Y}-\sum_{k \neq j} \mathbf{X}_k \beta_k \text {; }\\ &\quad \quad \text { Update the parameter of the } j \text {-th dimension, } \beta_j=\operatorname{sign}\left(\hat{\gamma}_j\right) \max \left(0,\left|\hat{\gamma}_j\right|-\lambda /\|\mathbf{X}\|_{\ell_2}^2\right) \text {, where }\\ &\hat{\gamma}_j=\left\langle\mathbf{R}_j, \mathbf{X}_j\right\rangle /\left\|\mathbf{X}_j\right\|_{\ell_2}^2\\ &\quad \textbf{ end }\\ &\textbf{ end } \end{aligned}

3.8 Bayesian regression

考虑最大似然估计

P(βX,Y)=P(βX)P(YX,β)P(YX)=P(β)P(YX,β)/C\begin{aligned} P(\beta | \mathbf{X}, \mathbf{Y}) & = \frac{P(\beta | \mathbf{X}) P(\mathbf{Y} | \mathbf{X}, \beta)}{P(\mathbf{Y} | \mathbf{X})} \\ & = P(\beta)P(\mathbf{Y} | \mathbf{X}, \beta) / C \end{aligned}

从而

logP(βX,Y)=logP(β)+logP(YX,β)+C\log P(\beta | \mathbf{X}, \mathbf{Y}) = \log P(\beta) + \log P(\mathbf{Y} | \mathbf{X}, \beta) + C

βN(0,τ2Ip)\beta \sim \mathrm{N}\left(0, \tau^2 \mathbf{I}_p\right)

YN(Xβ,σ2I)\mathbf{Y} \sim N(\mathbf{X}\beta, \sigma^2\mathbf{I})

那么

logP(βX,Y)=12σ2YXβ2212τ2β22+C\log P(\beta | \mathbf{X}, \mathbf{Y})=-\frac{1}{2 \sigma^2}\|\mathbf{Y}-\mathbf{X} \beta\|_{\ell_2}^2-\frac{1}{2 \tau^2}\|\beta\|_{\ell_2}^2+C

最大化似然函数,得到:

β^=(XX+σ2τ2Ip)1XY\hat{\beta}=\left(\mathbf{X}^{\top} \mathbf{X}+\frac{\sigma^2}{\tau^2} \mathbf{I}_p\right)^{-1} \mathbf{X}^{\top} \mathbf{Y}

对应 λ=σ2/τ2\lambda=\sigma^2 / \tau^2 的 Ridge regression。

3.9 SVM and ridge logistic regression

3.10 Linear Version

设优化目标

i=1nL(yi;xiβ)+λβ2\sum_{i=1}^n L\left(y_i ; x_i^{\top} \beta\right)+\lambda\|\beta\|^2

那么解空间的形式为:

β^=i=1nαixi\hat{\beta}=\sum_{i=1}^n \alpha_i x_i

采用反证法,设最优解

β~=i=1nαixi+k=1Kκkxk\tilde{\beta}=\sum_{i=1}^n \alpha_i x_i+\sum_{k=1}^K \kappa_k x_k

其中 xkxix_k \perp x_i,那么

xiTβ~=xiT(j=1nαjxj+k=1Kκkxk)=j=1nαjxiTxj+k=1KκkxiTxk=j=1nαjxiTxj+k=1Kκk0=j=1nαjxiTxj=xiTβ^\begin{aligned} x_i^T \tilde{\beta} & =x_i^T\left(\sum_{j=1}^n \alpha_j x_j+\sum_{k=1}^K \kappa_k x_k\right) \\ & =\sum_{j=1}^n \alpha_j x_i^T x_j+\sum_{k=1}^K \kappa_k x_i^T x_k \\ & =\sum_{j=1}^n \alpha_j x_i^T x_j+\sum_{k=1}^K \kappa_k 0 \\ & =\sum_{j=1}^n \alpha_j x_i^T x_j \\ & =x_i^T \hat{\beta} \end{aligned}

从而

i=1nL(yi;xiβ^)+λβ^2i=1nL(yi;xiβ~)+λβ~2\sum_{i=1}^n L(y_i ; x_i^{\top} \hat{\beta})+\lambda\|\hat{\beta}\|^2 \leq \sum_{i=1}^n L(y_i ; x_i^{\top} \tilde{\beta})+\lambda\|\tilde{\beta}\|^2

因此,β^\hat{\beta} 才是最优解。

3.11 Feature version

设优化目标

i=1nL(yi;ϕ(xi)β)+λβ2\sum_{i=1}^n L(y_i ; \phi(x_i)^{\top} \beta)+\lambda\|\beta\|^2

那么解空间的形式为:

β^=i=1nαiϕ(xi)\hat{\beta}=\sum_{i=1}^n \alpha_i \phi(x_i)

证明过程与 3.10 中类似,在此省略。

3.12 Gaussian Process and Bayesian Estimation

Linear version

对于 Y=Xβ+ϵY=X \beta+\epsilon,其中 βN(0,τ2Ip),ϵN(0,σ2In)\beta \sim \mathrm{N}\left(0, \tau^2 I_p\right), \epsilon \sim \mathrm{N}\left(0, \sigma^2 I_n\right),且 β\betaϵ\epsilon 相互独立。为了考察 β\beta 的稳定性,我们希望得到后验分布 Pr[βY,X]\operatorname{Pr}[\beta | Y, X]

  • 引理:设 X1X_1X2X_2 是两个多维随机变量,且

    [X1X2]N([μ1μ2],[Σ11Σ12Σ21Σ22])\begin{bmatrix} X_1 \\ X_2 \end{bmatrix} \sim N\left(\begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix},\begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{bmatrix}\right)

    Pr[X2X1]N(μ2+Σ21Σ111(X1μ1),Σ22Σ21Σ111Σ12]\operatorname{Pr}\left[X_2 | X_1\right] \sim N\left(\mu_2+\Sigma_{21} \Sigma_{11}^{-1}\left(X_1-\mu_1\right), \Sigma_{22}-\Sigma_{21} \Sigma_{11}^{-1} \Sigma_{12}\right]

首先,求解 Y=Xβ+ϵY = X\beta + \epsilon 的分布。其均值为:

E[Y]=XE[β]+E[ϵ]=0\operatorname{E}[Y]=X \operatorname{E}[\beta]+\operatorname{E}[\epsilon]=0

方差为:

Var[Y]=Var[Xβ]+Var[ϵ]=XVar[β]X+σ2In=τ2XX+σ2In\begin{aligned} \operatorname{Var}[Y] & =\operatorname{Var}[X \beta]+\operatorname{Var}[\epsilon] \\ & =X \operatorname{Var}[\beta] X^\top+\sigma^2 I_n \\ & =\tau^2 X X^\top+\sigma^2 I_n \end{aligned}

因此,YY 的分布是:

YN(0,τ2XX+σ2In)Y \sim N\left(0, \tau^2 X X^\top+\sigma^2 I_n\right)

又因为

Cov(Y,β)=Ei[(YiEj[Yj])(βiEj[βj])]=Ei[(Xβi+ϵiEj[Xβj+ϵj])(βiEj[βj])]=Ei[(Xβi+ϵiEj[Xβj]Ej[ϵj])(βi)] because Ej[βj]=0=Ei[(Xβi+ϵiEj[Xβj])βi] because Ej[ϵj]=0=Ei[Xβiβi]+Ei[ϵiβi]Ei[Ej[Xβj]βi]=Ei[Xβiβi]+0Ei[0βi]=XEi[βiβi]=X(τ2Ip)=τ2X\begin{aligned} \operatorname{Cov}(Y, \beta) & =E_i\left[\left(Y_i-E_j\left[Y_j\right]\right)\left(\beta_i-E_j\left[\beta_j\right]\right)^{\top}\right] \\ & =E_i\left[\left(X \beta_i+\epsilon_i-E_j\left[X \beta_j+\epsilon_j\right]\right)\left(\beta_i-E_j\left[\beta_j\right]\right)^{\top}\right] \\ & =E_i\left[\left(X \beta_i+\epsilon_i-E_j\left[X \beta_j\right]-E_j\left[\epsilon_j\right]\right)\left(\beta_i\right)^{\top}\right] \quad \text { because } E_j\left[\beta_j\right]=0 \\ & =E_i\left[\left(X \beta_i+\epsilon_i-E_j\left[X \beta_j\right]\right) \beta_i^{\top}\right] \quad \text { because } E_j\left[\epsilon_j\right]=0 \\ & =E_i\left[X \beta_i \beta_i^{\top}\right]+E_i\left[\epsilon_i \beta_i^{\top}\right]-E_i\left[E_j\left[X \beta_j\right] \beta_i^{\top}\right] \\ & =E_i\left[X \beta_i \beta_i^{\top}\right]+0-E_i\left[0 \beta_i^{\top}\right] \\ & =X E_i\left[\beta_i \beta_i^{\top}\right] \\ & =X\left(\tau^2 I_p\right) \\ & =\tau^2 X \end{aligned}

因此,

[Yβ]N([00],[τ2XXT+σ2Inτ2Xτ2XTτ2Ip])\begin{bmatrix} Y \\ \beta \end{bmatrix} \sim N\left(\begin{bmatrix} 0 \\ 0 \end{bmatrix},\begin{bmatrix} \tau^2 X X^T+\sigma^2 I_n & \tau^2 X \\ \tau^2 X^T & \tau^2 I_p \end{bmatrix}\right)

从而

Pr[βY,X]=N(τ2XT(τ2XXT+σ2In)1Y,τ2Ipτ2XT(τ2XXT+σ2In)1τ2X)\begin{equation} \operatorname{Pr}[\beta | Y, X]=N\left(\tau^2 X^T\left(\tau^2 X X^T+\sigma^2 I_n\right)^{-1} Y, \tau^2 I_p-\tau^2 X^T\left(\tau^2 X X^T+\sigma^2 I_n\right)^{-1} \tau^2 X\right) \end{equation}

因此,ridge regression 可以看作最大化 β\beta 的后验概率。

p(βY,X)p(β)p(YX,β)exp(12τ2β2)exp(12σ2YXβ2)=exp(12[1σ2YXβ2+1τ2β2])\begin{aligned} p(\beta | Y, X) & \propto p(\beta) p(Y | X, \beta) \\ & \propto \exp \left(-\frac{1}{2 \tau^2}|\beta|^2\right) \exp \left(-\frac{1}{2 \sigma^2}|Y-X \beta|^2\right) \\ & =\exp \left(-\frac{1}{2}\left[\frac{1}{\sigma^2}|Y-X \beta|^2+\frac{1}{\tau^2}|\beta|^2\right]\right) \end{aligned}

β^=argmaxβlog(p(βY,X))=(XTX+λIp)1(XTY)\hat{\beta}=\underset{\beta}{\operatorname{argmax}} \log (p(\beta | Y, X))=\left(X^T X+\lambda I_p\right)^{-1}\left(X^T Y\right)

该结果与 (3) 式中求出的 Pr[βY,X]\operatorname{Pr}[\beta | Y, X] 的均值一致。

Feature version

对于 yi=ϕ(xi)β+ϵiy_i=\phi\left(x_i\right)^{\top} \beta+\epsilon_i,我们同样求解后验分布 Pr[βY,X]\operatorname{Pr}[\beta | Y, X]。设

ϕ(X)=[ϕ(x1)ϕ(x2)ϕ(xn)]n×d\phi(X)=\begin{bmatrix} \phi\left(x_1\right)^{\top} \\ \phi\left(x_2\right)^{\top} \\ \ldots \\ \phi\left(x_n\right)^{\top} \end{bmatrix}_{n \times d}

那么只需要将 (3) 式中的 XX 替换为 ϕ(X)\phi(X) 即可,得到:

Pr[βY,X]=N(τ2ϕ(X)T(τ2ϕ(X)ϕ(X)T+σ2In)1Y,V)\operatorname{Pr}[\beta | Y, X]=N\left(\tau^2 \phi(X)^T\left(\tau^2 \phi(X) \phi(X)^T+\sigma^2 I_n\right)^{-1} Y, V\right)

其中

V=τ2Ipτ2ϕ(X)T(τ2ϕ(X)ϕ(X)T+σ2In)1τ2ϕ(X)V=\tau^2 I_p-\tau^2 \phi(X)^T\left(\tau^2 \phi(X) \phi(X)^T+\sigma^2 I_n\right)^{-1} \tau^2 \phi(X)

Kernel version

f(x)=ϕ(x)βf(x)=\phi(x)^{\top} \beta,那么

Y=[y1y2yn]=[f(x1)f(x2)f(xn)]+ϵY=\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}=\begin{bmatrix} f\left(x_1\right) \\ f\left(x_2\right) \\ \vdots \\ f\left(x_n\right) \end{bmatrix}+\epsilon

由于

Cov(f(x),f(x))=Cov(ϕ(x)Tβ,ϕ(x)Tβ)=τ2ϕ(x)Tϕ(x)\operatorname{Cov}\left(f(x), f\left(x^{\prime}\right)\right)=\operatorname{Cov}\left(\phi(x)^T \beta, \phi\left(x^{\prime}\right)^T \beta\right)=\tau^2 \phi(x)^T \phi\left(x^{\prime}\right)

K(x,x)=τ2ϕ(x)Tϕ(x)K\left(x, x^{\prime}\right) = \tau^2 \phi(x)^T \phi\left(x^{\prime}\right)

那么 YY 的分布可以写作:

YN(0,K+σ2In)Y \sim N\left(\mathbf{0}, \boldsymbol{K}+\sigma^2 \boldsymbol{I}_{\boldsymbol{n}}\right)

YN([00],[τ2ϕ(x1)Tϕ(x1)+σ2τ2ϕ(x1)Tϕ(xn)τ2ϕ(xn)Tϕ(x1)τ2ϕ(xn)Tϕ(xn)+σ2])Y \sim N\left(\begin{bmatrix} 0 \\ \vdots \\ 0 \end{bmatrix},\begin{bmatrix} \tau^2 \phi\left(x_1\right)^T \phi\left(x_1\right)+\sigma^2 & \dots & \tau^2 \phi\left(x_1\right)^T \phi\left(x_n\right) \\ \vdots & \ddots & \vdots \\ \tau^2 \phi\left(x_n\right)^T \phi\left(x_1\right) & \ldots & \tau^2 \phi\left(x_n\right)^T \phi\left(x_n\right)+\sigma^2 \end{bmatrix}\right)

从而

[Yf(x0)]=N([00],[K+σ2InK(x,x0)K(x0,x)K(x0,x0)](n+1)×(n+1))\begin{bmatrix} Y \\ f\left(x_0\right) \end{bmatrix}=N\left(\begin{bmatrix} \mathbf{0} \\ 0 \end{bmatrix},\begin{bmatrix} \boldsymbol{K}+\sigma^2 \boldsymbol{I}_{\boldsymbol{n}} & \boldsymbol{K}\left(\boldsymbol{x}, x_0\right) \\ \boldsymbol{K}\left(x_0, \boldsymbol{x}\right) & K\left(x_0, x_0\right) \end{bmatrix}_{(n+1) \times(n+1)}\right)

因此,

Pr[f(x0)Y,X]N(K(x0,x)(K+σ2In)1Y,K(x0,x0)K(x0,x)(K+σ2In)1K(x0,x)T)\operatorname{Pr}\left[f\left(x_0\right) | Y, X\right] \sim N\left(K\left(x_0, x\right)\left(K+\sigma^2 I_n\right)^{-1} Y, K\left(x_0, x_0\right)-K\left(x_0, x\right)\left(K+\sigma^2 I_n\right)^{-1} K\left(x_0, x\right)^T\right)

Marginal likelihood

我们已经推出,YY 的边缘分布为:

YN(0,Kγ+σ2In)Y \sim \mathrm{N}\left(0, K_\gamma+\sigma^2 I_n\right)

其中 γ\gamma 是高斯核参数。从而,γ\gamma 的边缘似然函数为:

1(2π)n/2Σγ1/2exp(12YTΣγ1Y)\frac{1}{(2 \pi)^{n / 2}\left|\Sigma_\gamma\right|^{1 / 2}} \exp \left(-\frac{1}{2} Y^T \Sigma_\gamma^{-1} Y\right)

其中 Σγ=Kγ+σ2In\Sigma_\gamma=K_\gamma+\sigma^2 I_n。取对数,得:

l=12YTΣγ1Y12log(Σγ)n2log(2π)l=-\frac{1}{2} Y^T \Sigma_\gamma^{-1} Y-\frac{1}{2} \log \left(\left|\Sigma_\gamma\right|\right)-\frac{n}{2} \log (2 \pi)

可以由最大化边缘似然求解出 γ\gamma

Lecture 4: Neural Networks

4.1 Neural networks

4.1.1 Two-layer perceptron

设双层感知机的输出 yi{0,1}y_i \in \{0, 1\}hi=(hik,k=1,,d)h_i=\left(h_{i k}, k=1, \ldots, d\right)^{\top} 经过 logistic regression 得到,而每个 hikh_{ik}Xi=(xij,j=1,,p)X_i=\left(x_{i j}, j=1, \ldots, p\right)^{\top} 经过 logistic regression 得到,即:

yiBernoulli(pi),pi=sigmoid(hiβ)=sigmoid(k=1dβkhik),hik=sigmoid(Xiαk)=sigmoid(j=1pαkjxij).\begin{aligned} & y_i \sim \operatorname{Bernoulli}\left(p_i\right), \\ & p_i=\operatorname{sigmoid}\left(h_i^{\top} \beta\right)=\operatorname{sigmoid}\left(\sum_{k=1}^d \beta_k h_{i k}\right), \\ & h_{i k}=\operatorname{sigmoid}\left(X_i^{\top} \alpha_k\right)=\operatorname{sigmoid}\left(\sum_{j=1}^p \alpha_{k j} x_{i j}\right) . \end{aligned}

Back-propagation

我们使用最大似然估计,所有样本分类正确的概率为:

P=ipiyi(1pi)1yilogP=i[yilogpi+(1yi)log(1pi)]logP=i=1n{yi{Alog[1+exp(A)]}+(1yi){log1log[1+exp(A)]}} where A=k=1dβkhiklogP=i=1n{yiAlog[1+exp(A)]} where A=k=1dβkhik\begin{aligned} & P=\prod_i p_i^{y_i}\left(1-p_i\right)^{1-y_i} \\ \Longrightarrow \quad & \log P=\sum_i\left[y_i \log p_i+\left(1-y_i\right) \log \left(1-p_i\right)\right] \\ \Longrightarrow \quad & \log P=\sum_{i=1}^n\left\{y_i\{A-\log [1+\exp (A)]\}+\left(1-y_i\right)\{\log 1-\log [1+\exp (A)]\}\right\} \quad \text { where } \quad A=\sum_{k=1}^d \beta_k h_{i k} \\ \Longrightarrow \quad & \log P=\sum_{i=1}^n\left\{y_i A-\log [1+\exp (A)]\right\} \quad \text { where } \quad A=\sum_{k=1}^d \beta_k h_{i k} \end{aligned}

即对数似然为:

L(β,α)=i=1n{yik=1dβkhiklog[1+exp(k=1dβkhik)]}\mathscr{L}(\beta, \alpha)=\sum_{i=1}^n\left\{y_i \sum_{k=1}^d \beta_k h_{i k}-\log \left[1+\exp \left(\sum_{k=1}^d \beta_k h_{i k}\right)\right]\right\}

求梯度,得:

Lβ=i=1n(yipi)hiLαk=Lhkhkαk=i=1nLhikhikαk=i=1n(yipi)βkhik(1hik)Xi\begin{aligned} & \frac{\partial \mathscr{L}}{\partial \beta}=\sum_{i=1}^n\left(y_i-p_i\right) h_i \\ & \frac{\partial \mathscr{L}}{\partial \alpha_k}=\frac{\partial \mathscr{L}}{\partial h_k} \frac{\partial h_k}{\partial \alpha_k}=\sum_{i=1}^n \frac{\partial \mathscr{L}}{\partial h_{i k}} \frac{\partial h_{i k}}{\partial \alpha_k}=\sum_{i=1}^n\left(y_i-p_i\right) \beta_k h_{i k}\left(1-h_{i k}\right) X_i \end{aligned}

Rectified linear unit (ReLU)

在神经网络中,非线性常常通过 ReLU 函数 max(0,a)\max(0, a) 引入。例如:

yiBernoulli(pi),pi=sigmoid(hiβ)=sigmoid(k=1dβkhik),hik=max(Xiαk,0)=max(j=1pαkjxij,0).\begin{aligned} & y_i \sim \operatorname{Bernoulli}\left(p_i\right), \\ & p_i=\operatorname{sigmoid}\left(h_i^{\top} \beta\right)=\operatorname{sigmoid}\left(\sum_{k=1}^d \beta_k h_{i k}\right), \\ & h_{i k}=\max \left(X_i^{\top} \alpha_k, 0\right)=\max \left(\sum_{j=1}^p \alpha_{k j} x_{i j}, 0\right) . \end{aligned}

对于 ReLU 函数的反向传播,有:

hikαk=1(αkXi>0)Xi=1(hik>0)Xi\frac{\partial h_{i k}}{\partial \alpha_k}=\mathbf{1}\left(\alpha_k^{\top} X_i>0\right) \cdot X_i=\mathbf{1}\left(h_{i k}>0\right) \cdot X_i

4.1.2 Multi-layer network

设多层感知机的每一层为:

hl=fl(sl)sl=Wlhl1+bl\begin{aligned} h_l & =f_l\left(s_l\right) \\ s_l & =W_l h_{l-1}+b_l \end{aligned}

其中层数 l=1,,Ll=1, \ldots, Lh0=Xh_0 = X 是输入向量,hLh_L 用于预测 YY

Chain rule in matrix form

为了推导反向传播,我们先回顾向量的求导法则。设 Y=(yi)m×1,X=(xj)n×1Y=\left(y_i\right)_{m \times 1}, X=\left(x_j\right)_{n \times 1},如果 Y=h(X)Y=h(X),那么

YX=(yixj)m×n.\frac{\partial Y}{\partial X^{\top}}=\left(\frac{\partial y_i}{\partial x_j}\right)_{m \times n}.

如果 Y=h(X)Y=h(X)X=g(Z)X=g(Z),那么

yizj=kyixkxkzj\frac{\partial y_i}{\partial z_j}=\sum_k \frac{\partial y_i}{\partial x_k} \frac{\partial x_k}{\partial z_j}

YZ=YXXZ\frac{\partial Y}{\partial Z^{\top}}=\frac{\partial Y}{\partial X^{\top}} \frac{\partial X}{\partial Z^{\top}}

  • Y=AXY=A X,那么 yi=kaikxky_i=\sum_k a_{i k} x_k,从而 yi/xj=aij\partial y_i / \partial x_j=a_{i j}。因此,Y/X=A\partial Y / \partial X^{\top}=A.
  • Y=XSXY=X^{\top} S X,其中 SS 是对称矩阵,那么 Y/X=2SX\partial Y / \partial X=2 S X.
  • S=I,Y=X2S=I, Y=\|X\|^2,那么 Y/X=2X\partial Y / \partial X=2 X.
Multi-layer back-propagation

对于多层感知机,链式法则可以写为:

Lsl=Lhlhlsl=LhlflLhl1=Lhlhlslslhl1=LhlflWl\begin{gathered} \frac{\partial L}{\partial s_l^{\top}}=\frac{\partial L}{\partial h_l^{\top}} \frac{\partial h_l}{\partial s_l^{\top}}=\frac{\partial L}{\partial h_l^{\top}} f_l^{\prime} \\ \frac{\partial L}{\partial h_{l-1}^{\top}}=\frac{\partial L}{\partial h_l^{\top}} \frac{\partial h_l}{\partial s_l^{\top}} \frac{\partial s_l}{\partial h_{l-1}^{\top}}=\frac{\partial L}{\partial h_l^{\top}} f_l^{\prime} W_l \end{gathered}

其中

Wl=slhl1=[sl,1hl1,1sl,1hl1,2sl,1hl1,3sl,1hl1,4sl,1hl1,5sl,2hl1,1sl,2hl1,2sl,2hl1,3sl,2hl1,4sl,2hl1,5sl,3hl1,1sl,3hl1,2sl,3hl1,3sl,3hl1,4sl,3hl1,5sl,4hl1,1sl,4hl1,2sl,4hl1,3sl,4hl1,4sl,4hl1,5sl,5hl1,1sl,5hl1,2sl,5hl1,3sl,5hl1,4sl,5hl1,5]fl=hlsl=[hl1sl100000hl2sl200000hl3sl300000hl4sl400000hl5sl5]\begin{gathered} W_l=\frac{\partial s_l}{\partial h_{l-1}^{\top}}=\begin{bmatrix} \frac{\partial s_{l, 1}}{\partial h_{l-1,1}} & \frac{\partial s_{l, 1}}{\partial h_{l-1,2}} & \frac{\partial s_{l, 1}}{\partial h_{l-1,3}} & \frac{\partial s_{l, 1}}{\partial h_{l-1,4}} & \frac{\partial s_{l, 1}}{\partial h_{l-1,5}} \\ \frac{\partial s_{l, 2}}{\partial h_{l-1,1}} & \frac{\partial s_{l, 2}}{\partial h_{l-1,2}} & \frac{\partial s_{l, 2}}{\partial h_{l-1,3}} & \frac{\partial s_{l, 2}}{\partial h_{l-1,4}} & \frac{\partial s_{l, 2}}{\partial h_{l-1,5}} \\ \frac{\partial s_{l, 3}}{\partial h_{l-1,1}} & \frac{\partial s_{l, 3}}{\partial h_{l-1,2}} & \frac{\partial s_{l, 3}}{\partial h_{l-1,3}} & \frac{\partial s_{l, 3}}{\partial h_{l-1,4}} & \frac{\partial s_{l, 3}}{\partial h_{l-1,5}} \\ \frac{\partial s_{l, 4}}{\partial h_{l-1,1}} & \frac{\partial s_{l, 4}}{\partial h_{l-1,2}} & \frac{\partial s_{l, 4}}{\partial h_{l-1,3}} & \frac{\partial s_{l, 4}}{\partial h_{l-1,4}} & \frac{\partial s_{l, 4}}{\partial h_{l-1,5}} \\ \frac{\partial s_{l, 5}}{\partial h_{l-1,1}} & \frac{\partial s_{l, 5}}{\partial h_{l-1,2}} & \frac{\partial s_{l, 5}}{\partial h_{l-1,3}} & \frac{\partial s_{l, 5}}{\partial h_{l-1,4}} & \frac{\partial s_{l, 5}}{\partial h_{l-1,5}} \end{bmatrix} \\ f_l^{\prime}=\frac{\partial h_l}{\partial s_l^{\top}}=\begin{bmatrix} \frac{\partial h_{l 1}}{\partial s_{l 1}} & 0 & 0 & 0 & 0 \\ 0 & \frac{\partial h_{l 2}}{\partial s_{l 2}} & 0 & 0 & 0 \\ 0 & 0 & \frac{\partial h_{l 3}}{\partial s_{l 3}} & 0 & 0 \\ 0 & 0 & 0 & \frac{\partial h_{l 4}}{\partial s_{l 4}} & 0 \\ 0 & 0 & 0 & 0 & \frac{\partial h_{l 5}}{\partial s_{l 5}} \end{bmatrix} \end{gathered}

如果 ff 是 Sigmoid 函数,对角线元素 hlk/slk\partial h_{l k} / \partial s_{l k} 的值为 hlk(1hlk)h_{l k}\left(1-h_{l k}\right);如果 ff 是 ReLU 函数,hlk/slk\partial h_{l k} / \partial s_{l k} 的值为 1(hlk>0)1\left(h_{l k}>0\right).

L/hl1\partial L/\partial h_{l-1}^{\top} 转置,得:

Lhl1=WlflLhl.\frac{\partial L}{\partial h_{l-1}}=W_l^{\top} f_l^{\prime} \frac{\partial L}{\partial h_l} .

接着,推导 L/Wl\partial L / \partial W_l. 首先,对 WlW_l 的第 kk 行求梯度:

(LWlk)1×K=(Lslk)1×1(slkWlk)1×K=(Lslk)1×1(hl1)1×K.\left(\frac{\partial L}{\partial W_{l k}}\right)_{1 \times K}=\left(\frac{\partial L}{\partial s_{l k}}\right)_{1 \times 1}\left(\frac{\partial s_{l k}}{\partial W_{l k}}\right)_{1 \times K}=\left(\frac{\partial L}{\partial s_{l k}}\right)_{1 \times 1}\left(h_{l-1}^{\top}\right)_{1 \times K} .

再将所有行的结果整合,得到:

(LWl)K×K=(Lsl)K×1(hl1)1×K=(fl)K×K(Lhl)K×1(hl1)1×K\left(\frac{\partial L}{\partial W_l}\right)_{K \times K}=\left(\frac{\partial L}{\partial s_l}\right)_{K \times 1}\left(h_{l-1}^{\top}\right)_{1 \times K}=\left(f_l^{\prime}\right)_{K \times K}\left(\frac{\partial L}{\partial h_l}\right)_{K \times 1}\left(h_{l-1}^{\top}\right)_{1 \times K}

同理,

Lbl=flLhl.\frac{\partial L}{\partial b_l}= f_l^{\prime} \frac{\partial L}{\partial h_l} .

Δhl=Lhl,ΔWl=LWl,Dl=fl,\Delta h_l=\frac{\partial L}{\partial h_l}, \Delta W_l=\frac{\partial L}{\partial W_l}, D_l=f_l^{\prime},

那么之前的结果可以写为:

Δhl1=WlDlΔhl,ΔWl=DlΔhlhl1,\begin{aligned} & \Delta h_{l-1}=W_l^{\top} D_l \Delta h_l, \\ & \Delta W_l=D_l \Delta h_l h_{l-1}^{\top}, \end{aligned}

加入 blb_l 可以写为:

(ΔWl,Δbl)=Δ(Wl,bl)=DlΔhl(hl1,1).\left(\Delta W_l, \Delta b_l\right)=\Delta\left(W_l, b_l\right)=D_l \Delta h_l\left(h_{l-1}^{\top}, 1\right) .

4.1.3 Stochastic gradient descent (SGD)

Mini-batch

使用在小批量上的梯度下降法. 设 L(θ)=1ni=1nLi(θ)\mathscr{L}(\theta)=\frac{1}{n} \sum_{i=1}^n L_i(\theta) 是在一个小批量上的平均损失,那么 SGD 的公式为:

θt+1=θtηtL(θt).\theta_{t+1}=\theta_t-\eta_t \mathscr{L}^{\prime}\left(\theta_t\right).

Momentum, Adagrad, RMSprop, Adam

为了解决优化过程中的振荡问题,我们在 SGD 中引入动量 (Momentum),其更新公式为:

vt=γvt1+ηtgt,θt=θt1vt,\begin{aligned} & v_t=\gamma v_{t-1}+\eta_t g_t, \\ & \theta_t=\theta_{t-1}-v_t, \end{aligned}

其中 gtg_t 是当前小批量上的平均梯度,vtv_t 是动量.

在另一个方向,有 Adagrad 方法:

Gt=Gt1+gt2θt+1=θtηtgtGt+ε\begin{aligned} & G_t=G_{t-1}+g_t^2 \\ & \theta_{t+1}=\theta_t-\eta_t \frac{g_t}{\sqrt{G_t+\varepsilon}} \end{aligned}

RMSProp 对 Adagrad 作了如下改进:

Gt=βGt1+(1β)gt2G_t=\beta G_{t-1}+(1-\beta) g_t^2

Adam 方法是对 RMSProp 和 Momentum 方法的结合,即:

vt=γvt1+(1γ)gt,Gt=βGt1+(1β)gt2,vtvt/(1γ),GtGt/(1β),θt+1=θtηtvtGt+ε.\begin{aligned} & v_t=\gamma v_{t-1}+(1-\gamma) g_t, \\ & G_t=\beta G_{t-1}+(1-\beta) g_t^2, \\ & v_t \leftarrow v_t /(1-\gamma), G_t \leftarrow G_t /(1-\beta), \\ & \theta_{t+1}=\theta_t-\eta_t \frac{v_t}{\sqrt{G_t+\varepsilon}} . \end{aligned}

4.2 Convolutional neural networks

4.2.1 Convolution, kernels, filters

基本的卷积神经网络可以表示为:

s=Wh+bsuvw=i=1Rj=1Rk=1CWijkwht(u1)p+i,t(v1)p+j,k+bw\begin{aligned} s & =W \otimes h+b \\ s_{u v w} & =\sum_{i=1}^R \sum_{j=1}^R \sum_{k=1}^C W_{i j k}^w h_{t(u-1)-p+i, t(v-1)-p+j, k}+b^w \end{aligned}

其中卷积核尺寸为 R×R×CR \times R \times Ctt 是 stride,pp 是 padding,WijkwW_{i j k}^w 表示第 ww 个卷积核的 (i,j,k)(i, j, k) 位置。

接着使用 ReLU 层:

h=max{s,0} i.e. hijk=max{sijk,0}\begin{array}{ll} & h=\max \{s, \mathbf{0}\} \\ \text { i.e. } & h_{i j k}=\max \left\{s_{i j k}, 0\right\} \end{array}

然后作最大池化(max-pooling)或平均池化(average-pooling)。

重点:要会算输入和输出的形状。

4.2.2 Softmax layer for classification

经典的 CNN 结构:x -> Conv -> ReLU -> Max-Pooling -> Conv -> ReLU -> Max-Pooling -> FC -> ReLU -> FC -> ReLU -> FC -> Softmax.

4.2.3 Alex net, VGG net, inception net

4.2.4 Batch normalization layer

Batch normalization 是在通道维度上做归一化。

μ=1ni=1nxi;μd=1ni=1nxid; with channels d=1,2,,Dσd2=1ni=1n(xidμd)2x^id=xidμdσdyid=βd+γdx^id.\begin{aligned} \mu & =\frac{1}{n} \sum_{i=1}^n x_i ; \quad \mu_d=\frac{1}{n} \sum_{i=1}^n x_{i d} ; \quad \text { with channels } d=1,2, \ldots, D \\ \sigma_d^2 & =\frac{1}{n} \sum_{i=1}^n\left(x_{i d}-\mu_d\right)^2 \\ \hat{x}_{i d} & =\frac{x_{i d}-\mu_d}{\sigma_d} \\ y_{i d} & =\beta_d+\gamma_d \hat{x}_{i d} . \end{aligned}

4.2.5 Residual net

4.3 Recurrent neural networks

4.3.1 RNN

4.3.2 LSTM

4.3.3 GRU

4.3.4 Encoder-decoder, thought vector

4.4 Generator and GAN

4.4.1 Encoder and decoder again

4.4.2 Supervised decoder

4.4.3 GAN

4.4.4 Geometry of VAE

4.4.5 ELBO

Lecture 5: Reinforcement Learning

Lecture 6: Visualization, EM Algorithm and Shapley Values

参考资料

本文参考上海交通大学《机器学习》课程 CS3612 张拳石老师的 PDF 讲义整理。


机器学习:笔记整理
https://cny123222.github.io/2026/03/09/机器学习:笔记整理/
Author
Nuoyan Chen
Posted on
March 9, 2026
Licensed under