机器学习:笔记整理

Last updated on June 3, 2026 pm

本文为 SJTU-CS3612 机器学习课程的笔记整理,主要聚焦于公式及其推导过程。

Lecture 1: Linear Model

1.1 Linear regression

假设有 nn 个样本的数据集 {xi1,xi2,,xip,yi}i=1n\{x_{i1}, x_{i2}, \dots, x_{ip}, y_i\}_{i=1}^n,则线性模型可以表示为:

yi=β01+β1xi1+β2xi2++βpxip+εi=j=0pxijβj+εiy_i = \beta_0 1+\beta_1 x_{i 1}+\beta_2 x_{i 2} +\cdots+\beta_p x_{i p}+\varepsilon_i =\sum_{j=0}^p x_{i j} \beta_j+\varepsilon_i

或者表示为矩阵形式:

y=Xβ+ε\mathbf{y}=\mathbf{X}^{\top} \boldsymbol{\beta}+\boldsymbol{\varepsilon}

其中

X=[x1x2xn]=[1x11x1p1x21x2p1xn1xnp],y=[y1y2yn],β=[β1β2βp],ε=[ε1ε2εn]\mathbf{X}^\top =\begin{bmatrix} \mathbf{x}_1^{\top} \\ \mathbf{x}_2^{\top} \\ \vdots \\ \mathbf{x}_n^{\top} \end{bmatrix}=\begin{bmatrix} 1 & x_{11} & \cdots & x_{1 p} \\ 1 & x_{21} & \cdots & x_{2 p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n 1} & \cdots & x_{n p} \end{bmatrix}, \quad \mathbf{y}=\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \quad \boldsymbol{\beta}=\begin{bmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end{bmatrix}, \quad \boldsymbol{\varepsilon}=\begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix}

1.2 Logistic regression

假设有 nn 个样本的数据集 {(Xi,yi)}i=1n\{(X_i, y_i)\}_{i=1}^n,其中标签 yi{0,1}y_i \in \{0, 1\}. 设模型对样本 XiX_i 的预测概率

Pr(yi=1Xi)=pi,Pr(yi=0Xi)=1pi\operatorname{Pr}(y_i = 1|X_i) = p_i, \quad \operatorname{Pr}(y_i = 0|X_i) = 1 - p_i

Logistic regression 的分类公式是:

pi=sigmoid(Xiβ)=eXiβ1+eXiβ=11+eXiβp_i=\operatorname{s igmoid}\left(X_i^{\top} \beta\right)=\frac{e^{X_i^{\top} \beta}}{1+e^{X_i^{\top} \beta}}=\frac{1}{1+e^{-X_i^{\top} \beta}}

设每个样本的真实标签为 yiy_i^*,那么预测正确的概率为

Pr(yi=yiXi)=piyi(1pi)1yi=(eXiβ1+eXiβ)yi(1eXiβ1+eXiβ)1yi=eyiXiβ1+eXiβ\operatorname{Pr}\left(y_i=y_i^* | X_i\right)=p_i^{y_i^*}\left(1-p_i\right)^{1-y_i^*}=\left(\frac{e^{X_i^{\top} \beta}}{1+e^{X_i^{\top} \beta}}\right)^{y_i^*}\left(1-\frac{e^{X_i^{\top} \beta}}{1+e^{X_i^{\top} \beta}}\right)^{1-y_i^*}=\frac{e^{y_i^* X_i^{\top} \beta}}{1+e^{X_i^{\top} \beta}}

采用最大似然估计(MLE),目标是所有样本全部预测正确的概率最大,即

Pr(β)=i=1npiyi(1pi)1yi=i=1neyiXiβ1+eXiβ\operatorname{Pr}(\beta)=\prod_{i=1}^n p_i^{y_i^*}\left(1-p_i\right)^{1-y_i^*}=\prod_{i=1}^n \frac{e^{y_i^* X_i^{\top} \beta}}{1+e^{X_i^{\top} \beta}}

取对数,得

logPr(β)=i=1n[yiXiβlog(1+expXiβ)]\log \operatorname{Pr}(\beta)=\sum_{i=1}^n\left[y_i^* X_i^{\top} \beta-\log \left(1+\exp X_i^{\top} \beta\right)\right]

即优化目标变为

arg maxβPr(β)arg maxβlogPr(β)\argmax_\beta \operatorname{Pr}(\beta) \equiv \argmax_\beta \log \operatorname{Pr}(\beta)

1.3 Classification

如果分类标签 yi{+1,1}y_i \in \{+1, -1\},那么 logistic regression 变为:

Pr(yi=+1Xi)=11+eXiβ,Pr(yi=1Xi)=11+eXiβ\operatorname{Pr}(y_i = +1|X_i) = \frac{1}{1+e^{-X_i^{\top} \beta}}, \quad \operatorname{Pr}(y_i = -1|X_i) = \frac{1}{1+e^{X_i^{\top} \beta}}

合并起来,得到

p(yi)=11+eyiXiβp\left(y_i\right)=\frac{1}{1+e^{-y_iX_i^{\top} \beta}}

1.4 Perceptron

一个基础的感知机,可以表示为:

yi=sign(Xiβ)y_i=\operatorname{sign}\left(X_i^{\top} \beta\right)

其中 sign()\operatorname{sign}(\cdot) 是符号函数。

1.5 Three models

Generative models

给定观测变量 XX 和目标变量 YY生成式模型建模联合概率 pθ(X,Y)p_\theta(X, Y)。用生成式模型做分类时,使用条件概率公式:

pθ(YX)=pθ(X,Y)pθ(X)=pθ(X,Y)Ypθ(X,Y)p_\theta(Y | X)=\frac{p_\theta(X, Y)}{p_\theta(X)}=\frac{p_\theta(X, Y)}{\sum_{Y^{\prime}} p_\theta\left(X, Y^{\prime}\right)}

假设 gθg_\theta 是一个生成式网络,给定输入

hN(0,I)h \sim \mathrm{N}(0, I)

期望的输出为:

X=gθ(h)+εX = g_\theta(h) + \varepsilon

输出 XX 的概率为:

pθ(X)=p(h)pθ(Xh)dhp_\theta(X)=\int p(h) p_\theta(X | h) \mathrm{d} h

要增大 logpθ(X)\log p_\theta(X),就是要增大 logp(h)\log p(h)logpθ(Xh)\log p_\theta(X | h).

  1. logp(h)\log p(h):由于 hN(0,I)h \sim N(0, I),即

    p(h)=12πexp(12h2)p(h) = \frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}\| h \|^2\right)

    从而

    logp(h)=12h2+constant\log p(h)=-\frac{1}{2}\|h\|^2+\mathrm{constant}

  2. logpθ(Xh)\log p_\theta(X | h):假设

    Xgθ(h)=εN(0,σ2I)X-g_\theta(h)=\varepsilon \sim \mathrm{N}\left(0, \sigma^2 \mathbf{I}\right)

    那么

    XhN(gθ(h),σ2I)X | h \sim \mathrm{N}\left(g_\theta(h), \sigma^2 \mathbf{I}\right)

    从而

    logpθ(Xh)=Xgθ(h)22σ2+constant\log p_\theta(X | h)=-\frac{\left\|X-g_\theta(h)\right\|^2}{2 \sigma^2}+\mathrm{constant}

Discriminative models

给定观测变量 XX 和目标变量 YY判别式模型建模条件概率 P(YX)P(Y|X). 设输入为 XX,网络的输出为 KK 维向量 fθ(X)f_\theta(X). 最后经过 Softmax 层,得到分类为每一类的概率:

pθ(y=kX)=pk=1Z(θ)exp(fθ(k)(X))p_\theta(y=k | X)= p_k = \frac{1}{Z(\theta)} \exp (f_\theta^{(k)}(X))

其中

Z(θ)=kexp(fθ(k)(X))Z(\theta) = \sum_k \exp (f_\theta^{(k)}(X))

Descriptive models

描述式模型建模输入数据分布 pθ(X)p_\theta(X). 可以表示为:

pθ(X)=1Z(θ)exp(fθ(X))p_\theta(X)=\frac{1}{Z(\theta)} \exp(f_\theta(X))

其中

Z(θ)=exp(fθ(x))dxZ(\theta) = \int \exp (f_\theta(x)) \mathrm{d} x

1.7 Loss functions

为了从数据集 {(Xi,yi)}i=1n\left\{\left(X_i, y_i\right)\right\}_{i=1}^n 中学习 β\beta,我们最小化损失函数

L(β)=i=1nL(yi,Xiβ)\mathscr{L}(\beta)=\sum_{i=1}^n L\left(y_i, X_i^{\top} \beta\right)

其中 L(yi,Xiβ)L\left(y_i, X_i^{\top} \beta\right) 是对每个训练样本的损失.

Loss function for least squares regression

在线性回归中,我们常使用最小平方损失:

L(yi,Xiβ)=(yiXiβ)2L(y_i, X_i^{\top} \beta)=\left(y_i-X_i^{\top} \beta\right)^2

这一损失函数同样可以从最大似然估计中得到. 先假设误差服从正态 分布,即

yiXiβ=εiN(0,σ2I)y_i-X_i^{\top} \beta=\varepsilon_i \sim \mathrm{N}(0,\sigma^2I)

那么有

yiXiN(Xiβ,σ2I)y_i|X_i \sim \mathrm{N}(X_i^\top \beta, \sigma^2I)

Pr(yiXi,β)=12πσexp[(yiXiβ)22σ2]\mathrm{Pr}(y_i|X_i, \beta) = \frac{1}{\sqrt{2\pi} \sigma} \exp\left[-\frac{(y_i - X_i^\top \beta)^2}{2\sigma^2}\right]

取负对数似然为损失函数,有:

L(yi,Xiβ)=logPr(yiXi,β)=(yiXiβ)22σ2+constantL(y_i, X_i^{\top} \beta)= -\log \operatorname{Pr}(y_i |X_i, \beta) = \frac{(y_i-X_i^{\top} \beta)^2}{2\sigma^2}+\mathrm{constant}

从而

L(β)=i=1nL(yi,Xiβ)=2σ2i=1n[(yiXiβ)22σ2+constant]=i=1n(yiXiβ)2\mathscr{L}(\beta)=\sum_{i=1}^n L(y_i, X_i^{\top} \beta)=2\sigma^2 \sum_{i=1}^n\left[\frac{(y_i-X_i^{\top} \beta)^2}{2\sigma^2}+\mathrm{constant}\right]=\sum_{i=1}^n(y_i-X_i^{\top} \beta)^2

因此,最小二乘法的本质是最大似然估计.

Loss function for robust linear regression

考虑到最小平方损失对 outliers 较为敏感,可以将损失函数中的平方换为绝对值,即:

L(yi,Xiβ)=yiXiβL(y_i, X_i^{\top} \beta)=| y_i-X_i^{\top} \beta |

将两种损失结合,得到 Huber loss:

L(yi,Xiβ)={12(yiXiβ)2, if yiXiβδδyiXiβδ22, otherwise L(y_i, X_i^{\top} \beta)= \begin{cases}\frac{1}{2}(y_i-X_i^{\top} \beta)^2, & \text { if }|y_i-X_i^{\top} \beta| \leq \delta \\ \delta|y_i-X_i^{\top} \beta|-\frac{\delta^2}{2}, & \text { otherwise }\end{cases}

其中 δ\delta 是选取的截断值,在 δ\delta 之外采用绝对值损失.

Loss function for logistic regression with 0/1 responses

前面推导过,当标签 yi{0,1}y_i \in \{0, 1\} 时,

Pr(yiXi,β)=exp(yiXiβ)1+exp(Xiβ)\operatorname{Pr}\left(y_i | X_i, \beta\right) =\frac{\exp (y_i X_i^\top \beta)}{1+\exp (X_i^{\top} \beta)}

为了最大化这一概率,取负对数似然为损失函数:

L(yi,Xiβ)=logPr(yiXi,β)=[yiXiβlog(1+exp(Xiβ))]L(y_i, X_i^{\top} \beta)= -\log \operatorname{Pr}(y_i |X_i, \beta) = -\left[y_i X_i^{\top} \beta-\log (1+\exp (X_i^{\top} \beta))\right]

Loss function for logistic regression with ±\pm responses

前面推导过,当标签 yi{+1,1}y_i \in \{+1, -1\} 时,

Pr(yiXi,β)=11+exp(yiXiβ)\operatorname{Pr}\left(y_i | X_i, \beta\right) =\frac{1}{1+\exp (-y_i X_i^{\top} \beta)}

同样取负对数似然为损失函数,有:

L(yi,Xiβ)=logPr(yiXi,β)=log[1+exp(yiXiβ)]L(y_i, X_i^{\top} \beta)= -\log \operatorname{Pr}(y_i |X_i, \beta) = \log \left[1+\exp \left(-y_i X_i^{\top} \beta\right)\right]

该损失被称为 logistic loss.

Loss functions for classification

对于 yi{+1,1}y_i \in \{+1, -1\} 的分类问题,除了 logistic loss,还有其他的损失函数可以选择:

 Logistic loss =log(1+exp(yiXiβ)), Exponential loss =exp(yiXiβ), Hinge loss =max(0,1yiXiβ), Zero-one loss =1(yiXiβ<0)\begin{aligned} & \text { Logistic loss }=\log \left(1+\exp \left(-y_i X_i^{\top} \beta\right)\right), \\ & \text { Exponential loss }=\exp \left(-y_i X_i^{\top} \beta\right), \\ & \text { Hinge loss }=\max \left(0,1-y_i X_i^{\top} \beta\right), \\ & \text { Zero-one loss }=1\left(y_i X_i^{\top} \beta<0\right) \end{aligned}

图中横轴表示 mi=yiXiβm_i = y_iX_i^\top\beta(称为 margin),纵轴表示 L(yi,Xiβ)L(y_i, X_i^\top\beta). 当 yiy_iXiβX_i^\top\beta 同号时,表明分类正确,所以当 yi=+1y_i = +1 时,我们希望 XiβX_i^\top\beta 是越正越好;当 yi=1y_i = -1 时,我们希望 XiβX_i^\top\beta 是越负越好. 也就是说,我们希望 mi=yiXiβm_i = y_iX_i^\top\beta 越大越好,即 mim_i 越小,损失函数越大.

1.8 Least Squares

考虑线性模型 Y=Xβ+εY = X\beta + \varepsilon,优化目标是:

β^=arg minβL(β)=arg minβYXβ2\hat{\beta}=\argmin_\beta L(\beta) = \argmin_\beta\Vert Y-X \beta\Vert^2

考虑到 L(β)L(\beta) 为凸函数,在 β=β^\beta = \hat{\beta} 处有一阶条件:

βL(β)=0\frac{\partial}{\partial \beta} L(\beta) = 0

下面对 L(β)L(\beta) 求导. 展开得:

L(β)=(YXβ)(YXβ)=YYYXββXY+βXXβ\begin{aligned} L(\beta) & = (Y - X\beta)^\top(Y - X\beta) \\ & = Y^{\top} Y-Y^{\top} X \beta-\beta^{\top} X^{\top} Y+\beta^{\top} X^{\top} X \beta \\ \end{aligned}

考虑到 βXY\beta^{\top} X^{\top} Y 是标量,有 βXY=YXβ\beta^{\top} X^{\top} Y=Y^{\top} X \beta,从而

L(β)=YY2YXβ+βXXβL(\beta) = Y^{\top} Y-2 Y^{\top} X \beta+\beta^{\top} X^{\top} X \beta

假设 AA 是对称矩阵,那么有:

β(βAβ)=2Aβ\frac{\partial}{\partial \beta}\left(\beta^{\top} A \beta\right)=2 A \beta

β(bβ)=b\frac{\partial}{\partial \beta}\left(b^{\top} \beta\right)=b

根据矩阵求导法则,有:

Lβ=2XY+2XXβ\frac{\partial L}{\partial \beta}=-2 X^{\top} Y+2 X^{\top} X \beta

Lβ=0\dfrac{\partial L}{\partial \beta} = 0,得:

2X(YXβ)=0\begin{equation} 2 X^\top(Y-X \beta)=0 \end{equation}

如果 XX 列满秩,那么 XXX^\top X 可逆,解得:

β^=(XX)1XY\hat{\beta}=\left(X^\top X\right)^{-1} X^\top Y

因此,

Y^=Xβ^=X(XX)1XY\hat{Y}=X \hat{\beta}=X\left(X^\top X\right)^{-1} X^\top Y

可以看出,最小二乘法的实质是将 YY 投影到 XX 的列空间上,其投影为 Xβ^X\hat{\beta};而误差 ε=YXβ\varepsilon = Y - X\beta 和列空间正交,这可以从 (1) 式中看出。

Distribution of β^\hat{\beta}

下面讨论训练得到的参数 β^\hat{\beta} 的稳定性. 设用无穷个样本得到的参数为 βtrue\beta_\text{true},那么我们有:

β^=(XX)1XY=(XX)1X(Xβtrue+ε)=βtrue+(XX)1Xε\begin{aligned} \hat{\beta} & =(X^\top X)^{-1} X^\top Y \\ & =(X^\top X)^{-1} X^\top(X \beta_{\text {true}}+\varepsilon) \\ & =\beta_{\text {true}}+(X^\top X)^{-1} X^\top \varepsilon \end{aligned}

可以看出,β^\hat{\beta} 服从正态分布,其均值为:

E[β^]=βtrue\mathbb{E}[\hat{\beta}]=\beta_{\text {true}}

协方差矩阵为:

Var(β^)=E[(β^E[β^])(β^E[β^])]=E[((XX)1Xε)((XX)1Xε)]=(XX)1XE[εε]X(XX)1,=(XX)1Xσ2IX(XX)1=σ2(XX)1\begin{aligned} \mathrm{Var}(\hat{\beta}) & = \mathbb{E}\left[(\hat{\beta} - \mathbb{E}[\hat{\beta}])(\hat{\beta} - \mathbb{E}[\hat{\beta}])^\top\right] \\ & = \mathbb{E}\left[((X^\top X)^{-1} X^\top \varepsilon )((X^\top X)^{-1} X^\top \varepsilon)^\top\right] \\ & = (X^\top X)^{-1} X^\top \mathbb{E}[\varepsilon \varepsilon^\top] X (X^\top X)^{-1, \top} \\ & = (X^\top X)^{-1} X^\top \cdot \sigma^2 I \cdot X (X^\top X)^{-1} \\ & = \sigma^2 (X^\top X)^{-1} \end{aligned}

因此,

β^N(βtrue,σ2(XX)1)\hat{\beta} \sim N\left(\beta_\text{true}, \sigma^2 (X^\top X)^{-1}\right)

1.9 Kullback-Leibler divergence and cross entropy

Coding and entropy

一个分布 p(x)p(x) 的不确定性用熵来衡量:

H(p)=Ep[logp(X)]=xp(x)[logp(x)]H(p)=\mathbb{E}_p[-\log p(X)]=\sum_x p(x)[-\log p(x)]

熵等于用 logp(x)-\log p(x) 长度来编码达到的最短编码长度.

Kullback-Leibler divergence and cross entropy

交叉熵,是对于分布 p(x)p(x),使用另一个分布 q(x)q(x)logq(x)-\log q(x) 长度进行编码,所得到的编码长度,即:

CE(pq)=Ep[logq(X)]=xp(x)logq(x)\mathrm{CE}(p | q) = \mathbb{E}_p[-\log q(X)]=-\sum_x p(x) \log q(x)

不难得到,交叉熵不小于熵,即:

CE(pq)H(p)\mathrm{CE}(p | q) \ge H(p)

当且仅当 p(x)=q(x)p(x) = q(x),即两个分布相等时,等号成立.

定义 KL 散度为交叉熵和熵的差,即:

KL(pq)=CE(pq)H(p)=Ep[logp(X)q(X)]=xp(x)logp(x)q(x)0\mathrm{KL}(p | q) = \mathrm{CE}(p | q) - H(p) = \mathbb{E}_p\left[\log \frac{p(X)}{q(X)}\right] = \sum_x p(x) \log \frac{p(x)}{q(x)} \ge 0

当且仅当 p(x)=q(x)p(x) = q(x)时,等号成立. 一般来说,p(x)p(x) 表示真实分布,q(x)q(x) 表示模型输出的预测分布. KL 散度衡量了两个分布之间的距离,但是 KL 散度不对称,即:

KL(pq)KL(qp)\operatorname{KL}(p | q) \neq \operatorname{KL}(q | p)

对于条件概率分布,KL 散度的定义是:

KL(p(yx)q(yx))= def Ep(x,y)[logp(YX)q(YX)]=Ep(x)Ep(yx)[logp(YX)q(YX)]\operatorname{KL}(p(y | x) | q(y | x)) \stackrel{\text { def }}{=} \mathbb{E}_{p(x, y)}\left[\log \frac{p(Y | X)}{q(Y | X)}\right] = \mathbb{E}_{p(x)} \mathbb{E}_{p(y | x)}\left[\log \frac{p(Y | X)}{q(Y | X)}\right]

对于多变量分布,KL 散度为:

KL(p(x,y)q(x,y))=KL(p(x)q(x))+KL(p(yx)q(yx))\begin{equation} \operatorname{KL}(p(x, y) | q(x, y))=\operatorname{KL}(p(x) | q(x))+\operatorname{KL}(p(y | x) | q(y | x)) \end{equation}

1.10 Maximum likelihood

下面证明最大似然估计的正确性. 优化目标是:

L(θ)=1ni=1nlogpθ(yiXi)\mathscr{L}(\theta)=\frac{1}{n} \sum_{i=1}^n \log p_\theta(y_i | X_i)

Pdata(X,y)P_\text{data}(X, y) 是样本分布,则有:

maxθL(θ)=maxθ1ni=1nlogpθ(yiXi)=maxθEPdata[logpθ(yX)]=minθ{EPdata[logpθ(yX)]}\begin{aligned} \max_\theta \mathscr{L}(\theta) & = \max _\theta \frac{1}{n} \sum_{i=1}^n \log p_\theta(y_i | X_i) \\ & = \max _\theta \mathbb{E}_{P_{\text {data}}}\left[\log p_\theta(y | X)\right] \\ & = \min _\theta\left\{-\mathbb{E}_{P_{\text {data}}}\left[\log p_\theta(y | X)\right]\right\} \\ \end{aligned}

由于 EPdata[logPdata(yX)]\mathbb{E}_{P_{\text {data}}}\left[\log P_{\text {data}}(y | X)\right] 是与 θ\theta 无关的常数,

maxθL(θ)=minθEPdata [logPdata(yX)]EPdata[logpθ(yX)]=minθKL(Pdata(yX)pθ(yX))\begin{aligned} \max_\theta \mathscr{L}(\theta) & = \min _\theta \mathbb{E}_{P_{\text {data }}}\left[\log P_{\text {data}}(y | X)\right]-\mathbb{E}_{P_{\text {data}}}\left[\log p_\theta(y | X)\right] \\ & = \min _\theta \mathrm{KL}\left(P_{\text {data}}(y | X) | p_\theta(y | X)\right) \end{aligned}

可以看出,最大似然估计的实质是:最小化预测分布和真实数据分布的 KL 散度,即让预测分布尽可能接近真实数据分布.

1.11 Kullback-Leibler of conditionals

下面证明多变量分布的 KL 散度公式,即式 (2).

KL(p(x,y)q(x,y))=Ep[logp(x,y)q(x,y)]=Ep[logp(x)p(yx)q(x)q(yx)]=Ep[logp(x)q(x)]+Ep[logp(yx)q(yx)]=KL(p(x)q(x))+KL(p(yx)q(yx))\begin{aligned} \mathrm{KL}(p(x, y) | q(x, y)) & =\mathbb{E}_p\left[\log \frac{p(x, y)}{q(x, y)}\right] \\ & =\mathbb{E}_p\left[\log \frac{p(x) p(y | x)}{q(x) q(y | x)}\right] \\ & =\mathbb{E}_p\left[\log \frac{p(x)}{q(x)}\right]+\mathbb{E}_p\left[\log \frac{p(y | x)}{q(y | x)}\right] \\ & =\mathrm{KL}(p(x) | q(x))+\mathrm{KL}(p(y | x) | q(y | x)) \end{aligned}

1.13 Gradient of log-likelihood

Discriminative model

前面提到过,判别式模型的输出为:

pθ(y=kX)=pk=1Z(θ)exp(fθ(k)(X))p_\theta(y=k | X)=p_k=\frac{1}{Z(\theta)} \exp (f_\theta^{(k)}(X))

其中

Z(θ)=kexp(fθ(k)(X))Z(\theta)=\sum_k \exp (f_\theta^{(k)}(X))

对对数概率求梯度,有:

θlogpθ(yX)=θfθ(k)(X)θlogZ(θ)=θfθ(k)(X)1Z(θ)θZ(θ)=θfθ(k)(X)1Z(θ)θ[kexp(fθ(k)(X))]=θfθ(k)(X)[kexp(fθ(k)(X))Z(θ)θfθ(k)(X)]=θfθ(k)(X)[kpkθfθ(k)(X)]=k(1(y=k)pk)θfθ(k)(X)=θfθ(X)(Yp)=θfθ(X)(YEθ(YX))\begin{aligned} \frac{\partial}{\partial \theta} \log p_\theta(y | X) & =\frac{\partial}{\partial \theta} f_\theta^{(k)}(X)-\frac{\partial}{\partial \theta} \log Z(\theta) \\ & =\frac{\partial}{\partial \theta} f_\theta^{(k)}(X)-\frac{1}{Z(\theta)} \frac{\partial}{\partial \theta} Z(\theta) \\ & =\frac{\partial}{\partial \theta} f_\theta^{(k)}(X)-\frac{1}{Z(\theta)} \frac{\partial}{\partial \theta}\left[\sum_{k^{\prime}} \exp (f_\theta^{(k^{\prime})}(X))\right] \\ & =\frac{\partial}{\partial \theta} f_\theta^{(k)}(X)-\left[\sum_{k^{\prime}} \frac{\exp (f_\theta^{(k^{\prime})}(X))}{Z(\theta)} \frac{\partial}{\partial \theta} f_\theta^{(k^{\prime})}(X)\right] \\ & =\frac{\partial}{\partial \theta} f_\theta^{(k)}(X)-\left[\sum_{k^{\prime}} p_{k^{\prime}} \frac{\partial}{\partial \theta} f_\theta^{(k^{\prime})}(X)\right] \\ & =\sum_{k^{\prime}}\left(1(y=k^{\prime})-p_{k^{\prime}}\right) \frac{\partial}{\partial \theta} f_\theta^{(k^{\prime})}(X) \\ & =\frac{\partial}{\partial \theta} f_\theta(X)^{\top}(Y-p) \\ & =\frac{\partial}{\partial \theta} f_\theta(X)^{\top}\left(Y-\mathbb{E}_\theta(Y | X)\right) \end{aligned}

其中 YYyy 的独热形式,即:

Y=[Y1,Y2,,Yn],Yk={1,k=k0,kkY=\left[Y_1, Y_2, \ldots, Y_n\right], \quad Y_{k^{\prime}}=\left\{\begin{array}{cc} 1, & k^{\prime}=k \\ 0, & k^{\prime} \neq k \end{array}\right.

由此可见,判别式模型从误差中学习(learn from errors).

Descriptive model

前面提到,描述式模型建模数据分布,即:

pθ(X)=1Z(θ)exp(fθ(X))p_\theta(X)=\frac{1}{Z(\theta)} \exp (f_\theta(X))

其中

Z(θ)=exp(fθ(x))dxZ(\theta)=\int \exp (f_\theta(x)) \mathrm{d} x

对对数概率求导,有:

θlogpθ(X)=θfθ(X)θlogZ(θ)=θfθ(X)1Z(θ)θZ(θ)=θfθ(X)1Z(θ)exp(fθ(X))θfθ(X)dx=θfθ(X)1Z(θ)exp(fθ(X))θfθ(X)dx=θfθ(X)Xpθ(X)θfθ(X)=θfθ(X)Eθ[θfθ(X)]\begin{aligned} \frac{\partial}{\partial \theta} \log p_\theta(X) & =\frac{\partial}{\partial \theta} f_\theta(X)-\frac{\partial}{\partial \theta} \log Z(\theta) \\ & =\frac{\partial}{\partial \theta} f_\theta(X)-\frac{1}{Z(\theta)} \frac{\partial}{\partial \theta} Z(\theta) \\ & =\frac{\partial}{\partial \theta} f_\theta(X)-\frac{1}{Z(\theta)} \int \exp (f_\theta(X)) \frac{\partial}{\partial \theta} f_\theta(X) \mathrm{d} x \\ & =\frac{\partial}{\partial \theta} f_\theta(X)-\int \frac{1}{Z(\theta)} \exp \left(f_\theta(X)\right) \frac{\partial}{\partial \theta} f_\theta(X) \mathrm{d} x \\ & =\frac{\partial}{\partial \theta} f_\theta(X)-\sum_X p_\theta(X) \frac{\partial}{\partial \theta} f_\theta(X) \\ & =\frac{\partial}{\partial \theta} f_\theta(X)-\mathbb{E}_\theta\left[\frac{\partial}{\partial \theta} f_\theta(X)\right] \end{aligned}

Generative model

对于生成式模型 pθ(h,X)p_\theta(h, X),有:

θlogpθ(X)=1pθ(X)θpθ(h,X)dh=1pθ(X)[θpθ(h,X)]dh=1pθ(X)[θlogpθ(h,X)]pθ(h,X)dh=[θlogpθ(h,X)]pθ(h,X)pθ(X)dh=[θlogpθ(h,X)]pθ(hX)dh=Epθ(hX)[θlogpθ(h,X)]\begin{aligned} \frac{\partial}{\partial \theta} \log p_\theta(X) & =\frac{1}{p_\theta(X)} \frac{\partial}{\partial \theta} \int p_\theta(h, X) \mathrm{d} h \\ & =\frac{1}{p_\theta(X)} \int\left[\frac{\partial}{\partial \theta} p_\theta(h, X)\right] \mathrm{d} h \\ & =\frac{1}{p_\theta(X)} \int\left[\frac{\partial}{\partial \theta} \log p_\theta(h, X)\right] p_\theta(h, X) \mathrm{d} h \\ & =\int\left[\frac{\partial}{\partial \theta} \log p_\theta(h, X)\right] \frac{p_\theta(h, X)}{p_\theta(X)} \mathrm{d} h \\ & =\int\left[\frac{\partial}{\partial \theta} \log p_\theta(h, X)\right] p_\theta(h | X) \mathrm{d} h \\ & =\mathbb{E}_{p_\theta(h | X)}\left[\frac{\partial}{\partial \theta} \log p_\theta(h, X)\right] \end{aligned}

其中

pθ(h,X)=p(h)pθ(Xh)=N(h0,I)N(Xgθ(h),σ2I)p_\theta(h, X) = p(h) \cdot p_\theta(X | h) = \mathrm{N}(h | \mathbf{0}, \mathbf{I}) \cdot \mathrm{N}\left(X | g_\theta(h), \sigma^2 \mathbf{I}\right)

logpθ(h,X)=12σ2Xgθ(h)212h2+constant\log p_\theta\left(h, X\right)=-\frac{1}{2 \sigma^2}\left\|X-g_\theta\left(h\right)\right\|^2-\frac{1}{2}\left\|h\right\|^2+\mathrm{constant}

Optimizing logistic regression via gradient ascent

在 logistic regression 中,对数似然函数是:

l(β)=logL(β)=i=1n[yiXiβlog(1+expXiβ)]l(\beta)=\log L(\beta)=\sum_{i=1}^n\left[y_i X_i^{\top} \beta-\log \left(1+\exp X_i^{\top} \beta\right)\right]

为了最大化 l(β)l(\beta),计算梯度:

l(β)=i=1n[yiXieXiβ1+eXiβXi]=i=1n(yipi)Xil^{\prime}(\beta)=\sum_{i=1}^n\left[y_i X_i-\frac{e^{X_i^{\top} \beta}}{1+e^{X_i^{\top} \beta}} X_i\right]=\sum_{i=1}^n\left(y_i-p_i\right) X_i

其中

pi=eXiβ1+eXiβ=11+eXiβp_i=\frac{e^{X_i^{\top} \beta}}{1+e^{X_i^{\top} \beta}}=\frac{1}{1+e^{-X_i^{\top} \beta}}

我们用这个梯度更新 β\beta 以最大化 l(β)l(\beta)

β(t+1)=β(t)+γti=1n(yipi)Xi\beta^{(t+1)}=\beta^{(t)}+\gamma_t \sum_{i=1}^n\left(y_i-p_i\right) X_i

可以看出,该算法从错误 yipiy_i - p_i 中学习.

1.14 Langevin

在梯度下降算法中,加入随机噪声扰动,有利于跳出局部最优解,这就是 Langevin 动力学.

Brownian motion, Δt\sqrt{\Delta t} notation, second order Taylor expansion

XtX_ttt 时刻粒子的位置,对应线性模型中的参数 β\beta. 在布朗运动中,我们有:

Xt+Δt=Xt+σεtΔtX_{t+\Delta t}=X_t+\sigma \varepsilon_t \sqrt{\Delta t}

其中 εtN(0,1)\varepsilon_t \sim N(0, 1). 设初始位置 X0=xX_0 = x,并将时间 [0,t][0, t] 平均分成 nn 段,即 Δt=t/n\Delta t = t / n,那么有:

Xt=x+i=1nσεitn=x+σt1ni=1nεi N(x,σ2tI)X_t=x+\sum_{i=1}^n \sigma \varepsilon_i \sqrt{\frac{t}{n}}=x+\sigma \sqrt{t} \frac{1}{\sqrt{n}} \sum_{i=1}^n \varepsilon_i \sim \mathrm{~N}\left(x, \sigma^2 t \mathbf{I}\right)

Langevin: energy and entropy

对于描述式模型,我们采样 pθ(X)p_\theta(X) 可以用:

X(t+1)=X(t)+ηXfθ(X)+λεX^{(t+1)} = X^{(t)} + \eta \cdot \frac{\partial}{\partial X} f_\theta(X) + \lambda \varepsilon

对于生成式模型,我们采样从 pθ(hiXi)p_\theta(h_i | X_i) 中采样 hih_i 可以用:

h(t+1)=h(t)+ηhpθ(h,Xi)+λεh^{(t+1)} = h^{(t)} + \eta \cdot \frac{\partial}{\partial h} p_\theta(h, X_i) + \lambda \varepsilon

1.17 Linear Discriminant Analysis (LDA)

LDA 的目标是学习一个线性分类器,将 XiX_i 投影到 z=Xiβz = X_i^\top \beta,使得类间方差最大化、类内方差最小化.

设所有样本为 Ω\Omega,其中正类样本为 Ω+\Omega^+,负类样本为 Ω\Omega^-,且:

  • XiΩ+,p(Xiy=+1)N(μ+,Σ+)\forall X_i \in \Omega^{+}, p\left(X_i | y=+1\right) \sim \mathrm{N}\left(\mu^{+}, \Sigma^{+}\right)
  • XiΩ,p(Xiy=1)N(μ,Σ)\forall X_i \in \Omega^{-}, p\left(X_i | y=-1\right) \sim \mathrm{N}\left(\mu^{-}, \Sigma^{-}\right)

那么类间方差为:

σbetween 2=(EiΩ+[Xiβ]EiΩ[Xiβ])2=(EiΩ+[Xi]βEiΩ[Xi]β)2=((μ+)β(μ)β)2=[(μ+μ)β]2\begin{aligned} \sigma_{\text {between }}^2 & =\left(\mathbb{E}_{i \in \Omega^{+}}[X_i^{\top} \beta]-\mathbb{E}_{i \in \Omega^{-}}[X_i^{\top} \beta]\right)^2 \\ & =\left(\mathbb{E}_{i \in \Omega^{+}}[X_i^{\top}] \beta-\mathbb{E}_{i \in \Omega^{-}}[X_i^{\top}] \beta\right)^2 \\ & =((\mu^{+})^{\top} \beta-(\mu^{-})^{\top} \beta)^2 \\ & =\left[(\mu^{+}-\mu^{-})^{\top} \beta\right]^2 \end{aligned}

类内方差为:

σwithin 2=npos σpos 2+nneg σneg 2,npos =Ω+,nneg =Ωσpos 2=EiΩ+[(XiβEiΩ+[Xiβ])2]=EiΩ+[(βXiEiΩ+[βXi])(XiβEiΩ+[Xiβ])]=EiΩ+[β(XiEiΩ+[Xi])(XiEiΩ+[Xi])β]=βEiΩ+[(XiEiΩ+[Xi])(XiEiΩ+[Xi])]β=βΣ+βσneg 2=βΣβ\begin{aligned} \sigma_{\text {within }}^2 & =n_{\text {pos }} \sigma_{\text {pos }}^2+n_{\text {neg }} \sigma_{\text {neg }}^2, \quad n_{\text {pos }}=\left|\Omega^{+}\right|, n_{\text {neg }}=\left|\Omega^{-}\right| \\ \sigma_{\text {pos }}^2 & =\mathbb{E}_{i \in \Omega^{+}}\left[\left(X_i^{\top} \beta-\mathbb{E}_{i^{\prime} \in \Omega^{+}}[X_{i^{\prime}}^{\top} \beta]\right)^2\right] \\ & =\mathbb{E}_{i \in \Omega^{+}}\left[\left(\beta^{\top} X_i-\mathbb{E}_{i^{\prime} \in \Omega^{+}}[\beta^{\top} X_{i^{\prime}}]\right)\left(X_i^{\top} \beta-\mathbb{E}_{i^{\prime} \in \Omega^{+}}[X_{i^{\prime}}^{\top} \beta]\right)\right] \\ & =\mathbb{E}_{i \in \Omega^{+}}\left[\beta^{\top}\left(X_i-\mathbb{E}_{i^{\prime} \in \Omega^{+}}[X_{i^{\prime}}]\right)\left(X_i^{\top}-\mathbb{E}_{i^{\prime} \in \Omega^{+}}[X_{i^{\prime}}^{\top}]\right) \beta\right] \\ & =\beta^{\top} \mathbb{E}_{i \in \Omega^{+}}\left[\left(X_i-\mathbb{E}_{i^{\prime} \in \Omega^{+}}[X_{i^{\prime}}]\right)\left(X_i^{\top}-\mathbb{E}_{i^{\prime} \in \Omega^{+}}[X_{i^{\prime}}^{\top}\right]\right)] \beta \\ & =\beta^{\top} \Sigma^{+} \beta \\ \sigma_{\text {neg }}^2 & =\beta^{\top} \Sigma^{-} \beta \end{aligned}

从而优化目标是最大化

S=σbetween 2σwithin 2=[(μ+μ)β]2nposβΣ+β+nnegβΣβ=(β(μ+μ))((μ+μ)β)β(nposΣ++nnegΣ)β=βSBββSWβ\begin{aligned} S & =\frac{\sigma_{\text {between }}^2}{\sigma_{\text {within }}^2} \\ & =\frac{[(\mu^{+}-\mu^{-})^{\top} \beta]^2}{n_{\mathrm{pos}} \beta^{\top} \Sigma^{+} \beta+n_{\mathrm{neg}} \beta^{\top} \Sigma^{-} \beta} \\ & =\frac{(\beta^{\top}(\mu^{+}-\mu^{-}))((\mu^{+}-\mu^{-})^{\top} \beta)}{\beta^{\top}(n_{\mathrm{pos}} \Sigma^{+}+n_{\mathrm{neg}} \Sigma^{-}) \beta} \\ & =\frac{\beta^{\top} S_B \beta}{\beta^{\top} S_W \beta} \end{aligned}

其中

SB=(μ+μ)(μ+μ)SW=nposΣ++nnegΣ\begin{aligned} S_B & =(\mu^{+}-\mu^{-})(\mu^{+}-\mu^{-})^{\top} \\ S_W & =n_{\mathrm{pos}} \Sigma^{+}+n_{\mathrm{neg}} \Sigma^{-} \end{aligned}

由于我们只关心 β\beta 的方向,不妨设 βSWβ=1\beta^\top S_W \beta = 1,从而目标变为:

maxββSBβ, s.t. βSWβ=1\max _\beta \beta^{\top} S_B \beta, \quad \text { s.t. } \quad \beta^{\top} S_W \beta=1

使用 Lagrange 乘子法:

L=βSBβ+λ(βSWβ1)Lβ=2SBβ+2λSWβ=0SBβ=λSWβSW1SBβ=λβ\begin{aligned} & L=-\beta^{\top} S_B \beta+\lambda(\beta^{\top} S_W \beta-1) \\ \Rightarrow \quad & \frac{\partial L}{\partial \beta}=-2 S_B \beta+2 \lambda S_W \beta=0 \\ \Rightarrow \quad & S_B \beta=\lambda S_W \beta \\ \Rightarrow \quad & S_W^{-1} S_B \beta=\lambda \beta \end{aligned}

由此看出,β\betaSW1SBS_W^{-1} S_B 的特征向量. 代入 SBS_B,可以解出:

βSW1(μ+μ)\beta \propto S_W^{-1}(\mu^{+}-\mu^{-})

Lecture 2: Support Vector Machines

2.1 Margin and support vectors

考虑分类问题 yi{1,+1}y_i \in\{-1,+1\},使用分类器

y^=sign(wx+b)={+1,wx+b01,wx+b<0\hat{y}=\operatorname{sign}(w^{\top} x+b)= \begin{cases}+1, & w^{\top} x+b \geq 0 \\ -1, & w^{\top} x+b<0\end{cases}

定义 margin:

γi=yi(wxi+b)\gamma_i=y_i(w^{\top} x_i+b)

γi\gamma_i 表示了分类质量,当 γi>0\gamma_i > 0 时分类正确,且越大意味着越自信. 因此,目标函数可以表示为:

maxw,bmini=1,,nγi\max_{w, b}\min_{i=1, \dots, n} \gamma_i

但是,考虑到 γi\gamma_iwwbb 的增大而增大,我们约束 w=1\Vert w \Vert = 1,即 margin 的定义变为:

γi=yi(wXi+b)w=yi[(ww)Xi+bw]\gamma_i=\frac{y_i(w^{\top} X_i+b)}{\|w\|}= y_i\left[\left(\frac{w}{\|w\|}\right)^{\top} X_i+\frac{b}{\|w\|}\right]

事实上,在分类正确的情况下,γi\gamma_i 等于 XiX_iwx+b=0w^\top x+b = 0 的距离. 可以验证,将 XiX_i 向分类面推动 γi\gamma_i 距离得到的点 XiγiwwyiX_i-\gamma_i \dfrac{w}{\|w\|} y_i 恰好在分类面上,即:

w(Xiγiwwyi)+b=(wXi+b)yi2w(wXi+b)ww2=(wXi+b)(wXi+b)=0w^{\top}\left(X_i-\gamma_i \frac{w}{\|w\|} y_i\right)+b=\left(w^{\top} X_i+b\right)-y_i^2 \frac{w^{\top}\left(w^{\top} X_i+b\right) w}{\|w\|^2}=\left(w^{\top} X_i+b\right)-\left(w^{\top} X_i+b\right)=0

2.2 Margin classifier

假设所有样本都能分类正确. 为了最大化 margin,优化目标是:

maxτ,w,bτ s.t. i,yi(wXi+b)τw=1\begin{aligned} & \max _{\tau, w, b} \, \tau \\ \text { s.t. } \quad & \forall i, y_i(w^{\top} X_i+b) \geq \tau \\ & \|w\|=1 \end{aligned}

为了简化计算,将目标重写为:

maxτ,w,bτw s.t. i,yi(wXi+b)τ\begin{aligned} & \max _{\tau, w, b} \frac{\tau}{\|w\|} \\ \text { s.t. } \quad & \forall i, y_i\left(w^{\top} X_i+b\right) \geq \tau \end{aligned}

这时,由于我们可以缩放 wwbb 而不改变超平面,不妨固定 τ=1\tau = 1,得到的优化目标等价为:

minw,b12w2 s.t. i,yi(wXi+b)1\begin{aligned} & \min _{w, b} \frac{1}{2}\|w\|^2 \\ \text { s.t. } \quad & \forall i, y_i\left(w^{\top} X_i+b\right) \geq 1 \end{aligned}

Lagrange multipliers

考虑带等式约束的优化问题:

minwf(w) s.t. hk(w)=0,k=1,2,,K\begin{aligned} & \min _w f(w) \\ \text { s.t. } \quad & h_k(w)=0, k=1,2, \ldots, K \end{aligned}

我们可以最小化

L(w,λ)=f(w)+k=1Kλkhk(w)L(w, \lambda)=f(w)+\sum_{k=1}^K \lambda_k h_k(w)

从而要求

Lw=0,k,Lλk=0\frac{\partial L}{\partial w}=0, \quad \forall k, \frac{\partial L}{\partial \lambda_k}=0

加入不等式约束,即考虑一般的优化问题:

minwf(w) s.t. gk(w)0,k=1,2,,Khl(w)=0,l=1,2,,L\begin{aligned} & \min _w f(w) \\ \text { s.t. } \quad & g_k(w) \leq 0, k=1,2, \ldots, K \\ & h_l(w)=0, l=1,2, \ldots, L \end{aligned}

L(w,α,β)=f(w)+k=1Kαkgk(w)+l=1Lβlhl(w)L(w, \alpha, \beta)=f(w)+\sum_{k=1}^K \alpha_k g_k(w)+\sum_{l=1}^L \beta_l h_l(w)

当满足

minwmaxα,β:αk0L(w,α,β)=maxα,β:αk0minwL(w,α,β)\min _w \max _{\alpha, \beta: \alpha_k \geq 0} L(w, \alpha, \beta)=\max _{\alpha, \beta: \alpha_k \geq 0} \min _w L(w, \alpha, \beta)

时,有 KKT 条件:

p,wpL(w,α,β)=0l,βlL(w,α,β)=0k,αkgk(w)=0k,gk(w)0k,αk0\begin{aligned} \forall p, \quad \frac{\partial}{\partial w_p} L(w, \alpha, \beta) & =0 \\ \forall l, \quad \frac{\partial}{\partial \beta_l} L(w, \alpha, \beta) & =0 \\ \forall k, \quad \alpha_k g_k(w) & =0 \\ \forall k, \quad g_k(w) & \leq 0 \\ \forall k, \quad \alpha_k & \geq 0 \end{aligned}

Learning and support vectors

之前已经推出,SVM 的优化目标是:

minw,b12w2 s.t. i,yi(wXi+b)+10\begin{aligned} & \min _{w, b} \frac{1}{2}\|w\|^2 \\ \text { s.t. } \quad & \forall i,-y_i\left(w^{\top} X_i+b\right)+1 \leq 0 \end{aligned}

根据 KKT 条件,对于

yi(wXi+b)+1=0-y_i\left(w^{\top} X_i+b\right)+1=0

的样本,才可能有

αi>0\alpha_i > 0

这些样本被称为 support vectors(支持向量).

Optimization

SVM 的 Lagrangian 是:

L(w,b,α)=12w2i=1nαi[yi(wXi+b)1]L(w, b, \alpha)=\frac{1}{2}\|w\|^2-\sum_{i=1}^n \alpha_i\left[y_i\left(w^{\top} X_i+b\right)-1\right]

由 KKT 条件,有:

L(w,b,α)w=wi=1nαiyiXi=0L(w,b,α)b=i=1nαiyi=0\begin{aligned} \frac{\partial L(w, b, \alpha)}{\partial w}&=w-\sum_{i=1}^n \alpha_i y_i X_i = 0 \\ \frac{\partial L(w, b, \alpha)}{\partial b}&=-\sum_{i=1}^n \alpha_i y_i = 0 \end{aligned}

从而有

{w=i=1nαiyiXii=1nαiyi=0\left\{ \begin{aligned} & \quad w=\sum_{i=1}^n \alpha_i y_i X_i \\ & \quad \sum_{i=1}^n \alpha_i y_i=0 \end{aligned}\right.

代入 L(w,b,α)L(w, b, \alpha),得:

L(w,b,α)=12w2i=1nαi[yi(wXi+b)1]=12i=1nj=1nyiyjαiαjXiXji=1nj=1nyiyjαiαjXiXjbi=1nαiyi+i=1nαi=i=1nαi12i=1nj=1nyiyjαiαjXiXjbi=1nαiyi=i=1nαi12i=1nj=1nyiyjαiαjXiXj\begin{aligned} L(w, b, \alpha) & =\frac{1}{2}\|w\|^2-\sum_{i=1}^n \alpha_i\left[y_i\left(w^{\top} X_i+b\right)-1\right] \\ & =\frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i y_j \alpha_i \alpha_j X_i^{\top} X_j-\sum_{i=1}^n \sum_{j=1}^n y_i y_j \alpha_i \alpha_j X_i^{\top} X_j-b \sum_{i=1}^n \alpha_i y_i+\sum_{i=1}^n \alpha_i \\ & =\sum_{i=1}^n \alpha_i-\frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i y_j \alpha_i \alpha_j X_i^{\top} X_j-b \sum_{i=1}^n \alpha_i y_i \\ & =\sum_{i=1}^n \alpha_i-\frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i y_j \alpha_i \alpha_j X_i^{\top} X_j \end{aligned}

进而,我们可以用下式优化出 α\alpha

maxαW(α),W(α)=i=1nαi12i=1nj=1nyiyjαiαjXi,Xj s.t. i,αi0i=1nαiyi=0\begin{aligned} \max _\alpha W(\alpha), \quad W(\alpha)= & \sum_{i=1}^n \alpha_i-\frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i y_j \alpha_i \alpha_j\left\langle X_i, X_j\right\rangle \\ \text { s.t. } \quad & \forall i, \alpha_i \geq 0 \\ & \sum_{i=1}^n \alpha_i y_i=0 \end{aligned}

接着可以求出 ww,以及 bb

b=12(mini:yi=+1wXi+maxi:yi=1wXi)b=-\frac{1}{2}\left(\min _{i: y_i=+1} w^{\top} X_i+\max _{i: y_i=-1} w^{\top} X_i\right)

从而,inference 过程可以写为:

wXi+b=(j=1nαjyjXj)Xi+b=j=1nαjyjXj,Xi+b=j:αj>0αjyjXj,Xi+b=j:αj>0αjyjXj,Xi+b\begin{aligned} w^{\top} X_i+b= & \left(\sum_{j=1}^n \alpha_j y_j X_j\right)^\top X_i+b \\ = & \sum_{j=1}^n \alpha_j y_j\left\langle X_j, X_i\right\rangle+b \\ = & \sum_{j: \alpha_j>0} \alpha_j y_j\left\langle X_j, X_i\right\rangle+b \\ = & \left\langle \sum_{j: \alpha_j>0} \alpha_j y_j X_j, X_i\right\rangle+b \end{aligned}

也就是说,只有支持向量对 ww 有贡献.

2.3 Kernel-based SVM

考虑非线性分类问题,ϕ(x)\phi(x)xx 的高维特征向量. 对于推理过程,有:

wϕ(Xi)+b=(j=1nαjyjϕ(Xj))ϕ(Xi)+b=j=1nαjyjϕ(Xj),ϕ(Xi)+b=j:αj>0αjyjϕ(Xj),ϕ(Xi)+b=j:αj>0αjyjK(Xj,Xi)+b\begin{aligned} w^{\top} \phi\left(X_i\right)+b & =\left(\sum_{j=1}^n \alpha_j y_j \phi\left(X_j\right)\right)^{\top} \phi\left(X_i\right)+b \\ & =\sum_{j=1}^n \alpha_j y_j\left\langle\phi\left(X_j\right), \phi\left(X_i\right)\right\rangle+b \\ & =\sum_{j: \alpha_j>0} \alpha_j y_j\left\langle\phi\left(X_j\right), \phi\left(X_i\right)\right\rangle+b \\ & =\sum_{j: \alpha_j>0} \alpha_j y_j K\left(X_j, X_i\right)+b \end{aligned}

其中

K(Xi,Xj)= def ϕ(Xi)ϕ(Xj)K\left(X_i, X_j\right) \stackrel{\text { def }}{=} \phi\left(X_i\right)^{\top} \phi\left(X_j\right)

被称为核(Kernel). 也就是说,对之前求解的 SVM 结果,只要将 Xi,Xj\left\langle X_i, X_j\right\rangle 替换为 K(Xi,Xj)K(X_i, X_j) 即可.

有时,K(Xi,Xj)K\left(X_i, X_j\right) 相对 ϕ(Xi)\phi\left(X_i\right) 更好求. 事实上,我们只需要计算 K(Xi,Xj)K(X_i, X_j) 而不用关心 ϕ(Xi)\phi(X_i) 的计算.

K(Xi,Xj)K(X_i, X_j) 是核函数的充分必要条件是,同时满足:

  • 对称性K(Xi,Xj)=K(Xj,Xi)K\left(X_i, X_j\right)=K\left(X_j, X_i\right)

  • 半正定:设核矩阵 KK,其中 Kij=K(Xi,Xj)K_{ij} = K(X_i, X_j),那么 z,zKz0\forall z, z^\top K z \ge 0

2.4 Common kernels

  • RBF kernel:即 Gaussian kernel

    K(Xi,Xj)=exp(XiXj22σ2)K\left(X_i, X_j\right)=\exp \left(-\frac{\left\|X_i-X_j\right\|^2}{2 \sigma^2}\right)

    • 衡量了 XiX_iXjX_j 的相似度
  • Simple polynomial kernel

    K(Xi,Xj)=(XiXj)dK\left(X_i, X_j\right)=\left(X_i^{\top} X_j\right)^d

  • Cosine similarity kernel

    K(Xi,Xj)=XiXjXiXjK\left(X_i, X_j\right)=\frac{X_i^{\top} X_j}{\left\|X_i\right\|\left\|X_j\right\|}

  • Sigmoid kernel

    K(Xi,Xj)=tanh(αXiXj+c)K\left(X_i, X_j\right)=\tanh \left(\alpha X_i^{\top} X_j+c\right)

2.5 With outliers

在现实生活中,我们不能保证全部正确分类. 此时的优化目标是:

minξ,w,b12w2+Ci=1nξi s.t. yi(wXi+b)1ξi,i=1,2,,nξi0,i=1,2,,n\begin{aligned} \min _{\xi, w, b} \quad& \frac{1}{2}\|w\|^2+C \sum_{i=1}^n \xi_i \\ \text { s.t. } \quad & y_i\left(w^{\top} X_i+b\right) \geq 1-\xi_i, \quad i=1,2, \ldots, n \\ & \xi_i \geq 0, \quad i=1,2, \ldots, n \end{aligned}

这等价于

minξ,w,b12w2+Ci=1nmax(0,1yi(wXi+b))\min _{\xi, w, b} \quad \frac{1}{2}\|w\|^2+C \sum_{i=1}^n \max \left(0,1-y_i\left(w^{\top} X_i+b\right)\right)

注意到,max(0,1yi(wXi+b))\max \left(0,1-y_i\left(w^{\top} X_i+b\right)\right) 就是 hinge loss.

此时的 Lagrangian 是:

L(w,b,ξ,α,β)=12w2+Ci=1nξii=1nαi[yi(wXi+b)1+ξi]i=1nβiξiL(w, b, \xi, \alpha, \beta)=\frac{1}{2}\|w\|^2+C \sum_{i=1}^n \xi_i-\sum_{i=1}^n \alpha_i\left[y_i\left(w^{\top} X_i+b\right)-1+\xi_i\right]-\sum_{i=1}^n \beta_i \xi_i

目标是:

minw,b,ξ:ξi0maxα,β:αi0,βi0L(w,b,ξ,α,β)\min _{w, b, \xi: \xi_i \geq 0} \max _{\alpha, \beta: \alpha_i \geq 0, \beta_i \geq 0} L(w, b, \xi, \alpha, \beta)

L(w,b,ξ,α,β)w=0,L(w,b,ξ,α,β)b=0\frac{\partial L(w, b, \xi, \alpha, \beta)}{\partial w}=0, \quad\frac{\partial L(w, b, \xi, \alpha, \beta)}{\partial b}=0

可以得到优化 α\alpha 的目标:

maxαW(α),W(α)=i=1nαi12i=1nj=1nyiyjαiαjXi,Xj s.t. i,0αiCi=1nαiyi=0\begin{aligned} \max _\alpha W(\alpha), \quad W(\alpha)= & \sum_{i=1}^n \alpha_i-\frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i y_j \alpha_i \alpha_j\left\langle X_i, X_j\right\rangle \\ \text { s.t. } \quad & \forall i, 0 \leq \alpha_i \leq C \\ & \sum_{i=1}^n \alpha_i y_i=0 \end{aligned}

KKT 条件是:

αi=0yi(wXi+b)>1αi=Cyi(wXi+b)<10<αi<Cyi(wXi+b)=1\begin{aligned} \alpha_i=0 & \Rightarrow y_i\left(w^{\top} X_i+b\right)>1 \\ \alpha_i=C & \Rightarrow y_i\left(w^{\top} X_i+b\right)<1 \\ 0<\alpha_i<C & \Rightarrow y_i\left(w^{\top} X_i+b\right)=1 \end{aligned}

Lecture 3: Kernels and Regularized Learning

3.1 Over-fitting & under-fitting

training set validation set Performance
error too large irrelevant Underfitting
error small error too large Overfitting
error small error small Ideal
(Good generalization)

3.2 Ridge Regression

为了防止过拟合,Ridge regression 引入惩罚项 λβ2\lambda\|\beta\|^2,其中 λ>0\lambda > 0,即损失函数是:

(β)=YXβ2+λβ2\ell(\beta)=\|Y-X \beta\|^2+\lambda\|\beta\|^2

对损失求导,有:

(β)=YXβ2+λβ2=(YXβ)(YXβ)+λββ\ell(\beta)=\|Y-X \beta\|^2+\lambda\|\beta\|^2=(Y-X \beta)^{\top}(Y-X \beta)+\lambda \beta^{\top} \beta

0=(β)ββ=β^λ=2X(YXβ^λ)+2λβ^λ0=\left.\frac{\partial \ell(\beta)}{\partial \beta}\right|_{\beta=\hat{\beta}_\lambda}=-2 X^{\top}\left(Y-X \hat{\beta}_\lambda\right)+2 \lambda \hat{\beta}_\lambda

从而得到 β\beta 的解析解为:

β^λ=(XX+λIp)1XY\hat{\beta}_\lambda=\left(X^{\top} X+\lambda I_p\right)^{-1} X^{\top} Y

3.3 Kernel Regression

将 Ridge regression 中的 xx 换成 ϕ(x)\phi(x),并定义核 K(x,xi)=ϕ(x)ϕ(xi)K(x, x_i) = \phi(x)^\top \phi(x_i),就可以得到 kernel regression.

在 inference 时,有:

f(x)=ϕ(x)βf(x) = \phi(x)^\top \beta

可以证明(见 3.11 节),

β=iciϕ(xi)\beta=\sum_i c_i \phi\left(x_i\right)

从而

f(x)=i=1nciK(x,xi)f(x)=\sum_{i=1}^n c_i K\left(x, x_i\right)

因此,目标就是要最小化:

l(c)=i=1nyij=1ncjK(xi,xj)2+λi,jcicjK(xi,xj)l(c) = \sum_{i=1}^n\|y_i-\sum_{j=1}^n c_j K\left(x_i, x_j\right)\|^2+\lambda \sum_{i, j} c_i c_j K\left(x_i, x_j\right)

Kij=K(xi,xj)K_{ij} = K(x_i, x_j),那么损失函数可以写作:

(c)=i=1nyij=1ncjKij2+λi,jcicjKij=YKc2+λcKc\ell(c)=\sum_{i=1}^n\|y_i-\sum_{j=1}^n c_j K_{i j}\|^2+\lambda \sum_{i, j} c_i c_j K_{i j}=\|Y-K c\|^2+\lambda c^{\top} K c

其中惩罚项:

cKc=i,jcicjK(xi,xj)=i,jcicjϕ(xi)ϕ(xj)=(iciϕ(xi))(iciϕ(xi))=β2\begin{aligned} c^{\top} K c & =\sum_{i, j} c_i c_j K\left(x_i, x_j\right) \\ & =\sum_{i, j} c_i c_j \phi\left(x_i\right)^{\top} \phi\left(x_j\right) \\ & =\left(\sum_i c_i \phi\left(x_i\right)\right)^{\top}\left(\sum_i c_i \phi\left(x_i\right)\right) \\ & =\|\beta\|^2 \end{aligned}

对损失求导,有:

(c)=YKc2+λcKc=(YKc)(YKc)+λcKc0=(c)cc=c^λ=2K(YKc^λ)+2λKc^λ\begin{gathered} \ell(c)=\|Y-K c\|^2+\lambda c^{\top} K c=(Y-K c)^{\top}(Y-K c)+\lambda c^{\top} K c \\ 0=\left.\frac{\partial \ell(c)}{\partial c}\right|_{c=\hat{c}_\lambda}=-2 K^{\top}\left(Y-K \hat{c}_\lambda\right)+2 \lambda K \hat{c}_\lambda \end{gathered}

从而得到 cc 的解析解为:

c^λ=(K+λIn)1Y\hat{c}_\lambda=\left(K+\lambda I_n\right)^{-1} Y

3.4 Spline Regression

xRx \in \mathbb{R},线性样条函数的形式是:

f(x)=α0+j=1pαjmax(0,xkj)f(x)=\alpha_0+\sum_{j=1}^p \alpha_j \max \left(0, x-k_j\right)

从而最小化的目标是:

i=1nyiα0j=1pαjmax(0,xikj)2+λj=1pαj2\sum_{i=1}^n\|y_i-\alpha_0-\sum_{j=1}^p \alpha_j \max \left(0, x_i-k_j\right)\|^2+\lambda \sum_{j=1}^p \alpha_j^2

也可将拟合误差写成:

yiα0j=1pαjmax(0,xikj)=yi[1,max(0,xik1),max(0,xik2),,max(0,xikp)]αy_i-\alpha_0-\sum_{j=1}^p \alpha_j \max \left(0, x_i-k_j\right)=y_i-\left[1, \max \left(0, x_i-k_1\right), \max \left(0, x_i-k_2\right), \ldots, \max \left(0, x_i-k_p\right)\right] \alpha

其中

α=[α0,α1,α2,,αp].\alpha=\left[\alpha_0, \alpha_1, \alpha_2, \ldots, \alpha_p\right]^{\top}.

Relations to the Ridge regression

X~ij=max(0,xikj)Z=[1n X~]D=diag(0,1,,1)\begin{aligned} \widetilde{X}_{i j} & =\max \left(0, x_i-k_j\right) \\ Z & =\left[1_n ~ \widetilde{X}\right] \\ D & =\operatorname{diag}(0,1, \ldots, 1) \end{aligned}

那么目标函数可以写作:

(α)=YZα2+λDα2\ell(\alpha)=\|Y-Z \alpha\|^2+\lambda\|D \alpha\|^2

解得 α\alpha 的解析解为:

α^λ=(ZZ+λD)1ZY\hat{\alpha}_\lambda=\left(Z^{\top} Z+\lambda D\right)^{-1} Z^{\top} Y

Relations to the Kernel regression

kjk_j 替换为 xjx_j,令 K(xi,xj)=max(0,xixj)K\left(x_i, x_j\right)=\max \left(0, x_i-x_j\right),那么有:

f^(x)=j=1nα^jx,xj=j=1nα^jK(x,xj)\hat{f}(x)=\sum_{j=1}^n \hat{\alpha}_j\left\langle x, x_j\right\rangle=\sum_{j=1}^n \hat{\alpha}_j K\left(x, x_j\right)

3.5 Lasso regression

Lasso regression 的目标是优化

β^λ=arg minβ[12YXβ22+λβ1]\hat{\beta}_\lambda=\argmin_\beta\left[\frac{1}{2}\|\mathbf{Y}-\mathbf{X} \beta\|_{\ell_2}^2+\lambda\|\beta\|_{\ell_1}\right]

其中

β1=j=1pβj\|\beta\|_{\ell_1}=\sum_{j=1}^p | \beta_j |

对于一般的 pp,没有解析解,但对于 p=1p = 1,存在解析解:

β^λ={(Y,Xλ)/X22, if Y,X>λ(Y,X+λ)/X22, if Y,X<λ0 if Y,Xλ\hat{\beta}_\lambda= \begin{cases}(\langle\mathbf{Y}, \mathbf{X}\rangle-\lambda) /\|\mathbf{X}\|_{\ell_2}^2, & \text { if }\langle\mathbf{Y}, \mathbf{X}\rangle>\lambda \\ (\langle\mathbf{Y}, \mathbf{X}\rangle+\lambda) /\|\mathbf{X}\|_{\ell_2}^2, & \text { if }\langle\mathbf{Y}, \mathbf{X}\rangle<-\lambda \\ 0 & \text { if }|\langle\mathbf{Y}, \mathbf{X}\rangle| \leq \lambda\end{cases}

β^λ=sign(γ^)max(0,γ^λ/X22)\hat{\beta}_\lambda=\operatorname{sign}(\hat{\gamma}) \max \left(0,|\hat{\gamma}|-\lambda /\|\mathbf{X}\|_{\ell_2}^2\right)

其中

γ^=Y,X/X22\hat{\gamma}=\langle\mathbf{Y}, \mathbf{X}\rangle /\|\mathbf{X}\|_{\ell_2}^2

Ridge regression 希望没有主导特征,而 Lasso regression 希望特征稀疏.

3.6 Primal form of Lasso

Lasso regression 有两种等价形式:

  • Primal form of Lasso

min YXβ22/2 subject to β1t\min ~ \|\mathbf{Y}-\mathbf{X} \beta\|_{\ell_2}^2 / 2 \quad \text{ subject to } \quad \|\beta\|_{\ell_1} \leq t

  • Dual form of Lasso

min YXβ22/2+λβ1\min ~ \|\mathbf{Y}-\mathbf{X} \beta\|_{\ell_2}^2 / 2+\lambda\|\beta\|_{\ell_1}

接下来证明两种形式等价. 设两个式子有不相等的解:

β^λ=argminYXβ22/2+λβ1\hat{\beta}_\lambda=\operatorname{argmin}\|\mathbf{Y}-\mathbf{X} \beta\|_{\ell_2}^2 / 2+\lambda\|\beta\|_{\ell_1}

β^=argminβYXβ22/2 s.t. β1t\hat{\beta}=\underset{\beta}{\operatorname{argmin}}\|\mathbf{Y}-\mathbf{X} \beta\|_{\ell_2}^2 / 2 \quad \text { s.t. } \quad\|\beta\|_{\ell_1} \leq t

t=β^λ1t=\|\hat{\beta}_\lambda\|_{\ell_1},那么有

YXβ^22/2<YXβ^λ22/2,β^1β^λ1\|\mathbf{Y}-\mathbf{X} \hat{\beta}\|_{\ell_2}^2 / 2<\|\mathbf{Y}-\mathbf{X} \hat{\beta}_\lambda\|_{\ell_2}^2 / 2, \quad\|\hat{\beta}\|_{\ell_1} \leq\|\hat{\beta}_\lambda\|_{\ell_1}

从而

YXβ^22/2+λβ^1<YXβ^λ22/2+λβ^λ1\|\mathbf{Y}-\mathbf{X} \hat{\beta}\|_{\ell_2}^2 / 2+\lambda\|\hat{\beta}\|_{\ell_1}<\|\mathbf{Y}-\mathbf{X} \hat{\beta}_\lambda\|_{\ell_2}^2 / 2+\lambda\|\hat{\beta}_\lambda\|_{\ell_1}

这与第一个式子矛盾,因此 β^λ=β^\hat{\beta}_\lambda = \hat{\beta}.

3.7 Coordinate descent for Lasso solution path

高维 Lasso regression 的求解思路是:将每一个特征维度看作一维的 Lasso regression,从而用以下算法求解.

 for  λ=10a,10aΔ,10a2Δ,10a3Δ,,10b  do  for  Feature dimension j=1,2,,p  do  Compute the residual, Rj=YkjXkβk Update the parameter of the j-th dimension, βj=sign(γ^j)max(0,γ^jλ/X22), where γ^j=Rj,Xj/Xj22 end  end \begin{aligned} &\textbf { for } ~\lambda=10^a, 10^{a-\Delta}, 10^{a-2 \Delta}, 10^{a-3 \Delta}, \ldots, 10^b \textbf{ ~do }\\ &\quad \textbf { for } \text{~Feature dimension } j=1,2, \ldots, p \textbf{ ~do }\\ &\quad \quad \text { Compute the residual, } \textstyle \mathbf{R}_j=\mathbf{Y}-\sum_{k \neq j} \mathbf{X}_k \beta_k \text {; }\\ &\quad \quad \text { Update the parameter of the } j \text {-th dimension, } \beta_j=\operatorname{sign}\left(\hat{\gamma}_j\right) \max \left(0,\left|\hat{\gamma}_j\right|-\lambda /\|\mathbf{X}\|_{\ell_2}^2\right) \text {, where } \hat{\gamma}_j=\left\langle\mathbf{R}_j, \mathbf{X}_j\right\rangle /\left\|\mathbf{X}_j\right\|_{\ell_2}^2\\ &\quad \textbf{ end }\\ &\textbf{ end } \end{aligned}

3.8 Bayesian regression

考虑最大似然估计

P(βX,Y)=P(βX)P(YX,β)P(YX)=P(β)P(YX,β)C\begin{aligned} P(\beta | \mathbf{X}, \mathbf{Y}) & = \frac{P(\beta | \mathbf{X}) P(\mathbf{Y} | \mathbf{X}, \beta)}{P(\mathbf{Y} | \mathbf{X})} \\ & = P(\beta)P(\mathbf{Y} | \mathbf{X}, \beta) \cdot C \end{aligned}

从而

logP(βX,Y)=logP(β)+logP(YX,β)+C\log P(\beta | \mathbf{X}, \mathbf{Y}) = \log P(\beta) + \log P(\mathbf{Y} | \mathbf{X}, \beta) + C

βN(0,τ2Ip)\beta \sim \mathrm{N}\left(0, \tau^2 \mathbf{I}_p\right)

YN(Xβ,σ2I)\mathbf{Y} \sim \mathrm{N}(\mathbf{X}\beta, \sigma^2\mathbf{I})

那么

logP(βX,Y)=12σ2YXβ2212τ2β22+C\log P(\beta | \mathbf{X}, \mathbf{Y})=-\frac{1}{2 \sigma^2}\|\mathbf{Y}-\mathbf{X} \beta\|_{\ell_2}^2-\frac{1}{2 \tau^2}\|\beta\|_{\ell_2}^2+C

要使似然函数最大,得到:

β^=(XX+σ2τ2Ip)1XY\hat{\beta}=\left(\mathbf{X}^{\top} \mathbf{X}+\frac{\sigma^2}{\tau^2} \mathbf{I}_p\right)^{-1} \mathbf{X}^{\top} \mathbf{Y}

对应 λ=σ2/τ2\lambda=\sigma^2 / \tau^2 的 Ridge regression.

3.9 SVM and ridge logistic regression

考虑 SVM 的损失函数

loss(β)=i=1nmax(0,1yiXiβ)+λ2β2\operatorname{loss}(\beta)=\sum_{i=1}^n \max \left(0,1-y_i X_i^{\top} \beta\right)+\frac{\lambda}{2}\|\beta\|^2

求梯度,得:

loss(β)=i=1n1(yiXiβ<1)yiXi+λβ\operatorname{loss}^{\prime}(\beta)=-\sum_{i=1}^n 1\left(y_i X_i^{\top} \beta<1\right) y_i X_i+\lambda \beta

该损失函数与 Ridge logistic regression 的形式类似,即:

loss(β)=i=1nlog[1+exp(yiXiβ)]+λ2β2\operatorname{loss}(\beta)=\sum_{i=1}^n \log \left[1+\exp \left(-y_i X_i^{\top} \beta\right)\right]+\frac{\lambda}{2}\|\beta\|^2

其梯度为:

loss(β)=i=1nsigmoid(yiXiβ)yiXi+λβ\operatorname{loss}^{\prime}(\beta)=-\sum_{i=1}^n \operatorname{sigmoid}\left(-y_i X_i^{\top} \beta\right) y_i X_i+\lambda \beta

3.10 Linear Version

若优化目标可以写作

i=1nL(yi;xiβ)+λβ2\sum_{i=1}^n L\left(y_i ; x_i^{\top} \beta\right)+\lambda\|\beta\|^2

那么解空间的形式为:

β^=i=1nαixi\hat{\beta}=\sum_{i=1}^n \alpha_i x_i

采用反证法,设最优解

β~=i=1nαixi+k=1Kκkxk\tilde{\beta}=\sum_{i=1}^n \alpha_i x_i+\sum_{k=1}^K \kappa_k x_k

其中 xkxix_k \perp x_i,那么

xiTβ~=xiT(j=1nαjxj+k=1Kκkxk)=j=1nαjxiTxj+k=1KκkxiTxk=j=1nαjxiTxj+k=1Kκk0=j=1nαjxiTxj=xiTβ^\begin{aligned} x_i^T \tilde{\beta} & =x_i^T\left(\sum_{j=1}^n \alpha_j x_j+\sum_{k=1}^K \kappa_k x_k\right) \\ & =\sum_{j=1}^n \alpha_j x_i^T x_j+\sum_{k=1}^K \kappa_k x_i^T x_k \\ & =\sum_{j=1}^n \alpha_j x_i^T x_j+\sum_{k=1}^K \kappa_k 0 \\ & =\sum_{j=1}^n \alpha_j x_i^T x_j \\ & =x_i^T \hat{\beta} \end{aligned}

从而

i=1nL(yi;xiβ^)+λβ^2i=1nL(yi;xiβ~)+λβ~2\sum_{i=1}^n L(y_i ; x_i^{\top} \hat{\beta})+\lambda\|\hat{\beta}\|^2 \leq \sum_{i=1}^n L(y_i ; x_i^{\top} \tilde{\beta})+\lambda\|\tilde{\beta}\|^2

因此,β^\hat{\beta} 才是最优解.

3.11 Feature version

若优化目标可以写作

i=1nL(yi;ϕ(xi)β)+λβ2\sum_{i=1}^n L(y_i ; \phi(x_i)^{\top} \beta)+\lambda\|\beta\|^2

那么解空间的形式为:

β^=i=1nαiϕ(xi)\hat{\beta}=\sum_{i=1}^n \alpha_i \phi(x_i)

证明过程与 3.10 中类似,在此省略.

3.12 Gaussian Process and Bayesian Estimation

Linear version

对于 Y=Xβ+ϵY=X \beta+\epsilon,其中 βN(0,τ2Ip),ϵN(0,σ2In)\beta \sim \mathrm{N}\left(0, \tau^2 I_p\right), \epsilon \sim \mathrm{N}\left(0, \sigma^2 I_n\right),且 β\betaϵ\epsilon 相互独立. 为了考察 β\beta 的稳定性,我们希望得到后验分布 Pr[βY,X]\operatorname{Pr}[\beta | Y, X].

  • 引理:设 X1X_1X2X_2 是两个多维随机变量,且

    [X1X2]N([μ1μ2],[Σ11Σ12Σ21Σ22])\begin{bmatrix} X_1 \\ X_2 \end{bmatrix} \sim N\left(\begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix},\begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{bmatrix}\right)

    Pr[X2X1]N(μ2+Σ21Σ111(X1μ1),Σ22Σ21Σ111Σ12]\operatorname{Pr}\left[X_2 | X_1\right] \sim N\left(\mu_2+\Sigma_{21} \Sigma_{11}^{-1}\left(X_1-\mu_1\right), \Sigma_{22}-\Sigma_{21} \Sigma_{11}^{-1} \Sigma_{12}\right]

首先,求解 Y=Xβ+ϵY = X\beta + \epsilon 的分布. 其均值为:

E[Y]=XE[β]+E[ϵ]=0\operatorname{E}[Y]=X \operatorname{E}[\beta]+\operatorname{E}[\epsilon]=0

方差为:

Var[Y]=Var[Xβ]+Var[ϵ]=XVar[β]X+σ2In=τ2XX+σ2In\begin{aligned} \operatorname{Var}[Y] & =\operatorname{Var}[X \beta]+\operatorname{Var}[\epsilon] \\ & =X \operatorname{Var}[\beta] X^\top+\sigma^2 I_n \\ & =\tau^2 X X^\top+\sigma^2 I_n \end{aligned}

因此,YY 的分布是:

YN(0,τ2XX+σ2In)Y \sim N\left(0, \tau^2 X X^\top+\sigma^2 I_n\right)

再求 YYβ\beta 的协方差:

Cov(Y,β)=Ei[(YiEj[Yj])(βiEj[βj])]=Ei[(Xβi+ϵiEj[Xβj+ϵj])(βiEj[βj])]=Ei[(Xβi+ϵiEj[Xβj]Ej[ϵj])(βi)]=Ei[(Xβi+ϵiEj[Xβj])βi]=Ei[Xβiβi]+Ei[ϵiβi]Ei[Ej[Xβj]βi]=Ei[Xβiβi]+0Ei[0βi]=XEi[βiβi]=X(τ2Ip)=τ2X\begin{aligned} \operatorname{Cov}(Y, \beta) & =E_i\left[\left(Y_i-E_j\left[Y_j\right]\right)\left(\beta_i-E_j\left[\beta_j\right]\right)^{\top}\right] \\ & =E_i\left[\left(X \beta_i+\epsilon_i-E_j\left[X \beta_j+\epsilon_j\right]\right)\left(\beta_i-E_j\left[\beta_j\right]\right)^{\top}\right] \\ & =E_i\left[\left(X \beta_i+\epsilon_i-E_j\left[X \beta_j\right]-E_j\left[\epsilon_j\right]\right)\left(\beta_i\right)^{\top}\right] \\ & =E_i\left[\left(X \beta_i+\epsilon_i-E_j\left[X \beta_j\right]\right) \beta_i^{\top}\right] \\ & =E_i\left[X \beta_i \beta_i^{\top}\right]+E_i\left[\epsilon_i \beta_i^{\top}\right]-E_i\left[E_j\left[X \beta_j\right] \beta_i^{\top}\right] \\ & =E_i\left[X \beta_i \beta_i^{\top}\right]+0-E_i\left[0 \beta_i^{\top}\right] \\ & =X E_i\left[\beta_i \beta_i^{\top}\right] \\ & =X\left(\tau^2 I_p\right) \\ & =\tau^2 X \end{aligned}

因此,

[Yβ]N([00],[τ2XXT+σ2Inτ2Xτ2XTτ2Ip])\begin{bmatrix} Y \\ \beta \end{bmatrix} \sim N\left(\begin{bmatrix} 0 \\ 0 \end{bmatrix},\begin{bmatrix} \tau^2 X X^T+\sigma^2 I_n & \tau^2 X \\ \tau^2 X^T & \tau^2 I_p \end{bmatrix}\right)

从而

Pr[βY,X]=N(τ2XT(τ2XXT+σ2In)1Y,τ2Ipτ2XT(τ2XXT+σ2In)1τ2X)\begin{equation} \operatorname{Pr}[\beta | Y, X]=N\left(\tau^2 X^T\left(\tau^2 X X^T+\sigma^2 I_n\right)^{-1} Y, \tau^2 I_p-\tau^2 X^T\left(\tau^2 X X^T+\sigma^2 I_n\right)^{-1} \tau^2 X\right) \end{equation}

可以验证,该后验的均值与 λ=σ2/τ2\lambda=\sigma^2 / \tau^2 的 Ridge regression 的求解结果相等. 因此,Ridge regression 可以看作最大化 β\beta 的后验概率.

Feature version

对于 yi=ϕ(xi)β+ϵiy_i=\phi\left(x_i\right)^{\top} \beta+\epsilon_i,我们同样求解后验分布 Pr[βY,X]\operatorname{Pr}[\beta | Y, X]. 设

ϕ(X)=[ϕ(x1)ϕ(x2)ϕ(xn)]n×d\phi(X)=\begin{bmatrix} \phi\left(x_1\right)^{\top} \\ \phi\left(x_2\right)^{\top} \\ \ldots \\ \phi\left(x_n\right)^{\top} \end{bmatrix}_{n \times d}

那么只需要将 (3) 式中的 XX 替换为 ϕ(X)\phi(X) 即可,得到:

Pr[βY,X]=N(τ2ϕ(X)T(τ2ϕ(X)ϕ(X)T+σ2In)1Y,V)\operatorname{Pr}[\beta | Y, X]=N\left(\tau^2 \phi(X)^T\left(\tau^2 \phi(X) \phi(X)^T+\sigma^2 I_n\right)^{-1} Y, V\right)

其中

V=τ2Ipτ2ϕ(X)T(τ2ϕ(X)ϕ(X)T+σ2In)1τ2ϕ(X)V=\tau^2 I_p-\tau^2 \phi(X)^T\left(\tau^2 \phi(X) \phi(X)^T+\sigma^2 I_n\right)^{-1} \tau^2 \phi(X)

不难发现,VV 中不含有 YY. 这表明 β\beta 的稳定性只和当前特征 ϕ(x)\phi(x) 有关,和高层任务 YY 无关.

Kernel version

f(x)=ϕ(x)βf(x)=\phi(x)^{\top} \beta,那么

Y=[y1y2yn]=[f(x1)f(x2)f(xn)]+ϵY=\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}=\begin{bmatrix} f\left(x_1\right) \\ f\left(x_2\right) \\ \vdots \\ f\left(x_n\right) \end{bmatrix}+\epsilon

由于

Cov(f(x),f(x))=Cov(ϕ(x)Tβ,ϕ(x)Tβ)=τ2ϕ(x)Tϕ(x)\operatorname{Cov}\left(f(x), f\left(x^{\prime}\right)\right)=\operatorname{Cov}\left(\phi(x)^T \beta, \phi\left(x^{\prime}\right)^T \beta\right)=\tau^2 \phi(x)^T \phi\left(x^{\prime}\right)

K(x,x)=τ2ϕ(x)Tϕ(x)K\left(x, x^{\prime}\right) = \tau^2 \phi(x)^T \phi\left(x^{\prime}\right)

那么 YY 的分布可以写作:

YN(0,K+σ2In)Y \sim N\left(\mathbf{0}, \boldsymbol{K}+\sigma^2 \boldsymbol{I}_{\boldsymbol{n}}\right)

YN([00],[τ2ϕ(x1)Tϕ(x1)+σ2τ2ϕ(x1)Tϕ(xn)τ2ϕ(xn)Tϕ(x1)τ2ϕ(xn)Tϕ(xn)+σ2])Y \sim N\left(\begin{bmatrix} 0 \\ \vdots \\ 0 \end{bmatrix},\begin{bmatrix} \tau^2 \phi\left(x_1\right)^T \phi\left(x_1\right)+\sigma^2 & \dots & \tau^2 \phi\left(x_1\right)^T \phi\left(x_n\right) \\ \vdots & \ddots & \vdots \\ \tau^2 \phi\left(x_n\right)^T \phi\left(x_1\right) & \ldots & \tau^2 \phi\left(x_n\right)^T \phi\left(x_n\right)+\sigma^2 \end{bmatrix}\right)

从而

[Yf(x0)]=N([00],[K+σ2InK(x,x0)K(x0,x)K(x0,x0)](n+1)×(n+1))\begin{bmatrix} Y \\ f\left(x_0\right) \end{bmatrix}=N\left(\begin{bmatrix} \mathbf{0} \\ 0 \end{bmatrix},\begin{bmatrix} \boldsymbol{K}+\sigma^2 \boldsymbol{I}_{\boldsymbol{n}} & \boldsymbol{K}\left(\boldsymbol{x}, x_0\right) \\ \boldsymbol{K}\left(x_0, \boldsymbol{x}\right) & K\left(x_0, x_0\right) \end{bmatrix}_{(n+1) \times(n+1)}\right)

因此,

Pr[f(x0)Y,X]N(K(x0,x)(K+σ2In)1Y,K(x0,x0)K(x0,x)(K+σ2In)1K(x0,x)T)\operatorname{Pr}\left[f\left(x_0\right) | Y, X\right] \sim N\left(K\left(x_0, x\right)\left(K+\sigma^2 I_n\right)^{-1} Y, K\left(x_0, x_0\right)-K\left(x_0, x\right)\left(K+\sigma^2 I_n\right)^{-1} K\left(x_0, x\right)^T\right)

该方差表明了在测试样本 x0x_0 上的拟合误差大小,即该样本是否适用于这个网络.

Marginal likelihood

我们已经推出,YY 的边缘分布为:

YN(0,Kγ+σ2In)Y \sim \mathrm{N}\left(0, K_\gamma+\sigma^2 I_n\right)

其中 γ\gamma 是高斯核参数. 使用最大似然估计,我们希望得到使得真实的 YY 出现概率最大的参数 γ\gamma. 似然函数为:

1(2π)n/2Σγ1/2exp(12YTΣγ1Y)\frac{1}{(2 \pi)^{n / 2}\left|\Sigma_\gamma\right|^{1 / 2}} \exp \left(-\frac{1}{2} Y^T \Sigma_\gamma^{-1} Y\right)

其中 Σγ=Kγ+σ2In\Sigma_\gamma=K_\gamma+\sigma^2 I_n. 取对数,得:

l=12YTΣγ1Y12log(Σγ)n2log(2π)l=-\frac{1}{2} Y^T \Sigma_\gamma^{-1} Y-\frac{1}{2} \log \left(\left|\Sigma_\gamma\right|\right)-\frac{n}{2} \log (2 \pi)

接着由最大化 ll 即可解出 γ\gamma.

Lecture 4: Neural Networks

4.1 Neural networks

4.1.1 Two-layer perceptron

设双层感知机的输出 yi{0,1}y_i \in \{0, 1\}hi=(hik,k=1,,d)h_i=\left(h_{i k}, k=1, \ldots, d\right)^{\top} 经过 logistic regression 得到,而每个 hikh_{ik}Xi=(xij,j=1,,p)X_i=\left(x_{i j}, j=1, \ldots, p\right)^{\top} 经过 logistic regression 得到,即:

yiBernoulli(pi),pi=sigmoid(hiβ)=sigmoid(k=1dβkhik),hik=sigmoid(Xiαk)=sigmoid(j=1pαkjxij).\begin{aligned} & y_i \sim \operatorname{Bernoulli}\left(p_i\right), \\ & p_i=\operatorname{sigmoid}\left(h_i^{\top} \beta\right)=\operatorname{sigmoid}\left(\sum_{k=1}^d \beta_k h_{i k}\right), \\ & h_{i k}=\operatorname{sigmoid}\left(X_i^{\top} \alpha_k\right)=\operatorname{sigmoid}\left(\sum_{j=1}^p \alpha_{k j} x_{i j}\right) . \end{aligned}

Back-propagation

我们使用最大似然估计,所有样本分类正确的概率为:

P=i=1npiyi(1pi)1yilogP=i=1n[yilogpi+(1yi)log(1pi)]logP=i=1n{yi{Alog[1+exp(A)]}+(1yi){log1log[1+exp(A)]}} where A=k=1dβkhiklogP=i=1n{yiAlog[1+exp(A)]} where A=k=1dβkhik\begin{aligned} & P=\prod_{i=1}^n p_i^{y_i}\left(1-p_i\right)^{1-y_i} \\ \Longrightarrow \quad & \log P=\sum_{i=1}^n \left[y_i \log p_i+\left(1-y_i\right) \log \left(1-p_i\right)\right] \\ \Longrightarrow \quad & \log P=\sum_{i=1}^n\left\{y_i\{A-\log [1+\exp (A)]\}+\left(1-y_i\right)\{\log 1-\log [1+\exp (A)]\}\right\} \quad \text { where } \quad A=\sum_{k=1}^d \beta_k h_{i k} \\ \Longrightarrow \quad & \log P=\sum_{i=1}^n\left\{y_i A-\log [1+\exp (A)]\right\} \quad \text { where } \quad A=\sum_{k=1}^d \beta_k h_{i k} \end{aligned}

即对数似然为:

L(β,α)=i=1n{yik=1dβkhiklog[1+exp(k=1dβkhik)]}\mathscr{L}(\beta, \alpha)=\sum_{i=1}^n\left\{y_i \sum_{k=1}^d \beta_k h_{i k}-\log \left[1+\exp \left(\sum_{k=1}^d \beta_k h_{i k}\right)\right]\right\}

求梯度,得:

Lβ=i=1n(yipi)hiLαk=Lhkhkαk=i=1nLhikhikαk=i=1n(yipi)βkhik(1hik)Xi\begin{aligned} & \frac{\partial \mathscr{L}}{\partial \beta}=\sum_{i=1}^n\left(y_i-p_i\right) h_i \\ & \frac{\partial \mathscr{L}}{\partial \alpha_k}=\frac{\partial \mathscr{L}}{\partial h_k} \frac{\partial h_k}{\partial \alpha_k}=\sum_{i=1}^n \frac{\partial \mathscr{L}}{\partial h_{i k}} \frac{\partial h_{i k}}{\partial \alpha_k}=\sum_{i=1}^n\left(y_i-p_i\right) \beta_k h_{i k}\left(1-h_{i k}\right) X_i \end{aligned}

其中

Lhik=hiki=1n{yik=1dβkhiklog[1+exp(k=1dβkhik)]}=yiβkexp(k=1dβkhik)βk1+exp(k=1dβkhik)={yiexp(k=1dβkhik)1+exp(k=1dβkhik)}βk={yi11+exp(k=1dβkhik)}βk=(yipi)βkhikαk=αk(11+exp(αkXi))=exp(αkXi)Xi[1+exp(αkXi)]2=(11+exp(αkXi))(exp(αkXi)1+exp(αkXi))Xi=hik(1hik)Xi\begin{aligned} \frac{\partial \mathscr{L}}{\partial h_{i k}} & =\frac{\partial}{\partial h_{i k}} \sum_{i=1}^n\left\{y_i \sum_{k=1}^d \beta_k h_{i k}-\log \left[1+\exp \left(\sum_{k=1}^d \beta_k h_{i k}\right)\right]\right\} \\ & =y_i \beta_k-\frac{\exp \left(\sum_{k=1}^d \beta_k h_{i k}\right) \beta_k}{1+\exp \left(\sum_{k=1}^d \beta_k h_{i k}\right)} \\ & =\left\{y_i-\frac{\exp \left(\sum_{k=1}^d \beta_k h_{i k}\right)}{1+\exp \left(\sum_{k=1}^d \beta_k h_{i k}\right)}\right\} \beta_k \\ & =\left\{y_i-\frac{1}{1+\exp \left(-\sum_{k=1}^d \beta_k h_{i k}\right)}\right\} \beta_k \\ & =\left(y_i-p_i\right) \beta_k \\ \frac{\partial h_{i k}}{\partial \alpha_k} & =\frac{\partial}{\partial \alpha_k}\left(\frac{1}{1+\exp \left(-\alpha_k^{\top} X_i\right)}\right) \\ & =\frac{\exp \left(-\alpha_k^{\top} X_i\right) X_i}{\left[1+\exp \left(-\alpha_k^{\top} X_i\right)\right]^2} \\ & =\left(\frac{1}{1+\exp \left(-\alpha_k^{\top} X_i\right)}\right)\left(\frac{\exp \left(-\alpha_k^{\top} X_i\right)}{1+\exp \left(-\alpha_k^{\top} X_i\right)}\right) X_i \\ & =h_{i k}\left(1-h_{i k}\right) X_i \end{aligned}

Rectified linear unit (ReLU)

在神经网络中,非线性常常通过 ReLU 函数 max(0,a)\max(0, a) 引入. 例如:

yiBernoulli(pi),pi=sigmoid(hiβ)=sigmoid(k=1dβkhik),hik=max(Xiαk,0)=max(j=1pαkjxij,0).\begin{aligned} & y_i \sim \operatorname{Bernoulli}\left(p_i\right), \\ & p_i=\operatorname{sigmoid}\left(h_i^{\top} \beta\right)=\operatorname{sigmoid}\left(\sum_{k=1}^d \beta_k h_{i k}\right), \\ & h_{i k}=\max \left(X_i^{\top} \alpha_k, 0\right)=\max \left(\sum_{j=1}^p \alpha_{k j} x_{i j}, 0\right) . \end{aligned}

对于 ReLU 函数的反向传播,有:

hikαk=1(αkXi>0)Xi=1(hik>0)Xi\frac{\partial h_{i k}}{\partial \alpha_k}=\mathbf{1}\left(\alpha_k^{\top} X_i>0\right) \cdot X_i=\mathbf{1}\left(h_{i k}>0\right) \cdot X_i

4.1.2 Multi-layer network

设多层感知机的每一层为:

hl=fl(sl)sl=Wlhl1+bl\begin{aligned} h_l & =f_l\left(s_l\right) \\ s_l & =W_l h_{l-1}+b_l \end{aligned}

其中层数 l=1,,Ll=1, \ldots, Lh0=Xh_0 = X 是输入向量,hLh_L 用于预测 YY.

Chain rule in matrix form

为了推导反向传播,我们先回顾向量的求导法则. 设 Y=(yi)m×1,X=(xj)n×1Y=\left(y_i\right)_{m \times 1}, X=\left(x_j\right)_{n \times 1}. 如果 Y=h(X)Y=h(X),那么

YX=(yixj)m×n.\frac{\partial Y}{\partial X^{\top}}=\left(\frac{\partial y_i}{\partial x_j}\right)_{m \times n}.

如果 Y=h(X)Y=h(X)X=g(Z)X=g(Z),那么

yizj=kyixkxkzj\frac{\partial y_i}{\partial z_j}=\sum_k \frac{\partial y_i}{\partial x_k} \frac{\partial x_k}{\partial z_j}

YZ=YXXZ\frac{\partial Y}{\partial Z^{\top}}=\frac{\partial Y}{\partial X^{\top}} \frac{\partial X}{\partial Z^{\top}}

  • Y=AXY=A X,那么 yi=kaikxky_i=\sum_k a_{i k} x_k,从而 yi/xj=aij\partial y_i / \partial x_j=a_{i j}. 因此,Y/X=A\partial Y / \partial X^{\top}=A.
  • Y=XSXY=X^{\top} S X,其中 SS 是对称矩阵,那么 Y/X=2SX\partial Y / \partial X=2 S X.
  • S=I,Y=X2S=I, Y=\|X\|^2,那么 Y/X=2X\partial Y / \partial X=2 X.
Multi-layer back-propagation

对于多层感知机,链式法则可以写为:

Lsl=Lhlhlsl=LhlflLhl1=Lhlhlslslhl1=LhlflWl\begin{gathered} \frac{\partial L}{\partial s_l^{\top}}=\frac{\partial L}{\partial h_l^{\top}} \frac{\partial h_l}{\partial s_l^{\top}}=\frac{\partial L}{\partial h_l^{\top}} f_l^{\prime} \\ \frac{\partial L}{\partial h_{l-1}^{\top}}=\frac{\partial L}{\partial h_l^{\top}} \frac{\partial h_l}{\partial s_l^{\top}} \frac{\partial s_l}{\partial h_{l-1}^{\top}}=\frac{\partial L}{\partial h_l^{\top}} f_l^{\prime} W_l \end{gathered}

其中

Wl=slhl1=[sl,1hl1,1sl,1hl1,2sl,1hl1,3sl,1hl1,4sl,1hl1,5sl,2hl1,1sl,2hl1,2sl,2hl1,3sl,2hl1,4sl,2hl1,5sl,3hl1,1sl,3hl1,2sl,3hl1,3sl,3hl1,4sl,3hl1,5sl,4hl1,1sl,4hl1,2sl,4hl1,3sl,4hl1,4sl,4hl1,5sl,5hl1,1sl,5hl1,2sl,5hl1,3sl,5hl1,4sl,5hl1,5]fl=hlsl=[hl1sl100000hl2sl200000hl3sl300000hl4sl400000hl5sl5]\begin{gathered} W_l=\frac{\partial s_l}{\partial h_{l-1}^{\top}}=\begin{bmatrix} \frac{\partial s_{l, 1}}{\partial h_{l-1,1}} & \frac{\partial s_{l, 1}}{\partial h_{l-1,2}} & \frac{\partial s_{l, 1}}{\partial h_{l-1,3}} & \frac{\partial s_{l, 1}}{\partial h_{l-1,4}} & \frac{\partial s_{l, 1}}{\partial h_{l-1,5}} \\ \frac{\partial s_{l, 2}}{\partial h_{l-1,1}} & \frac{\partial s_{l, 2}}{\partial h_{l-1,2}} & \frac{\partial s_{l, 2}}{\partial h_{l-1,3}} & \frac{\partial s_{l, 2}}{\partial h_{l-1,4}} & \frac{\partial s_{l, 2}}{\partial h_{l-1,5}} \\ \frac{\partial s_{l, 3}}{\partial h_{l-1,1}} & \frac{\partial s_{l, 3}}{\partial h_{l-1,2}} & \frac{\partial s_{l, 3}}{\partial h_{l-1,3}} & \frac{\partial s_{l, 3}}{\partial h_{l-1,4}} & \frac{\partial s_{l, 3}}{\partial h_{l-1,5}} \\ \frac{\partial s_{l, 4}}{\partial h_{l-1,1}} & \frac{\partial s_{l, 4}}{\partial h_{l-1,2}} & \frac{\partial s_{l, 4}}{\partial h_{l-1,3}} & \frac{\partial s_{l, 4}}{\partial h_{l-1,4}} & \frac{\partial s_{l, 4}}{\partial h_{l-1,5}} \\ \frac{\partial s_{l, 5}}{\partial h_{l-1,1}} & \frac{\partial s_{l, 5}}{\partial h_{l-1,2}} & \frac{\partial s_{l, 5}}{\partial h_{l-1,3}} & \frac{\partial s_{l, 5}}{\partial h_{l-1,4}} & \frac{\partial s_{l, 5}}{\partial h_{l-1,5}} \end{bmatrix} \\ f_l^{\prime}=\frac{\partial h_l}{\partial s_l^{\top}}=\begin{bmatrix} \frac{\partial h_{l 1}}{\partial s_{l 1}} & 0 & 0 & 0 & 0 \\ 0 & \frac{\partial h_{l 2}}{\partial s_{l 2}} & 0 & 0 & 0 \\ 0 & 0 & \frac{\partial h_{l 3}}{\partial s_{l 3}} & 0 & 0 \\ 0 & 0 & 0 & \frac{\partial h_{l 4}}{\partial s_{l 4}} & 0 \\ 0 & 0 & 0 & 0 & \frac{\partial h_{l 5}}{\partial s_{l 5}} \end{bmatrix} \end{gathered}

如果 ff 是 Sigmoid 函数,对角线元素 hlk/slk\partial h_{l k} / \partial s_{l k} 的值为 hlk(1hlk)h_{l k}\left(1-h_{l k}\right);如果 ff 是 ReLU 函数,hlk/slk\partial h_{l k} / \partial s_{l k} 的值为 1(hlk>0)1\left(h_{l k}>0\right).

L/hl1\partial L/\partial h_{l-1}^{\top} 转置,得:

Lhl1=WlflLhl.\frac{\partial L}{\partial h_{l-1}}=W_l^{\top} f_l^{\prime} \frac{\partial L}{\partial h_l} .

接着,推导 L/Wl\partial L / \partial W_l. 首先,对 WlW_l 的第 kk 行求梯度:

(LWlk)1×K=(Lslk)1×1(slkWlk)1×K=(Lslk)1×1(hl1)1×K.\left(\frac{\partial L}{\partial W_{l k}}\right)_{1 \times K}=\left(\frac{\partial L}{\partial s_{l k}}\right)_{1 \times 1}\left(\frac{\partial s_{l k}}{\partial W_{l k}}\right)_{1 \times K}=\left(\frac{\partial L}{\partial s_{l k}}\right)_{1 \times 1}\left(h_{l-1}^{\top}\right)_{1 \times K} .

再将所有行的结果整合,得到:

(LWl)K×K=(Lsl)K×1(hl1)1×K=(fl)K×K(Lhl)K×1(hl1)1×K\left(\frac{\partial L}{\partial W_l}\right)_{K \times K}=\left(\frac{\partial L}{\partial s_l}\right)_{K \times 1}\left(h_{l-1}^{\top}\right)_{1 \times K}=\left(f_l^{\prime}\right)_{K \times K}\left(\frac{\partial L}{\partial h_l}\right)_{K \times 1}\left(h_{l-1}^{\top}\right)_{1 \times K}

同理,

Lbl=flLhl.\frac{\partial L}{\partial b_l}= f_l^{\prime} \frac{\partial L}{\partial h_l} .

Δhl=Lhl,ΔWl=LWl,Dl=fl,\Delta h_l=\frac{\partial L}{\partial h_l}, \Delta W_l=\frac{\partial L}{\partial W_l}, D_l=f_l^{\prime},

那么之前的结果可以写为:

Δhl1=WlDlΔhl,ΔWl=DlΔhlhl1,\begin{aligned} & \Delta h_{l-1}=W_l^{\top} D_l \Delta h_l, \\ & \Delta W_l=D_l \Delta h_l h_{l-1}^{\top}, \end{aligned}

加入 blb_l 可以写为:

(ΔWl,Δbl)=Δ(Wl,bl)=DlΔhl(hl1,1).\left(\Delta W_l, \Delta b_l\right)=\Delta\left(W_l, b_l\right)=D_l \Delta h_l\left(h_{l-1}^{\top}, 1\right) .

4.1.3 Stochastic gradient descent (SGD)

Mini-batch

使用在小批量上的梯度下降法. 设 L(θ)=1ni=1nLi(θ)\mathscr{L}(\theta)=\frac{1}{n} \sum_{i=1}^n L_i(\theta) 是在一个小批量上的平均损失,那么 SGD 的公式为:

θt+1=θtηtL(θt).\theta_{t+1}=\theta_t-\eta_t \mathscr{L}^{\prime}\left(\theta_t\right).

Momentum, Adagrad, RMSprop, Adam

为了解决优化过程中的振荡问题,我们在 SGD 中引入动量 (Momentum),其更新公式为:

vt=γvt1+ηtgt,θt=θt1vt,\begin{aligned} & v_t=\gamma v_{t-1}+\eta_t g_t, \\ & \theta_t=\theta_{t-1}-v_t, \end{aligned}

其中 gtg_t 是当前小批量上的平均梯度,vtv_t 是动量.

在另一个方向上,为了解决不同维度梯度尺度不一致的问题,有 Adagrad 方法:

Gt=Gt1+gt2θt+1=θtηtgtGt+ε\begin{aligned} & G_t=G_{t-1}+g_t^2 \\ & \theta_{t+1}=\theta_t-\eta_t \frac{g_t}{\sqrt{G_t+\varepsilon}} \end{aligned}

这一方法使得高频维度学习率降低,稀疏维度保持大学习率. 但 Adagrad 存在学习率越来越小的问题,因此 RMSProp 对该方法作了改进:

Gt=βGt1+(1β)gt2G_t=\beta G_{t-1}+(1-\beta) g_t^2

Adam 方法是对 RMSProp 和 Momentum 方法的结合,即:

vt=γvt1+(1γ)gt,Gt=βGt1+(1β)gt2,vtvt/(1γt),GtGt/(1βt),θt+1=θtηtvtGt+ε.\begin{aligned} & v_t=\gamma v_{t-1}+(1-\gamma) g_t, \\ & G_t=\beta G_{t-1}+(1-\beta) g_t^2, \\ & v_t \leftarrow v_t /(1-\gamma^t), G_t \leftarrow G_t /(1-\beta^t), \\ & \theta_{t+1}=\theta_t-\eta_t \frac{v_t}{\sqrt{G_t+\varepsilon}} . \end{aligned}

4.2 Convolutional neural networks (CNN)

4.2.1 Convolution, kernels, filters

基本的卷积神经网络可以表示为:

s=Wh+bsuvw=i=1Rj=1Rk=1CWijkwht(u1)p+i,t(v1)p+j,k+bw\begin{aligned} s & =W \otimes h+b \\ s_{u v w} & =\sum_{i=1}^R \sum_{j=1}^R \sum_{k=1}^C W_{i j k}^w h_{t(u-1)-p+i, t(v-1)-p+j, k}+b^w \end{aligned}

其中卷积核尺寸为 R×R×CR \times R \times CWijkwW_{i j k}^w 表示第 ww 个卷积核的 (i,j,k)(i, j, k) 位置;tt 表示 stride,即每一步移动几格;pp 表示 padding,即外侧补 0 的圈数.

注意:每个卷积核的参数数量是 R×R×C+1R \times R \times C + 1;卷积核的第三个维度应等于输入通道数,卷积核的数量等于输出通道数. 要会计算卷积核的参数量,以及输入和输出的尺寸大小.

输出的尺寸为:

WF+2PS+1\left\lfloor \frac{W - F + 2P}{S} + 1 \right\rfloor

其中 WW 是原始尺寸(长宽),FF 是卷积核尺寸,PP 是 padding,SS 是 stride.

接着通过 ReLU 层:

h=max{s,0} i.e. hijk=max{sijk,0}\begin{array}{ll} & h=\max \{s, \mathbf{0}\} \\ \text { i.e. } & h_{i j k}=\max \left\{s_{i j k}, 0\right\} \end{array}

然后作最大池化(Max-Pooling)或平均池化(Average-Pooling). 该操作也有 kernel size RR、stride 和 padding 参数,表示在每个 R×RR \times R 的区域内选取最大元素或求平均值.

4.2.2 Softmax layer for classification

经典的 CNN 结构是:x \to Conv \to ReLU \to Max-Pooling \to Conv \to ReLU \to Max-Pooling \to FC \to ReLU \to FC \to ReLU \to FC \to Softmax \to CrossEntropy.

4.2.3 Alex net, VGG net, inception net

VGG16 的结构是 x \to (Conv \to ReLU \to Conv \to ReLU \to Max-Pooling) * 2 \to (Conv \to ReLU \to Conv \to ReLU \to Conv \to ReLU \to Max-Pooling) * 3 \to FC \to ReLU \to FC \to ReLU \to FC \to Softmax \to CrossEntropy.

4.2.4 Batch normalization layer

Batch normalization 是在通道维度上做归一化,即对每个通道单独做归一化.

μd=1ni=1nxid; with channels d=1,2,,Dσd2=1ni=1n(xidμd)2x^id=xidμdσdyid=βd+γdx^id\begin{aligned} \mu_d & =\frac{1}{n} \sum_{i=1}^n x_{i d} ; \quad \text { with channels } d=1,2, \ldots, D \\ \sigma_d^2 & =\frac{1}{n} \sum_{i=1}^n\left(x_{i d}-\mu_d\right)^2 \\ \hat{x}_{i d} & =\frac{x_{i d}-\mu_d}{\sigma_d} \\ y_{i d} & =\beta_d+\gamma_d \hat{x}_{i d} \end{aligned}

例如,设输入尺寸为 (N,H,W,C)(N, H, W, C),那么对第 cc 个通道,使用 N×H×WN \times H \times W 个位置统计 μc\mu_cσc\sigma_c.

4.2.5 Residual net

残差网络的基本结构是:

xl+1=xl+F(xl)x_{l+1}=x_l+F\left(x_l\right)

其中 FF 包含的结构是 BN \to ReLU \to Conv \to BN \to ReLU \to Conv.

相比于级联网络,残差网络性能更好、收敛更快. 一种解释是,nn 层残差网络等效于并联 2n2^n 个深度不等的级联模块,从而在训练初期可以先用浅层的模块进行拟合,优化更简单.

4.3 Recurrent neural networks

4.3.1 RNN

基本的 RNN 结构是:

st=f(Uxt+Wst1)ot=g(Vst)\begin{aligned} & s_t=f\left(U x_t+W s_{t-1}\right) \\ & o_t=g\left(V s_t\right) \end{aligned}

其中 xtx_toto_tsts_t 分别是 tt 时刻的输入、输出和状态.

4.3.2 LSTM

LSTM 的基本结构是:

Δct=f(Wc(ht1,xt)), Representation of input information it=f(Wi(ht1,xt)), Input gate ft=f(Wf(ht1,xt)), Forget gate ct=ct1ft+Δctit, Cell state vector ot=f(Wo(ht1,xt)), Output gate ht=otf(ct), Hidden state vector yt=f(Wht).\begin{aligned} & \Delta c_t=f\left(W_c\left(h_{t-1}, x_t\right)\right), \quad \text { Representation of input information } \\ & i_t=f\left(W_i\left(h_{t-1}, x_t\right)\right), \quad \text { Input gate } \\ & f_t=f\left(W_f\left(h_{t-1}, x_t\right)\right), \quad \text { Forget gate } \\ & c_t=c_{t-1} f_t+\Delta c_t i_t, \quad \text { Cell state vector } \\ & o_t=f\left(W_o\left(h_{t-1}, x_t\right)\right), \quad \text { Output gate } \\ & h_t=o_t f\left(c_t\right), \quad \text { Hidden state vector } \\ & y_t=f\left(W h_t\right) . \end{aligned}

其中 ctc_ttt 时刻的 cell states,即长期记忆;hth_ttt 时刻的 hidden states,即短期记忆.

4. 3.3 GRU

GRU 是对 LSTM 的简化,其基本结构是:

zt=f(Wz(ht1,xt)), Update gate rt=f(Wr(ht1,xt)), Reset gate h~t=f(W(xt,rtht1)), New information from the input ht=(1zt)ht1+zth~t, Hidden state yt=f(Whht).\begin{aligned} & z_t=f\left(W_z\left(h_{t-1}, x_t\right)\right), \quad \text { Update gate } \\ & r_t=f\left(W_r\left(h_{t-1}, x_t\right)\right), \quad \text { Reset gate } \\ & \tilde{h}_t=f\left(W\left(x_t, r_t h_{t-1}\right)\right), \quad \text { New information from the input } \\ & h_t=\left(1-z_t\right) h_{t-1}+z_t \tilde{h}_t, \quad \text { Hidden state } \\ & y_t=f\left(W_h h_t\right) . \end{aligned}

4.4 Generator and GAN

4.4.2 Supervised decoder

考虑用 Encode-Decoder 架构进行图像重建:Encoder(参数 ϕ\phi) 将输入图像 II 进行压缩,再经过 Decoder(参数 θ\theta) 对其进行重建,即

f=Encoderϕ(I)I^=Decoderθ(f)\begin{gathered} f = \operatorname{Encoder}_\phi (I) \\ \hat{I} = \operatorname{Decoder}_\theta(f) \end{gathered}

优化目标是:

minϕ,θ II^2\min_{\phi, \theta} ~ \Vert I - \hat{I} \Vert^2

4.4.3 GAN

GAN 中有一个判别器 DD 和一个生成器 GG. 输入 hN(0,σ2I)h \sim N(0, \sigma^2 I)GθG_\theta 输出 x^\hat{x},再用 DϕD_\phi 判断 x^\hat{x}xx^* 是否是真实样本,即 D(X)D(X) 是真实图像的概率.

hp(h)h \sim p(h)G(h)G(h) 的分布是 pθp_\theta,那么

V(D,G)=EPdata [logD(X)]+Epθ[log(1D(X))]=EPdata [logD(X)]+Ehp(h)[log(1D(G(h)))]V(D, G)= \mathbb{E}_{P_{\text {data }}}[\log D(X)]+\mathbb{E}_{p_\theta}[\log (1-D(X))] = \mathbb{E}_{P_{\text {data }}}[\log D(X)]+\mathbb{E}_{h \sim p(h)}[\log (1-D(G(h)))]

从而 GAN 的训练目标是:

minGmaxDV(D,G)\min _G \max _D V(D, G)

对于理想的判别器,有:

VD=0 Pdata (X)D(X)pθ(X)1D(X)=0 pθ(X)D(X)+Pdata (X)Pdata (X)D(X)=0 D(X)=Pdata (X)pθ(X)+Pdata (X)\begin{aligned} & \frac{\partial V}{\partial D}=0 \\ \Rightarrow ~ & \frac{P_{\text {data }}(X)}{D(X)}-\frac{p_\theta(X)}{1-D(X)}=0 \\ \Rightarrow ~ & -p_\theta(X) D(X)+P_{\text {data }}(X)-P_{\text {data }}(X) D(X)=0 \\ \Rightarrow ~ & D(X)=\frac{P_{\text {data }}(X)}{p_\theta(X)+P_{\text {data }}(X)} \end{aligned}

对于理想的生成器,θ\theta 要最小化 Pdata P_{\text {data }}pθp_\theta 之间的 JS 散度(一种具有对称性的 KL 散度变体). 令 pmix=(Pdata+pθ)/2p_{\operatorname{mix}}=\left(P_{\mathrm{data}}+p_\theta\right) / 2,那么有:

JSD(Pdata pθ)=KL(pθpmix )+KL(Pdata pmix )=X[pθ(X)logpθ(X)pmix (X)+Pdata (X)logPdata (X)pmix (X)]=H(pθ)H(Pdata )X[pθ(X)logpmix (X)+Pdata (X)logpmix (X)]=H(pθ)H(Pdata )X[pθ(X)logpθ(X)2(1D(X))+Pdata (X)logPdata (X)2D(X)]=H(pθ)H(Pdata )+H(pθ)+H(Pdata )+X[pθ(X)log[2(1D(X))]+Pdata (X)log(2D(X))]=X[pθ(X)log2+Pdata (X)log2]+X[pθ(X)log(1D(X))+Pdata (X)log(D(X))]=2log2+V(D,G)\begin{aligned} \operatorname{JSD}\left(P_{\text {data }} | p_\theta\right) & =\operatorname{KL}\left(p_\theta | p_{\text {mix }}\right)+\operatorname{KL}\left(P_{\text {data }} | p_{\text {mix }}\right) \\ & =\sum_X\left[p_\theta(X) \log \frac{p_\theta(X)}{p_{\text {mix }}(X)}+P_{\text {data }}(X) \log \frac{P_{\text {data }}(X)}{p_{\text {mix }}(X)}\right] \\ & =-H\left(p_\theta\right)-H\left(P_{\text {data }}\right)-\sum_X\left[p_\theta(X) \log p_{\text {mix }}(X)+P_{\text {data }}(X) \log p_{\text {mix }}(X)\right] \\ & =-H\left(p_\theta\right)-H\left(P_{\text {data }}\right)-\sum_X\left[p_\theta(X) \log \frac{p_\theta(X)}{2(1-D(X))}+P_{\text {data }}(X) \log \frac{P_{\text {data }}(X)}{2 D(X)}\right] \\ & =-H\left(p_\theta\right)-H\left(P_{\text {data }}\right)+H\left(p_\theta\right)+H\left(P_{\text {data }}\right)+\sum_X\left[p_\theta(X) \log [2(1-D(X))]+P_{\text {data }}(X) \log (2 D(X))\right] \\ & =\sum_X\left[p_\theta(X) \log 2+P_{\text {data }}(X) \log 2\right]+\sum_X\left[p_\theta(X) \log (1-D(X))+P_{\text {data }}(X) \log (D(X))\right] \\ & =2 \log 2+V(D, G) \end{aligned}

由此可见,降低 V(D,G)V(D, G) 的本质就是使得 pθp_\thetaPdataP_\text{data} 两个分布尽可能接近.

4.4.4 Geometry of VAE

VAE 分为 Encoder 和 Decoder 两部分. 输入 xx 经过 Encoder qϕq_\phi 压缩为低维向量 hh,再通过 Decoder pθp_\theta 重建为 x^\hat{x}. 优化目标是 xx^2\Vert x - \hat{x} \Vert^2 尽可能小.

VAE 的损失函数是:

L(ϕ,θ,X)=KL(qϕ(hX)pθ(hX))Eqϕ(hX)(logpθ(Xh))L(\phi, \theta, X)=\mathrm{KL}\left(q_\phi(h | X) | p_\theta(h | X)\right)-\mathbb{E}_{q_\phi(h | X)}\left(\log p_\theta(X | h)\right)

其中 pθ(hX)p_\theta(h|X) 可以认为是 N(0,I)N(\mathbf{0}, \mathbf{I}). 第一项是 qϕ(hX)q_\phi(h|X)N(0,I)N(\mathbf{0}, \mathbf{I}) 之间的 KL 散度,限制了 hh 的分布,用于防止过拟合及便于作为生成器;第二项是 pθ(Xh)p_\theta(X|h) 的重建误差,实质上是 XXpθ(Xh)p_\theta(X|h) 之间的平方损失.

该损失函数也可以写作

KL(Pdata(X)pθ(X))+KL(qϕ(hX)pθ(hX))=KL(Pdata(X)qϕ(hX)pθ(h,X))\operatorname{KL}\left(P_{\text {data}}(X) | p_\theta(X)\right)+\operatorname{KL}\left(q_\phi(h | X) | p_\theta(h | X)\right)=\operatorname{KL}\left(P_{\text {data}}(X) q_\phi(h | X) | p_\theta(h, X)\right)

可以看出,VAE 的目标是让 Encoder 和 Decoder 建模的联合概率分布尽可能接近.

下面证明两个损失函数等价:

θKL(Pdata (X)pθ(X))=θEPdata (X)(loghpθ(h,X))θH(Pdata (X))=θEPdata (X)(loghpθ(h,X))=EPdata (X)hp(h,X)hp(h,X)1pθ(h,X)θpθ(h,X)=EPdata (X)hp(hX)θlogpθ(h,X)=EPdata (X)Eqϕ(hX)θ(logpθ(Xh)+logp(h)) where p(hX)qϕ(hX)=EPdata (X)Eqϕ(hX)θlogpθ(Xh)=θEPdata (X)[Eqϕ(hX)logpθ(Xh)]\begin{aligned} \frac{\partial}{\partial \theta} \mathrm{KL}\left(P_{\text {data }}(X) | p_\theta(X)\right) & =\frac{\partial}{\partial \theta} \mathbb{E}_{P_{\text {data }}(X)}\left(-\log \sum_h p_\theta(h, X)\right)-\frac{\partial}{\partial \theta} H\left(P_{\text {data }}(X)\right) \\ & =\frac{\partial}{\partial \theta} \mathbb{E}_{P_{\text {data }}(X)}\left(-\log \sum_h p_\theta(h, X)\right) \\ & =-\mathbb{E}_{P_{\text {data }}(X)} \sum_h \frac{p(h, X)}{\sum_{h^{\prime}} p\left(h^{\prime}, X\right)} \frac{1}{p_\theta(h, X)} \frac{\partial}{\partial \theta} p_\theta(h, X) \\ & =-\mathbb{E}_{P_{\text {data }}(X)} \sum_h p(h | X) \frac{\partial}{\partial \theta} \log p_\theta(h, X) \\ & =-\mathbb{E}_{P_{\text {data }}(X)} \mathbb{E}_{q_\phi(h | X)} \frac{\partial}{\partial \theta}\left(\log p_\theta(X | h)+\log p(h)\right) \quad \text { where } p(h | X) \leftarrow q_\phi(h | X) \\ & =-\mathbb{E}_{P_{\text {data }}(X)} \mathbb{E}_{q_\phi(h | X)} \frac{\partial}{\partial \theta} \log p_\theta(X | h) \\ & =\frac{\partial}{\partial \theta} \mathbb{E}_{P_{\text {data }}(X)}\left[-\mathbb{E}_{q_\phi(h | X)} \log p_\theta(X | h)\right] \end{aligned}

4.4.5 ELBO

VAE 的优化目标是:

KL(Pdata (X)pθ(X))+KL(qϕ(hX)pθ(hX))=EPdata [logPdata (X)]EPdata [logpθ(X)]+KL(qϕ(hX)pθ(hX))\operatorname{KL}\left(P_{\text {data }}(X) | p_\theta(X)\right)+\operatorname{KL}\left(q_\phi(h | X) | p_\theta(h | X)\right) =\mathbb{E}_{P_{\text {data }}}\left[\log P_{\text {data }}(X)\right]-\mathbb{E}_{P_{\text {data }}}\left[\log p_\theta(X)\right]+\operatorname{KL}\left(q_\phi(h | X) | p_\theta(h | X)\right)

因此等效于最大化

EPdata [logpθ(X)]KL(qϕ(hX)pθ(hX))=1ni=1n[logpθ(Xi)KL(qϕ(hiXi)pθ(hiXi))]\mathbb{E}_{P_{\text {data }}}\left[\log p_\theta(X)\right]-\mathrm{KL}\left(q_\phi(h | X) | p_\theta(h | X)\right)=\frac{1}{n} \sum_{i=1}^n\left[\log p_\theta\left(X_i\right)-\mathrm{KL}\left(q_\phi\left(h_i | X_i\right) | p_\theta\left(h_i | X_i\right)\right)\right]

这被称为证据下界(ELBO).

Lecture 6: Clustering, Expectation-Maximization, Gaussian Mixture Model, Feature Visualization, and Machine Learning Theories

6.1 Principle component analysis (PCA)

PCA 的作用是:找到数据分布的主方向,从而实现降维.

将每一个样本 x(i)\mathbf{x}_{(i)} 投影到 w(k)\mathbf{w}_{(k)} 方向上,其投影长度为:

tk(i)=x(i),w(k) for i=1,,nk=1,,lt_{k(i)}=\left\langle\mathbf{x}_{(i)}, \mathbf{w}_{(k)}\right\rangle \quad \text { for } \quad i=1, \ldots, n \quad k=1, \ldots, l

6.1.1 First component

我们希望找到一个方向,来最大化 t1(i)t_{1(i)} 的方差,即:

w(1)=argmaxw=1Vari(t1)=argmaxw=1{i(t1(i)Ej[t1(j)])2}=argmaxw=1{i(t1(i)0)2}=argmaxw=1{ix(i),w2}\mathbf{w}_{(1)}=\underset{\|\mathbf{w}\|=1}{\operatorname{argmax}} \operatorname{Var}_i\left(t_1\right)=\underset{\|\mathbf{w}\|=1}{\operatorname{argmax}}\left\{\sum_i\left(t_{1(i)}-\mathbb{E}_j\left[t_{1(j)}\right]\right)^2\right\}=\underset{\|\mathbf{w}\|=1}{\operatorname{argmax}}\left\{\sum_i\left(t_{1(i)}-0\right)^2\right\}=\underset{\|\mathbf{w}\|=1}{\operatorname{argmax}}\left\{\sum_i\left\langle\mathbf{x}_{(i)}, \mathbf{w}\right\rangle^2\right\}

进而写成矩阵形式:

w(1)=argmaxw=1{Xw2}=argmaxw=1{wXXw}\mathbf{w}_{(1)}=\underset{\|\mathbf{w}\|=1}{\operatorname{argmax}}\left\{\|\mathbf{X} \mathbf{w}\|^2\right\}=\underset{\|\mathbf{w}\|=1}{\operatorname{argmax}}\left\{\mathbf{w}^{\top} \mathbf{X}^{\top} \mathbf{X} \mathbf{w}\right\}

这等价为:

w(1)=argmaxw{wXXwww}\mathbf{w}_{(1)}=\operatorname{argmax}_{\mathbf{w}}\left\{\frac{\mathbf{w}^{\top} \mathbf{X}^{\top} \mathbf{X} \mathbf{w}}{\mathbf{w}^{\top} \mathbf{w}}\right\}

使用 Lagrange 乘子法,得到:

0=w[wXXwλww]=2XXw2λwXXw=λw\begin{aligned} & 0=\dfrac{\partial}{\partial \mathbf{w}}\left[\mathbf{w}^{\top} \mathbf{X}^{\top} \mathbf{X} \mathbf{w}-\lambda \mathbf{w}^{\top} \mathbf{w}\right]=2 \mathbf{X}^{\top} \mathbf{X} \mathbf{w}-2 \lambda \mathbf{w} \\ \Longrightarrow \quad & \mathbf{X}^{\top} \mathbf{X} \mathbf{w}=\lambda \mathbf{w} \end{aligned}

此时,目标函数值为 λ\lambda. 因此,w\mathbf{w}XX\mathbf{X}^{\top} \mathbf{X} 的最大特征值对应的特征向量.

6.1.2 Further components

求解第 kk 个主方向,可以先从 X\mathbf{X} 中减去前 k1k-1 个主方向:

X^k=Xs=1k1Xw(s)w(s)\hat{\mathbf{X}}_k=\mathbf{X}-\sum_{s=1}^{k-1} \mathbf{X} \mathbf{w}_{(s)} \mathbf{w}_{(s)}^{\top}

再求解第一主方向,即:

w(k)=argmaxw=1{X^kw2}=argmaxw{wX^kX^kwww}\mathbf{w}_{(k)}=\underset{\|\mathbf{w}\|=1}{\operatorname{argmax}}\left\{\|\hat{\mathbf{X}}_k \mathbf{w}\|^2\right\}=\operatorname{argmax}_{\mathbf{w}}\left\{\frac{\mathbf{w}^{\top} \hat{\mathbf{X}}_k^{\top} \hat{\mathbf{X}}_k \mathbf{w}}{\mathbf{w}^{\top} \mathbf{w}}\right\}

可以证明,X\mathbf{X} 的第 kk 个主方向就是 XX\mathbf{X}^{\top} \mathbf{X} 的第 kk 大特征值对应的特征向量.

W\mathbf{W} 的每一列是 X\mathbf{X} 的主方向,那么通过

T=XW\mathbf{T}=\mathbf{X W}

就可以实现数据降维.

6.2 Stochastic Neighbor Embedding (t-SNE) (in order to visualize the data distribution)

xix_i 通常是一个高维特征,我们想要将 xix_i 投影到低维空间(通常是二维)hih_i 中,来进行特征的可视化. 为了使得更相似的样本离得更近,我们首先需要定义样本 xjx_jxix_i 之间的相似度:

Pij=P(xjxi)=exjxi2/2σi2m=1nexmxi2/2σi2P_{i j} =P\left(x_j | x_i\right) =\frac{e^{-\left|x_j-x_i\right|^2 / 2 \sigma_i^2}}{\sum_{m=1}^n e^{-\left|x_m-x_i\right|^2 / 2 \sigma_i^2}}

再定义投影后 hjh_jhih_i 之间的相似度:

qij=P(hjhi)=(1+hjhi2)1m=1n(1+hmhi2)1q_{i j} =P\left(h_j | h_i\right) =\frac{\left(1+\left|h_j-h_i\right|^2\right)^{-1}}{\sum_{m=1}^n\left(1+\left|h_m-h_i\right|^2\right)^{-1}}

从而 t-SNE 的优化目标是:

min{hi} KL(PQ)=i,jPijlog(Pijqij)\min_{\left\{h_i\right\}} ~ \mathrm{KL}(P \| Q)=\sum_{i, j} P_{i j} \log \left(\frac{P_{i j}}{q_{i j}}\right)

这意味着高维空间中的邻近关系在降维后得到了保留.

6.2.1 Local Linear Embedding

为了在降维时保留局部线性关系,我们将 xix_i 表示为:

xijNiwijxjx_i \approx \sum_{j \in N_i} w_{i j} x_j

其中 NiN_i 表示 ii 的邻居. 通过 least-squares 可以求解 ww,即

min{wij}xijNiwijxj2\min _{\left\{w_{i j}\right\}}|x_i-\sum_{j \in N_i} w_{i j} x_j|^2

进而求解出 hih_i

min{hi}hijNiwijhj2\min _{\left\{h_i\right\}}|h_i-\sum_{j \in N_i} w_{i j} h_j|^2

6.3 The Expectation Maximization (EM) Algorithm

EM 算法的目标,是建模所有样本 X\mathbf{X} 的潜在分布. 以聚类问题为例,设想我们要对样本 X=(x1,x2,,xn)\mathbf{X} = (x_1, x_2, \dots, x_n) 进行聚类,那么要识别出 kk 个 cluster,其参数 θ=(μ1,σ12,μ2,σ22,,μk,σk2)\theta = (\mu_1, \sigma_1^2, \mu_2, \sigma_2^2, \dots, \mu_k, \sigma_k^2),并给每个 xix_i 一个类别标签 ziz_i,即 Z=(z1,z2,,zn)\mathbf{Z} = (z_1, z_2, \dots, z_n),其中 zi=jz_i = j 表示判断为第 jj 类. 故优化目标为:

maxθL(θ,X),L(θ,X)= def p(Xθ)=Zp(X,Zθ)dZ\max _\theta L(\theta, \mathbf{X}), \quad L(\theta, \mathbf{X}) \stackrel{\text { def }}{=} p(\mathbf{X} | \theta)=\int_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z} | \theta) d \mathbf{Z}

  • Expectation step (E step):

Q(θθ(t))=EZp(ZX,θ(t))[logp(X,Zθ)]Q(\theta | \theta^{(t)})=\mathbb{E}_{\mathbf{Z} \sim p(\mathbf{Z} | \mathbf{X}, \theta^{(t)})}[\log p(\mathbf{X}, \mathbf{Z} | \theta)]

  • Maximization step (M step):

θ(t+1)=arg maxθQ(θθ(t))\theta^{(t+1)}=\argmax_\theta Q(\theta | \theta^{(t)})

为了更好地计算 Q(θθ(t))Q(\theta | \theta^{(t)}),我们在实际使用中对其化简:

Q(θθ(t))=EZp(ZX,θ(t))[logp(X,Zθ)]=EZp(ZX,θ(t))[i=1nlogp(xi,ziθ)]=i=1nEZp(ZX,θ(t))[logp(xi,ziθ)]=i=1nEzip(zixi,θ(t))[logp(xi,ziθ)]=i=1nEzip(zixi,θ(t))[logp(zi)+logp(xizi,θ)]\begin{aligned} Q(\theta | \theta^{(t)}) & = \mathbb{E}_{\mathbf{Z} \sim p(\mathbf{Z} | \mathbf{X}, \theta^{(t)})}[\log p(\mathbf{X}, \mathbf{Z} | \theta)] \\ & = \mathbb{E}_{\mathbf{Z} \sim p(\mathbf{Z} | \mathbf{X}, \theta^{(t)})}\left[\sum_{i=1}^n \log p(x_i, z_i | \theta)\right] \\ & = \sum_{i=1}^n \mathbb{E}_{\mathbf{Z} \sim p(\mathbf{Z} | \mathbf{X}, \theta^{(t)})}\left[ \log p(x_i, z_i | \theta)\right] \\ & = \sum_{i=1}^n \mathbb{E}_{z_i \sim p(z_i | x_i, \theta^{(t)})}\left[ \log p(x_i, z_i | \theta)\right] \\ & = \sum_{i=1}^n \mathbb{E}_{z_i \sim p(z_i | x_i, \theta^{(t)})}\left[ \log p(z_i) + \log p(x_i | z_i, \theta)\right] \end{aligned}

接下来,我们推导 Q(θθ(t))Q(\theta | \theta^{(t)}) 为何等效于我们的优化目标.

θlogp(Xθ)=1p(Xθ)θp(Xθ)=1p(Xθ)θZp(X,Zθ)=1p(Xθ)Zθp(X,Zθ)=1p(Xθ)Zp(X,Zθ)θlogp(X,Zθ)=Zp(X,Zθ)p(Xθ)θlogp(X,Zθ)=Zp(ZX,θ)θlogp(X,Zθ)Zp(ZX,θ(t))θlogp(X,Zθ), here we use θ(t) instead of θ.=θZp(ZX,θ(t))logp(X,Zθ)=θEZp(ZX,θ(t))[logp(X,Zθ)]\begin{aligned} \frac{\partial}{\partial \theta} \log p(\mathbf{X} | \theta) & =\frac{1}{p(\mathbf{X} | \theta)} \frac{\partial}{\partial \theta} p(\mathbf{X} | \theta) \\ & =\frac{1}{p(\mathbf{X} | \theta)} \frac{\partial}{\partial \theta} \sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z} | \theta) \\ & =\frac{1}{p(\mathbf{X} | \theta)} \sum_{\mathbf{Z}} \frac{\partial}{\partial \theta} p(\mathbf{X}, \mathbf{Z} | \theta) \\ & =\frac{1}{p(\mathbf{X} | \theta)} \sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z} | \theta) \frac{\partial}{\partial \theta} \log p(\mathbf{X}, \mathbf{Z} | \theta) \\ & =\sum_{\mathbf{Z}} \frac{p(\mathbf{X}, \mathbf{Z} | \theta)}{p(\mathbf{X} | \theta)} \frac{\partial}{\partial \theta} \log p(\mathbf{X}, \mathbf{Z} | \theta) \\ & =\sum_{\mathbf{Z}} p(\mathbf{Z} | \mathbf{X}, \theta) \frac{\partial}{\partial \theta} \log p(\mathbf{X}, \mathbf{Z} | \theta) \\ & \approx \sum_{\mathbf{Z}} p(\mathbf{Z} | \mathbf{X}, \theta^{(t)}) \frac{\partial}{\partial \theta} \log p(\mathbf{X}, \mathbf{Z} | \theta), \quad \text { here we use } \theta^{(t)} \text { instead of } \theta . \\ & =\frac{\partial}{\partial \theta} \sum_{\mathbf{Z}} p(\mathbf{Z} | \mathbf{X}, \theta^{(t)}) \log p(\mathbf{X}, \mathbf{Z} | \theta) \\ & =\frac{\partial}{\partial \theta} \mathbb{E}_{\mathbf{Z} \sim p(\mathbf{Z} | \mathbf{X}, \theta^{(t)})}[\log p(\mathbf{X}, \mathbf{Z} | \theta)] \end{aligned}

因此,最大化 Q(θθ(t))Q(\theta | \theta^{(t)}) 近似等效为最大化 logp(Xθ)\log p(\mathbf{X} | \theta).

6.3.1 Proof of the correctness

接下来,证明增大 Q(θθ(t))Q(\theta | \theta^{(t)}) 就一定能增大 logp(Xθ)\log p(\mathbf{X} | \theta).

logp(Xθ)=logp(X,Zθ)logp(ZX,θ)\log p(\mathbf{X} | \theta)=\log p(\mathbf{X}, \mathbf{Z} | \theta)-\log p(\mathbf{Z} | \mathbf{X}, \theta)

logp(Xθ)=[Zp(ZX,θ(t))]logp(X,Zθ)[Zp(ZX,θ(t))]logp(ZX,θ), because Zp(ZX,θ(t))=1=Q(θθ(t))+H(θθ(t))\begin{aligned} \log p(\mathbf{X} | \theta) & =\left[\sum_{\mathbf{Z}} p(\mathbf{Z} | \mathbf{X}, \theta^{(t)})\right] \log p(\mathbf{X}, \mathbf{Z} | \theta)-\left[\sum_{\mathbf{Z}} p(\mathbf{Z} | \mathbf{X}, \theta^{(t)})\right] \log p(\mathbf{Z} | \mathbf{X}, \theta), \quad \text { because } \sum_{\mathbf{Z}} p(\mathbf{Z} | \mathbf{X}, \theta^{(t)})=1 \\ & =Q(\theta | \theta^{(t)})+H(\theta | \theta^{(t)}) \end{aligned}

其中 H(θθ(t))=CE(p(ZX,θ(t))p(ZX,θ))H(\theta | \theta^{(t)}) = \mathrm{CE}\left(p(\mathbf{Z} | \mathbf{X}, \theta^{(t)}) | p(\mathbf{Z} | \mathbf{X}, \theta)\right).
θ=θ(t)\theta = \theta^{(t)},得:

logp(Xθ(t))=Q(θ(t)θ(t))+H(θ(t)θ(t))\log p(\mathbf{X} | \theta^{(t)})=Q(\theta^{(t)} | \theta^{(t)})+H(\theta^{(t)} | \theta^{(t)})

两式相减,得:

logp(Xθ)logp(Xθ(t))=Q(θθ(t))Q(θ(t)θ(t))+H(θθ(t))H(θ(t)θ(t))\log p(\mathbf{X} | \theta)-\log p(\mathbf{X} | \theta^{(t)})=Q(\theta | \theta^{(t)})-Q(\theta^{(t)} | \theta^{(t)})+H(\theta | \theta^{(t)})-H(\theta^{(t)} | \theta^{(t)})

由熵的性质知 H(θθ(t))H(θ(t)θ(t))H\left(\theta | \theta^{(t)}\right) \geq H\left(\theta^{(t)} | \theta^{(t)}\right),从而

logp(Xθ)logp(Xθ(t))Q(θθ(t))Q(θ(t)θ(t))\log p(\mathbf{X} | \theta)-\log p(\mathbf{X} | \theta^{(t)}) \geq Q(\theta | \theta^{(t)})-Q(\theta^{(t)} | \theta^{(t)})

因此,优化过程中 logp(Xθ)\log p(\mathbf{X} | \theta) 的增加量不小于 Q(θθ(t))Q(\theta | \theta^{(t)}) 的增加量.

6.3.2 Mixture of Gaussians, EM, K-means

K-means 聚类的算法如下:

  • 初始化 kk 个类别的中心点 μ1,μ2,,μk\mu_1, \mu_2, \dots, \mu_k
  • 对每个样本 xix_i,找到距离最近的中心点 zi=arg minkxiμkz_i = \argmin_k |x_i - \mu_k|
  • 更新中心点坐标:μk=Ei:zi=k[xi]\mu_k = \mathbb{E}_{i:z_i=k}[x_i]
  • 重复以上两步操作

K-means 算法是 EM 算法的一种变体:其分布采样是确定性的、不优化方差、目标函数中丢弃了 logp(zi)\log p(z_i).

参考资料

本文参考上海交通大学《机器学习》课程 CS3612 张拳石老师的 PDF 讲义整理。


机器学习:笔记整理
https://cny123222.github.io/2026/03/09/机器学习:笔记整理/
Author
Nuoyan Chen
Posted on
March 9, 2026
Licensed under