Mingbo

样本方差为什么除以N-1?（翻译）

发表于 2017-06-17

译者：程明波

译者注：由于历史原因，高斯分布(Gaussian Distribution)，正态分布(Normal Distribution)皆指概率密度函数形如$\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$的分布。文中我会采用正态分布的提法。

简介

本文，呼应标题，我将推导著名正态分布数据均值和方差的计算公式。如果一些读者对于这个问题的“为什么”并不感兴趣，仅仅是对“什么时候使用”感兴趣，那答案就非常简单了：

如果你想预估一份数据的均值和方差(典型情况)，那么方差公式除的是$N-1$，即：

\[\sigma^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \mu)^2\]

另一种情况，如果整体的真实均值已知，那么方差公式除的就是$N$，即：

\[\sigma^2 = \frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2\]

然而，前一种情况，会是你遇到更典型的情形。一会儿，我会举一个预估高斯白噪音的离散程度例子。例子中高斯白噪音的均值是已知的0，这种情况下，我们只需要估计方差。

如果数据是正态分布，我们可以完全用均值$\mu$和方差$\sigma^2$刻画这个分布。其中，方差是标准差$\sigma$的平方，标准差代表了每个数据点偏离均值点的平均距离，也就是说，方差表示了数据离散程度。对于正态分布，68.3%的数据的值会介于$\mu-\sigma$和$\mu+\sigma$之间。下面图片展示是一个正态分布的概率密度函数，他的均值是$\mu=10$,方差是$\sigma^2=3^2=9$：

图1. 正态分布概率密度函数. 对于正态分布数据，68%的样本落在均值$\pm$方差。

通常，我们拿不到全部的全体数据。上面的例子中，典型的情况是我们有一些观察数据，但是，我们没有上图中x轴上所有可能的观察数据。例如我们可能有下面一些观察数据：

表1

观察数据ID	观察值
观察数据 1	10
观察数据 2	12
观察数据 3	7
观察数据 4	5
观察数据 5	11

现在如果我们通过把所有值相加并除以观察的次数，得到经验均值：

\[\mu=\frac{10+12+7+5+11}{5}=9\tag{1}\].

通常，我们会假设经验均值接近分布的未知的真实均值，因此，我们可以假设观测数据来自于均值为$\mu=9$的正态分布。在这个例子中，分布真实均值是10，也就是说，经验均值实际上接近于真实均值。

数据的方差计算如下：

\[\begin{aligned}\sigma^2&= \frac{1}{N-1}\sum_{i=1}^N (x_i - \mu)^2\\&= \frac{(10-9)^2+(12-9)^2+(7-9)^2+(5-9)^2+(11-9)^2}{4})\\&= 8.5.\end{aligned}\tag{2}\]

同样，我们一般假设经验方差接近于基于分布真实未知方差。在此例中，真实方差是9，所以，经验方差也是接近于真实方差。

那么我们手上的问题现在就是为什么我们用于计算经验均值和经验方差的公式是正确的。事实上，另一个我们经常用于计算方差的公式是这样定义的：

\[\begin{aligned}\sigma^2 &= \frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2 \\&= \frac{(10-9)^2+(12-9)^2+(7-9)^2+(5-9)^2+(11-9)^2}{4}) \\&= 6.8.\end{aligned}\tag{3}\]

公式(2)和公式(3)的唯一不同是前一个公式除的是$N-1$，而后一个除的是$N$。两个公式都是对的，只是根据不同的场景使用不同的公式。

接下来的部分，我们针对给定一个正态分布的样本集，完成对其未知方差和均值最好估计的完整推导。我们将会看到，一些情况下，方差除的是$N$，另一些情况除的是$N-1$。

用一个公式近似一个参数(均值或方差)叫做估计量。下面，我们定义一个分布的真实但未知的参数为$\hat{\mu}$和$\hat{\sigma}^2$。而估计量，例如，经验的平均和经验方差，定义为$\mu$和$\sigma^2$。

为了找到最优的估计量，首先，一个整体均值为$\mu$标准差为$\sigma$的正态分布，对于特定的观察点$x_i$，我们需要一个分析相似的表达式。对于一个已知参数的正态分布一般定义为$N(\mu,\sigma^2)$。似然函数为：

\[x_i \sim N(\mu,\sigma^2) \Rightarrow P(x_i; \mu,\sigma)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2\sigma^2}(x-\mu)^2}.\tag{4}\]

为了计算均值和方差，显然，我们需要这个分布一个以上的样本。接下来，设$\vec{x}=(x_1,x_2,\cdots,x_N)$为包含所有的可用样本的向量（例如：表一中所有的值）。如果所有这些样本统计独立，我们可以写出联合似然函数为所有似然函数的乘积：

\[\begin{aligned}P(\vec{x};\mu,\sigma^2)&=P(x_1,x_2,\cdots,x_n;\mu,\sigma^2)\\&=P(x_1;\mu,\sigma^2)P(x_2;\mu,\sigma^2)\cdots P(x_N;\mu,\sigma^2)\\&=\prod_{i=1}^{N}P(x_i;\mu,\sigma^2)\end{aligned}.\tag{5}\]

把公式(4)代入公式(5)，可得出联合概率密度函数的分析表达式：

\[\begin{aligned}P({\vec{x};\mu,\sigma})&=\prod_{i=1}^{N}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2\sigma^2}(x_i-\mu)^2}\\&=\frac{1}{(2\pi\sigma^2)^{\frac{N}{2}}}e^{-\frac{1}{2\sigma^2}\sum_{i=1}^{N}(x_i-\mu)^2}\end{aligned}.\tag{6}\]

公式(6)在接下来的部分将非常重要。我们会用它推导关于正态分布著名的估计量均值和方差。

最小方差，无偏估计量

决定一个估计量是不是“好”估计量，首先我们需要定义什么是真正的“好” 估计量。说一个估计量好，依赖于两个度量，叫做其偏差(bias)和方差(variance)(是的，我们要讨论均值估计量的方差，以及方差估计量的方差)。本节将简单的讨论这两个度量。

参数偏差

想象一下，如果我们能拿到全体不同的(互斥)数据子集。类比之前的的例子，假设，除了【表1】中的数据，我们还有完全不同观察结果表2及表3。那么，一个关于均值好的估计量，应该使得这个估计量平均下来等于真实的均值。我们可以接受其中一个自己的经验均值不等于真实均值，但是，一个好的估计量应该保证：对于所有子集均值估计的平均值等于真实均值。这个限制条件用数学化的表示，就是估计量的期望值(Expected Value)应该等于参数值：

\[E(\mu)=\hat{\mu}\qquad E(\sigma^2)=\hat{\sigma}^2.\tag{7}\]

如果满足上面的条件，那么这些估计量就被称之为“无偏估计”。反之，如果上面的条件不满足，这些估计量叫做“有偏的”，也就是说平均来看，他们或者低估或者高估了参数的真实值。

参数方差

无偏估计量保证平均来看，它们估计的值等于真是参数。但是，这并不意味着每次估计是一个好的估计。比如，如果真实均值为10，一个无偏估计量可以估计全体的其中一个子集的均值为50，而另一个均值为-30。期望的估计的值确实是10，也等于真是的参数值，但是，估计量的质量明显依赖每次估计的离散程度。对于全体5个不同子集，一个估计量产生的估计值(10,15,5,12,8)是无偏的和另一个估计量产生的估计值（50，-30，100，-90，20）（译者注：原文作者最后一个是10，我计算换成20，这样均值才是10）。但是第一个估计量的所有估计值明显比第二个估计量的估计值更接近真实值。

因此，一个好的估计量不仅需要有低偏差，同时也需要低方差。这个方差表示为平均平方误差的估计量：

\[Var(\mu)=E[(\hat{\mu}-\mu)^2]\]

\[Var(\sigma^2)=E[(\hat{\sigma}-\sigma)^2]\]

因此一个好的估计量是低偏差，低方差的。如果存在最优的估计量，那么这个估计应该是无偏的，而且方差比所有的其他可能估计量都要低。这样的一个估计量被称之为最小方差，无偏（MVU）估计量。下一节，我们将会针对一个正态分布推导均值和方差估计量的数学表达式。我们将会看到，一个正态分布的方差MVU估计量在一些假设下需要除以$N$，而在另一些假设下需要除以$N-1$。

最大似然估计

基于整体的一个子集，尽管有大量的获取一个参数估计量的技术，所有这些技术中最简单的可能就数最大似然估计了。

观察值$\vec{x}$的概率在公式(6)定义为$P(\vec{x};\mu,\sigma^2)$. 如果我们在此函数中固定$x$和$\sigma^2$，当使$\vec{x}$变化时，我们就可以获得图(1)的正态分布。但是，我们也可以固定$\vec{x}$，使$\mu$和（或）$\sigma^2$变化。比如，我们可以选择类似前面例子中的$\vec{x}=(10,12,7,5,11)$。我们选择固定$\mu=10$，同时使$\sigma^2$变化。图(2)展示了当$x$和$\mu$固定时，$\sigma^2$对于这个分布取不同值的变化曲线：

图 2. 此图表示了似然函数在特定观察数据$\vec{x}$，下固定$\mu=10$，$\sigma^2$变化曲线。

上图，我们通过固定$\mu=10$，令$\sigma^2$变化计算了$P(\vec{x};\sigma^2)$的似然函数。在结果曲线的每一个数据点代表了似然度，观察值$\vec{x}$是一个正态分布在参数$\sigma^2$下的样本。那么对应最大似然度的参数值最有可能是从我们定义的分布中产生数据的参数。因此，我们能通过找到似然度曲线的最大值决定最优的$\sigma^2$。在此例中，最大值在$\sigma^2=7.8$，这样标准差就是$\sqrt{(\sigma^2)=2.8}$。事实上，如果给定$\mu=10$，通过传统的方法计算，我们会发明方差就是7.8：

\[\frac{(10-10)^2+(12-10)^2+(7-10)^2+(5-10)^2+(11-10)^2}{5}=7.8\]

因此，基于样本数据的方差计算公式只需要简单的通过找到最大的似然函数的最高点。此外，除了固定$\mu$，我们可以使$\mu$和$\sigma^2$同时变化。然后找到两个估计量对应在两个维度的似然函数的最大值。

要找一个函数的最大值，也很简单，只需要求导使其等于0。如果想找一个有两个变量函数的最大值，我们需要计算每个变量的偏导，再把两个偏导全部设置为0。接下来，设$\hat{\mu}_{ML}$为通过极大似然方法得到的总体均值的最优估计量，设$\hat{\sigma}^2_ML$为方差的最优估计量。要最大化似然函数，我们可以简单的计算它的(偏)导数，然后赋值为0，如下：

\[\begin{aligned} &\hat{\mu}_{ML} = \arg\max_\mu P(\vec{x}; \mu, \sigma^2)\\ &\Rightarrow \frac{\partial P(\vec{x}; \mu, \sigma^2)}{\partial \mu} = 0 \end{aligned}\]

及

\[\begin{aligned} &\hat{\sigma}^2_{ML} = \arg\max_{\sigma^2} P(\vec{x}; \mu, \sigma^2)\\ &\Rightarrow \frac{\partial P(\vec{x}; \mu, \sigma^2)}{\partial \sigma^2} = 0 \end{aligned}\]

下一节，我们将利用这个技术得到$\mu$和$\sigma^2$的MVU估计量。我们考虑两种情形：

第一种情形，我们假设分布的真正的均值$\hat{\mu}$是已知的，因此，我们只需要估计方差，那么问题就变成在参数为$\sigma^2$的一维的极大似然函数中对应找其最大值。这种情况不经常出现，但是，在实际应用中确实存在。例如，如果我们知道一个信号(比如：一幅图中一个像素的颜色值)本来应该有特定的值，但是，信号被白噪音污染了（均值为0的高斯噪音），这时分布的均值是已知的，我们只需要估计方差。

第二种情形就是处理均值和方差的真实值都不知道的情况。这种情况最常见，这时，我们需要基于样本数据估计均值和方差。

后面我们将看到，每种情形产生不同的MVU估计量。具体来说，第一种情形方差估计量需要除以$N$来标准化MVU。而第二种除的是$N-1$。

均值已知的方差估计

参数估计

如果分布的均值真实值已知，那么似然函数只有一个参数$\sigma^2$。求最大似然估计量也就是解决：

\[\hat{\sigma^2}_{ML}=\arg\max_{\sigma^2} P(\vec{x};\sigma^2).\tag{8}\]

但是，根据公式(6)的定义，如果计算$P(\vec{x};\sigma^2)$涉及到计算函数中指数的偏导。事实上，计算对数似然函数比计算似然函数本身的导数要简单的多。因为对数函数是单调递增函数，其最大值取值位置与原似然函数是一样的。因此我们用下面的式子替换：

\[\hat{\sigma}^2_{ML}=\arg\max_{\sigma^2}\log(P(\vec{x};\sigma^2)).\tag{9}\]

下面，我令$s=\sigma^2$简化式子。我们通过计算公式(6)的对数的导数赋值为0来最大化对数似然函数：

\[\begin{aligned}&\frac{\partial \log(P(\vec{x};\sigma^2))}{\partial \sigma^2}=0\\&\Leftrightarrow\frac{\partial\log(P(\vec{x};s))}{\partial s}=0\\&\Leftrightarrow\frac{\partial}{\partial s}\log\left(\frac{1}{(2\pi s)^{\frac{N}{2}}}e^{-\frac{1}{2s}\sum_{i=1}^{N}(x_i-\mu)^2} \right)=0\\&\Leftrightarrow\frac{\partial}{\partial s}\log\left(\frac{1}{(2\pi)^{\frac{N}{2}}}\right)+\frac{\partial}{\partial s}\log\left(\frac{1}{\sqrt{s}^\frac{N}{2}}\right)+\frac{\partial}{\partial s} \log\left(e^{-\frac{1}{2s}\sum_{i=1}^{N}(x_i-\mu)^2}\right )=0\\&\Leftrightarrow0+\frac{\partial}{\partial s}\log\left((s)^{-\frac{N}{2}}\right)+\frac{\partial}{\partial s}\left(-\frac{1}{2s}\sum_{i=1}^{N}(x_i-\mu)^2\right)=0\\&\Leftrightarrow -\frac{N}{2}\log (s)+\frac{1}{2 s^2}\sum_{i=1}^{N}(x_i-\mu)^2=0\\&\Leftrightarrow -\frac{N}{2s}+\frac{1}{2s^2}\sum_{i=1}^{N}(x_i-\mu)^2=0\\&\Leftrightarrow \frac{N}{2s^2}\left(-s+\frac{1}{N}\sum_{i=1}^{N}(x_i-\mu)^2\right)=0\\&\Leftrightarrow\frac{N}{2s^2}\left(\frac{1}{N}\sum_{i=1}^{N}(x_i-\mu)^2-s\right)=0\end{aligned}\]

很明显，如果$N>0$，那么上面等式唯一的解就是：

\[s=\sigma^2=\frac{1}{N}\sum_{i=1}^{N}(x_i-\mu)^2.\tag{10}\]

注意到，实际上$\hat{\sigma}^2$的极大似然估计估计量就是传统上一般计算方差的公式。这里标准化因子是$\frac{1}{N}$.

但是，极大似然估计并不保证得出的是一个无偏估计量。另外，就算得到的估计量是无偏的，极大似然估计也不能保证估计是最小方差，即MVU。因此，我们需要检查公式(10)的的估计量是否是无偏的。

表现评价

我们需要检查公式(7)的等式是否成立，来确定是否公式(10)中的估计量是无偏的。即判断：

\[E(s)=\hat{s}.\]

我们把公式(10)代入到$E(s)$，计算：

\[\begin{aligned}E[s] &= E \left[\frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2 \right] = \frac{1}{N} \sum_{i=1}^N E \left[(x_i - \mu)^2 \right] = \frac{1}{N} \sum_{i=1}^N E \left[x_i^2 - 2x_i \mu + \mu^2 \right]\\&= \frac{1}{N} \left( N E[x_i^2] -2N \mu E[x_i] + N \mu^2 \right)\\&= \frac{1}{N} \left( N E[x_i^2] -2N \mu^2 + N \mu^2 \right)\\&= \frac{1}{N} \left( N E[x_i^2] -N \mu^2 \right)\end{aligned}\]

另外，真实方差$\hat{s}$有一个非常重要的性质为$\hat{s}=E[x_i^2]-E[x_i]^2$，可变换公式为$E[x_i^2]=\hat{s}+E[x_i]^2=\hat{s}+\mu^2$。使用此性质我们可能从上面的公式推出：

\[\begin{aligned}E[s]&=\frac{1}{N}(N E[x_i^2]-N\mu^2)\\&=\frac{1}{N}(N\hat{s}+N\mu^2-N\mu^2)\\&=\frac{1}{N}(N\hat{s})\\&=\hat{s}\end{aligned}\]

满足了公式(7)的条件$E[s]=\hat s$，因此，我们得到的数据方差$\hat s$的统计量是无偏的。此外，因为极大似然估计的如果是一个无偏的估计量，那么也是最小方差(MVU)，也就是说，我们得到的估计量比任何一个其他的估计量都大。

因此，在分布真实均值已知的情况下，我们不用除以$N-1$，而是用除$N$计算正态分布的方差。

均值未知的方差估计

参数估计

上一节，分布的真实均值已知，因此，我们只需要估计数据的方差。但是，如果真实的均值未知，我们均值的估计量就也需要计算了。

此外，方差的估计量需要使用均值的估计量。我们会看到，这时，之前我们得到的方差的估计量就不再无偏了。我们一会儿会通过除以N-1，而不是N来稍微的增加方差估计量的值，从而使方差估计无偏。

与之前一样，基于log似然函数，我们用极大似然估计计算两个估计量。首先我们先计算$\hat\mu$的极大似然估计量：

\[\begin{aligned}&\frac{\partial \log(P(\vec{x}; s, \mu))}{\partial \mu} = 0\\&\Leftrightarrow \frac{\partial}{\partial \mu} \log \left( \frac{1}{(2 \pi s)^{\frac{N}{2}}} e^{-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2} \right) = 0\\&\Leftrightarrow \frac{\partial}{\partial \mu} \log \left( \frac{1}{(2 \pi)^{\frac{N}{2}}} \right) + \frac{\partial}{\partial \mu} \log \left(e^{-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2} \right) = 0\\&\Leftrightarrow \frac{\partial}{\partial \mu} \left(-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2 \right) = 0\\&\Leftrightarrow -\frac{1}{2s}\frac{\partial}{\partial \mu} \left(\sum_{i=1}^N(x_i - \mu)^2 \right) = 0\\&\Leftrightarrow -\frac{1}{2s} \left(\sum_{i=1}^N -2(x_i - \mu) \right) = 0\\&\Leftrightarrow \frac{1}{s} \left(\sum_{i=1}^N (x_i - \mu) \right) = 0 \\&\Leftrightarrow \frac{N}{s} \left( \frac{1}{N} \sum_{i=1}^N (x_i) - \mu \right) = 0 \end{aligned}\]

显然，如果$N>0$，那么上面的等式只有一种解：

\[\mu=\frac{1}{N}\sum_{i=1}^{N}x_i.\tag{11}\]

注意到，实际的这是计算一个分布均值的著名公式。虽然我们知道这个公式，但我们现在证明了极大似然估计量估计了一个正态分布未知均值的真实值。现在我们先假定我们之前公式(10)计算的方差$\hat s$的估计量仍然是MVU方差估计量。但下一节我们会证明这个估计量已经是有偏的了。

表现评价

我们需要通过检查估计量$\mu$对真实$\hat \mu$的估计是否无偏来确定公式(7)的条件能否成立：

\[E[\mu]=E\left[\frac{1}{N}\sum_{i=1}^{N}x_i\right]=\frac{1}{N}\sum_{i=1}^N E[x_i]=\frac{1}{N}N E[x_i]=\frac{1}{N} N \hat\mu=\hat\mu.\]

既然$E[\mu]=\hat\mu$，那么也就是说我们对分布均值的估计量是无偏的。因为极大似然估计可以保证在估计是无偏的情况下得到的是最小方差估计量，所以我们就已经是证明了$\mu$是均值的MVU估计量。

现在我们检查基于经验均值$\mu$，而不是真实均值$\hat\mu$的方差估计量$s$对真实方差$\hat s$的估计身上仍然是无偏的。我们只需要把得到的估计量$\mu$带入到之前在公式(10)推导出的公式：

\[\begin{aligned} s &= \sigma^2 = \frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2\\&=\frac{1}{N}\sum_{i=1}^N \left(x_i - \frac{1}{N} \sum_{i=1}^N (x_i) \right)^2\\&=\frac{1}{N}\sum_{i=1}^N \left[x_i^2 - 2 x_i \frac{1}{N} \sum_{i=1}^N (x_i) + \left(\frac{1}{N} \sum_{i=1}^N (x_i) \right)^2 \right]\\&=\frac{\sum_{i=1}^N x_i^2}{N} - \frac{2\sum_{i=1}^N x_i \sum_{i=1}^N x_i}{N^2} + \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2\\&=\frac{\sum_{i=1}^N x_i^2}{N} - \frac{2\sum_{i=1}^N x_i \sum_{i=1}^N x_i}{N^2} + \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2\\&=\frac{\sum_{i=1}^N x_i^2}{N} - \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2\end{aligned}\]

现在我们需要再次检查公式(7)的条件是否成立，来决定估计量是否无偏：

\[\begin{aligned} E[s]&= E \left[ \frac{\sum_{i=1}^N x_i^2}{N} - \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2 \right ]\\&= \frac{\sum_{i=1}^N E[x_i^2]}{N} - \frac{E[(\sum_{i=1}^N x_i)^2]}{N^2} \end{aligned}\]

记得我们在之前用过方差一个非常重要的性质，真实方差$\hat s$可以写成$\hat s = E[x_i^2]-E[x_i]^2$，即，$E[x_i^2]=\hat s + E[x_i]^2=\hat s +\mu^2$。利用这个性质我们可以推出：

\[\begin{aligned} E[s] &= \frac{\sum_{i=1}^N E[x_i^2]}{N} - \frac{E[(\sum_{i=1}^N x_i)^2]}{N^2}\\&= s + \mu^2 - \frac{E[(\sum_{i=1}^N x_i)^2]}{N^2}\\&= s + \mu^2 - \frac{E[\sum_{i=1}^N x_i^2 + \sum_i^N \sum_{j\neq i}^N x_i x_j]}{N^2}\\&= s + \mu^2 - \frac{E[N(s+\mu^2) + \sum_i^N \sum_{j\neq i}^N x_i x_j]}{N^2}\\&= s + \mu^2 - \frac{N(s+\mu^2) + \sum_i^N \sum_{j\neq i}^N E[x_i] E[x_j]}{N^2}\\&= s + \mu^2 - \frac{N(s+\mu^2) + N(N-1)\mu^2}{N^2}\\&= s + \mu^2 - \frac{N(s+\mu^2) + N^2\mu^2 -N\mu^2}{N^2}\\&= s + \mu^2 - \frac{s+\mu^2 + N\mu^2 -\mu^2}{N}\\&= s + \mu^2 - \frac{s}{N} - \frac{\mu^2}{N} - \mu^2 + \frac{\mu^2}{N}\\&= s - \frac{s}{N}\\&= s \left( 1 - \frac{1}{N} \right)\\&= s \left(\frac{N-1}{N} \right) \end{aligned}\]

显然$E[s]\neq\hat s$，上面公式可知分布的方差估计量不再是无偏的了。事实上，平均来看，这个估计量低估了真实方差，比例为$\frac{N-1}{N}$。当样本的数量趋于无穷时($N\rightarrow\infty$)，这个偏差趋近于0。但是对于小的样本集，这个偏差就意义了，需要被消除。

修正偏差

因为偏差不过是一个因子，我们只需通过对公式(10)的估计量乘以偏差的倒数。这样我们就可以定义一个如下的无偏的估计量$s\prime$：

\[\begin{aligned} s\prime &= \left ( \frac{N-1}{N} \right )^{-1} s\\s\prime &= \left ( \frac{N-1}{N} \right )^{-1} \frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2\\s\prime &=\left ( \frac{N}{N-1} \right ) \frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2\\s\prime &= \frac{1}{N-1}\sum_{i=1}^N(x_i - \mu)^2\end{aligned}\]

这个估计量现在就是无偏的了，事实上，这个公式与传统计算方差的公式非常像，不同的是除的是$N-1$而不是$N$。然而，你可能注意到这个估计量不再是最小方差估计量，但是这个估计量是所有无偏估计量中最小方差的一个。如果我们除以$N$，那么估计量就是有偏的了，如果我们除以$N-1$，估计量就不是最小方差估计量。但大体来说，一个有偏的估计量要比一个稍高一点方差的估计量要糟糕的多。因此，如果当总体的均值是未知的情况下，方差除的是$N-1$，而不是$N$。

总结

本文，我们推导了如果从分布数据中计算常见的方差和均值公式。此外，我们还证明了在方差估计中，标准化因子在总体均值已知时是$\frac{1}{N}$，在均值也需要估计时是$\frac{1}{N-1}$。

本文PDF

Decision Tree (ID3)

发表于 2017-05-21 分类于 Alogorithm

Preface

In April 9th, 2017, incident occurred in United Airlines where crew of UA beat up a passenger and dragged him out of the plane before which was about to take off attracted attention all around the world. Many would gave out doubt: why a company being so rude to passengers can exist in this world? Actually, UA is going well is just because they have an extremely precise emergency situation procedure which is calculate by compute depending on big-data analysis. Computer can help us make decisions though, it has no emotions, which is effective in most cases, but can not be approved by our human beings. Let's take a look at how algorithm make a decision: It is a decision tree, which simply represents the procedure of how UA algorithm make the decision. First of all, before taking off, four employees of UA need fly from Chicago to Kentucky. Then the algorithm check if there is any seats left, if so, passengers were safe for the moment. But UA3411 was full, the algorithm began assessing the importance of employees or passengers. Obviously, the algorithm think crew is more important due to business consideration. Then how to choose who should be evicted from the plane. The algorithm was more complicated than the tree I drew, however, Asian or not was one of the criterion. But why? Because Asian are pushovers. The passenger agreed at first, however, when he heard that he had to wait for one day, he realized that he could not treat his patient, then he refused. Then he was beat up and dragged off the plane.

As you have seen, it is a decision tree, which is similar to human decision-making process. Decision tree is a simple but powerful algorithm in machine learning. In fact, you are often using decision tree theory when making decision, for example

Introduction

Decision tree is a classification and regression algorithm, we build a tree through statistics. Today we only talk about how to classify dataset using Decision Tree. First we will introduce some information theory background knowledge, then we use iris data build a decision tree using IDC3 algorithm.

Iris data

Iris dataset is a very famous dataset deposited on UCI machine learning repository, which described three kinds of iris. there are four columns corresponding for features as followed： * sepal length in cm * sepal width in cm * petal length in cm * petal width in cm

The last column represents iris categories:

Iris-setosa (n=50)
Iris-versicolor (n=50)
Iris-virginica (n=50)

Here, our task is to use the dataset to train a model and generate a decision tree. During the process we need calculate some statistics values to decide how to generate a better one.

The dataset is very small so that you can easily download it and take a look.

Entropy and Information Gain

Entropy

Before Decision Tree, I'd like to talk about some concept in Information Theory. Entropy is a concept from thermodynamics at first, C.E.Shannon introduced which into information theory which represent redundancy in 1948. It sounds a very strange concept. In fact, it is very easy to understand. For example, during the knockout stages in world Cup Games, there are 16 teams. Now I let you guess which team will win the champion which assume I know the answer, how many times do you need to get the outcome? First of all, you cut 16 teams to 8-8 parts, you asked me if the team in first 8 teams or the other. I told you that the team was in the other 8 teams. Then you cut the the 8 teams again, you ask me if the team is in the first 4 teams or the other, I told you that the champion would be in the first 4 teams, and so forth and so on. And how many times is the entropy of who wining the champion.

\[ Entropy(champion) = {\rm log}_2^{16}=4 \]

That is, we can use 4 bits to represents which team will win the game. Clever you may ask why we divide team to two parts other than three or four parts. That is because we use binary represents the world in computer world. $ 2^4=16 $ means we can use 4 bit represents 16 conditions. We can use entropy represent all information in this world. And if you have known that which team will win the campion, the entropy is 0, because, you do not need any more information to deduce the outcome.

Entropy represents uncertainty indeed. Ancient China, we have to record history on bamboo slips, which demanded us decrease words. That means entropy of every single ancient Chinese character is higher than words we are saying today. That is, if we lost just some of these words, we would lose lots of stories. There are many songs starts with:"Yoo, yoo, check now", which barely offer us information, which means we can drop those words and interpret the these songs precisely as well. The entropy of these sentence is low.

Assume $X$ is discrete random variable, the distribution is: \[P(X=x_i)=p_i\] then the entropy of X is: \[H(X)=-\sum_{i=1}^{n}p_i {\rm log}_2 p_i\] where if p_0=0, we define 0log0 = 0.

It seems that the equation has nothing to do with the entropy we have calculated in the champion example. Now let's calculate the example. First of all $X$ represents the probability of each team which would win the game. we assume all teams were at the same level, so we have \[p(X=x_1)=p(X=x_2)=p(X=x_3)=\cdots = p(X=x_{16})=\frac{1}{16}\] the entropy is \[H(X)=-\sum_{i=1}^{16}\frac{1}{16}{\rm log}_2 \frac{1}{16}=-16\times\frac{1}{16}\times {\rm log}_2 {2^{-4}}=4\]

Bingo, the the answer is same. In fact, if we know some more information, the entropy is lower than 4. for example, the probability of Germany is higher than some Asian teams. #### Entropy and Iris Data Now we calculate entropy of Iris Data which will be used to fit a decision tree in following sections. We concern about the categories(setosa, versicolor and virginica). Remember the equation of how to calculate entropy: \[H(X)=-\sum_{i=1}^{n}p_i {\rm log}_2 p_i\]

Three kinds of flowers are all 50s, so the probability of each category is the same: \[p_1=p_2=p_3=\frac{50}{50+50+50}=\frac{1}{3}\] Then, the entropy is pretty easy to calculate \[H(X)=-1\times (\frac{1}{3}{\rm log}_2\frac{1}{3}+\frac{1}{3}{\rm log}_2\frac{1}{3}+\frac{1}{3}{\rm log}_2\frac{1}{3})=1.5850\] #### Conditional Entropy The meaning of Conditional Entropy is as its name. With respect with random variable$(X, Y)$, the joint distribution is \[P(X=x_i, Y=y_j)=p_{ij}, i=1,2,3\cdots m; j=1,2,3,\cdots n\] Conditional Entropy H(Y|X) represents that given we have known random variable $X$ , the disorder or uncertainty of $Y$. The definition is as followed: \[H(Y|X)=\sum_{i=1}^m p_i H(Y|X=x_i)\] Here, $p_i=P(X=x_i)$.

Conditional Entropy and Iris Data

We calculate some Conditional Entropy as examples. First of all, I random choose 15 columns of sepal length with respect to their categories. the result is as followed：

No.	sepal length in cm	categories
1	5.90	Iris-versicolor
2	7.20	Iris-virginica
3	5.00	Iris-versicolor
4	5.00	Iris-setosa
5	5.90	Iris-versicolor
6	5.70	Iris-setosa
7	5.20	Iris-versicolor
8	5.50	Iris-versicolor
9	4.80	Iris-setosa
10	4.60	Iris-setosa
11	6.50	Iris-versicolor
12	5.20	Iris-setosa
13	7.70	Iris-virginica
14	6.40	Iris-virginica
15	6.00	Iris-versicolor

The octave code is

%% octave
[a,b,c,d, cate] = textread("iris.data", "%f%f%f%f%s","delimiter", ",");

for i=1:15
  x = floor(rand()*150);
  fprintf('%f %s\n', a(x), cate{x} );
end

We just take this 15 items for examples, I assume that we divide sepal length into two parts: greater than mean and less than mean. The mean is \[mean = (5.90+7.2+\cdots+6.00)/15 = 5.7733\] There are 8 elements less then 5.7733 and 7 bigger ones. That is

mean	idx of greater than mean	idx of less than mean
5.7733	1,2,5,11,13,14,15	3,4,6,7,8,9,10,12

We let $x_1=greater$(1,2,5,11,13,14,15), $x_2=less$(3,4,6,7,8,9,10,12) then \[H(Y|X=x_1)=-(p_1 {\rm log}_2 p_1 + p_2 {\rm log}_2 p_2 + p_3 {\rm log}_2 p_3)=\frac{4}{7}{\rm log}_2\frac{4}{7}+\frac{3}{7}{\rm log}_2\frac{3}{7}+0{\rm log}_2 0=0.98523\] \[H(Y|X=x_2)=-(p_1 {\rm log}_2 p_1 + p_2 {\rm log}_2 p_2+p_3 {\rm log}_2 p_3)=\frac{3}{8}{\rm log}_2\frac{3}{8}+0{\rm log}_2 0+\frac{5}{8}{\rm log}_2\frac{5}{8}=0.95443\]

The Conditional Entropy then is \[H(Y|X)=\sum_{i=1}^{2}p_i H(Y|x_i)=\frac{7}{15}\times 0.98523+\frac{8}{15}\times 0.95443=0.96880\] #### Information Gain Just as its name implies, Information Gain means the information we have gained after adding some features. That is, we can vanish some uncertainty when we add some information. For example, I want you to guess an NBA player, the uncertainty is very high, however, there are only several persons in the list if I tell you that he is a Chinese. You gained information after knowing the Chinese feature to decrease the uncertainty. The calculation of Information Gain is \[IG(Y, X)= H(Y)-H(Y|X)\] Here, we want to decide $Y$ with feature $X$. It is easy, just Entropy of $Y$ minus Conditional Entropy $Y$ given $X$. The meaning is obvious too: $H(Y)$ represents uncertainty, $H(Y|X)$ represents uncertainty of $Y$ given $X$, the difference is the Information Gain. #### Information Gain and Iris Data In this section, I will apply Information Gain equations to the whole Iris data. First of all, let $Y$ represent categories of iris, and $X_1,X_2,X_3, X_4$ represent sepal length, sepal width, petal length petal width respectively.

We have computed that $H(Y)=1.0986$, next, we will calculate 4 Conditional Entropy $H(Y|X_1),H(Y|X_2),H(Y|X_3),H(Y|X_4)$. In light of continuousness of $X$, we divide them by mean of each feature. Then \[\overline{X_1}=5.8433,\,\overline{X_2}=3.0540,\,\overline{X_3}=3.7587,\,\overline{X_4}=1.1987\]

\[H(Y|X_1)=-\sum_{i=1}^3 p_i H(Y|X_{1i})=-(\frac{70}{150}(\frac{0}{70}{\rm log}_2\frac{0}{70}+\frac{26}{70}{\rm log}_2\frac{26}{70} +\frac{44}{70}{\rm log}_2\frac{44}{70})+\frac{80}{150}(\frac{50}{80}{\rm log}_2\frac{50}{80}+\frac{24}{80}{\rm log}_2\frac{24}{80}+\frac{6}{80}{\rm log}_2\frac{6}{80}))=1.09757\]

\[H(Y|X_2)=-\sum_{i=1}^3 p_i H(Y|X_{2i})=-(\frac{67}{150}(\frac{42}{67}{\rm log}_2\frac{42}{67}+\frac{8}{67}{\rm log}_2\frac{8}{67}+\frac{17}{67}{\rm log}_2\frac{17}{67}+\frac{83}{150}(\frac{8}{83}{\rm log}_2\frac{8}{83}+\frac{42}{83}{\rm log}_2\frac{42}{83}+\frac{33}{83}{\rm log}_2\frac{33}{83}))=1.32433\]

\[H(Y|X_3)=-\sum_{i=1}^3 p_i H(Y|X_{3i})=-(\frac{93}{150}(\frac{0}{93}{\rm log}_2\frac{0}{93}+\frac{43}{93}{\rm log}_2\frac{43}{93}+\frac{50}{93}{\rm log}_2\frac{50}{93}+\frac{57}{150}(\frac{50}{57}{\rm log}_2\frac{50}{57}+\frac{7}{57}{\rm log}_2\frac{7}{57}+\frac{0}{57}{\rm log}_2\frac{0}{57}))=0.821667\]

\[H(Y|X_4)=-\sum_{i=1}^3 p_i H(Y|X_{4i})=-(\frac{90}{150}(\frac{0}{90}{\rm log}_2\frac{0}{90}+\frac{40}{90}{\rm log}_2\frac{40}{90}+\frac{50}{90}{\rm log}_2\frac{50}{90}+\frac{60}{150}(\frac{50}{60}{\rm log}_2\frac{50}{60}+\frac{10}{60}{\rm log}_2\frac{10}{60}+\frac{0}{60}{\rm log}_2\frac{0}{60}))=0.854655 \] Information Gains is easy to get \[IG(Y, X_1)=H(Y)-H(Y|X_1)=1.5850-1.09757=0.487427\]

\[IG(Y, X_2)=H(Y)-H(Y|X_2)=1.5850-1.32433=0.260669\]

\[IG(Y, X_3)=H(Y)-H(Y|X_3)=1.5850-0.821667=0.763333\]

\[IG(Y, X_4)=H(Y)-H(Y|X_4)=1.5850-0.854655=0.730345\] By now, we find that $IG(Y, X_3)$ is bigger than others, which means feature $X_3$ supplies more information.

ID3(Iterative Dichotomiser 3)

ID3 algorithm was developed by Ross Quinlan in 1986, which is a very classic algorithm as well as C4.5 and CART. We First apply Information Gain of each feature with respect to iris data. Then to choose the maximum to divide data into 2 parts. For each part we apply Information Gain recursively until we put all parents data to one node. Now that we have know Information Gain from the last section, obviously we choose X3 as the feature dividing data into 2 parts in the first place.

Let's take a look at the first cut using feature $X_3$. We have 150 items at first, after comparing if $X_3>3.7587$, we divide data into two parts, one has 93 items, the other got 57. From the data, we know that there is no setosa in node B, meanwhile, no virginica in node C, which means that this feature is very good for split data due to exclude setosa and virginica.

	Node B	Node C
setosa	0	50
versicolor	43	7
virginica	50	0

The end condition of the algorithm is decided by IG. When IG is less then some threshold or if there is only one category left, we can end the algorithm. If IG less than some value(e.g. 0.01) and more than one category left simultaneously, we have to choose a final category to be the leaf, the rule is to set the category having samples more than the others.

Take Node H for example, we set IG threshold to 0.01 in the first place. Then we calculate the Information Gain for each feature, the biggest IG from feature 2(sepal width in cm), which is 0.003204 and less than 0.01. So we have to set H as a leaf. There are 0 Iris-setosa, 25 Iris-versicolor and 44 Iris-virginica in the leaf, so we set the bigger one(i.e. Iris-virginica) to the leaf.

Summary

Today we have talked about what is decision tree algorithm. Firstly, I introduce three background concept Entropy, Conditional Entropy and Information Gain. Next we apply ID3 algorithm to Iris data to build a decision.

One of the most significant advantages of decision tree is that we can explain the result. If the algorithm decided UA should beat the their passengers, they could trace the tree to find the path of reason chain. It is very useful to tell consumers why we recommend them something, under such circumstance, we can use decision tree to train a model.

There is a shortcoming that Information Gain tends to use feature with more values. In order to resolve the problem, Ross Quinlan improved the algorithm through Information Gain Rate Rather than IG. Breiman introduced CART algorithm subsequently, which can be applied to classification as well as regression. Recently, Scientists have developed more powerful algorithm such as Random Forest and Gradient Boosting Decision Tree etc.

Reference

《统计学习方法》，李航
《数学之美》，吴军
http://www.shogun-toolbox.org/static/notebook/current/DecisionTrees.html
https://en.wikipedia.org/

Appendix code

%% octave main function file
%% iris data dowload link: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
[a,b,c,d, cate] = textread("iris.data", "%f%f%f%f%s","delimiter", ",");


%for i=1:15
%	x = floor(rand()*150);
%	fprintf('%f %s\n', a(x), cate{x} );
%end;

features = [a, b, c, d];
for i=1:length(features(1, :))
	col = features(:, i);
	me = mean(col);
	disp(me);
	feat(i).greater = find(col > me);
	feat(i).less = find(col <= me);
end

total = (1:150)';
decision(feat, length(features(1, :)), cate, total);
fprintf('\n')

%% octave: decsion tree file
function decision(feat, feat_size, cate, total)
	if length(total) == 0
		return
	end
	fprintf('(-%d-)', length(total));
	%plogp = @(x)[x*log2(x)];
	function e = plogp(pi)
		if pi == 0
			e = 0;
		else
			e = pi*log2(pi);
		end
	end

	function d = div(a, b)
		if b == 0
			d = 0;
		else
			d = a/b;
		end
	end

	debug = 0;

	function m = maxc(cate, cates, total)
		maxidx = 1;
		max_c = 0;
		for i=1:length(cates)
			c =find(strcmp(cate, cates{i}));
			cl = length(intersect(c, total));
			if debug == 1 fprintf('\n%d##%d  %s###',i, cl, char(cates{i})) end
			%if (debug == 1 && cl <10 && cl >0) disp(intersect(c, total)') end
			if cl > max_c
				max_c = cl;
				maxidx = i;
			end
		end
		if debug == 1 fprintf('\n****%d    %d******\n', maxidx, max_c) end
		%m = cates(maxidx);
		m = maxidx;
	end


	% compute h(y)
	cates = unique(cate);
	hx = 0;
	for i = 1:length(cates)
		c = find(strcmp(cate, cates{i}));
		rc = intersect(c, total);
		hx -= plogp(length(rc)/length(total));
	end
	%fprintf('hx = %f\n', hx)			
	% compute h(y|x)
	max_feature = 1;
	max_ig = 0;

	max_left = intersect(feat(1).greater, total);
	max_right = intersect(feat(1).less, total);
	for i=1:feat_size
		hxh = 0;
		hxl = 0;
		feat_greater = intersect(feat(i).greater, total);
		feat_less = intersect(feat(i).less, total);
		ge = length(feat_greater);
		le = length(feat_less);

		if (ge+le) == 0
			continue
		end

		for j = 1:length(cates);
			c = find(strcmp(cate, cates{j}));
			xh = length(intersect(feat_greater, c));
			xl = length(intersect(feat_less, c));
			hxh -= plogp(div(xh, ge));
			hxl -= plogp(div(xl, le));
		end
		% compute hx - h(y|x)
		hxy = (ge/(ge+le))*hxh + ((le)/(ge+le))*hxl;
		ig = hx - hxy;

		if ig > max_ig
			max_ig = ig;
			max_feature = i;
			max_left= feat_less;
			max_right = feat_greater;
		end

	end

	left = max_left;
	right = max_right;
	%fprintf('feature:ig  %d %f %d %d ------ \n', max_feature, max_ig, length(left), length(right));


	if debug == 1 printf("\033[0;32;1m-ig--%f \033[0m",  max_ig); end
	if(max_ig < 0.01)
		%fprintf('<%s>', char(maxc(cate, cates, total)))
		printf("\033[0;31;1m<%d>\033[0m",  maxc(cate, cates, total));
		return
	end
	fprintf("\033[0;34;1m#%d \033[0m",  max_feature);
	fprintf('{' )
	decision(feat, feat_size, cate, left);
	decision(feat, feat_size, cate, right);
	fprintf('}')
end

A Tutorial on Singular Value Decomposition

发表于 2017-05-01 分类于 Alogorithm

Preface

Under some circumstance, we want to compress data to save storage space. For example, when iPhone7 was released, many were trapped in a dilemma: Should I buy a 32G iPhone without enough free space or that of 128G with a lot of storage being wasted? I had been trapped in such dilemma indeed. I still remember that I only had 8G storage totally when I was using my first Android phone. What annoyed me most was my thousands of photos. Well, I confess that I was being always a mad picture taker. I knew that there were some technique which could compress a picture through reducing pixel. However, it is not enough, because, as you know, in some arbitrary position in a picture, we can tell that the picture share the same color. An extreme Example: if we have a pure color picture, what we just need know is the RGB value and the size, then reproducing the picture is done without extra effort. What I was dreaming is done perfectly by Singular Value Decomposition(SVD).

Introduction

Before SVD, in this article, I will introduce some mathmatical concepts in the first place which cover Linear transformation and EigenVector&EigenValue. This Background knowledge is meant to make SVD straightforward. You can skip if you are familar with this knowledge.

Linear transformation

Given a matrice $A$ and vector $\vec{x}$, we want to compute the mulplication of $A$ and $\vec{x}$

\[\vec{x}=\begin{pmatrix}1\\3\end{pmatrix}\qquad A=\begin{pmatrix}2 & 1 \\ -1 & 1 \end{pmatrix}\qquad\vec{y}=A\vec{x}\]

But when we do this multiplication, what happens? Acutually, when we multiply $A$ and $\vec{x}$, we are changing the coordinate axes of the vector $x$ to another new axes. Begin with a simpler example, we let

\[A=\begin{pmatrix}1 & 0\\ 0 &1\end{pmatrix}\]

then we have \[A\vec{x}=\begin{pmatrix}1 & 0\\ 0 &1\end{pmatrix}\begin{pmatrix}1\\3\end{pmatrix}=\begin{pmatrix}1\\3\end{pmatrix}\]

You may have noticed that we can always get the same $\vec{x}$ after left multiply by A. In this case, we use coordinate axes $i=\begin{pmatrix}1 \\ 0\end{pmatrix}$ and $j=\begin{pmatrix}0 \\ 1\end{pmatrix}$ as the figure below demonstrated. That is, if we want to represent $\begin{pmatrix}1\\3\end{pmatrix}$ under the coordination, we can calculate the transformation as followed:

\[\begin{align} A\vec{x}=1\cdot i + 3\cdot j = 1\cdot \begin{pmatrix}1 \\ 0\end{pmatrix} + 3\cdot \begin{pmatrix}0 \\ 1\end{pmatrix}=\begin{pmatrix}1\\3\end{pmatrix}\end{align}\]

As we know, we can put a vector to anywhere in space, and if we want to calculate sum of two vectors, the simplest way is to connect the to vector from one's head to the other's tail. Our example, we compute $A\vec{x}$ means add two vector(green imaginary lines) up. And the answer is still $\begin{pmatrix}1\\3\end{pmatrix}$.

Now we change $i=\begin{pmatrix}2\\ -1\end{pmatrix}$ and $j=\begin{pmatrix}1\\1\end{pmatrix}$ as the coordinate axes(the red vectors), which means $A=\begin{pmatrix}2 & 1 \\ -1 & 1\end{pmatrix}$. I put vectors(black ones) to this figure as well. We can see what happens when we change a new coordinate axes.

First of all, we multiply $j$ by $3$ and $i$ by 1. Then we move vector j and let the head of $i$ connect the tail of $3\cdot j$. We can now find what is the coordination of $1\cdot i+3\cdot j$(the blue one). We now verify the result using mutiplication of $A$ and $\vec{x}$:

\[A\vec{x}=\begin{pmatrix}2 & 1 \\ -1 & 1\end{pmatrix}\begin{pmatrix}1\\3\end{pmatrix}=1\cdot \begin{pmatrix}2 \\ -1\end{pmatrix} + 3\cdot \begin{pmatrix}1 \\ 1\end{pmatrix}=\begin{pmatrix}5\\2\end{pmatrix}\]

Here, you can imagine that matrice $A$ is just like a function $f(x)\rightarrow y$, when you subsitute $x$, we get the exact $y$ using the principle $f(x)\rightarrow y$. In fact, the multiplication is tranform the vector from one coordination to another.

Exercise

$A=\begin{pmatrix}1 & 2 \\ 3 & 4\end{pmatrix}$, draw the picture to stretch and rotate $x=\begin{pmatrix}1\\3\end{pmatrix}$.
Find a $A$ matrix to rotate $\vec{x}=\begin{pmatrix}1\\3\end{pmatrix}$ to $90^{\circ}$ and $180^{\circ}$.
what if $A=\begin{pmatrix}1 & 2 \\ 2 & 4\end{pmatrix}$.

EigenVector and EigenValue

EigenVector and EigenValue is an extremely important concept in linear algebra, and is commonly used everywhere including SVD we are talking today. However, many do not know how to interpret it. In fact, EigenVector and EigenValue is very easy as long as we know about what is linear transformation.

A Problem

Before start, let's take a look at a question: if we want to multiply matrices for 1000 times, how to calculate effectively? \[AAA\cdots A= \begin{pmatrix}3& 1 \\0 & 2\end{pmatrix}\begin{pmatrix}3& 1 \\0 & 2\end{pmatrix}\begin{pmatrix}3& 1 \\0 & 2\end{pmatrix}\cdots \begin{pmatrix}3& 1 \\0 & 2\end{pmatrix}=\prod_{i=1}^{1000}\begin{pmatrix}3& 1 \\0 & 2\end{pmatrix}\]

Intuition

Last section, we have talked that if we multiply a vector by a matrix $A$, means that we use $A$ to stretch and rotate the vector in order to represent the vector in a new coordinate axes. However, there are some vectors for $A$, they can only be stretched but can not be rotated. Assume $A=\begin{pmatrix}3 & 1 \\ 0 & 2\end{pmatrix}$, let $\vec{x}=\begin{pmatrix}1 \\ -1\end{pmatrix}$. When we multiply $A$ and $\vec{x}$

\[A\vec{x}=\begin{pmatrix}3 & 1 \\ 0 & 2\end{pmatrix}\begin{pmatrix}1 \\ -1\end{pmatrix}=\begin{pmatrix}2 \\ -2\end{pmatrix}=2\cdot \begin{pmatrix}1 \\ -1\end{pmatrix}\]

It turns out we can choose any vector along $\vec{x}$, the outcome is the same, for example:

\[A\vec{x}=\begin{pmatrix}3 & 1 \\ 0 & 2\end{pmatrix}\begin{pmatrix}-3 \\ 3\end{pmatrix}=\begin{pmatrix}-6 \\ -6\end{pmatrix}=2\cdot \begin{pmatrix}-3 \\ 3\end{pmatrix}\]

We name vectors like $\begin{pmatrix}-3 \\ 3\end{pmatrix}$ and $\begin{pmatrix}1 \\ -1\end{pmatrix}$ EigenVectors and 2 the conresponse EigenValues. In practice, we usually choose unit eigenvectors(length equals to 1) given that there are innumerable EigenVectors along the line.

I won't cover how to compute these vectors and vaules and just list the answer as followed

\[\begin{align}&\begin{pmatrix}3 & 1 \\0& 2\end{pmatrix} \begin{pmatrix}{-1}/{\sqrt(2)} \\ {1}/{\sqrt(2)}\end{pmatrix}= 2\begin{pmatrix}{-1}/{\sqrt(2)} \\ {1}/{\sqrt(2)}\end{pmatrix} \qquad\qquad\vec{x_1}=\begin{pmatrix}{-1}/{\sqrt(2)} \\ {1}/{\sqrt(2)}\end{pmatrix} &\lambda_1=2\\ &\begin{pmatrix}3 & 1 \\0& 2\end{pmatrix} \begin{pmatrix}1 \\ 0\end{pmatrix}\qquad=\qquad 3\begin{pmatrix}1 \\ 0\end{pmatrix}\qquad\qquad\quad\,\,\,\vec{x_2}=\begin{pmatrix}1 \\ 0\end{pmatrix} &\lambda_2=3 \end{align}\] Notice that $|\vec{x_1}|=1$ and $|\vec{x_2}|=1$ #### EigenValue Decomposition If we put two EigenVectors and corresponding EigenValues together, we can get the following equation: \[AQ=\begin{pmatrix}3 & 1 \\0& 2\end{pmatrix} \begin{pmatrix} {-1}/{\sqrt(2)}&1\\ {1}/{\sqrt(2)}&0 \end{pmatrix}= \begin{pmatrix} {-1}/{\sqrt(2)}&1 \\ {1}/{\sqrt(2)}&0 \end{pmatrix} \begin{pmatrix} 2 & 0\\ 0 & 3 \end{pmatrix}=Q\Lambda \] Then we have $AQ=Q\Lambda$, the conclusion is still right if we introduce more dimensions, that is \[\begin{align} A\vec{x_1}=\lambda\vec{x_1}\\ A\vec{x_2}=\lambda\vec{x_2}\\ \vdots\qquad\\ A\vec{x_k}=\lambda\vec{x_k} \end{align}\]

\[Q= \begin{pmatrix} x_{11}& x_{21} &\cdots x_{k1}&\\ x_{12}& x_{22} &\cdots x_{k2}&\\ &\vdots&&\\ x_{1m}& x_{22} &\cdots x_{km}& \end{pmatrix} \qquad\Lambda= \begin{pmatrix} \lambda_1 & 0 & \cdots&0\\ 0 &\lambda_2&\cdots&0\\ \vdots&\vdots&\ddots\\ 0&\cdots&\cdots&\lambda_k \end{pmatrix}\]

If we do something on the equation $AQ=Q\Lambda$, then we have: \[AQQ^{-1}=A=Q\Lambda Q^{-1}\] It is EigenVaule Decomposition. #### Resolution Now, Let's look at the question in the beginning of this section \[\begin{align} AAA\cdots A&= \begin{pmatrix}3& 1 \\0 & 2\end{pmatrix}\begin{pmatrix}3& 1 \\0 & 2\end{pmatrix}\begin{pmatrix}3& 1 \\0 & 2\end{pmatrix}\cdots \begin{pmatrix}3& 1 \\0 & 2\end{pmatrix}=\prod_{i=1}^{1000}\begin{pmatrix}3& 1 \\0 & 2\end{pmatrix}\\ AAA\cdots A &= Q\Lambda Q^{-1}Q\Lambda Q^{-1}Q\Lambda Q^{-1}\cdots Q\Lambda Q^{-1}=Q\prod_{i=1}^{1000}\begin{pmatrix}3& 1 \\0 & 2\end{pmatrix}Q^{-1}\\ AAA\cdots A &=Q\Lambda\Lambda\cdots \Lambda Q^{-1}= Q\begin{pmatrix}2^{1000} & 0 \\0 & 3^{1000}\end{pmatrix}Q^{-1} \end{align}\] The calculation is extremely simple using EVD.

Exercise

Research how to compute EigenVectors and EigenValues, then compute$\begin{pmatrix}1 & 2 & 3\\4 & 5 &6\\7 & 8 & 9\end{pmatrix}$.
Think about the decisive factor affects how many EigenValues we can get.

Singular Value Decompositon

Notice that EigenVector Decomposition is applied to decompose square matrices. Is there any approach to decompose non-square matrices? The answer is a YES, and the name is Singular Value Decompositon.

Intuition

First of all, let's take a look at what SVD looks like From the picture, we can find that matrice $A$ is decomposed to 3 components: $U$, $\Sigma$ and $V^{T}$. $U$ and$V^T$ are both sqaure matrices and $\Sigma$ has the same size as $A$. Still, I want to emphasize that $U$ and$V^T$ are both unitary matrix, which means the Determinant of $U$ and $V^T$ is 1 and $U^T=U^{-1}\quad V^T=V^{-1}$.

Deduction

In the Linear Transformation section, we can transform a vector to another coordinate axes. Assume you have a non-square matrice, and you want to transform A from vectors $V=(\vec{v_1}, \vec{v_2},\cdots,\vec{v_n})^T$ to antoher coordinate axes which is $U=(\vec{u_1}, \vec{u_2},\cdots,\vec{u_n})^T$, the thing is, $\vec{v_i}$ and $\vec{u_i}$ have unit length, and all directions are perpendicular, that is, each of $\vec{v_i}$ are at right angles to other $\vec{v_j}$, we name such matrices as orthogonal matrices. In addition, I need add a factor $\Sigma=(\sigma_1,\sigma_2, \sigma_3,\cdots,\sigma_n)$ which represent the times of each direction of $\vec{u_i}$, i.e., We need transform A from $V=(\vec{v_1}, \vec{v_2},\cdots,\vec{v_n})^T$ to $(\sigma_1 \vec{u_1},\sigma_2 \vec{u_2}, \sigma_3 \vec{u_3},...\sigma_n \vec{u_n})^T$. From the picture below we can find that we want to transform from the circle coordinate axes to the ellipse axes. \[\begin{align} \vec{v_1} \vec{v_2} \vec{v_3},...\vec{v_n} \qquad\rightarrow \qquad &\vec{u_1},\vec{u_2},\vec{u_3},...\vec{u_n}\\ &\sigma_1,\sigma_2, \sigma_3,...\sigma_n \end{align}\]

Recall that we can transform $A$ at every direction, then generate another direction as new coordinate direction. So we have \[ A \vec{v_1}=\sigma_1 \vec{u_1}\\ A \vec{v_2}=\sigma_2 \vec{u_2}\\ \vdots\\ A \vec{v_j}=\sigma_j \vec{u_j}\]

\[\begin{align} &\begin{pmatrix}\\A\\\end{pmatrix}\begin{pmatrix}\\ \vec{v_1},\vec{v_2},\cdots,\vec{v_n}\\\end{pmatrix}=\begin{pmatrix}\\ \vec{u_1}, \vec{u_2},\cdots,\vec{u_n}\\ \end{pmatrix}\begin{pmatrix} \sigma_1 & 0 & \cdots & 0 \\ 0 & \sigma_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma_n \end{pmatrix}\\ &C^{m\times n}\qquad\quad C^{n\times n}\qquad\qquad\qquad C^{m\times n}\qquad \qquad \qquad C^{n\times n} \end{align}\] Which is \[A_{m\times n}V_{n\times n} = \hat{U}_{m\times n}\hat{\Sigma}_{n\times n}\]

\[\begin{align} A_{m\times n}V_{n\times n} &= \hat{U}_{m\times n}\hat{\Sigma}_{n\times n}\\ (A_{m\times n}V_{n\times n}V_{n\times n}^{-1} &= \hat{U}_{m\times n}\hat{\Sigma}_{n\times n}V_{n\times n}^{-1}\\ A_{m\times n}&=\hat{U}_{m\times n}\hat{\Sigma}_{n\times n}V_{n\times n}^{-1}\\&=\hat{U}_{m\times n}\hat{\Sigma}_{n\times n}V_{n\times n}^{T} \end{align}\]

We need do something to the equation in order to continue the deduction. First we stretch matrice $\hat{\Sigma}$ vertically to $m \times n$ size. Then stretch $\hat{U}$ horizonly to $m\times m$, we can set any value to the right entries.

Due to the fact we need calculate $U^{-1}$ and $V^{-1}$, the equation is adjusted to \[A_{m\times n} = U_{m\times m}\Sigma_{m\times n}V^T_{n\times n}\] For furture convenience, we need sort all $\sigma s$, which means: \[\sigma_1\geq\sigma_2\geq\sigma_3 \geq\cdots\geq \sigma_m\]. #### How to calculate $U$, $V^T$ and $\Sigma$ To Decompose matrice $A$, we need calculate $U$, $V^T$ and $\Sigma$. Remember that $U^T = U^{-1}$ and $V^T = V^{-1}$, we will use the property next.

\[\begin{align} A &= U\Sigma V^T\\ \end{align}\]

\[\begin{align} AA^T&=U\Sigma V^T(U\Sigma V^T)^T\\ &=U\Sigma V^TV\Sigma^T U^T\\ &=U\Sigma V^{-1}V\Sigma^T U^T\\ &=U\Sigma I\Sigma^T U^T\\ &=U\Sigma^2 U^T \end{align}\]

\[\begin{align} (AA^T)U&=(U\Sigma^2 U^T)U\\ &=(U\Sigma^2 )U^{-1}U\\ &=U\Sigma^2 \end{align}\]

\[\begin{align} A^TA &=(U\Sigma V^T)^TU\Sigma V^T\\ &=V\Sigma^T U^TU\Sigma V^T\\ &=V\Sigma^T U^TU\Sigma V^T\\ &=V\Sigma^T U^{-1}U\Sigma V^T\\ &=V\Sigma^T I\Sigma V^T\\ &=V\Sigma^2 V^T\\ \end{align}\]

\[\begin{align}(A^TA)V&=(V\Sigma^2 V^T)V\\ &=(V\Sigma^2)V^{-1}V\\ &=V\Sigma^2 \end{align}\]

Image Compression

Firstly, let's look at the process of compressing a picture, the left picture is original grayscale image. On the right, under different compress rate, we can see pictures after reproducing. Before compress, the size of the picture is 1775K byte. Then the picture is almost the same, when we compress which into 100K byte size, which means we can save 90% storage space

To compress a piture, you just decompose the matrice through SVD, then instead of using the original $U_{m\times m}$, $\Sigma_{m\times n}$ and $U_{n\times n}$, we shrink every matrice to new size $U_{m\times r}$, $\Sigma_{r\times r}$ and $U_{r\times n}$. The final $size(R)$ is still $m\times n$, but we abandon some entries since these entries are not so important than these we have reserved.

%% octave: core code of svd compressions
X = imread(filename);  
[U S V] = svd(double(X));
R = U(:,1:r)*S(1:r,1:r)*V(:,1:r)';

Summary

Today we have learned mathmatics backgroud on SVD, including linear transformation and EigenVector&EigenVaule. Before SVD, we first talked about EigenValue Decomposition. Finally, Singular Vaule Decomposition is very easy to be deduced. In the last section, we took an example see how SVD be applied to image compression field.

Now, it comes to the topic how to save our storage of a 32G iPhone7, the coclusion is obvious: using SVD compress image to shrink the size of our photos.

Reference

https://www.youtube.com/watch?v=EokL7E6o1AE
https://www.youtube.com/watch?v=cOUTpqlX-Xs
https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw
https://yhatt.github.io/marp/
https://itunes.apple.com/cn/itunes-u/linear-algebra/id354869137
http://www.ams.org/samplings/feature-column/fcarc-svd
https://www.cnblogs.com/LeftNotEasy/archive/2011/01/19/svd-and-applications.html

怎样理解特征向量和特征值（翻译）

发表于 2017-04-16

原文地址：stackexchange

原文答案作者主页：Arturo Magidin

版权声明

本译文首发于我的个人博客chengmingbo.github.io, 版权属于原作者。 #### 简短的答案特征向量可以让线性变换的理解变得简单。它们是沿着坐标轴（方向）的线性变换包括简单的伸/缩以及翻转；特征值提供的是这些线性变换影响因子。如果你理解越多沿着坐标轴（方向）的线性变换行为，理解线性变换就变得越简单；所以你要做的是有足够多的线性无关的特征向量与单因素线性变换产生联系。

长一点儿的答案

这个世界上有非常多的问题可以通过线性变换来建模，而特征向量提供了非常简单的解决方案。例如，考虑线性微分方程: \[\frac{\mathrm d x}{\mathrm d t} = ax + by\] \[\frac{\mathrm d y}{\mathrm d t} = cx + dy\]

可以找到很多描述此微分方程的系统，比如，两个物种数量的增长相互影响。具体来说，可能物种$x$是物种$y$的捕食者；周围越多的物种$x$，意味着越少的物种$y$可以得到繁衍壮大；问题是周围的物种$y$越少，那么对于物种$x$来说食物就会越少，所以物种$x$的繁衍就会越少；但是接下来因为物种$x$对物种$y$的生存压力降低，很快会导致$y$物种数量的增长；但是这就意味这物种$x$的食物变多了，所以物种xx的数量也跟着增长；如此这般，循环往复。特定的物理现象也能形成这样的系统，比如粒子在运动的流体中，粒子的速度矢量取决于其所处的流体中位置。

直接解决这种系统是非常复杂的。但是，假设如果你可以不用去关注变量$x$和变量$y$而是转而关注$z$和$w$（这里$z$和$w$与$x$和$y$线性相关，也就是说，$z=\alpha x + \beta y$, $\alpha$和$\beta$是常量，同时$w=\gamma x + \delta y$， $\gamma$和$\delta$也是常量）。这样，我们的系统就变换成了如下的形式： \[\frac{\mathrm d z}{\mathrm d t} = \kappa w\] \[\frac{\mathrm d w}{\mathrm d t} = \lambda z\]

也就是说，你对系统做了解耦，这样你就可以单独的处理各个独立函数了。接下来就这个问题就变得非常简单：$z=Ae^{\kappa t}$，以及$w=Be^{\lambda t}$。下一步就是用$z$和$w$的公式，算出$x$和$y$。

这能做到么？事实上，这等于我们精确的找到了矩阵$\begin{pmatrix}a & b\\ c&d\end{pmatrix}$线性独立的两个特征向量！$z$和$w$是其特征向量，而$\kappa$和$\lambda$为相对应的特征值。通过使用一个表达式把$x$和$y$混合起来，然后解耦成两个互相独立的函数，问题现在变得非常简单了。

这就是我们希望使用特征向量及特征值的本质：通过线性变换把问题解耦成一系列沿着各个隔离方向的操作，使得各个方向问题都可独立解决。

大量的问题归根结底是解决线性独立操作，理解这些可以实实在在的帮助你理解矩阵/线性变换到底在做什么。

主成分分析（PCA）简明教程（翻译）

发表于 2017-04-05

作者：Lindsay I Smith

时间：2002.2.26

译者：程明波

英文文章地址

译文地址

版权声明

本译文首发于我的个人博客chengmingbo.github.io, 版权属于原作者。

第一章

前言

这是一篇帮助读者理解主成分分析（PCA）的教程。PCA是一种统计技术，在人脸识别和图像压缩等领域都有应用。同时，PCA也是一种高维数据模式发现的一种常用方法。

在讲PCA之前，本文先介绍了PCA用到的一些数学概念。其中包括标准差、协方差、特征向量和特征值等。这些背景知识意在帮助我们理解PCA部分，如果你对这些概念已经非常清晰可以跳过此部分。

示例贯穿于整个教程，以便于通过直观的例子讨论概念。如果你还想了解更多内容，霍华德.安东著有约翰威立国际出版公司出版的数学课本《Elementary Linear Algebra 5e》提供了非常好的这方面数学背景知识。

第二章

数学背景知识

本章试图介绍便于理解主成分分析计算过程所需的基本数学知识。各个主题互相独立，同时各主题会举例说明。理解为什么使用这些技术以及一些关于数据计算结果所告诉我们的比记住枯燥的数学原理更重要。尽管不是所有这些知识都应用于PCA，一些看起来不是直接相关知识是这些最重要技术的基石。

我安排其中的一节介绍统计学知识，主要着眼于分布的度量，或者说数据是怎样离散的，其他的部分主要讲矩阵代数的一些知识，包括特征值、特征向量以及一些PCA所需的重要矩阵性质。 #### 2.1 统计知识整个统计学都基于你有一个很大数据集的前提，以及你想分析关于数据集中各个数据点的关系。我会介绍一些量度这些数据的一些方法，帮助你理解这些数据本身。

2.1.1 标准差

要了解标准差，我们需要一个数据集。统计学经常会使用总体中的一些采样。以选举为例，总体就是一个国家的所有人，因此一个采样就是统计学家用来度量的总体的一个子集。统计学伟大之处是我们只需度量（例如电话调查等）总体中的采样，你就可以计算出最接近所有整体的度量。

本节我假设我们的数据集是某个很大总体的采样。本节后面会提供总体以及采样的更多信息。这是一个示例数据集： \[X=[1\, 2\,4\, 6\, 12\, 15\, 25\, 45\, 68\, 67\, 65\, 98]\] 我们简单地假设字符$X$表示包含这些所有数字的集合。如果我想表示这个集合中某个单独的数字，我会用$X$加上下标表示某个具体的数。例：$X_3$表示$X$集合中的第三个数，也就是数字$4$。注意，有些书用$X_0$表示第一个数字，我们这里用$X_1$。另外，我们用字符$n$表示集合中元素的数量。

我们可以计算一个集合的很多维度，比如，我们可以计算样本的均值。这里我假设读者明白什么是一个样本的均值。这里仅给出公式： \[\overline{X} =\frac{\sum_{i=1}^n X_i}{n}\] 注意，我们用字符$\overline{X}$标识集合的均值。这个公式表示：把所有的数字加起来再除以他们的数量。

很不幸，均值除了告诉我们某种中心以外，并没有告诉关于数据的更多信息。比如以下两个数据集合的均值(10)完全一样，但是显然它们区别很大。 \[[0\, 8\, 12\, 20]\quad 和 \quad[8\, 9\, 11\, 12]\]

那么，这两个集合有何不同呢？这两个集合的离散程度是不同的。一个数据集的标准差（Standard Deviation, 缩写SD）是衡量这个集合数据离散程度的一个指标。怎么计算呢？SD的定义是这样的：每个数据点到这份数据均值点的平均距离。计算每一个数据点到均值点的距离平方，然后相加，再除以$n-1$，再开方，公式如下： \[s=\sqrt{\frac{\sum_{i=1}^n (X-\overline{X})^2}{(n-1)}}\] 这里$s$常用来标识样本方差。可能有人会问：“为啥分母除以$n-1$而不是$n$呢？” 答案有点儿复杂，大体来说，如果你的数据集合是一个采样，比如，你取样于真实世界（比如调查500人的选举情况）得到一个子集，那么你就必须用$n-1$，因为这个结果比你用$n$更接近于你用全部的整体算出的标准差。但是，如果你不是计算一个样本的而是整体的标准差，这种情况下，你就应该除以$n$而不是$n-1$。如果想了解更多的关于标准差的内容，可以访问这里,链接文章用类似的方法讲了标准差，提供了不同分母计算的区别实验，同时还探讨了采样和总体的异同。

数据集1 \[\begin{array}{lrr} X & (X-\overline{X}) & (X-\overline{X})^2 \\ \hline 0 & -10 & 100\\ 8 & -2 & 4\\ 12 & 2 & 4\\ 20 & 10 & 100\\ \hline \bf{总计} & & 208\\ \hline \bf{除以(n-1)} & & 69.333\\ \hline \bf{平方根} & & 8.3266\\ \hline \end{array}\]

数据集2 \[\begin{array}{lrr} X & (X-\overline{X}) & (X-\overline{X})^2 \\ \hline 8 & -2 & 4\\ 9 & -1 & 1\\ 11 & 1 & 1\\ 12 & 2 & 4\\ \hline \bf{总计} & & 10\\ \hline \bf{除以(n-1)} & & 3.333\\ \hline \bf{平方根} & & 1.8.257\\ \hline \end{array}\]

\[\bf{表2.1 标准差计算}\]

从上表2.1，我们可以看到标准差的计算过程。

因此，和我们预想的一样，第一个数据的的标准差要比第二个大得多。原因是数据离散于均值点的程度更高。再举一个例子，数据集: \[[10\, 10\, 10\, 10]\] 的均值也是10，但是它的标准差是0，因为所有的数字是相同的。没有任何数据点偏离均值。

2.1.2 方差

方差是数据离散程度的另一个度量。实际上，它和标准差几乎相同，公式如下： \[s^2=\frac{\sum_{i=1}^n (X-\overline{X})^2}{(n-1)}\] 你会注意到方差就是标准差的平方，标识上也有$s$($s^2$)。$s^2$经常用来标识一个数据集的方差。方差和标准差都是用来衡量数据的离散程度。标准差使用的更普遍，方差也常使用。之所以介绍方差是因为下一节我们介绍的协方差是基于方差的。

练习

计算下列数据集的均值、标准差和方差。

[12 23 34 44 59 70 98]

[12 15 25 27 32 88 99]

[15 35 78 82 90 95 97]

2.1.3 协方差

我们之前介绍的前两种度量方式只针对纯1维情况。1维数据集合可能是这种形式：屋里所有人的身高，或者上学期计算机科目101的成绩等等。但许多数据集是大于1维的情况，对于这种数据集，统计分析的目标经常是分析不同的维度之间的关系。例如，我们可能有个数据集同时包含了课堂上学生的身高，以及他们论文的分数。接着我们就可以利用统计分析工具来观察学生的身高对他们的成绩是否有影响。

标准差和方差只是对单一维度的计算，因此，你只能对数据的每一个维度单独计算标准差。然而，如果有一种类似的度量能找到各维度相互在偏离均值的变化关系会非常有用。

协方差就是这样一种度量。协方差总是用来度量两个维度，如果你计算一个维度和他自己维度的协方差，这时协方差就退化为这一个维度的方差。因此，如果你有一个3维数据集$(x,y,z)$,协方差的计算公式与方差非常相似。方差的计算公式也可以这样表示： \[var(X)=\frac{\sum_{i=1}^n(X_i-\overline{X_i})(X_i-\overline{X_i})}{(n-1)}\] 这里我简单的对平方项进行了展开。有了以上知识，我们现在可以写出协方差的公式了： \[cov(X,Y)=\frac{\sum_{i=1}^n(X_i-\overline{X_i})(Y_i-\overline{Y_i})}{(n-1)}\]

除了第二个括号中的$X$全部被替换成了$Y$以外，协方差和方差的公式完全一样。我们可以这么表述：“对于每个数据项，把每个$x$和$x$均值的差与每个$y$和$y$均值的差相乘，再加和除以$(n-1)$”。协方差是怎样的一种工作机理呢？我们这里用一些数据来举例。想象你通过调查得到一个2维数据。假设我们问了一堆学生他们花在科目COSC241的总小时数，以及他们的学期末成绩。现在我们有了两个维度，第一个维度是$H$，标识学习的小时数，第二个维度是M，标识学生的成绩。图2.2展示了我们假设的数据以及两个维度学习小时数和成绩之间的协方差$cov(H,M)$。

这张图告诉我们什么呢？协方差的值没有它的符号重要（正或负）。如果值是正的，比如我们这里，那么意味着两个维度一起增减。即，一般来说，如果学习的小时数增加，那么这个学生最后取得的成绩就会高。

但是如果协方差的值是负的，那么如果其中一个维度增加，另一个维度就会减少。如果我们刚刚计算的协方差的结果是负值。那我们的说法就变成了随着学习小时数的增加，期末成绩会降低。

最后一种情况，如果协方差是$0$，那么说明两个维度是相互独立的。

我们很容易画一张图如图2.1.3，得出结论：学习成绩随着学习的小时数增加而增加。但是，只有两维或三维这种低维的奢侈情况，我们才能通过可视化观察趋势。由于在一个数据集中可以计算任意两个维度的协方差，这种技术经常是高维数据可视化非常困难的情况下寻找维度之间关系的一种方法。

你可能会问，$cov(X, Y)$与$cov(Y,X)$是否相等？简单一看我们就会发现，它们是完全相等的，因为两个式子计算的唯一不同是在$cov(Y,X)$中$(X_i-\overline{X_i})(Y_i-\overline{Y_i})$被替换成了$(Y_i-\overline{Y_i})(X_i-\overline{X_i})$。我们知道乘法满足交换率，也就是说，无论乘数和被乘数的位置怎么变化，结果都是一样，也就是说这两个协方差结果是相同的。

2.1.4

我们知道，协方差总是用来计算两个维度之间的关系。如果我们有一个超过2维的数据集合，那么我们要计算的协方差的值的数量就不止一个了。比如，一个三维的数据集（$x,y,z三个维度$)。你可以计算的协方差就有$cov(x,y)$、$cov(x,z)$ 和 $cov(x,z)$。事实上，对于一个$n$维的数据集，你可以计算$\frac{n!}{(n-2)! * 2}$不同的值。

数据： \[\begin{array}{lrr} &小时数(H)&成绩(M)\\ \hline 数据 & 9 & 39\\ & 15 &56 \\ & 25 &93 \\ & 14 &61 \\ & 10 &50 \\ & 18 &75 \\ & 0 &32 \\ & 16 &85 \\ & 5 &42 \\ & 19 &70 \\ & 16 &66 \\ & 20 &80 \\ \hline 总数&167&749\\ \hline 平均 & 13.92& 62.42\\ \hline \end{array}\]

协方差： \[\begin{array}{cc|c|c|c} H & M & (H_i - \overline{H}) & (M_i-\overline{M}) & (H_i-\overline{H})(M_i-\overline{M})\\ \hline 9 & 39 & -4.92& -23.42 &115.23\\ 15 & 56 & 1.08& -6.42 &-6.93\\ 25 & 93 & 11.08& -30.58 &338.83\\ 14 & 61 & 0.08& -1.42 &-0.11\\ 10 & 50 & -3.92& -12.42 &48.69\\ 18 & 75 & 4.08& 12.58 &51.33\\ 0 & 32 & -13.92& -30.42 &423.45\\ 16 & 85 & 2.08& -22.58 &46.97\\ 5 & 42 & -8.92& -20.42 &182.15\\ 19 & 70 & 5.08& -7.58 &38.51\\ 16 & 66 & 2.08& -3.58 &7.45\\ 20 & 80 & 6.08& 17.58 &106.89\\ \hline 总数 & & & & 1149.89\\ \hline 平均 & & & & 104.54\\ \end{array}\]

想求出所有不同维度的协方差，非常有用的方法是把他们全计算出来然后放入矩阵。我假设你对矩阵比较熟悉，以及矩阵怎样定义。因此，对于一个$n$维的数据集的协方差矩阵： \[C^{n\times n}=(c_{i,j}, c_{i,j}=cov(Dim_i, Dim_j))\], 这里$C^{n\times n}$是一个$n$行$n$列的矩阵，$Dim_x$ 是第$x$维。上面非常不美观的公式说的是，如果你有一个$n$维数据集，那么协方差矩阵就是一个$n$行$n$列的矩阵，矩阵的每一个元素是两个维度之间的协方差计算结果。例如，矩阵的第2行第三列就是维度2和维度3之间的协方差计算结果。

一个例子。我们假设有一个3维的数据集，分别使用$x$,$y$,$z$表示3个维度。那么协方差矩阵是一个3行3列的矩阵，矩阵中的元素就是： \[\begin{pmatrix} cov(x,x) & cov(x,y) & cov(x,z) \\ cov(y,x) & cov(y,y) & cov(y,z) \\ cov(z,x) & cov(z,y) & cov(z,z) \\ \end{pmatrix}\]

几个需要注意：主对角线计算的某一维和它自己的协方差，也就是这些维度的方差。剩下的元素，因为$cov(a,b)=cov(b,a)$，所以矩阵关于主对角线对称。

练习

计算以下关于$x$和$y$的2维数据集的协方差，然后描述一下协方差结果可能推导出数据什么方面的结论。 \[\begin{array}{c|c|c|c|c|c} 项目id & 1 & 2 & 3 & 4 & 5\\ \hline x & 10 & 39 & 19 & 23 & 28\\ y & 43 & 13 & 32 & 21 & 20\\ \hline \end{array}\]
计算下列3维数据的协方差矩阵： \[\begin{array}{c|c|c|c} 项目id & 1 & 2 & 3 \\ \hline x & 1 & -1 & 4\\ y & 2 & 1 & 3\\ z & 1 & 3 & -1\\ \hline \end{array}\]

2.2 矩阵代数

本节会介绍PCA所用到的一些矩阵代数的背景知识，我将重点介绍对给定矩阵计算特征向量和特征值的相关知识。这里我假设你了解矩阵的基本知识。 \[\begin{align}\begin{pmatrix} 2 & 3\\ 2 & 1\\ \end{pmatrix}\times \begin{pmatrix} 1\\\\ 3 \end{pmatrix}= \begin{pmatrix} 11\\\\ 5 \end{pmatrix} \end{align}\]

\[\begin{align}\begin{pmatrix} 2 & 3\\ 2 & 1\\ \end{pmatrix}\times \begin{pmatrix} 3\\\\ 2 \end{pmatrix}= \begin{pmatrix} 12\\\\ 8 \end{pmatrix}=4\times \begin{pmatrix} 3\\\\ 2 \end{pmatrix} \end{align}\] \[\bf{图2.2：非特征向量和1个特征向量}\] \[\begin{align} 2\times \begin{pmatrix} 3\\\\ 2 \end{pmatrix}= \begin{pmatrix} 6\\\\ 4 \end{pmatrix} \end{align}\]

\[\begin{align}\begin{pmatrix} 2 & 3\\ 2 & 1\\ \end{pmatrix}\times \begin{pmatrix} 6\\\\ 4 \end{pmatrix}= \begin{pmatrix} 24\\\\ 16 \end{pmatrix}=4\times \begin{pmatrix} 6\\\\ 4 \end{pmatrix} \end{align}\]

\[\bf{图2.3: 缩放特征向量后仍为特征向量}\]

2.2.1 特征向量

如你所知，只要两个矩阵的大小相容，你就可以将两个矩阵相乘。特征向量是矩阵相乘的的特殊形式。我们现在考虑如图2.2所示的矩阵和向量相乘的情况。

第一个例子中，计算结果不是整数与原始矩阵相乘的形式，但到了第二的例子，计算结果的就是一个整数乘以与左边完全相同的一个向量。为何能产生这样的结果呢？实际上，向量就是2维空间的一个矢量。向量$\begin{pmatrix}3\\2\end{pmatrix}$(第二个相乘的例子)代表从原点$(0,0)$一个指向$(3,2)$的一个箭头，另一个矩阵可以被认为是变换矩阵。如果你在向量的左边乘以一个矩阵，结果就是把这个向量从其原始位置进行了变换。

上面说得就是变换就是特这向量的本质。想象一个变换矩阵，以及一个在直线$y=x$上的向量，矩阵左乘这个向量。如果你发现结果仍然位于$y=x$这条直线上，那么这就向是量的自反射。这个向量（所有的乘子，因为我们不关心向量的大小）就是这个变换矩阵的一个特征向量。

这些特征向量有什么性质呢？第一你要知道的就是只有方矩阵才有特征向量。其次是不是所有的方矩阵都有特征向量。最后，如果一个$n\times n$矩阵只要有，那么就一定有$n$个特征向量。如果一个$3\times 3$的矩阵有特征向量，那就有3个。

特征向量的另一个性质是：如果我在相乘之前对其缩放一定量，那么我可以仍然得到相同的乘积形式（如图2.3）。这是因为如果你缩放一个的向量，你做的仅仅是把这个向量变长，而没有改变其方向。最后，一个矩阵的所有的特征向量都是垂直的。也就是说，无论你有多少维的向量，他们都是互相形成直角。另一个更数学化的说法叫正交。这么描述非常重要，原因是我们可以更方便表述这些垂直的正交向量，而不用在$x$轴和$y$轴的坐标系中描述。在PCA介绍部分我们会用到这些。

另一点重要的是，数学家们在寻找特征向量时，他们总喜欢找长度为1的特征向量。原因我们已经知道，向量的长度并不是影响因素，方向才是。所以为了使特征向量有标准形式，我们的一般做法是将其缩放成长度为1的向量。这样，所有的特征向量就都有相同的长度了。下面我们把例子中的向量标准化。 \[\begin{pmatrix} 3\\ 2 \end{pmatrix} \] 是一个特征向量，这个向量的长度为： \[\sqrt{(3^2+2^2)}=\sqrt{13}\] 所以我们把原始的向量除以这个长度，就得到了长度唯一的特征向量。 \[\begin{pmatrix} 3\\ 2 \end{pmatrix}\div\sqrt{13}= \begin{pmatrix} {3}/{\sqrt{13}}\\ {2}/{\sqrt{13}} \end{pmatrix} \]

怎么找到这些神秘的特征向量呢？很不幸，只有当矩阵足够小时，特征向量才好找，比如不超过$3\times 3$的矩阵。如果矩阵大小再变大，通常的做法是用复杂的迭代方式求解，这些方法此教程不会讲解。如果你想在程序中使用计算特征向量的方法，很多数学库都有实现，一个有用的数学库包。

如果想进一步了解特征向量和特征值以及正交等内容，请参考霍华德.安东著有约翰威立国际出版公司出版的数学课本《Elementary Linear Algebra 5e》，ISBN 0-471-85223-6。

2.2.2 特征值

特征值和特征向量高度相关，其实我们已经在图2.2看到过特征值。还记得被矩阵缩放以后的特征向量有相同的大小么？在那个例子中，这个值是4。这里4就是特征向量相关的特征值。无论我们对特征向量怎么缩放，我们始终得到的特征值都一直是一样的，如图2.3的例子特征值一直是4。

现在我们发现特征向量和特征值总是成对出现。如果你现在需要某个编程库计算特征向量，通常特征值也被同时计算出来了。

练习

对于下面的矩阵 \[\begin{pmatrix} 3&0&-1\\ -4&1&2\\ -6&0&-2 \end{pmatrix} \] 判断下面是否有此矩阵的特征向量，如果有，请求出对应的特征值。 \[ \begin{pmatrix} 2\\ 2\\ -1 \end{pmatrix}\ \begin{pmatrix} -1\\ 0\\ 2 \end{pmatrix}\ \begin{pmatrix} -1\\ 1\\ 3 \end{pmatrix}\ \begin{pmatrix} 0\\ 1\\ 0 \end{pmatrix}\ \begin{pmatrix} 3\\ 2\\ 1 \end{pmatrix} \]

第三章主成分分析（Principal Components Analysis）

终于到了主成分分析（PCA）部分了，PCA可以在数据中识别模式，并通过此种方式突出数据中相似和不同的部分。由于很难用图像表示高维数据，也就意味着在高维数据中寻找模式变得非常困难，这时，PCA就成了极为强大的数据分析工具。

另一个PCA的优点是，如果你通过其找到了数据中的模式，你还可以用来压缩数据，即，在不损失太多信息的前提下降低数据的维度。这个技术被用于图像压缩，我们在稍后的章节中会有涉及。

本章我们将针对一个数据集，一步一步实现PCA计算。这里我不准备描述为什么 PCA表现出色。我做的是为你提供每一步都发生了什么，这样，将来如果你想使用此技术时，就会有足够多的知识帮助你做决策。

3.1 方法

第一步：数据集

在我们这个简单的例子中，我会使用我编造的一个数据集。这个数据集只有两维，之所以选择这份数据是因为我可以通过画出图形来分析PCA的每一步都发生了什么。 ##### 第二步：减掉均值如果想实现PCA，我们首先要把每一维的数据减掉均值。就是说要对每一维求平均值，接着把每一维的每个数据都减掉均值。我们这里所有的$x$值都要减掉$\overline{x}$($x$维度所有数据的均值)，所有的$y$值都减掉$\overline{y}$。这样我们就构造了一个均值为0的数据集。 \[\begin{align} \bf{数据}= \begin{array}{c|c} x& y\\ \hline 2.5 & 2.4\\ 0.5 & 0.7\\ 2.2 & 2.9\\ 1.9 & 2.2\\ 3.1 & 3.0\\ 2.3 & 2.7\\ 2 & 1.6\\ 1 & 1.1\\ 1.5 & 1.6\\ 1.1 & 0.9\\ \end{array}\bf{调整后的数据=} \begin{array}{c|c} x& y\\ \hline 0.69 & 0.49\\ -1.31 & -1.21\\ 0.39 & 0.99\\ 0.09 & 0.29\\ 1.29 & 1.09\\ 0.49 & 0.79\\ 0.19 & -0.31\\ -0.81 & -0.81\\ -0.31 & -0.31\\ -0.71 & -1.01\\ \end{array} \end{align}\] \[图3.1：PCA示例数据，左边为原始数据，右边为减掉均值的数据\]

第三步：计算协方差矩阵

协方差矩阵我们在2.1.4小节已经讨论过。由于我们的数据是2维的，所有协方差矩阵就是$2\times 2$。协方差矩阵计算没特别说明的，我直接给出结果： \[cov=\begin{pmatrix} 0.616555556&0.615444444\\ 0.615444444&0.716555556 \end{pmatrix}\] 由于协方差矩阵非对角线元素都是正值，所以我们可以预期$x$和$y$一起增减。 ##### 第四步：计算协方差矩阵的特征向量和特征值协方差矩阵是方阵，所以我们可以计算其特征向量和特征值。这极为重要，因为他们可以告诉我们关于数据的有用信息。我一会儿会说明原因，现在我们来看一下特征值和特征向量： \[\begin{align} \bf{特征值}=\begin{pmatrix} 0.0490833989\\ 1.28402771 \end{pmatrix}\bf{\bf 特征向量}=\begin{pmatrix} -0.7351178656 & -0.6778873399\\ 0.677873399 & -0.735178656 \end{pmatrix} \end{align}\] 一定要注意到两个特征向量都是单位向量。也就是说他们的长度都是1。这个结果对PCA非常重要，幸运的是，大部分数学工具包计算特征向量提供的都是单位向量。

那么，这些计算结果都是什么意思呢？如果你观察图3.2的数据点，你会发现这些数据有非常强的模式。和我们用协方差预期的一致，这些数据确实一起增减。我利用数据同时也画了两个特征向量，这两个特征向量看起来像图3.2的对角线。我们在特征向量小节介绍过，两个特征向量是相互垂直的。但是，更重要的是特征向量为我们提供了数据中的模式信息。可以看出，其中一条线（译者注：大概45度倾角的这条线）看起来像画了拟合这些数据点的一条线。这个特征向量告诉我们两个数据维度沿着线的相关性（译者注：原文是两个数据集，我认为是两个数据维度）。第2个特征向量（译者注：大概135度倾角的这条线）给我们提供了另外一些重要性稍低的数据中的模式，数据点分布在线的两边。

因此，通过从特征矩阵中取出特征向量进行分析，我们已经提取出了刻画数据特点的线。接下来的步骤会包括数据变换以便于用我们这些线来表达数据。 ##### 第五步：选择成分及构建特征的向量现在我们来讨论数据压缩和降维的概念。如果你学习了前面小节的特征向量和特征值的相关信息，你会注意到特征向量是之间有很大不同。事实上，具有更大特征值对应的特征向量是数据集的主成分(Principal Component)。在我们这个例子中，对应更大特征值特的特征向量是基本拟合数据点的这条线。这维特征向量描述了数据维度之间最重要的关系。 \[图3.2:\bf{标准化（减掉均值）后的数据图以及协方差矩阵中特征向量图}\]

一般来说，一旦从协方差矩阵中找出特征向量，下一步我们要做的就是把他们对应的特征值从高到低排列。排序表明了成分（component）的重要性高低。现在，如果你如果愿意，可以忽略那些重要性没那么高的成分，你就会丢失一些信息，但是如果特征值非常小，你丢失的信息并不会太多。如果你扔掉一些成分，最终的数据集的维度会低于原始数据的维度。具体来说，如果你的原始数据有$n$维，因此你可以计算出$n$个特征向量和$n$个特征值，接下来如果你只选取前$p$维特征向量，那么最终的数据就变成了$p$维的数据集。

现在你需要做的是构造一个特征(feature)的向量，其实就是一个向量的矩阵。矩阵是通过挑选你希望留下的特征向量，组成一个每列1个特征向量的矩阵(译者注：最后一维我认为是第p维更好，可以与上面一段对应)。 \[\bf{特征的向量}=(eig_1,eig_2,eig_3,...,eig_p)\]

来看我们的例子，现在我们有2个特征向量，我们现在有两个选择，第一选择是两个特征向量都被用于构造特征的向量： \[cov=\begin{pmatrix} -0.77873399&-0.735178656\\ -0.735178656&0.677873399 \end{pmatrix}\] 或者，我们可以扔掉不重要的成分，那么特征的向量只有1列： \[cov=\begin{pmatrix} -0.677873399\\ -0.735178656 \end{pmatrix}\] 下一节我们将针对上面两种新的数据集进行讨论。 ##### 第六步：生成新数据集这是PCA的最后同时是最简单的一步。一旦我们选择了我们希望保留到成分（特征向量集），我们只需把特征的矩阵转置，左乘调整后的数据（原始数据减掉均值），然后再转置。 \[\bf{最终数据=行特征的向量} \times \bf{行调整后的数据}\]

这里$\bf{行特征的向量}$是特征的向量组成的矩阵进行转置，也就是说现在特征向量现在是以行的形式排列，最重要的特征向量在第一行。$\bf{行调整后的数据}$是经过均值调整后的数据，也进行了转置，也就是说数据项在每一列，而每行是一个独立的维度。抱歉数据转置可能来得有点儿突然，但是如果我们现在对特征向量的矩阵和数据进行转置，后面的公式就会简单很多，而不是一直带着个转置的上标符号$T$。$\bf{最终数据}$是最终的数据集合，其中每一列是一个数据项，每一行是一个维度。

做完这些我们可以得到什么呢？我们可以得到和我们选择向量完全相关的原始数据。我们的原始数据有$x$轴和$y$轴两个坐标的坐标系，所以我们的数据与这两个坐标的坐标系相关。其实你可以用任何你喜欢的两个坐标轴的坐标系来表示你的数据。如果坐标轴互相垂直，这种表示方法是最高效的，这就是为何特征向量间互相垂直这么重要。现在我们已经把我们的数据从跟$x$轴和$y$轴相关改为2个特征向量组成的坐标系相关。如果说我们已经通过降维构造了新的数据集，也就是说我们扔掉了一些特征向量，那么新数据只跟我们留下的特征向量相关。

为了展示我们的数据，我已经把两种可能的特征的向量都对数据做了变换。我已经对每种情况的结果进行的了转置，这样我就把数据恢复成表结构的组织形式。同时，我也把最终的数据点画了出来，这样我们就可以观察这些数据点与这些成分之间的关系。

两个特征向量都保留的情况转换后的结果见图3.3。这个图其实就是原始数据旋转后，这样特征向量就成了坐标轴。这种情况很好理解，因为我们在分解的过程中并没有丢失任何信息。

另外一种变换，我们只保留有最大特征值的特征向量，我们可以从图3.4中看的数据的结果。和预期的一样，这个数据只有一维。如果你用这份数据与两维特征向量都用变换后的数据对比，你会注意到，这个数据就是另一份数据的第一列。所以，如果你画出这个数据的图，这份数据只有一维，那么结果其实就是图3.3数据点$x$的坐标点。我们其实就是高效的抛弃了其他的坐标轴，也就是其他的特征向量。

那么我究竟做了什么呢? 本质上我们把数据进行了变换，使之可以用相关的模式进行表示，这些模式就是一些最适合描述这些数据之间关系的线。这么做非常有用，因为我们现在已经把数据点对每条线的贡献进行分类，然后进行组合。首先，我们仅仅有$x$轴和$y$轴，这还不错，但是每个$x$和$y$的数据点其实并无法告诉我们，每个数据点和其他数据点之间的关系。现在数据点的值可以精确告诉我们数据点处于趋势线的位置（上面或者下面）。如果是两个特征向量都用的情况，我们仅仅是把数据转换以便于我们使这些数据与特征向量相关，而不是$x$轴和$y$轴。但是只留一维特征向量的分解移除了较小特征向量的贡献，是我们的数据只与保留的一维数据相关。 ##### 3.1.1 把旧数据找回来显然，如果你用PCA对数据进行压缩，你一定想把原始数据恢复回来。（下一章我们看到例子）这些内容来自于这里。 \[\begin{align} \bf{转换后的数据}=\begin{array}{c|c} x&y\\ \hline -0.827970186 & -0.175115307\\ 1.77758033 & 0.142857227\\ -0.992197494 & 0.384374989\\ -0.274210416 & 0.1304117207\\ -1.67580142 & -0.209498461\\ -0.912949103 & 0.17528282444\\ 0.0991094375 & -0.349824698\\ 1.14457216 & 0.0464172582\\ 0.438046137 & 0.0177646297\\ 1.22382056 & -0.162675287\\ \end{array} \end{align}\] \[图3.3：应用了PCA分析后并使用了两个特征向量的数据表以及绘制的新数据点\]

转换后的数据(单一特征向量) \[\begin{array}{c} x\\ \hline -0.827970186\\ 1.77758033\\ -0.992197494\\ -0.274210416\\ -1.67580142\\ -0.912949103\\ 1.14457216\\ 0.438046137\\ 1.22382056 \end{array}\]

\[图3.4：只用最重要的特征向量数据转换的数据\]

所以，我们怎么把原来数据恢复回来？在我们进行恢复原始数据之前，回忆只有我们将所有特征向量进行转换才能精确的把数据恢复回来。如果我们在最后转换时减少特征向量，那么恢复的数据已经失去很多信息。回想一下，最后的变换是： \[\bf{最终数据=行特征的向量} \times \bf{行调整后的数据}\] 我们可以把公式反转过来，进而得到原始数据， \[\bf{行调整后的数据}=\bf{行特征的向量}^{-1}\times\bf{最终数据}\] 这里，$\bf{行特征的向量}^{-1}$是的$\bf{行特征的向量}$的逆。由于我们讨论的是特征向量，组成的特征的向量，所以，$\bf{行特征的向量}^{-1}$其实就是$\bf{行特征的向量}$的转置。当然，只有在矩阵中的所有元素是由单位特征向量组成是才成立。这样，恢复原始数据又变得容易了很多，现在公式变成了： \[\bf{行调整后的数据}=\bf{行特征的向量}^{T}\times\bf{最终数据}\] 这个公式在我们只保留部分特征向量的情况下仍然成立。也就是说，就算你扔掉了一下特征向量，上面的公式仍然成立。

我不会演示用所有特征向量恢复原始数据，因为这样计算的结果和开始的数据一模一样。但是，我们一起来看一下只保留了一维特征向量的情况下，是怎样损失信息的。图3.5展示了丢失信息的情况。我们把图中的数据点与图3.1对比一下就会发现，沿着主特征向量的变化被保留下来了（见图3.2特征向量及数据）沿着其他成分（另一个特征向量被扔掉了）的变化丢失了。

\[图3.5 从单独一维特征向量重新构造的数据\]

练习

协方差矩阵的特征向量为我们提供了什么呢？
我们在PCA计算的过程中，哪一步可以决定压缩数据，压缩可以起到什么效果呢？
举例说明PCA在图像处理怎样用主成分表示，同时调研一下人脸识别中“特征脸”(Eigenfaces)主题。

第四章计算机视觉应用

本章我们将简单的PCA在计算机视觉领域的应用，首先我们看一下图像是怎么表示的，然后我们我们看看能怎样用PCA处理这些图像。本章关于人脸识别的的信息主要来自于1997年IEEE 9月Vol 85, No. 9《Face Recognition: Eigenface, Elastic Matching, and Neural Nets》。图像表示来自于爱迪生-韦斯利出版社1987年出版的由Rafael C. Gonzalez 和Paul Wintz合著的《Digital Image Processing》想了解更多信息，KL变换相关知识也是非常好的参考。图像压缩相关知识来自于这里，此网站还提供了大量用不同数量特征向量重新构造图像的方法。

4.1 表示

在我们把一系列矩阵技术应用于计算机视觉上时，我们必须考虑图像的表示方法。一个正方形，$N\times N$的图像可以被表示成$N^2$维的向量。 \[X=(x_1,x_2,x_3,...,x_{N^2})\] 这里，第一行前$N$个$(x_1--x_n)$一个挨着一个的像素点组成了1维的图像，下$N$个元素是下一行，以此类推。每个像素点的值代表图像三原色的亮度，也可能是只是灰度图像，那么只需要1个单独的值即可表示。 #### 4.2 PCA寻找模式假设我们有20个图像。每个图像的像素非常高。对每个图像，我们都建立一个图像向量表示相应图像。接着我们就可以把所有的图像放到一个像这样的大矩阵中： \[图像矩阵= \begin{pmatrix} ImageVec_1\\ ImageVec_2\\ \vdots\\ ImageVec_{20}\\ \end{pmatrix}\] 我们现在就可以开始以这条图像矩阵为基始，应用PCA，先构造协方差矩阵，然后得到原始数据相关的特征向量。为什么用PCA分析有用呢？假设我们要做人脸识别，那我们的原始数据就是很多人脸。接下来的问题是，给一张我新的图片，那么这是原始人脸数据中谁的人脸呢（注意，这新的图片不是我们开始给的20个人脸图片）？计算机视觉的处理方法是衡量新的图片和原始图片的差别，但并不是在原始坐标系进行对比，而是在PCA分析的生成的坐标系下衡量。

在实际应用中，PCA生成的坐标系下识别人脸会好非常多，因为PCA分析已经提供了原始图片中不同和相似等相关性。主成分分析已经识别出了数据中的统计学模式。

因为所有的向量都是$N^2$维的，我们最后会得到$N^2$个特征向量，在实践中，我们可以扔掉其中不重要的一些特征向量，识别效果仍然非常好。 #### 4.3 PCA图像压缩使用PCA做图像压缩常常也被称作霍特林变换或者是KL变换（Karhunen-Leove transform）. 如果我们有20个图像，每个图像有$N^2$个像素，所以我们就可以构造$N^2$个向量，每个向量20维。每个向量由每个图片的相同像素点的图片亮度值组成。这与我们之前的例子不同，因为之前我们是有一个图像的向量，向量里的每项都是不同的像素。然而我们现在是有一个每个像素的向量，向量里的每项是都是来自于不同的图片。

如果现在我们在一个数据集上应用PCA，那么，我们将会得到20个特征向量，因为，每个向量都是20维的。如果想要压缩数据，我们可以选择只用其中一部分特征向量变换，假设是15个特征向量。这样我得到的最终数据只有15维，达到了节省空间的目的。但是，当要恢复原始数据是，图像已经丢了一些信息。这种压缩技术叫做有损压缩，因为解压后的图片已经不是和原始图片完全一样的图片了，一般来说会变差。

附录 A

实现代码

这份代码用于可替换Matlab的自由软件Scilab。我用这份代码生成了文章的所有例子。除了第一个宏，剩下的都是我(原文作者)写的。

// This macro taken from
// http://www.cs.montana.edu/ ̃harkin/courses/cs530/scilab/macros/cov.sci // No alterations made
// Return the covariance matrix of the data in x, where each column of x
// is one dimension of an n-dimensional data set.  That is, x has x columns
// and m rows, and each row is one sample.
//
// For example, if x is three dimensional and there are 4 samples.
// x=[123;456;789;101112]
// c=cov(x)
function [c]=cov (x)
// Get the size of the array
sizex=size(x);
// Get the mean of each column
meanx = mean (x, "r");
// For each pair of variables, x1, x2, calculate
// sum ((x1 - meanx1)(x2-meanx2))/(m-1)
for var = 1:sizex(2),
        x1 = x(:,var);
        mx1 = meanx (var);
        for ct = var:sizex (2),
                x2 = x(:,ct);
                mx2 = meanx (ct);
                v = ((x1 - mx1)’ * (x2 - mx2))/(sizex(1) - 1);
end, c=cv;
end,
cv(var,ct) = v;
cv(ct,var) = v;
// do the lower part of c also.
// This a simple wrapper function to get just the eigenvectors
// since the system call returns 3 matrices
function [x]=justeigs (x)
// This just returns the eigenvectors of the matrix
[a, eig, b] = bdiag(x);
x= eig;
// this function makes the transformation to the eigenspace for PCA
// parameters:
// adjusteddata = mean-adjusted data set
// eigenvectors = SORTED eigenvectors (by eigenvalue)
// dimensions  = how many eigenvectors you wish to keep
//
// The first two parameters can come from the result of calling
// PCAprepare on your data.
// The last is up to you.
function [finaldata] = PCAtransform(adjusteddata,eigenvectors,dimensions) finaleigs = eigenvectors(:,1:dimensions);
prefinaldata = finaleigs’*adjusteddata’;
finaldata = prefinaldata’;
// This function does the preparation for PCA analysis
// It adjusts the data to subtract the mean, finds the covariance matrix,
// and finds normal eigenvectors of that covariance matrix.
// It returns 4 matrices
// meanadjust = the mean-adjust data set
// covmat = the covariance matrix of the data
// eigvalues = the eigenvalues of the covariance matrix, IN SORTED ORDER
// normaleigs = the normalised eigenvectors of the covariance matrix,
// IN SORTED ORDER WITH RESPECT TO
// THEIR EIGENVALUES, for selection for the feature vector.
//
// NOTE: This function cannot handle data sets that have any eigenvalues
// equal to zero. It’s got something to do with the way that scilab treats
// the empty matrix and zeros.
//
function [meanadjusted,covmat,sorteigvalues,sortnormaleigs] = PCAprepare (data) // Calculates the mean adjusted matrix, only for 2 dimensional data
means = mean(data,"r");
meanadjusted = meanadjust(data);
covmat = cov(meanadjusted);
eigvalues = spec(covmat);
normaleigs = justeigs(covmat);
sorteigvalues = sorteigvectors(eigvalues’,eigvalues’);
sortnormaleigs = sorteigvectors(eigvalues’,normaleigs);
// This removes a specified column from a matrix
// A = the matrix
// n = the column number you wish to remove
function [columnremoved] = removecolumn(A,n)
inputsize = size(A);
numcols = inputsize(2);
temp = A(:,1:(n-1));
for var = 1:(numcols - n)
        temp(:,(n+var)-1) = A(:,(n+var));
columnremoved = temp;
// This finds the column number that has the
// highest value in it’s first row.
function [column] = highestvalcolumn(A)
inputsize = size(A);
numcols = inputsize(2);
maxval = A(1,1);
maxcol = 1;
for var = 2:numcols
        if A(1,var) > maxval
                maxval = A(1,var);
end,
        end,
column = maxcol
maxcol = var;
25
end,
// This sorts a matrix of vectors, based on the values of
// another matrix
//
// values = the list of eigenvalues (1 per column)
// vectors = The list of eigenvectors (1 per column)
//
// NOTE:  The values should correspond to the vectors
// so that the value in column x corresponds to the vector
// in column x.
function [sortedvecs] = sorteigvectors(values,vectors)
inputsize = size(values);
numcols  = inputsize(2);
highcol = highestvalcolumn(values);
sorted = vectors(:,highcol);
remainvec = removecolumn(vectors,highcol);
remainval = removecolumn(values,highcol);
for var = 2:numcols
        highcol = highestvalcolumn(remainval);
        sorted(:,var) = remainvec(:,highcol);
        remainvec = removecolumn(remainvec,highcol);
        remainval = removecolumn(remainval,highcol);
end,
sortedvecs = sorted;
// This takes a set of data, and subtracts
// the column mean from each column.
function [meanadjusted] = meanadjust(Data)
inputsize = size(Data);
numcols = inputsize(2);
means = mean(Data,"r");
tmpmeanadjusted = Data(:,1) - means(:,1);
for var = 2:numcols
        tmpmeanadjusted(:,var) = Data(:,var) - means(:,var);
meanadjusted = tmpmeanadjusted
end,

本文PDF

Naive Bayes

发表于 2017-03-18

Introduction

You are sitting in front your sceen, annoyed by a bunch of spam mails. You wonder if there are any appoaches to get rid of so much many offended emails. Last time you doped out a extremely good idea. You set a series of words to identify those emails: every mail invovled by words "coupon" was trown to trash. However, on one hand, there were only about 10% spam including "coupon", one the other hand, you had trashed two significant emails, due to which, you lost two business valued about two million dollars. The thing was that, your inbox seems being overrun by those spams. Who can rescue you from endless deleting spams everyday?

Intuition

Actually, you were close to the answer when you were putting all emails to trash which included the word "coupon". Today, we learn about an efficient method to solve the problem systematically. Maybe we can search through internet, find all words which are about advertisment in email. You can restrict that if and only if those which includes more than 4 word in the list of spam words can be put into trash can. Maybe finally you can design a rule system to recoginize those spams without miss important ones. But it seems so boring a job to do this, moreover, it may be not personalized. If some those who are working for saling discount stuffs, he/she may find it doesn't work through applying your effective rules. How about thinking of probablity of those words emerge in all your inbox. Some words such as "coupon" may contribute more but not entirety, simultaneously, affordable may contribute less but not none. Notice that profit emerge in spam emails and normal emails both sometimes. We let $y=0$ denote an normal email while $y=1$ the opposite. And if a word such as "coupon" emerges in a mail, we set $coupon=1$, otherwise $coupon=0$. Suppose you have an email, we define the probability to : \[\begin{align} &p1=p(y=0|free=1, discount=1, affordable=1, customer=0, KPI=0, budget=0,...,bias=0)=?\\ &p2=p(y=1|free=1, discount=1, affordable=1, customer=0, KPI=0, budget=0,...,bias=0)=? \end{align}\] Our aim is to decide which is bigger $p1$ or $p2$. Let's describe the problem as followed: > We want to decide the probality of a mail spam or normal when word "free" is in the mail, "discount" is in the mail, "affordable" is in the mail, customer is not in the mail, ..., "bias" is not.

The description above is only about one email. For some other emails, maybe "free" and "discount" both did not emerge at all.

Definitioin

To generalize the problem, we let $x_i$ denote a word in emails. Suppose we now know all words in your inbox, say 10 thousand words. Then we have $x_1$ denote if "free" is in a mail, $x_1=0$ denotes negative while $x_1=1$ denotes positive. So we have $x_1,x_2,x_3,...,x_{10000}$, which denotes the status of each word in a mail. Then the problem is transferred as followed: \[\begin{align} &p1=p(y=0|x_1=1, x_2=1, x_3=1, x_4=0, x_5=0, x_6=0,...,x_{10000}=0)=?\\ &p2=p(y=1|x_1=1, x_2=1, x_3=1, x_4=0, x_5=0, x_6=0,...,x_{10000}=0)=? \end{align}\] Suppose we have N words in your inbox, then we want to decide: \[\begin{align} &p1=p(y=0|x_1, x_2,...,x_{N})=?\\ &p2=p(y=1|x_1, x_2,...,x_{N})=? \end{align}\] So how to sovle the probability problem?

Naive Bayes

Recall Bayes rules: \[\begin{equation} p(y|x)=\frac{p(x,y)}{p(x)}=\frac{p(x|y)p(y)}{p(x)} \end{equation}\] we apply the equation to our problem, then we have: \[\begin{align} &p(y=1|x_1,x_2,...,x_N)\\ & =\frac{p(x_1,x_2,...,x_N,y=1)}{p(x_1,x_2,...,x_N)}\\ & =\frac{p(x_1,x_2,...,x_N|y)p(y=1)}{p(x_1,x_2,...,x_N)}\\ \\ &p(y=0|x_1,x_2,...,x_N)\\ & =\frac{p(x_1,x_2,...,x_N,y=0)}{p(x_1,x_2,...,x_N)}\\ & =\frac{p(x_1,x_2,...,x_N|y)p(y=0)}{p(x_1,x_2,...,x_N)} \end{align}\] Notice that {p(x_1,x_2,...,x_N) is positive, and a constant as well. so our aim is transferred to: \[\begin{align} &Max(p(y=1|x_1,x_2,...,x_N), p(y=0|x_1,x_2,...,x_N)\\ &=Max(p(x_1,x_2,...,x_N|y=1)p(y=1),p(x_1,x_2,...,x_N|y=0)p(y=0) \end{align}\]

First of all, we talk about how to get $p(y=0)$ and $p(y=1)$. We have $N=10000$, suppose there are $900$ spams and $91000$ normal emails, then: \[\begin{align} &p(y=0)=\frac{count(spam\ email)}{count(all\ emails)}=\frac{900}{10000}=0.09\\ &p(y=1)=\frac{count(normal\ email)}{count(all\ emails)}=\frac{9100}{10000}=0.91 \end{align}\] And right now our task left is to compute $p(x_1,x_2,...,x_N|y=0)$ and $p(x_1,x_2,...,x_N|y=1)$, that means we want to know if a mail is a normal one or not, what is the probability of the combination of these $N=10000$ words. In other word, $x_1=0\ or\ 1$,$x_2=0\ or\ 1$ and so on. So we have to compute $2*2^{10000}=2*1.995*10^{3010}$ probabilities under our circumstance. It seems the scale is so large that we can not handle it. So we have an extremely adventurous assumption: > Each x in an email only can be decided by y.

Then we have: \[\begin{align} p(x_1,x_2,...,x_N|y=0)=p(x_1|y=0)\cdot p(x_2|y=0)\cdots p(x_N|y=0)\\ p(x_1,x_2,...,x_N|y=1)=p(x_1|y=1)\cdot p(x_2|y=1)\cdots p(x_N|y=1) \end{align}\] Now, we just need compute 2*10000 probabilities, it is surely a mission possible now. So how to compute $p(x_i=0\ or\ 1|y=0\ or\ 1)$? Take word "free" for example, we want to compute: \[\begin{align} &p(free=0|y=0)\\ &=\frac{p(free=0, y=0)}{p(y=0)}\\ &=\frac{count\ of\ emails\ have\ no\ word\ "free"\ in\ normal\ emails}{count\ of\ normal\ emails}\\ &p(free=0|y=1)\\ &=\frac{p(free=0, y=1)}{p(y=1)}\\ &=\frac{count\ of\ emails\ have\ no\ word\ "free"\ in\ spams}{count\ of\ spam\ emails} \end{align}\] As long as we have computed all of these values, we can use the equation to decide which email should be trashed. Suppose we have computed the probabilites of $p1$ and $p2$: \[\begin{align} &p1\triangleq p(x_1,x_2,...,x_N|y=1)p(y=1)\triangleq p(y=0)\cdot \Pi_{i=1}^N p(x_i|y=0)=0.00091\\ &p2\triangleq p(x_1,x_2,...,x_N|y=0)p(y=0)\triangleq p(y=1)\cdot \Pi_{i=1}^N p(x_i|y=1)=0.00000032 \end{align}\] Then we consider that the email is more of a spam mail. ### Summary Today we have talked about how to run a Naive Bayes algorithm to decide if an email is a spam. We suppose each word in a mail is no of business of the other, which is a simple assumption named conditional independence assumption but not the reality(e.g. "coupon" maybe emerges with "save" and "money" due to their inner association). However, the algorthm is very effective. ### Future Thinking Suppose if a word "wooooo" haven't emerged in inbox, then the probability will reach $p1=p2=0$, how to solve it? Suppose you have a task to differiate oranges from apples and pears using color and shape, how to design the algorithm? suppose x is continuous rather than discrete, Naive Bayes still works or not?

Neural Network ABC

发表于 2017-03-11

Background

Deep learning is very popular recently, which is based on Neural Network, an old algorithm that had degraded for years but is resurging right now. We talk about some basic concept about Neural network today, hoping supply a intuitive perspective of it.

Before beginning, I'd like to introduce you an exicting product which help those who are blind see the world. BrianPort, which is invented by Wicab, uses you tougue to see the world. Tongue array contains 400 electrodes and is connected to the glasses. The product transfers from light to electric signal. More than 80% blind persons could pass through the block during the experiments.

In fact, Wicab takes advantage the mechanism of neural network of our brain. There are 86 billion neuron in our brain. We can smell, see, hear the world just because of these neurons. They are connect to each other to help us sense the world. Algorithm Neural Network is a way of mimic the mechanism of our brain.

Intuition

Let's start from the easiest model, we get $a_1$ in two steps: step1: $z_1=w_1x_1+w_2x_2+w_3x_3$ step2: $a_1=\frac{1}{1+e^{(-z)}}$ In addition, we add a bias $w_0$ to the calculate. After letting $x_0=1$, then: \[z=w_0x_0+w_1x_1+w_2x_2+w_3x_3\] We always add a bias at each layer but the last to Neural Network. If we contrast this model with logistic regression model, ww find that right now the to model is just the same: input every $x$ represents a feature. In logistic regression, we want to train a model $h_w(x)=\frac{1}{1+e^{-W^Tx}}$. The simpliest Neural Network, the model is a little complex, but if we do not take hidden layer into account, the model is just logistic regression.

Neural Network

To approach the authentic Neural Network, we add two more nerons($a_2^{(2)}$ and $a_1^{(3)}$) to logistic regression model. Notice that the model inner green triangle box is just like logistic regression demonstrated above. There are only two layers in Logistic Regression, in contrast, we can add more layers like L2 layer. In Neural Network, we call these layers hidden layers which are neither the input(e.g. layer have $x_1, x_2, x_2$), nor the output $h(x)$. The figure below has only one hidden layer, though we can add many hidden layers to the model. Look at the figure above, let's look at the definition of Neural Network, take $w_{12}^{(1)}$ for example, the subscript $_{12}$ represents the weight from the former layer $2nd$ unit to the current layer $1st$ unit. The superscript $^1$ represents former layer is layer L1. These $w$ are named weights of Neural Network. The sigmoid function $f=\frac{1}{1+e^{-x}}$ is activation function. We can choose other activation function such as symmetrical sigmoid $S(x)=\frac{1-e^{-x}}{1+e^{-x}}$. Now let's think about how to calculate $h(x)$, for the L2 layer, we have: \[\begin{align} & z_1^{(2)}=w_{10}^{(1)}x_0 + w_{11}^{(1)}x_1 + w_{12}^{(1)}x_2 + w_{13}^{(1)}x_3\\ & z_2^{(2)}=w_{20}^{(1)}x_0 + w_{21}^{(1)}x_1 + w_{22}^{(1)}x_2 + w_{23}^{(1)}x_3\\ & a_1^{(2)} = g(w_{10}^{(1)}x_0 + w_{11}^{(1)}x_1 + w_{12}^{(1)}x_2 + w_{13}^{(1)}x_3)=g(z_1^{2})\\ & a_2^{(2)} = g(w_{20}^{(1)}x_0 + w_{21}^{(1)}x_1 + w_{22}^{(1)}x_2 + w_{23}^{(1)}x_3)=g(z_2^{2}) \end{align}\] Here, $g()$ is the activation function. Notice that if we use matrices represent the equation, result will be simpler: \[a^{(2)} = g(z^{(2)}) = g(W^{(1)} a^{(1)})\] Here, we let $a_i^{(1)}=x_i$. We can conclude one more step, for layer k, we have: \[a^{(k)} = g(z^{(k)}) = g(W^{(k-1)} a^{(k-1)})\] Then for the L3 Layer, we have only one neural: \[\begin{align} h(x) = a_1^{3}=g(w_{10}^{(1)}a_0^{(2)} + w_{11}^{(1)}a_1^{(2)} + w_{12}^{(1)}a_2^{(2)})=g(z_1^{3}) \end{align}\] If we substitute $a_1^{2}$ and $a_2^{2}$ for elme $h(x)$, we have: \[\begin{align} h(x)=a_1^{3}=g(w_{10}^{(1)}\cdot 1 + w_{11}^{(1)} \cdot g(z_1^{(2)})+ w_{12}^{(1)}\cdot g(z_2^{(2)})) \end{align}\] The formula show that we use $g()$ function once and once again to nest the input, and compute the output eventaully. It is rather a non-linear classifier than linear classifier such as Linear Regression and Logistic Regression.

More Complicated Network

A Neural Network can be very complex, as long as we add more hidden layer into the network, the figure showed below is a neural network which has 20 layers, which means it has 1 input layer, 1 output layer and 18 hidden layers. From the connected weight we can imagine how much many weight we would calculate if we want to train such a big Neural Network. Notice that we add a bias subscript with zero on each layer except the output layer. And in each layer, we can add different amount of nerons. If we want to recognize numer image in zipcode from 0~9, we can design the Neural Network with 10 outputs in the output layer.

Simple Applications

This section, I'd like to construct a Neural Network to simulate a logic gate. Remember that bias $x_0$ is always $1$. Now let set $w_{10},$$w_{11}$ and $w_{12}$, and find what will h(x) become: \[w_{10}=-30\,,w_{11}=20\,,w_{12}=20\,\]

$x_1$	$x_2$	$z_1$	$a_1$
0	0	-30	0
0	1	-10	0
1	0	-10	0
1	1	10	1

Here we take advantge the property of sigmoid function, $g(-10)=4.5\times 10^{-5}\approx 0$ and $g(10)=0.99995\approx 1$. From the table we have constructed an $AND$ logic gate. It is easy to construct an $OR$ logic gate. We just set: \[w_{10}=-30\,,w_{11}=50\,,w_{12}=50\,\] Then we get an $OR$ logic gate. We can construct $NOR$ gate as well, just set: \[w_{10}=10\,,w_{11}=-20\,,w_{12}=-20\,\] Question: can we construct a $XOR$ gate? In fact, we can get a more powerful logic gate through adding more hidden layers. Only 2 layers of Neural Network can not construct a $XOR$ gate but 3 layers can. Neural Network shown below can implement function as $XOR$ logic gate. The weights matrices is as followed, we can testify through table listed. \[\begin{align} &W^{(1)}=\begin{bmatrix}-30&20&20\\ 10&-20&-20 \end{bmatrix}\\ &W^{(2)}=\begin{bmatrix}10&-20&-20 \end{bmatrix} \end{align}\]

$x_1$	$x_2$	$a_1^{(2)}$	$a_2^{(2)}$	$a_1^{(3)}$
0	0	0	1	0
0	1	0	0	1
1	0	0	0	1
1	1	1	0	0

From examples we have seen, hope you can gain intuition about Neural Network. We can generate more abstract features through adding hidden layers. ### Summerize Today we used Logistic Regression adding hidden layers to generate Neural Network. Then we talked about how to represent a Neural Network. In the end, we found that Neural Network can simulate logic gate. We do not talk about how to train a Neural Network here. Usually we use Backpropagation Algorithm to train a Neural Network.

Reference

https://www.coursera.org/learn/machine-learning
《Neural Networks》by Raul Rojas

Linear Regression for Trump

发表于 2017-03-04

Introduction

In the end of 2016, Trump occupied the presidency. We are always thinking about how he will build the wall between US and Mexico, Today, I'd like to compute how many resources he would cost if he truly began to build the wall using linear regression.

Data

The table below is some basic information about Chinese GreatWall as well as one item about Hadrian's Wall：

Age	people(1000)	Years	Length(KM)
Qin Dynasty	300	15	5000
Han Dynasty	500	100	5000
North Northern Dynasties	1800	12	2800
Sui Dynasty	1280	30	350
Ming Dynasty	3000	40	885
Hadrian's Wall	18	14	117

We use matrices represent the resources and the length of these walls. Each row of $x$ represents the quantity of (1000 people and year) men and years, and each row of $y$ denotes the length of walls(KiloMeter)

\[\begin{align} X=\begin{bmatrix} 5000\\ 5000\\ 2800\\ 350\\ 8851\\ 117 \end{bmatrix} y= \begin{bmatrix} 300\*15\\ 500\*100\\ 1800\*12\\ 1280\*30\\ 3000\*40\\ 18\*14 \end{bmatrix} \end{align}\]

Let's draw the picture of these data:

% octave
X=[5000;5000;2800;350;8851;117];
y=[300\*15;500\*100; 1800\*12;1280\*30;3000\*40; 18\*14];
xlabel("x (km)");
hold on;
ylabel("y (k people year)");
plot(X,y,"ro", "MarkerFaceColor", "b");

We hope find a mapping $x\rightarrow f(x)$ ($x$ denotes the length of walls, $y$ denotes the cost of resources)drawing a line as the figure above which meets all data best. When we encounter new data(e.g. new length of wall), we hope $f(x)$ will help us find how many resources will we cost. This is the goal of linear regression.

Definition

First of all, we assume that the line have the form as followed: \[h\_{\theta}(x)=\theta_0+\theta_1 x_1+\theta_2 x_2 + \cdots + \theta_nx_n\] Specifically, under our circumstance, there is only one $x$(the length of walls), then we will have the form as below: \[h\_{\theta}(x)=\theta_0+\theta_1x\] In fact, there are many factor infulence the outcome of the cost of wall besides people and time. Tools we use and the economy as well as terrain where building the wall. if we add these factors, maybe the assumption looks like: \[h\_{\theta}(x)=\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_2\] $x_1$denotes the length of wall, while $x_2$ and $x_3$ denote economy and terrian condition respectively. For simplicity, we consider only the length of wall.

In general, we add a $x_0=1$ to the equation, the we have: \[h\_{\theta}(x)=\theta_0x_0+\theta_1 x_1 + \cdots + \theta_nx_n =\sum\_{i=1}^{n}\theta_ix_i\] If we use matrices represent the equation, then \[\begin{align} \Theta=\begin{bmatrix} \theta_0\\ \theta_1\\ \cdots\\ \theta_n \end{bmatrix} \end{align}\] thus, \[\begin{align} h\_{\theta}(x)=\sum\_{i=1}^{n}\theta_ix_i= \begin{bmatrix} \theta_0\theta_1\cdots\theta_n \end{bmatrix} \begin{bmatrix} x_0\\ x_1\\ \cdots\\ x_n \end{bmatrix} =\Theta^T\overrightarrow{X} \end{align}\] Notice that $\Theta^T$ is a $1\times n$ matrices while $\overrightarrow{X}$ is a $n\times1$ matrices. We get a real number after multiply two matrices.

Cost Function

We hope find a way to find the best function $h\_\theta(x)$ fit these data. Notice that if $|(h\_\theta(x_i)-y_i)|=0$ for each $x$, then the function fit data absolutely. In practice, if we minimize $|(h\_\theta(x_i)-y_i)|$, we can find the best fit to original data. An optional cost function is square cost function: \[J(\theta)=\frac{1}{2m}\sum\_{i=1}^{m}(h\_\theta(x_i)-y_i)^2\] As long as we can minimize $J(\theta)$ with respect to $\theta$, can we find the best $\theta$ fit original data. The cost function looks like as followed:

% octave
m = size(X,1)
XX = [ones(m,1),X];
theta0 = linspace(-100,6000, 100);    
theta1 = linspace(-20,36, 100);
l0=length(theta0);                               
l1=length(theta1);                               
J_vs = zeros(l0,l1);                             
for i=1:l0                                             
    for j=1:l1                                         
        t = [theta0(i); theta1(j)];               
        J_vs(i,j) = cost(XX, y, t);             
    end                                                 
end                                                     
figure;                                                 
J_vs = J_vs';                                       
surfc(theta0, theta1, J_vs);                 
colorbar;                                             
xlabel('\theta_0');                                
ylabel('\theta_1');

The figure has a bowl shape, and we can find the minimum of it both using matrices approach and gradient descent.

Matrics Solution

Remember that the cost function $J(\theta)=\frac{1}{2m}\sum\_{i=1}^{m}(h\_\theta(x_i)-y_i)^2$, It is more convinient if we apply $J(\theta)$ matrics form, then: \[\begin{align} J(\theta)=\frac{1}{2m}(h\_\theta(x)-y)^T\cdot(h\_\theta(x)-y)\\ =\frac{1}{2m}(X\Theta-y)^T\cdot(X\Theta-y) \end{align}\] We can take the derivative of $J(\theta)$ and let it be $\overrightarrow{0}$. It is a little complex if we take the derivative of the last equation about $J(\theta)$. Let's do the derivation a easy way, but not a total Matric solution. If we do the derivation on the sum, we can get: \[\frac{\partial}{\partial\theta_j}J(\theta)=\frac{1}{m}\sum\_{i=1}^m(h\_\theta(x_i)-y_i)x_j\] Then, \[\begin{align} \frac{\partial}{\partial\Theta}J(\theta)= \begin{bmatrix} \frac{\partial}{\partial\theta_1}J(\theta)\\ \frac{\partial}{\partial\theta_2}J(\theta)\\ \frac{\partial}{\partial\theta_3}J(\theta)\\ \cdots\\ \frac{\partial}{\partial\theta_n}J(\theta) \end{bmatrix}=\overrightarrow{0} \end{align}\] Now, we transfer the sum into matrices form: \[\frac{\partial}{\partial\Theta}J(\theta)=X^T(X\Theta-y)=X^TX\Theta-X^Ty=\overrightarrow{0}\] thus, we can get the solution of $\Theta$: \[\Theta=(X^TX)^{(-1)}X^Ty\] Let's look at the solution of Matrices:

% octave
xlabel("x (km)");
hold on;
ylabel("y (k people year)");
plot(X,y,"ro", "MarkerFaceColor", "b");
a=linspace(0,10000,100);
t = pinv(XX'*XX)*XX'*y;
b=t'*[ones(1, length(a)); a];
plot(a,b);

And $\theta_0$ and $\theta_1$ is as followed: \[\begin{align} \Theta=\begin{bmatrix} 2573\\ 9.9\\ \end{bmatrix} \end{align}\] That means $h\_\theta(x)=2573+9.9x$, Right Now we subsitute 3169(km) for $x$: \[h\_\theta(3169)=2573+9.9\*3169=33946(k\ people\ year)\] In other word, Donald Trump need 3,394,600 people continously build 10 years in order to finish the GreatWall between US and Mexico. Surely he can stimulate 33,946,000 people, it only take him one year to finished the wall. ### Gradient Descent Solution Let's look at the cost function one more time. If we first choose a random $\Theta$, say we have $\theta_0\&\theta_1$ located in point 1. Assume you stand at ponit 1, you want to find a way to go down the vally. Surely you can not see the land-form completely. But you can look around at point 1 and find a steepest direction, then you go that way a baby step. Right now, you are at point 2 after the first step. You do the same find a steepest direction and make another baby step. After several steps, you are now standing at point 5. When you are looking around you find that where you are standing is almost flat. Now we can stop the iteration, and we have found a reasonable $\Theta$ which makes the $J(\theta)$ has a minmum value. What I described is a method named Gradient Descent which is very easy way to find minima of $J(\theta)$. The step is as followed:

Find a reasonable cost function J().
Take the partial derivative of $J(\theta)$ for each $\theta_j$
Choose a moderate $\alpha$ to constrain the step size.
Take a baby step, the direction is from the derivative and the size is determined by derivative and parameter $alpha$
update the value of $\Theta$, check if we find the minimum. If not, return to the step 4, otherwise, stop the algo

Concretely, the algorithm using octave is listed below:

% octave: The derivative function
function g = gradient(X, y, theta)
    m = size(X,1);
    hx = X*theta;
    g = (1/m)*X'*(hx-y);
end

% octave: I just take 100 steps without checking when to stop.
theta=[0;0];
alpha=0.00000003
for i=1:100
    g = gradient(XX,y,theta);
    theta = theta - alpha*g
end

Under the circumstance above, $\theta_0=0.0048$ and $\theta_1=10.33$. If we substitute the parameters for $h\_\theta(x)$, the answer that Trump would take to build the wall is 32745.56$$1000 people year. ### Summarize Today we have talk about how to use linear regression to model a problem of building Great Wall. We have also talk about how to resovle the problem through minimize the cost function both using Matrices solution and Gradient Descent.

Appendix

% octave --cost function:cost.m
function J=cost(X, y, theta)
m = size(X,1);
J=0;
hx=X*theta;
J = 1/(2*m)*(hx-y)'*(hx-y);

Linear Algebra (I)

发表于 2017-02-25

Motivation

When I was learning Linear Algebra in University, I found it was hard to understand what would applications of this course. I remember that we learned determinant on Chapter 1, then we learned a lot of other concept such as Rank and EigenValue. I could only memorize one after another theorems without knowing why these happened to appear this way. However, when I was learning matrices from lectures taught by Gilbert Strang, I found every conclusion is so nature that I could easily understand and associate a bunch of concepts with each other. I'd like to conclude what I had learned from him as well as hoping that someone can learn from that.

Why this subject exists?

First of all, what is the origin of linear algebra? Why we make the world increasingly complicated? The truth is that the purpose of mathmatics is to make the world simpler, not the opposite, Linear Algebra as well. For example, we want to solve equations with a lot of variables, the original way is to write all the parameter out： \[\begin{cases} 3x+4y+6z = 0\\\\ 2x+3y+4z = 24\\\\ x+y+z = 15 \end{cases}\]

If there are only 3 parameters, the problems is not so hard for us to handle. But what if there are 100 parameters? When we transfer the equation to matrices, the outcome becomes so grateful: \[\begin{align} \begin{bmatrix} 3&4&6\\\\ 2&3&4\\\\ 1&1&1 \end{bmatrix} \begin{bmatrix} x\\\\ y\\\\ z \end{bmatrix}=\begin{bmatrix}0\\\\24\\\\15\end{bmatrix} \end{align}\] ### Solve the equation group Now, we try to solve the problems, our instinction is to draw the three planes, under most circumstance, there will be a cross point, octave code and the plot is as followed:

octave:1> x=linspace(-20, 20, 100);
octave:2> [xx, yy] = meshgrid(x,y);
octave:3> z = 15-xx-yy;
octave:4> mesh(xx, yy, z);
octave:5> l = (-3*xx-4*yy)./6;
octave:6> mesh(xx,yy, l);
octave:7> t = (24-(2*xx+3*yy))./4;
octave:8> mesh(xx,yy,z);
octave:9> hold on;
octave:11> mesh(xx,yy,t);
octave:12> mesh(xx,yy,l);
octave:13> xlabel("x")
octave:14> ylabel("y")
octave:15> zlabel("z")

The answer of this equation group is that \[\begin{align} \begin{bmatrix} x\\\\ y\\\\ z \end{bmatrix}=\begin{bmatrix} -18\\\\ 72\\\ -39 \end{bmatrix} \end{align}\]x} \end{align}

Another perspective

We can find another perspective to solve this equation: see the equation as a combination of columns, that means: \[\begin{align} x\*\begin{bmatrix} 3\\\\ 2\\\\ 1 \end{bmatrix}+y\* \begin{bmatrix} 4\\\\ 3\\\\ 1 \end{bmatrix}+z\* \begin{bmatrix} 6\\\\ 4\\\\ 1 \end{bmatrix}= \begin{bmatrix} 0\\\\ 24\\\\ 15 \end{bmatrix} \end{align}\] Now, we can acquire the new vector through three vectors' combination, from the plot blow demonstrates that we need to find a combination of red&blue&black vectors to generate another vector(the green one). Obviously, $x=-18,y=72,z=-39$ is the only solution of the combination problem.

octave:1> a = quiver3(0, 0, 0, 3,2,1);
octave:2> hold on;
octave:3> b = quiver3(0, 0, 0, 4,3,1);
octave:4> c = quiver3(0, 0, 0, 6,4,1);
octave:5> set(b, "Color", "r");
octave:6> set(c, "Color", "k");
octave:7> set(a, "maxheadsize", 0.02);
octave:8> set(b, "maxheadsize", 0.02);
octave:9> set(c, "maxheadsize", 0.02);
octave:10> d = quiver3(0, 0, 0, 0,24,15);
octave:11> set(d, "Color", "g");
octave:12> set(d, "maxheadsize", 0.02);

How to solve equations using program, and is there always an answer of equations? An easy way is to do elimination of these there equations. What now I want to do right now is to change these equation as follows: \[\begin{cases} 3x+4y+6z = 0\\\\ (1/3)y+0z = 24\\\\ -z = 39 \end{cases}\]

Step1, calclate z, step2, calclate y, step3, calclate x. We first leave a variables z alone with an answer in the third equation. Then, we substitute z to the second equation, and get y. At last we substitute x and y to the first equation and get x. if we represent above through matrices language, it is like this: \[\begin{align} \begin{bmatrix} 3&4&6\\\\ 0&1/3&0\\\\ 0&0&-1 \end{bmatrix} \begin{bmatrix} x\\\\ y\\\\ z \end{bmatrix}= \begin{bmatrix} 0\\\\ 24\\\\ 39 \end{bmatrix} \end{align}\] The triangle at bottom-left side is all zeros which ensure we can solve equations one by one. The process is named back-substitution. ### Additional thinkings But what if the equation group looks like: \[\begin{align} \begin{bmatrix} 1&2&3\\\\ 1&2&3\\\\ 1&2&3 \end{bmatrix} \begin{bmatrix} x\\\\ y\\\\ z \end{bmatrix}= \begin{bmatrix} 5\\\\ 6\\\\ 7 \end{bmatrix} or \begin{bmatrix} 3&4\\\\ 4&5\\\\ 5&6\\\\ 6&7\\\\ 7&9 \end{bmatrix} \begin{bmatrix} x\\\\ y \end{bmatrix}= \begin{bmatrix} 5\\\\ 5\\\\ 5\\\\ 5\\\\ 5 \end{bmatrix} \end{align}\]
Next time we will talk about LU decomposition and Rank of a matrix.

Gradient Descent in Logistic Regression

发表于 2017-02-18

As a simple model, Logistic regression is very popular in Machine Learning, especially in computer industry while gradient descent is more of popularity as well among dozens of optimization methods. The aim of this article is to demonstrate how to reach these formulas conclusion.

1. Model a problem

First of all, we know that Logistic regression is a statistical model. Suppose we have a coin, we'd like to know the probability of the head and tail. Bernoulli distribution is a good distribution to model the coin. The hyphothsis is as followed: \[P(y=1|\mu)=\mu \qquad(1)\].

$y=1$ means the probability of the head, then $y=0$ is that of the tail. $\mu$ here is the parameter, if $\mu=0.5$, then the probability of head equals to 0.5, which means we have half the probabilty reach the head as well as the tail. Now we have: \[P(y=0|\mu)=1-\mu\qquad(2)\]

If we take a look at these two formula, we can conclude a more general formula: \[P(y|\mu)={\mu}^y\cdot{(1-\mu)}^{(1-y)} \qquad(3)\]

Let's test it, $y$ only have two value $0$ and $1$, if $y=1$, $P(y|\mu)=\mu$, othewise, $P(y|\mu)=1-\mu$. Suppose we have a dataset $D=\{y\_1,y\_2,y\_3...y\_m\}$ observed values, we want to estimate the parameter $\mu$, the problems become following question: \[P(\mu|D)=?\qquad (4)\]

2. Estimate Parameter

We use Bayes formula to transfer the problem to another: \[P(\mu|D)=\frac{P(D|\mu)\cdot P(\mu)}{P(D)}=\frac{P(y\_1,y\_2,y\_3...,y\_m|\mu)\cdot P(\mu)}{P(D)}\qquad (5)\]

Here denominator $P(D)$ is a constant as well as $P(\mu)$ if we see $\mu$ as variable rather than a distribution. Then we have: \[P(\mu|D)\triangleq P(y\_1,y\_2,y\_3...,y\_m|\mu) \qquad(6)\]

We can find a series of $\mu$(e.g, $\mu=0.1, \mu=0.72$), but we should find the maximum $P(\mu|D)$, because when $P(\mu|D)$ reaches its max means $\mu$ most likely is the right parameter. We assume that each $y$ is independent from the others given the parameter $\mu$. then we have: \[P(\mu|D)\triangleq P(y\_1|\mu)P(y\_2|\mu)P(y\_3|\mu)...,P(y\_m|\mu)={\prod}\_{i=1}^{m}P(y\_i|\mu)\qquad(7)\]

Now, the problem is to maximize the ${\prod}\_{i=1}^{m}P(y\_i|\mu)$, that is: \[L = \underset{\mu}{argmax}{\prod}\_{i=1}^{m}P(y\_i|\mu)=\underset{\mu}{argmax}{\prod}\_{i=1}^{m}[{\mu}^y\cdot{(1-\mu)}^{(1-y)}] \qquad(8)\]

It is a little hard to find the maximum, we change the problem to another way: \[L =\underset{\mu}{argmax}ln({\prod}\_{i=1}^{m}[{\mu}^y\cdot{(1-\mu)}^{(1-y)}])\qquad(9)\]

Then we have: \[L = \underset{\mu}{argmax}{\sum}\_{i=1}^{m}[{y ln(\mu}) + (1-y) ln({1-\mu)}]\qquad(10)\]

3. Logistic Regression

If we let $\mu=h\_{\theta}(x)$, then we have:

\[L = \underset{\theta}{argmax}{\sum}\_{i=1}^{m}[{y ln(h\_{\theta}(x)}) + (1-y) ln({1-h\_{\theta}(x))}]\qquad(11)\]

Here, \[h\_{\theta}(x)=\frac{1}{1+e^{-(\theta\_0 x\_0+\theta\_1 x\_1+\theta\_2 x\_2+\theta\_m x\_m)}}=\frac{1}{1+e^{-{\Theta}X}}\qquad (12)\]

$h\_{\theta}(x)$ is named sigmoid function and now we use parameter $\theta$ to estimate $\mu$. Simgoid function is a pretty good function, derivative of which is elegant. we let $\sigma=\frac{1}{1+e^{-x}}$, then:

\[\begin{align} \frac{\partial{\sigma}}{\partial x} &= \frac{-1}{(1+e^{-x})^2}\cdot e^{-x}\cdot(-1)\\\\ &=\frac{e^{-x}}{(1+e^{-x})^2}=\frac{1+e^{-x}-1}{(1+e^{-x})^2}\\\\ &=\frac{1}{1+e^{-x}}-\frac{1}{(1+e^{-x})^2}\\\\ &=\sigma(1-\sigma)\qquad \qquad (13) \end{align}\]

4. Gradient Descent

We transfer the maximizing to a minimizing problem, we define the cost function:

\[J(\theta) = -\frac{1}{m}{\sum}\_{i=1}^{m}[{yln(h\_{\theta}(x)}) + (1-y)ln({1-h\_{\theta}(x))}]\qquad (14) \]

if we want to minimize $J(\theta)$, we need know the gradient of $\frac{\partial J(\theta)}{\partial\theta\_j}$, that is: \[\begin{align} \frac{\partial J(\theta)}{\partial\theta\_j} &=\frac{\partial}{\partial\theta\_j}(-\frac{1}{m}{\sum}\_{i=1}^{m}[yln(h\_\theta(x)) + (1-y)ln(1-h\_\theta(x))])\\\\ &= -\frac{1}{m}{\sum}\_{i=1}^{m}[\frac{y}{h\_\theta(x)} + \frac{y-1}{1-{h\_\theta(x)}} ]\frac{\partial h\_\theta (x)}{\partial\theta\_j}\\\\ &= -\frac{1}{m}{\sum}\_{i=1}^{m}[{\frac{y}{\sigma}} + \frac{y-1}{1-\sigma} ]\frac{\partial\sigma}{\partial\theta\_j}\\\\ &= -\frac{1}{m}{\sum}\_{i=1}^{m}[\frac{y-y\sigma}{\sigma(1-\sigma)} + \frac{\sigma y-\sigma}{\sigma(1-\sigma)}]\frac{\partial\sigma}{\partial\theta\_j}\\\\ &= -\frac{1}{m}{\sum}\_{i=1}^{m}[\frac{y-\sigma}{\sigma\cdot(1-\sigma)}]\frac{\partial\sigma}{\partial\theta\_j}\\\\ &= -\frac{1}{m}{\sum}\_{i=1}^{m}[\frac{y-\sigma}{\sigma(1-\sigma)}]\cdot\sigma(1-\sigma)\frac{\partial\Theta X}{\partial\theta\_j}\\\\ &= -\frac{1}{m}{\sum}\_{i=1}^{m}[\frac{y-\sigma}{1}]\frac{\partial\Theta X}{\partial\theta\_j}\\\\ &= -\frac{1}{m}{\sum}\_{i=1}^{m}[y-\sigma]{X\_j}\\\\ &= \frac{1}{m}{\sum}\_{i=1}^{m}[h\_{\theta}(x)-y]{X\_j} \qquad(15) \end{align}\]

Notice that the gradient descent is same as that of square cost function in Linear Regression, which is an extremely grace equation. ### Summarize Today We have talked about how to derive cost function of Logistic Regression, then we get the gradient descent equation of it. Next step is to use gradient descent to iterate the minimum of the cost function.

Reference

https://www.coursera.org/learn/machine-learning/home/welcome
Parameter estimation for text analysis, Gregor Heinrich
Pattern Recognition and Machine Learning, Christopher M. Bishop

\(x_1\)	\(x_2\)	\(a_1^{(2)}\)	\(a_2^{(2)}\)	\(a_1^{(3)}\)
0	0	0	1	0
0	1	0	0	1
1	0	0	0	1
1	1	1	0	0

简介

最小方差，无偏估计量

参数偏差

参数方差

最大似然估计

均值已知的方差估计

参数估计

表现评价

均值未知的方差估计

参数估计

表现评价

修正偏差

总结

Preface

Introduction

Iris data

Entropy and Information Gain

Entropy

Conditional Entropy and Iris Data

ID3(Iterative Dichotomiser 3)

Summary

Reference

Appendix code

Preface

Introduction

Linear transformation

Exercise

EigenVector and EigenValue

A Problem

Intuition

Exercise

Singular Value Decompositon

Intuition

Deduction

Image Compression

Summary

Reference

版权声明

长一点儿的答案

版权声明

第一章

前言

第二章

数学背景知识

2.1.1 标准差

2.1.2 方差

练习

2.1.3 协方差

2.1.4

练习

2.2 矩阵代数

2.2.1 特征向量

2.2.2 特征值

练习

第三章 主成分分析（Principal Components Analysis）

3.1 方法

第一步：数据集

第三步：计算协方差矩阵

练习

第四章 计算机视觉应用

4.1 表示

附录 A

实现代码

Introduction

Intuition

Definitioin

Naive Bayes

Background

Intuition

Neural Network

More Complicated Network

Simple Applications

Reference

Introduction

Data

Definition

Cost Function

Matrics Solution

Appendix

Motivation

第三章主成分分析（Principal Components Analysis）

第四章计算机视觉应用