Preface

Graph laplacian is familiar to computation science researcher, with which we can perform spectral analysis such as diffusion map, eigenmap or spectral clustering. Here, we discuss how to generalize the graph laplacian to it's high-order form, i.e., Hodge laplacian.

Graph laplacian

In the previous post we know that the graph laplacian can be obtianed by degree matrix \(D\) and adjacency matrix \(A\) such that: \[\begin{equation} L_0 = D - A \end{equation}\]

We can find the adjacency matrix and degree matrix of the graph blow with nine vertices:

such that:

\[\begin{equation} A = \begin{array}[r]{c |c c c c c c c c c } & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9\\ \hline 1& 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 2& 1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0\\ 3& 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 4& 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0\\ 5& 0 & 0 & 0 & 1 & 0 & 1 & 0 & 0 & 0\\ 6& 0 & 0 & 0 & 1 & 1 & 0 & 1 & 0 & 0\\ 7& 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0\\ 8& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1\\ 9& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\\ \end{array} \end{equation}\]

and

\[\begin{equation} D = \begin{array}[r]{c |c c c c c c c c c } & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9\\ \hline 1& 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 2& 0 & 2 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 3& 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0\\ 4& 0 & 0 & 0 & 2 & 0 & 0 & 0 & 0 & 0\\ 5& 0 & 0 & 0 & 0 & 2 & 0 & 0 & 0 & 0\\ 6& 0 & 0 & 0 & 0 & 0 & 3 & 0 & 0 & 0\\ 7& 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\ 8& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\\ 9& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1\\ \end{array} \end{equation}\]

And the graph laplacian thus can be calculated as:

\[\begin{equation} L_0 = D - A = \begin{array}[r]{c |c c c c c c c c c } & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9\\ \hline 1& 1 & -1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 2& -1 & 2 & -1 & 0 & 0 & 0 & 0 & 0 & 0\\ 3& 0 & -1 & 1 & 0 & 0 & 0 & 0 & 0 & 0\\ 4& 0 & 0 & 0 & 2 & -1 & -1 & 0 & 0 & 0\\ 5& 0 & 0 & 0 & -1 & 2 & -1 & 0 & 0 & 0\\ 6& 0 & 0 & 0 & -1 & -1 & 3 & -1 & 0 & 0\\ 7& 0 & 0 & 0 & 0 & 0 & -1 & 1 & 0 & 0\\ 8& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & -1\\ 9& 0 & 0 & 0 & 0 & 0 & 0 & 0 & -1 & 1\\ \end{array} \end{equation}\]

However, there's anoth way to represent the graph laplacian. Before that, we first introduce the incidence matrix of a graph. First, we can convert an undirected graph to be directed just assign the direction of edges by book keeping order. Next, let's construct the incidence matrix, each row represent a vertex, and each colum is an edge. We set the entry to be 1 if the the edge enter the vertex -1 when leave the vertex. Set 0 if there's no connection between a vertex and an edge. Thus the incidence matrix of a graph \(G\) is \(B\) such that: \[\begin{equation} B_1 = \begin{array}[r]{c | c c c c c c c} & \cdots & [4,5] & [4,6]& [5,6]& [6,7]& \cdots\\ \hline \vdots & \cdots & & \cdots & & \cdots & \\ 4 & \cdots & -1 & -1 & 0 & 0 & \cdots\\ 5 & \cdots & 1 & 0 & -1 & 0 & \cdots\\ 6 & \cdots & 0 & 1 & 1 & -1 & \cdots\\ 7 & \cdots & 0 & 0 & 0 & 1 & \cdots\\ \vdots & \cdots & & \cdots & & \cdots & \\ \end{array} \end{equation}\] Next, let construct graph laplacian matrix using \(B_1\). Actually, the graph laplacian is: \[\begin{equation} L_0=B_1 B_1^\top = \begin{array}[r]{c |c c c c c c c c c } & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9\\ \hline 1& 1 & -1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 2& -1 & 2 & -1 & 0 & 0 & 0 & 0 & 0 & 0\\ 3& 0 & -1 & 1 & 0 & 0 & 0 & 0 & 0 & 0\\ 4& 0 & 0 & 0 & 2 & -1 & -1 & 0 & 0 & 0\\ 5& 0 & 0 & 0 & -1 & 2 & -1 & 0 & 0 & 0\\ 6& 0 & 0 & 0 & -1 & -1 & 3 & -1 & 0 & 0\\ 7& 0 & 0 & 0 & 0 & 0 & -1 & 1 & 0 & 0\\ 8& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & -1\\ 9& 0 & 0 & 0 & 0 & 0 & 0 & 0 & -1 & 1\\ \end{array} \end{equation}\] As we can see that, \(L_0\) is the same as the that obtained from degree matrix and adjacency matrix.

Zero eigenvaules of graph laplacian

We know that for the application of diffusion map or spectral analysis, we have to drop the first eigenvector due to its same values and the eigenvaule is 0. The second eigenvector is also called fidler vector. However, these applications usually created a connected graph. Since the graph we show here is an unconnected graph. We can stop here to take a guess: Is the zero eigenvaule still there, or if so, how many zero eigenvaules?

Now, let's perform the eigenvaule decomposition that: \[\begin{equation} L_0 = Q\Lambda Q^\top \end{equation}\]

Interestingly, the top 3 eigenvalues are all zero and we also have 3 connected components in the graph. Is this a coincidence or a property of graph laplacian decomposition?

Furthermore, the top 3 eigenvectors corresponding to the zero eigenvaules are also interesting. Notice that the first eigenvector can differentiate the components \(\{4,5,6,7\}\) from the rest vectices. Likewise, the second eigenvector select the connected component \(\{8,9\}\) and the third choose the component \(\{1,2,3\}\).

In fact, in the field of Topological Data Analysis (TDA), the \(L_0\) is the special case of hodge laplacian (\(L_k\)). The number of connected components is call 0-dimensional cycles. And the graph laplaican can capture these cycles. The 1-dimensional cycles are correspond to holes, we will go into detail in the next section.

Hodge laplacian

We can see the graph laplacian zero-order of hodge laplacian, and the formalu can be represented as: \[\begin{equation} L_0 = \mathbf{0}^\top \mathbf{0} + B_1 B_1^\top \end{equation}\]

Similarly, we can obtian \(L_1\):

\[\begin{equation} L_1 = B_1^\top B_1 + B_2 B_2^\top \end{equation}\]

You must ask where is \(B_2\) coming from? We know that \(B_1\) captures the relationship between vertices and edges. Thus, \(B_2\) captures the relationship between edges and triangles. We can also define \(B_3\) to capture relationship between triangles and tetrahedron and so on and so forth. So what is a triangle or a tetrahedron in a graph? We would not go into the detail of the thory of Simplex and Simplicial complex. Here we just need to know that three connected vertices forms a triangle. Similarly, four connected vertices forms a tetrahedron which is a high-order of triangles. To define \(B_2\), of a graph \(G\), we would check the direction of an edge \(e_j\) to the triangle \(\bigtriangleup_q\) it beblongs, if it has the same direction as the triangle, the entry would be \(1\), if the direction is opposite, the entry would be -1, otherwise the entry would be zero. Specifically:

\[\begin{equation} {B}_2[j, q] = \begin{cases} 1 & \text{if } e_j \in \bigtriangleup_q \quad \text{with}\quad \text{same}\quad \text{direction} \\ -1 & \text{if } e_j \in \bigtriangleup_q \quad \text{with}\quad \text{opposite}\quad \text{direction} \\ 0 & \text{otherwise} \end{cases} \end{equation}\]

With the definition, we can obtian \(B_2\) of the graph aforementioned: \[\begin{equation} B_2 = \begin{array}[r]{c | c } & [4,5,6]\\ \hline \vdots & \cdots \\ [4,5] & 1\\ [4,6] & -1\\ [5,6]& 1\\ [6,7] & 0\\ \vdots & \cdots \\ \end{array} \end{equation}\]

We next introduce the normalized form and the decomposition of hodge 1-laplacian. The normalized form of hodge 1-laplacian is given:

\[\begin{equation} \mathcal{L}_1 = {D}_2 {B}_1^\top {D}_1^{-1} {B}_1 + {B}_2 {D}_3 {B}_2^\top {D}_2^{-1} \end{equation}\] where \(\mathbf{D}_1\) is the vertices degree matrix, \({D}_2\) is \(\max{(\text{diag}(|{B}_2| \mathbf{1}), \mathbf{I})}\) and \({D}_3\) is \(\frac{1}{3}\mathbf{I}\). Since the normalized \(L_1\) is not neccessarily symmetric, we next need to define the symmetric normalized Hodge 1-Laplacian such that: \[\begin{equation} \begin{aligned} \mathcal{L}_1^s & = {D}_2^{-1/2} \mathcal{L}_1 {D}_2^{1/2}\\ & = {D}_2^{1/2} {B}_1^\top {D}_1^{-1} {B}_1 {D}_2^{1/2} + {D}^{-1/2} {B}_2 {D}_3 {B}_2^\top {D}_2^{-1/2} \end{aligned} \end{equation}\]

We use the graph with three holes to present hodge 1-laplacian:

We next can perform eigenvalues decomposition on \(\mathcal{L}_1\): \[\begin{equation} \begin{aligned} \mathcal{L}_1 & = \mathbf{D}_2^{1/2} \mathcal{L}_1^s \mathbf{D}_2^{-1/2} \\ & = \mathbf{D}_2^{1/2} Q \Lambda Q^\top \mathbf{D}_2^{-1/2} \\ & = \mathbf{U} \Lambda \mathbf{U}^{-1} \end{aligned} \end{equation}\] Interestingly, the top 3 eigenvector also all zero which corresponding to the 3 holes, namely, the three 1-dimensional cycles.

When it comes to the eigenvectors, we can also notice that the top three eigenvectors are around the three holes.

In algebraic geometry, these the eigenvectors with zero eigenvaules are called harmonic function or harmonic. These harmonic function around holes is useful for some analysis like clustering to find the 1-dimensional cycles etc.

Conclusion

Today, we review the graph laplacian, and we can find the zeor eigenvalues and theirs corresponding eigenvectors can be used to find connected components. The high order graph laplacian hodge laplacian have the similar properties, we presented the hodge 1-laplaican and its eigenvalues decomposition. We can find that the zero eigenvalues indicates the number of holes of a graph. Furthermore, the corresponding eigenvectors with zero eigenvalues are around holes.

Diffussion map

Eigenvalue decomposition of weighted graph laplacian is used to obtain diffusion map. The asymmetric matrix or transition matrix \(P=D^{-1}W\). The decomposition is: \[\begin{equation} P = D^{-1/2}S\Lambda S^\top D^{1/2} \end{equation}\] In fact, the random walk on the graph would give rise to the transition graph such that: \[\begin{equation} p^{t+1}= p^t D^{-1}W = p^t P \end{equation}\] where \(p\) is a initial state of the graph (vertices weight). That is, after a certain step of random walk on the graph, we would reach a steady state when any more random walk would not change the weight. Let \(Q=D^{-1/2}S\), then \(Q^{-1}=S^{\top} D^{1/2}\), we can derive \(P = Q\Lambda Q^{-1}\). The random walk above mentioned can then be represent: \[\begin{equation} \begin{aligned} P^t &= Q\Lambda Q^{-1}Q\Lambda Q^{-1}\cdots Q\Lambda Q^{-1} &= Q\Lambda^t Q \end{aligned} \end{equation}\] Since \(\Lambda\) is diagnoal, the random walk on a graph can be easily cacluted by using the eigenvalue decomposition outcome. The column of \(Q\) is the diffusion map dimensions.

Diffusion map example

To understand diffusion map, we here introduce an example data and run diffusion map on it. The data is a single cell fibroblasts to neuron data with 392 cells. We first download the data from NCBI and loaded the data to memory:

1
2
3
4
5
6
7
import requests
url = 'https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE67310&format=file&file=GSE67310%5FiN%5Fdata%5Flog2FPKM%5Fannotated.txt.gz'
r = requests.get(url, allow_redirects=True)
open('GSE67310.txt.gz', 'wb').write(r.content)
data = pd.read_csv('GSE67310.txt.gz', sep='\t', index_col=0)
data = data.loc[data['assignment'] !='Fibroblast']
group = data['assignment']

From the code as we can see that, we extract the cell types from the data and stores it into the variable group. The rest part of the data is the normalized gene expression count matrix.

To do the pre-processing, we first get the log-normalized count matrix \(X\) and revert it to the counts and apply log geomatric scaling to the count matrix as the pre-processing then store it to matrix \(Y\).

1
2
3
X = np.array(data.iloc[:, 5:]).T
X = np.power(2, X[np.apply_along_axis(np.var, 1, X)>0, :]) - 1
Y = np.log(gscale(X+0.5)).T
The geomatric scaling implementation is:
1
2
3
4
5
6
7
8
9
def gscale(X:np.ndarray) -> np.ndarray:
assert(X.all()>=0)
div_ = np.divide(X.T, np.apply_along_axis(lambda x:np.exp(np.mean(np.log(x))), 1,
X)).T
scale_ = np.apply_along_axis(np.median,0, div_)
sc = StandardScaler(with_mean=False)
sc.fit(X)
sc.scale_ = scale_
return sc.transform(X)
After the pre-processing, we we next run PCA to perform dimensionality reduction and use which to calculate euclidean distances between cells, and then run diffusion map.

1
2
3
pc = run_pca(Y, 100)
R = distance_matrix(pc, pc)
d = diffusionMaps(R,7)

We first take a look at the PCA figure which shows the first 2 PCs of the dataset. PCA is a linear methods which cannot reflect cell fate differentation process. Since the diffusion map applied the Gaussian kernal during the distances calculation, it usually a better way to capture the cell differentation events. Here we show diffusion dimension 1,2 and 1,3. It's clear that 1,2 captures the cell differentation from fibroblasts to neuron, and 1,3 captures the differentiation from firoblasts to myocytes.

We can change the bandwidth from 7 to 100, which use the 100th neareast neighbor as bandwidth instead of 7th. The following shows the diffusion map that bandwidth=100.

pseudo time by random walk

The eigenvalue decomposition of graph laplacian can also be used to infer the pseudo time simulating the pseudo time of cell differentation events. We here set top \(m\) cells \(1/m\) of the progenitor cells (MEF) and set other cell 0 as the initial state. Next, we perform random walk to get new pesudotime \(u\) such that: \[\begin{equation} u = [\frac{1}{m}, \frac{1}{m}, \cdots, 0, \cdots 0](D^{-1}W)^t \end{equation}\] By testing different number of random walk steps we can check the new pseudo time \(u\). Here we show time step equals to 1, 10, 100 and 200. From the figure we can notice that after certain step, the pseudo time will not change anymore. That means the random walk reaches the steady state.

To test when can we reach the steady state, I use the graph we mentioned in my last post Spectral analysis (1):

Here we random generate two initial states (\(v_1\), \(v_2\)) to do the random walk such that: \[\begin{equation} \begin{aligned} v_1 &= [0.15,0.41,0.54,0.9,0.62,0.93,0.1,0.46,0.01,0.88]\\ v_2 &= [0.89,0.93,0.07,0.41,0.52,0.88,0.43,0.09,0.1,0.2] \end{aligned} \end{equation}\]

From the the figure below we can see that \(v_1\) reaches the steady state in 30 steps whereas \(v_2\) reaches steady state in 100 steps. In total, all these two initial state will reach the steady state that would not change any longer.

spectral clustering

The eigenvectors of graph Laplacian can also be used to do the clustering. From the figure we can find clear cut using the 1st, 2nd and the 3rd diffusion dimensions. In practice, we can use kmeans, DBSCAN, leiden, louvain algorithm to perform clustering using the diffusion dimensions.

Appendix

PCA function
1
2
3
4
5
6
7
8
9
def run_pca(mtx, n_components=2, random_state=2022):
dm = None
if scipy.sparse.issparse(mtx):
clf = TruncatedSVD(n_components, random_state=random_state)
dm = clf.fit_transform(mtx)
else:
pca = PCA(n_components=n_components, random_state=random_state)
dm = pca.fit_transform(mtx)
return dm
Affinity matrix function
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def affinity(R, k=7, sigma=None, log=False):
"""
Gaussian affinity matrix constructor
W = exp(-r_{ij}^2/sigma)
"""
def top_k(lst, k=1):
assert(len(lst) >k)
return np.partition(lst, k)[k]

R = np.array(R)
if not sigma:
s = [top_k(R[:, i], k=k) for i in range(R.shape[1])]
S = np.sqrt(np.outer(s, s))
else:
S = sigma
logW = -np.power(np.divide(R, S), 2)
if log:
return logW
return np.exp(logW)
Diffusion map function
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def diffusionMaps(R,k=7,sigma=None):
"""
Diffusion map(Coifman, 2005)
https://en.wikipedia.org/wiki/Diffusion_map
----------
dic:
psi: right eigvector of P = D^{-1/2} * evec
phi: left eigvector of P = D^{1/2} * evec
eig: eigenvalues
"""
k=k-1 ## k is R version minus 1 for the index
logW = affinity(R,k,sigma,log=True)
rs = np.exp([logsumexp(logW[i,:]) for i in range(logW.shape[0])]) ## dii=\sum_j
w_{i,j}
D = np.diag(np.sqrt(rs))
## D^{1/2}
Dinv = np.diag(1/np.sqrt(rs)) ##D^{-1/2}
Ms = Dinv @ np.exp(logW) @ Dinv ## D^{-1/2} W D^{-1/2}
e = np.linalg.eigh(Ms) ## eigen decomposition of P'
evalue= e[0][::-1]
evec = np.flip(e[1], axis=1)
s = np.sum(np.sqrt(rs) * evec[:,0]) # scaling
# Phi is orthonormal under the weighted inner product
#0:Psi, 1:Phi, 2:eig
dic = {'psi':s * Dinv@evec, 'phi': (1/s)*D@evec, "eig": evalue}
return dic

Introduction

Newton used glass separted the white sunlight into red, orange, yellow, green, blue, indigo and violet single colors. The light spectrum analysis can help scientists to interpret the world. For example, we can detect the elements of our solar system as well as far stars in the universe. The spectrum analysis is also a field in mathmatjics. In the graph thoery field, Laplacian matrix is used to represented a graph. We can obtian features from the undirected graph below (wikipedia).

For example, we can check the degree of each vertex, which forms our Degree matrix \(D\) such that: \[\begin{equation} D = \begin{pmatrix} 2 & 0 & 0 & 0 & 0 & 0\\ 0 & 3 & 0 & 0 & 0 & 0\\ 0 & 0 & 2 & 0 & 0 & 0\\ 0 & 0 & 0 & 3 & 0 &0\\ 0 & 0 & 0 & 0 & 3 & 0\\ 0 & 0 & 0 & 0 & 0 & 1 \end{pmatrix} \end{equation}\]

By checking the connection between all pairs nodes, we can create a Adjacency matrix:

\[\begin{equation} A = \begin{pmatrix} 0 & 1 & 0 & 0 & 1 & 0\\ 1 & 0 & 1 & 0 & 1 & 0\\ 0 & 1 & 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0 & 1 &1\\ 1 & 1 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0 \end{pmatrix} \end{equation}\]

graph laplacian

The graph Laplacian \(L\) extracts all useful information from a graph which is: \[\begin{equation} L=D-A =\begin{pmatrix} 2 & -1 & 0 & 0 & -1 & 0\\ -1 & 3 & -1 & 0 & -1 & 0\\ 0 & -1 & 2 & -1 & 0 & 0\\ 0 & 0 & -1 & 3 & -1 & -1\\ -1 & -1 & 0 & -1 & 3 & 0\\ 0 & 0 & 0 & -1 & 0 & 1 \end{pmatrix} \end{equation}\]

In fact, the graph Laplaican matrix is symmetric and also positive semidefinite (PSD), which means if we perform eigenvalue decomposition, the eigen values are all real and nonnegative. We can normlize the graph Laplaican by left multiplying the \(D^{-1}\) which will give rise to all \(1\)s of the diagnal entries of the matrix, namely: \[\begin{equation} \text{norm}({L}) = D^{-1}L = \begin{pmatrix}1 & -0.50 & 0 & 0 & -0.50 & 0\\ -0.33 & 1 & -0.33 & 0 & -0.33 & 0\\ 0 & -0.50 & 1 & -0.50 & 0 & 0\\ 0 & 0 & -0.33 & 1 & -0.33 &-0.33\\ -0.33 & -0.33 & 0 & -0.33 & 1 & 0\\ 0 & 0 & 0 & -1 & 0 & 1 \end{pmatrix} \end{equation}\] This matrix is called transition matrix. However, the matrix would not keep the symmetric and the PSD property after the normalization. We will come back to discuss the spectral for the normalized graph Laplacian.

Weighted graph

In practice, graph are usually weighted. The weight between vertices can be euclidean distance or other measures. The figure blow shows the weights of the graph. Here we apply gussian kernel to the euclidean distances between vertices such that: weighted graph

\[\begin{equation} w_{ij} =\exp (-r_{ij}^2/\sigma) = \exp \big(\frac{-\|x_i - x_j\|^2}{\sigma_i\sigma_j}\big) \end{equation}\] where \(r_{ij}\) is the euclidean distance between vetex \(i\) and \(j\), and \(sigma\) controls the bandwidth. We call the matrix \(W=(w_{ij})\) gaussian affinity matrix such that: \[\begin{equation} \begin{pmatrix} 0 & w_{12} & w_{13} & w_{14} & w_{15}& w_{16}\\ w_{21} & 0 & w_{23} & w_{24} & w_{25}& w_{26}\\ w_{31} & w_{32} & 0 & w_{34} & w_{35}& w_{36}\\ w_{41} & w_{42} & w_{43} & 0 & w_{45}& w_{46}\\ w_{51} & w_{52} & w_{53} & w_{54} & 0& w_{56}\\ w_{61} & w_{62} & w_{63} & w_{64} & w_{65}& 0 \end{pmatrix} \end{equation}\] The guassian kernel would enlarge the distance between too far vertices. Similar to unweighted matrix, we can also construct graph Laplaican matrix using the gaussian affinity matrix. First, we need find the weighted degree based on \(W\) such that: \[\begin{equation} d_{ii} = \sum_j{w_{ij}} \end{equation}\] With the diagnal degree matrix and affinity matrix, we now can have the weighted laplacian that: \[\begin{equation} L = D - W \end{equation}\] Likewise, we next give the normalized form of Laplacian such that: \[\begin{equation} \text{norm}{(L)}= D^{-1}L = I - D^{-1}W \end{equation}\] To facilitate the eigenvalue decomposition, we need apply trick to the asymmetric matrix \(D^{-1}L\). Since the eigenvectors of \(D^{-1}L\) and \(D^{-1}W\) are the same, we apply some trick to \(P = D^{-1}W\) to simplify the problem. Lets construct \(P'\) such that: \[\begin{equation} P' = D^{1/2} P D^{-1/2} = D^{-1/2}WD^{-1/2} \end{equation}\] It obvious \(P'\) is symmetric due to the symmetrisation of \(W\). We can perform eigenvalue decomposition on \(P'\) such that: \[\begin{equation} P' = S\Lambda S^\top \end{equation}\] Where S stores the eigenvectors of \(P'\) and the diagonals of \(\Lambda\) records the eigenvalues of \(P'\). We can also get the decompostion to \(P\) such that: \[\begin{equation} P = D^{-1/2}S\Lambda S^\top D^{1/2} \end{equation}\] Let \(Q=D^{-1/2}\), then \(Q^{-1}=S^{\top}D^{1/2}\). We therefore find the right and left eigenvector of \(P\) such that: \[\begin{equation} \psi = D^{-1/2} S \qquad \phi = S^{\top}D^{1/2} \end{equation}\] In fact, columns of \(\psi\) stores the spectral of the graph which also call diffusion map dimensions.

Fundamentals

Torch.tensor

1
2
3
4
t = tensor([[1,2,4],[4,5,6]])
t.shape: (2,3)
t.ndim: 2
type: scalar, vector, matrix, tensor

Tensor data produce

1
2
3
4
5
random_tensor = torch.rand(size=(3, 4, 5))
zeros = torch.zeros(size=(3, 4))
ones = torch.ones(size=(3, 4))
zero_to_ten = torch.arange(start=0, end=10, step=1)
ten_zeros = torch.zeros_like(input=zero_to_ten) # same shape but all zeros

Float

1
2
3
4
torch.float32/torch.float
torch.float16
torch.half
torch.float64/torch.double

types specific for GPU or CPU

1
2
device=’cuda’ if torch.cuda.is_available() else ‘cpu’
t = tensor([1,2,3], device=device)

tensor operations

1
2
tensor_A = torch.tensor([[1,2],[3,4],[5,6]],
dtype = torch.float32)
multiply
1
2
3
4
5
6
7
tensor = torch.tensor([1, 2, 3])
tensor + 10
torch.multiply(tensor, 10)
tensor * tensor # tensor([1,4,9])
tenorA @ tensorB # matrix multiplication -> tensor(14)
torch.matmul(tensor, tensor)/torch.mm # 1*1 + 2*2 + 3*3 = tensor(14)
tensor.T
layer
1
2
3
4
5
6
7
8
#Torch.nn.Linear
y = x A^T + b
torch.manual_seed(42)
linear = torch.nn.Linear(in_features=2, # in_features = matches inner dimension of input
out_features=6) # out_features = describes outer value
x = tensor_A
output = linear(x)
x.shape, output, output.shape
other operations
1
2
3
4
5
6
7
8
9
10
11
12
tensor = torch.arange(10, 100, 10) # tensor([10, 20, 30, 40, 50, 60, 70, 80, 90])
tensor.argmax() # 8
tensor.argmin() # 0
tensor.type(torch.float16) # tensor([10., 20., 30., 40., 50., 60., 70., 80., 90.]
torch.reshape(new_shape) # -1 is to ask calculating automatically
tensor.view(new_shape) # return a new shape view
torch.stack(t, dim=0) # concate a sequence of tensors along a new dimension(dim)
torch.sequeeze() # all into the first dimensions
torch.clamp() # min=min, max=max, limit the range
torch.unsqueeze()
torch.permute() # torch.Size([224, 224, 3]) -> torch.Size([3, 224, 224])
torch.permute_(), x.unsqueeze_() -> inplace operation
slice
 
1
2
3
4
x[:, 0] #  tensor([[1, 2, 3]])
x[:, :, 1] # tensor([[2,5,8]])
x[:, 1, 1] # tensor([5])
x[0, 0, :]/x[0][0] # tensor([1,2,3])
numpy
1
tensor = torch.from_numpy(array)
random seed
1
2
3
random seed
torch.manual_seed(seed=RANDOM_SEED)
torch.random.manual_seed(seed=RANDOM_SEED)
Variable
1
2
3
4
5
6
7
8
9
10
torch.autograd import Variable
.data, .grad, .grad_fn
x_tensor = torch.randn(10, 5)
y_tensor = torch.randn(10, 5)
x = Variable(x_tensor, requires_grad=True)
y = Variable(y_tensor, requires_grad=True)
z = torch.sum(x + y)
print(z.data) #-2.1379
print(z.grad_fn) #<SumBackward0 object at 0x10da636a0>
z.backward()
GPU
1
2
3
4
5
6
7
8
9
if torch.cuda.is_available():
device = "cuda" # Use NVIDIA GPU (if available)
elif torch.backends.mps.is_available():
device = "mps" # Use Apple Silicon GPU (if available)
else:
device = "cpu" # Default to CPU if no GPU is available

tensor.to(device)
tensor_on_gpu.cpu().numpy()

Neural network

torch.nn

Contains all of the building blocks for computational graphs (essentially a series of computations executed in a particular way).

torch.nn.Parameter

Stores tensors that can be used with nn.Module. If requires_grad=True gradients (used for updating model parameters via gradient descent) are calculated automatically, this is often referred to as "autograd".

torch.nn.Module

The base class for all neural network modules, all the building blocks for neural networks are subclasses. If you're building a neural network in PyTorch, your models should subclass nn.Module. Requires a forward() method be implemented.

torch.optim

Contains various optimization algorithms (these tell the model parameters stored in nn.Parameter how to best change to improve gradient descent and in turn reduce the loss).

def forward()

All nn.Module subclasses require a forward() method, this defines the computation that will take place on the data passed to the particular nn.Module (e.g. the linear regression formula above).

Define a net

1
2
3
4
5
6
7
8
9
10
11
12
class net(nn.Module):
__init__(self): super().__init__() ...
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.weights * x + self.bias
## expample
class LinearRegressModle(nn.Module):
def __init__(self):
super().__init__()
self.weights = nn.Parameter(torch.randn(1, required_grad=True, dtype=torch.float))
self.bias = nn.Parameter(torch.randn(1, required_grad=True, dtype=torch.float))
def forward(self, x:torch.Tensor) -> torch.Tensor:
return self.weight * x + self.bias
Check module
1
2
3
4
5
torch.manual_seed(42)
model_0 = LinearRegressionModel()
list(model_0.parameters()) # tensor([0.3367], requires_grad=True)
model_0.state_dict() # OrderedDict([('weights', tensor([0.3367])), ('bias', tensor([0.1288]))])
with torch.inference_mode(): y_preds = model_0(X_test) # run inference

Training

1
2
loss_fn = nn.L1Loss() # MAE loss is same as L1Loss
optimizer = torch.optim.SGD(params=model_0.parameters(), lr=0.01) ## lr(learning rate)
Step name What does it do? Code example
1 Forward pass The model goes through all of the training data once, performing its forward() function calculations. model(x_train)
2 Calculate the loss The model's outputs (predictions) are compared to the ground truth and evaluated to see how wrong they are. loss = loss_fn(y_pred, y_train)
3 Zero gradients The optimizers gradients are set to zero (they are accumulated by default) so they can be recalculated for the specific training step. optimizer.zero_grad()
4 Perform backpropagation on the loss Computes the gradient of the loss with respect for every model parameter to be updated (each parameter with requires_grad=True) loss.backward()
5 Update the optimizer (gradient descent) Update the parameters with requires_grad=True with respect to the loss gradients in order to improve them. optimizer.step()
Training example
1
2
3
4
5
6
7
for epoch in range(epoches):
model.train()
y_pred = model(X_train)
loss = loss_fn(y_pred, y_true)
optimizer.zero_grad()
loss.backward()
optimizer.step()

test

Forward pass The model goes through all of the training data once, performing its forward() function calculations. model(x_test)
Calculate the loss The model's outputs (predictions) are compared to the ground truth and evaluated to see how wrong they are. loss = loss_fn(y_pred, y_test)
Calulate evaluation metrics (optional) Alongisde the loss value you may want to calculate other evaluation metrics such as accuracy on the test set. Custom functions

Inference and save model

Inferennce

model_0.eval() # Set the model in evaluation mode with torch.inference_mode(): y_preds = model_0(X_test)

torch.save

Saves a serialized object to disk using Python's pickle utility. Models, tensors and various other Python objects like dictionaries can be saved using torch.save.

torch.load

Uses pickle's unpickling features to deserialize and load pickled Python object files (like models, tensors or dictionaries) into memory. You can also set which device to load the object to (CPU, GPU etc).

torch.nn.Module.load_state_dict ## recommended

Loads a model's parameter dictionary (model.state_dict()) using a saved state_dict() object.

Examples

Example 1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
torch.manual_seed(42)
epochs = 100 # Set the number of epochs

# Create empty loss lists to track values
train_loss_values = []
test_loss_values = []
epoch_count = []

for epoch in range(epochs):
### Training
model_0.train() # Put model in training mode (this is the default state of a model)

# 1. Forward pass on train data using the forward() method inside
y_pred = model_0(X_train)
# 2. Calculate the loss (how different are our models predictions to the ground truth)
loss = loss_fn(y_pred, y_train)
optimizer.zero_grad() # 3. Zero grad of the optimizer
loss.backward() # 4. Loss backwards
optimizer.step() # 5. Progress the optimizer
### Testing
# Put the model in evaluation mode
model_0.eval()

with torch.inference_mode():
# 1. Forward pass on test data
test_pred = model_0(X_test)

# 2. Caculate loss on test data
# predictions come in torch.float datatype, so comparisons need to be done with tensors of the same type
test_loss = loss_fn(test_pred, y_test.type(torch.float))

# Print out what's happening
if epoch % 10 == 0:
epoch_count.append(epoch)
train_loss_values.append(loss.detach().numpy())
test_loss_values.append(test_loss.detach().numpy())
print(f"Epoch: {epoch} | MAE Train Loss: {loss} | MAE Test Loss: {test_loss} ")

Example 2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
torch.manual_seed(42)

# Set the number of epochs
epochs = 1000

# Put data on the available device
# Without this, error will happen (not all model/data on device)
X_train = X_train.to(device)
X_test = X_test.to(device)
y_train = y_train.to(device)
y_test = y_test.to(device)

for epoch in range(epochs):
### Training
model_1.train() # train mode is on by default after construction

# 1. Forward pass
y_pred = model_1(X_train)
# 2. Calculate loss
loss = loss_fn(y_pred, y_train)

# 3. Zero grad optimizer
optimizer.zero_grad()

# 4. Loss backward
loss.backward()

# 5. Step the optimizer
optimizer.step()

### Testing
model_1.eval() # put the model in evaluation mode for testing (inference)
# 1. Forward pass
with torch.inference_mode():
test_pred = model_1(X_test)

# 2. Calculate the loss
test_loss = loss_fn(test_pred, y_test)

if epoch % 100 == 0:
print(f"Epoch: {epoch} | Train loss: {loss} | Test loss: {test_loss}")

VAE

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
class VAE(nn.Module):
def __init__(self, input_dim=784, hidden_dim=400, latent_dim=200, device=device):
super(VAE, self).__init__()

# encoder
self.encoder = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.LeakyReLU(0.2),
nn.Linear(hidden_dim, latent_dim),
nn.LeakyReLU(0.2)
)

# latent mean and variance
self.mean_layer = nn.Linear(latent_dim, 2)
self.logvar_layer = nn.Linear(latent_dim, 2)

# decoder
self.decoder = nn.Sequential(
nn.Linear(2, latent_dim),
nn.LeakyReLU(0.2),
nn.Linear(latent_dim, hidden_dim),
nn.LeakyReLU(0.2),
nn.Linear(hidden_dim, input_dim),
nn.Sigmoid()
)
def encode(self, x):
x = self.encoder(x)
mean, logvar = self.mean_layer(x), self.logvar_layer(x)
return mean, logvar

def reparameterization(self, mean, var):
epsilon = torch.randn_like(var).to(device)
z = mean + var*epsilon
return z

def decode(self, x):
return self.decoder(x)

def forward(self, x):
mean, logvar = self.encode(x)
z = self.reparameterization(mean, logvar)
x_hat = self.decode(z)
return x_hat, mean, log_var

CNN

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Create a neural net class
class Net(nn.Module):
# Constructor
def __init__(self, num_classes=3):
super(Net, self).__init__()

# Our images are RGB, so input channels = 3. We'll apply 12 filters in the first convolutional layer
self.conv1 = nn.Conv2d(in_channels=3, out_channels=12, kernel_size=3, stride=1, padding=1)

# We'll apply max pooling with a kernel size of 2
self.pool = nn.MaxPool2d(kernel_size=2)

# A second convolutional layer takes 12 input channels, and generates 12 outputs
self.conv2 = nn.Conv2d(in_channels=12, out_channels=12, kernel_size=3, stride=1, padding=1)

# A third convolutional layer takes 12 inputs and generates 24 outputs
self.conv3 = nn.Conv2d(in_channels=12, out_channels=24, kernel_size=3, stride=1, padding=1)

# A drop layer deletes 20% of the features to help prevent overfitting
self.drop = nn.Dropout2d(p=0.2)
# Our 128x128 image tensors will be pooled twice with a kernel size of 2. 128/2/2 is 32.
# So our feature tensors are now 32 x 32, and we've generated 24 of them
# We need to flatten these and feed them to a fully-connected layer
# to map them to the probability for each class
self.fc = nn.Linear(in_features=32 * 32 * 24, out_features=num_classes)

def forward(self, x):
# Use a relu activation function after layer 1 (convolution 1 and pool)
x = F.relu(self.pool(self.conv1(x)))

# Use a relu activation function after layer 2 (convolution 2 and pool)
x = F.relu(self.pool(self.conv2(x)))

# Select some features to drop after the 3rd convolution to prevent overfitting
x = F.relu(self.drop(self.conv3(x)))

# Only drop the features if this is a training pass
x = F.dropout(x, training=self.training)

# Flatten
x = x.view(-1, 32 * 32 * 24)
# Feed to fully-connected layer to predict class
x = self.fc(x)
# Return log_softmax tensor
return F.log_softmax(x, dim=1)

LSTM

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.functional as F
import torch.optim as optim
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

class LSTMClassifier(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_size):
super(LSTMClassifier, self).__init__()
self.embedding_dim = embedding_dim
self.hidden_dim = hidden_dim
self.vocab_size = vocab_size
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=1)
self.hidden2out = nn.Linear(hidden_dim, output_size)
self.softmax = nn.LogSoftmax()
self.dropout_layer = nn.Dropout(p=0.2)
def init_hidden(self, batch_size):
return(autograd.Variable(torch.randn(1, batch_size, self.hidden_dim)),
autograd.Variable(torch.randn(1, batch_size, self.hidden_dim)))

def forward(self, batch, lengths):

self.hidden = self.init_hidden(batch.size(-1))
embeds = self.embedding(batch)
packed_input = pack_padded_sequence(embeds, lengths)
outputs, (ht, ct) = self.lstm(packed_input, self.hidden)
# ht is the last hidden state of the sequences
# ht = (1 x batch_size x hidden_dim)
# ht[-1] = (batch_size x hidden_dim)
output = self.dropout_layer(ht[-1])
output = self.hidden2out(output)
output = self.softmax(output)

return output

References

1. https://github.com/mrdbourke/pytorch-deep-learning
2. https://readmedium.com/@rekalantar/variational-auto-encoder-vae-pytorch-tutorial-dce2d2fe0f5f
3. https://github.com/MicrosoftDocs/ml-basics/blob/master/05b%20-%20Convolutional%20Neural%20Networks%20(PyTorch).ipynb
4. https://github.com/ritchieng/the-incredible-pytorch?tab=readme-ov-file
5. https://github.com/claravania/lstm-pytorch
6. https://machinelearningmastery.com/pytorch-tutorial-develop-deep-learning-models/

Introduction

Suppose we have two sets of variable corresponding to two aspects such as height and weight, we want to analysis the relationship between this two sets. There are several ways to measure the relationship between them. However, sometime the it is hard to handle datasets with different dimensions, meaning, if \(X\in \mathbb{R}^m\) and \(Y\in \mathbb{R}^n\), how to resolve the relationship?

basic of CCA

Assume there are two sets of data \(X\) and \(Y\), the size of \(X\) is \(n \times p\), whereas size of \(Y\) is \(n\times q\). That is, \(X\) and \(Y\) share the same row numbers but are differnt in columns number. The idea of CCA is simple: find the best match of \(X w_x\) and \(Y w_y\). Let's just set: \[X w_x = z_x\qquad\text{and}\qquad Y w_y = z_y\]

Where \(X\in \mathbb{R}^{n\times p}\), \(w_x \in \mathbb{R}^{p}\), \(z_x\in \mathbb{R}^n\), \(Y\in \mathbb{R}^{n\times q}\), \(w_y \in \mathbb{R}^{q}\), \(z_y\in \mathbb{R}^n\). \(w_x\) and \(w_y\) are often refered as canonical weight vectors, \(z_x\) and \(z_y\) are named images as well as canonical variates or canonical scores. To simplify the problem, we assume \(X\) and \(Y\) are standardized to zero mean and unit variance. Our task is to maximize the angle of \(z_x\) and \(z_y\), meaning:

\[\max_{z_x, z_y \in \mathbf{R^n}} <z_x, z_y>=\max \cos(z_x, z_y)=\max\frac{<z_x, z_y>}{\|z_x\|\|z_y\|}\]

with respect to: \(\|z_x\|_{2}=1\quad \|z_y\|_{2}=1\).

In fact, our task is just project \(X\) and \(Y\) to a new coordinate system after the linear transformation to \(X\) and \(Y\).

Resolve CCA

There are many solutions to this problems. Before start, We need make some assumptions: 1. the each column vector of \(X\) is perpendicular to the others. Which means \(X^T X= I\). The assumption is the same with \(Y\) and \(w_x, w_y\). We can find \(\min(p,q)\) canonical components, and the \(r\)th component is orthogonal to all the \(r-1\) components.

Resolve CCA through SVD

To solve the CCA problem using SVD, we first introduce the joint covariance matrix \(C\) such such that: \[\begin{equation} C = \begin{pmatrix} C_{xx} & C_{xy}\\ C_{yx} & C_{yy}\\ \end{pmatrix} \end{equation}\] Where \(C_{xx}=\frac{1}{n-1}X^\top X\) and \(C_{yy}=\frac{1}{n-1}Y^\top Y\) are the empirical variance matrices between \(X\) and \(Y\) respectively. The \(C_{xy}=\frac{1}{n-1} X^\top Y\) is the covariance matrix between \(X\) and \(Y\).

We next can reform CCA problem with two linear transformations \(w_x\) and \(w_y\) such that:

\[\begin{equation} w_x^\top C_{xx} w_x = I_p, \quad w_y^\top C_{yy} w_y = I_q, \quad w_x^\top C_{xy} w_y = D \end{equation}\] Where I_p and I_q are th p-dimensional and q-dimensional identity meatrics respectively. The diagonal matrix \(D = \text{diag}(\gamma_i)\) so that:

\[\begin{equation} \begin{pmatrix} {w}_x^\top & { 0}\\ { 0} & {w}_y^\top \end{pmatrix} \begin{pmatrix} C_{xx} & C_{xy}\\ C_{yx} & C_{yy} \end{pmatrix} \begin{pmatrix} {w}_x & { 0}\\ { 0} & {w}_y \end{pmatrix} = \begin{pmatrix} I_p & D\\ D^\top & I_q \end{pmatrix}, \end{equation}\]

The canoical variable: \[\begin{equation} Z_x = Xw_x, \quad Z_y = Y w_y \end{equation}\] The diagonal elements \(\gamma_i\) of D denote the canonical correlations. Thus we find the linear compounds \({Z}_x\) and \({Z}_y\) to maximize the cross-correlations. Since both \(C_{xx}\) and \(C_{yy}\) are symmetric positive definite, we can perform Cholesky Decomposition on them to get: \[\begin{equation} C_{xx} = C_{xx}^{\top/2} C_{xx}^{1/2}, \quad C_{yy} = C_{yy}^{\top/2} C_{yy}^{1/2} \end{equation}\]

where \(C_{xx}^{\top/2}\) is the transpose of \(C_{xx}^{1/2}\). Applying the inverses of the square root factors symmetrically on the joint covariance matrix \(C\), the matrix is transformed into: \[\begin{equation} \begin{pmatrix} C_{xx}^{-\top/2} & {\mathbf 0}\\ {\mathbf 0} & C_{yy}^{-\top/2} \end{pmatrix} \begin{pmatrix} C_{xx} & C_{ab}\\ C_{yx} & C_{yy} \end{pmatrix} \begin{pmatrix} C_{xx}^{-1/2} & {\mathbf 0}\\ {\mathbf 0} & C_{yy}^{-1/2} \end{pmatrix} = \begin{pmatrix} I_p & C_{xx}^{-1/2}C_{ab}C_{yy}^{-1/2}\\ C_{yy}^{-1/2}C_{yx}C_{xx}^{-1/2} & I_q \end{pmatrix}. \end{equation}\]

The canonical correlation problem is reduced to that of finding an SVD of a triple product: \[\begin{equation} U^{\top} (C_{xx}^{-1/2}C_{ab}C_{yy}^{-1/2}) V = D. \end{equation}\] The matrix \(C\) is thus reduced to the joint covariance matrix by applying a two-sided Jacobi method such that: \[\begin{equation} \begin{pmatrix} U^\top & {\mathbf 0}\\ {\mathbf 0} & V^\top \end{pmatrix} \begin{pmatrix} I_p & C_{xx}^{-1/2}C_{ab}C_{yy}^{-1/2}\\ C_{yy}^{-1/2}C_{_y}C_{xx}^{-1/2} & I_q \end{pmatrix} \begin{pmatrix} U & {\mathbf 0}\\ {\mathbf 0} & V \end{pmatrix} = \begin{pmatrix} I_p & D\\ D^\top & I_q \end{pmatrix} \end{equation}\]

with the desired transformation \({w}_x\) and \({w}_y\): \[\begin{equation} {w}_x = C_{xx}^{-1/2} U, \quad {w}_y = C_{yy}^{-1/2}V \end{equation}\] where the singular values \(\gamma_i\) are in descending order such that: \[\begin{equation} \gamma_1 \geq \gamma_2 \geq \cdots \geq 0. \end{equation}\]

Resolve CCA through Standard EigenValue Problem

The Problem can be reformed to solve the problem: \[\begin{equation} \underset{w_x \in \mathbb{R}^p, w_y\in \mathbb{R}^q}{\arg \max} w_x^\top C_{xy} w_y \end{equation}\] With respect to \(\|\|w_x^\top C_{xx} w_x\|\|_2 = \sqrt{w_x^\top C_{xx} w_x}=1\) and \(\|\|w_y^\top C_{yy} w_y\|\|_2 = \sqrt{w_y^\top C_{yy} w_y}=1\). The problem can apparently sovled by Lagrange multiplier technique. Let construct the Lagrange multiplier \(L\) such that: \[\begin{equation} L = w_x^\top C_{xy} w_y - \frac{\rho_1}{2} w_x^\top C_{xx} w_x - \frac{\rho_2}{2} w_y^\top C_{yy} w_y \end{equation}\]

The differentiation of L to \(w_x\) and \(w_y\) is: \[\begin{equation} \begin{aligned} \frac{\partial L}{\partial w_x} = C_{xy} w_y - \rho_1 C_{xx}w_x = \mathbf{0}\\ \frac{\partial L}{\partial w_y} = C_{yx} w_x - \rho_2 C_{yy}w_y = \mathbf{0} \end{aligned} \end{equation}\]

By left multipling \(w_x\) and \(w_y\) the above equation, we have:

\[\begin{equation} \begin{aligned} w_x^\top C_xy w_y -\rho_1 w_x^\top C_xx w_x = \mathbf{0}\\ w_y^\top C_yx w_x -\rho_2 w_y^\top C_yy w_y = \mathbf{0} \end{aligned} \end{equation}\] Since w_x^C_xx w_x = 1 and w_y^C_yy w_y = 1, we can obtain that \(\rho_1 = \rho_2 = \rho\). By substituting \(\rho\) to the formula. We can get: \[\begin{equation} w_x = \frac{C_{xx}^{-1}C_{xy}w_y}{rho} \end{equation}\] Evantually we have the equation: \[\begin{equation} C_{yx} C_{xx}^{-1} C_{xy} w_y = \rho^2 C_yy w_y \end{equation}\] Obviously, this is the form of eigenvalue decompostion problem where all eigen values are greater or equal to zero. By solving the eigenvalue decomposition we can find \(w_x\) and \(w_y\).

博客的评论系统

希望把博客的评论系统建立起来,之前使用的是disqus,重新部署的时候,页面大部分都无法显示。不想再用disqus。看到有人创造性的利用github作为载体建立评论系统,也就是Gitment了。 按照教程在github上设置了Gitment,惊闻Gitment需要请求服务,是作者搭的,作者已经不维护了。按照以下操作:

1
npm i --save gitment

修改自己js,连接自己搭建的服务器,WTF?

1
/node_modules/gitment/dist/gitment.browser.js
详细修改过程可参照: https://sherry0429.github.io/2019/02/12/gitment%E4%BF%AE%E5%A4%8D/

后继续寻觅其他可以评论系统,找到这篇文章: https://wangjiezhe.com/posts/2018-10-29-Hexo-NexT-3/ 根据此文章的教程安装了utterances。目前发现还是比较不错。知识现在看到的效果是全局评论。 issue-term不太了解具体,目前不想深入探究,仅仅设置pathname。

不再折腾主题,乖乖的切到Next

又一次,今年年初的又一次,博客系统hexo下的maupassant主题又罢工了。由于年初的各种事情繁琐而多,我就放弃治疗博客系统了,也就是说,有新的博文也无法发出来,先不管那些报错了。

现在稍微腾出一点儿时间,准备把博客系统好好弄一下。其实最简单的办法,也是屡试不爽的方法就是把所有的环境重新安装一遍,显示hexo,再是maupassant。这次不灵了,hexo generate之后一堆报错。我甚至觉得maupassant已经无法搞定了,搜索错误的关键词,发现没有人遇到与我相同的问题。最后是怀疑我文章里有公式的特殊字符,影响markdown parse。修改了hexo-renderer-marked的js,仍然有问题。最后决定换其他主题了,然而只要把所有的文章迁移过来一定会报错。报错如下:

1
2
3
4
5
6
7
8
INFO  Start processing
FATAL Something's wrong. Maybe you can find the solution here: http://hexo.io/docs/troubleshooting.html
Template render error: (unknown path) [Line 65, Column 565]
expected variable end
at Object._prettifyError (/Users/chengmingbo/blog_deploy/blog/node_modules/nunjucks/src/lib.js:36:11)
at Template.render (/Users/chengmingbo/blog_deploy/blog/node_modules/nunjucks/src/environment.js:542:21)
at Environment.renderString (/Users/chengmingbo/blog_deploy/blog/node_modules/nunjucks/src/environment.js:380:17
... ...

好吧,一切从头来,一个文件一个文件的添加,每次hexo generate一下。终于找到了一个有问题的文件。先注释掉再说,后面慢慢查是什么特殊字符引起的问题。好在可以更新博客了。

反复

选定了Next主题,又出现了反复,加上评论系统disqus发现博客白屏了,只有一个文件头显示。可是加功能一时爽,调试火葬场。当时实在记不起来到底是加了什么使得博客又不工作了。只能重头再来。在找主题的过程中发现star排名第四的hexo-theme-apollo已经停止开发,作者一句话让我决定不再折腾什么主题了:

1
专注文章内容的创作胜过博客样式的美观,祝各位玩的开心:

后续工作

  1. 追查出什么特殊字符引起了hexo generate出现问题
  2. 看是否能复原评论系统,如果不能先这样吧,只要不耽误写博文。

Pmf of Poisson Distribution is as follows:

\[f(X=k;\lambda)=\frac{\lambda^k e^{-\lambda}}{k!}\]

Our aim is to derive the the expectation of \(E(X)\) and the variance \(Var(X)\). Given that the formula of expectation: \[ E(X)=\sum_{k=0}^{\infty} k \frac{\lambda^k e^{-\lambda }}{k!} \]

Notice that when \(k=0\), the formula is equal to 0, that is:

\[\sum_{k=0}^{\infty} k \frac{\lambda^ke^{-\lambda}}{k!}\Large|_{k=0}=0\]

Then, the formula become as followed:

\[E(X)=\sum_{k=1}^{\infty} k \frac{\lambda^ke^{-\lambda}}{k!}\]

\[\begin{aligned}E(X)&=\sum_{k=0}^{\infty} k \frac{\lambda^ke^{-\lambda}}{k!}=\sum_{k=0}^{\infty} \frac{\lambda^ke^{-\lambda}}{(k-1)!}\\&=\sum_{k=0}^{\infty} \frac{\lambda^{k-1}\lambda e^{-\lambda}}{(k-1)!}\\&=\lambda e^{-\lambda}\sum_{k=1}^{\infty}\frac{\lambda^{k-1}}{(k-1)!}\end{aligned}\]

Now we need take advantage of Taylor Expansion, recall that:

\[e^x=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\cdots+\frac{x^{k-1}}{(k-1)!}=\sum_{k=1}^{\infty}\frac{x^{k-1}}{(k-1)!}\]

Compare \(E(X)\), we can get:

\[E(X)=\lambda e^{-\lambda}e^\lambda=\lambda\]

As known that \(Var(X)=E(X^2)-(E(x))^2\), we just get \(E(X^2)\). Given that:

\[E(X)=\sum_{k=1}^{\infty} k \frac{\lambda^ke^{-\lambda}}{k!}=\lambda\]

we can use this formula to derive the \(E(X^2)\),

\[\begin{aligned}E(X)=&\sum_{k=1}^{\infty} k \frac{\lambda^ke^{-\lambda}}{k!}=\lambda\\\Leftrightarrow&\sum_{k=1}^{\infty} k \frac{\lambda^k}{k!}=\lambda e^{\lambda}\\\Leftrightarrow&\frac{\partial\sum_{k=1}^{\infty} k \frac{\lambda^k}{k!}}{\partial \lambda}=\frac{\partial \lambda e^{\lambda}}{\partial \lambda}\\\Leftrightarrow&\sum_{k=1}^{\infty}k^2\frac{\lambda^{k-1}}{k!}=e^\lambda+\lambda e^\lambda\\\Leftrightarrow&\sum_{k=1}^{\infty}k^2\frac{\lambda^{k-1}e^{-\lambda}}{k!}=1+\lambda \\\Leftrightarrow&\sum_{k=1}^{\infty}k^2\frac{\lambda^{k}e^{-\lambda}}{k!}=\lambda+\lambda^2=E(X^2)\end{aligned}\]

then,

\[Var(X)=E(X^2)-(E(X))^2=\lambda+\lambda^2-(\lambda)^2=\lambda\]

Thus, we have proved that the Expectation and the Variance of Poisson Distribution are both \(\lambda\)

Preface

There are many classification algorithm such as Logistic Regression, SVM and Decision Tree etc. Today we'll talk about Gaussian Discriminant Analysis(GDA) Algorithm, which is not so popular. Actually, Logistic Regression performance better than GDA because it can fit any distributions from exponential family. However, we can learn more knowledge about gaussian distribution from the algorithm which is the most import distribution in statistics. Furthermore, if you want to understand Gaussian Mixture Model or Factor Analysis, GDA is a good start.

We, firstly, talk about Gaussian Distribution and Multivariate Gaussian Distribution, in which section, you'll see plots about Gaussian distributions with different parameters. Then we will learn GDA classification algorithm. We'll apply GDA to a dataset and see the consequnce of it.

Multivariate Gaussian Distribution

Gaussian Distribution

As we known that the pdf(Probability Distribution Function) of gaussian distribution is a bell-curve, which is decided by two parameters \(\mu\) and \(\sigma^2\). The figure below shows us a gaussian distribution with \(\mu=0\) and \(\sigma^2=1\), which is often referred to \(\mathcal{N}(\mu, \sigma^2)\). Thus, Figure1 is distributed normally with \(\mathcal{N}(0,1)\). Figure 1. Gaussian Distribution with \(\mu=0\) and \(\sigma^2=1\).

Actually, parameter \(\mu\) and \(\sigma^2\) are exactly the mean and the variance of the distribution. Therefore, \(\sigma\) is the stand deviation of normal distribution. Let's take a look at area between red lines and magenta lines, which are respectively range from \(\mu\pm\sigma\) and from \(\mu\pm2\sigma\). The area between redlines accounts for 68.3% of the total area under the curve. That is, there are 68.3% samples are between \(\mu-\sigma\) and \(\mu+\sigma\) . Likely, there are 95.4% samples are between \(\mu-2\sigma\) and \(\mu+2\sigma\).

You must want to know how these two parameter influence the shape of PDF of gaussian distribution. First of all, when we change \(\mu\) with fixed \(\sigma^2\), the curve is the same as before but move along the random variable axis.

Figure 2. Probability Density Function curver with \(\mu=\pm2\) and \(\sigma=1\).

So, what if when we change \(\sigma\) then? Figure3. illustrates that smaller \(\sigma\) lead to sharper shape of pdf. Conversely, larger \(\sigma\) brings us broader curves.

Figure 3. Probability Density Function curver with \(\mu=0\) and change \(\sigma\).

Some may wonder what is the form of \(p(x)\) of a gaussian distribution, I just demonstrate here, you can compare Normal distribution with Multivariate Gaussian.

\[\mathcal{N(x|\mu, \sigma^2)}=\frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2 } } \]

Multivariate Gaussian

For convenience, we first see what is form of Multivariate Guassian Distribution:

\[\mathcal{N(x|\mu, \Sigma)}=\frac{1}{ { (2\pi)}^{\frac{d}{2 } } |\Sigma|^{\frac{1}{2 } } } e^{-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)}\]

where \(\mu\) is the mean, \(\Sigma\) is the covariance matrices, \(d\) is the dimension of random variable \(x\), specfically, 2-dimensional gaussian distribution, we have:

\[\mathcal{N(x|\mu, \Sigma)}=\frac{1}{\sqrt{2\pi}|\Sigma|^{\frac{1}{2 } } } e^{-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)}\]

In order to get an intuition of Multivariate Guassian Distribution, We first take a look at a distribution with \(\mu=\begin{pmatrix}0\\0\end{pmatrix}\) and \(\Sigma=\begin{pmatrix}1&0\\0&1\end{pmatrix}\).

Figure 4. 2-dimensional gaussian distribution with \(\mu=\begin{pmatrix}0\\0\end{pmatrix}\) and \(\Sigma=\begin{pmatrix}1&0\\0&1\end{pmatrix}\).

Notice that the the figure is rather than a curve but a 3-dimensional diagram. Just like normal distribution pdf, \(\sigma\) determines the shape of the figure. However, there are 4 entries of \(\Sigma\) can be changed in this example. Given that we need compute \(|\Sigma|\) as denominator and \(\Sigma^{-1}\) which demands non-zero determinant of \(\Sigma\), we must keep in mind that \(|\Sigma|\) is positive.

1. change \(\mu\)

Rather than change \(\Sigma\), we firstly take a look at how the contour looks like when changing \(\mu\). Figure 5. illustrates the contour variation when changing \(\mu\). As we can see, we only move the center of the contour during the variation of \(\mu\). i.e. \(\mu\) detemines the position of pdf rather than the shape. Next, we will see how entries in \(\Sigma\) influence the shape of pdf.

Figure 5. Contours when change \(\mu\) with \(\Sigma=\begin{pmatrix}1&0\\0&1\end{pmatrix}\).

2. change diagonal entries of \(\Sigma\)

If scaling diagonal entries, we can see from figure 6. samples are concentrated to a smaller range when change \(\Sigma\) from \(\begin{pmatrix}1&0\\0&1\end{pmatrix}\) to \(\begin{pmatrix}0.3&0\\0&0.3\end{pmatrix}\). Similarly, if we alter \(\Sigma\) to \(\begin{pmatrix}3&0\\0&3\end{pmatrix}\), then figure will spread out.

Figure 6. Density when scaling diagonal entries to 0.3.

What if we change only one entry of the diagonal? Figure 7. shows the variation of the density when change \(\Sigma\) to \(\begin{pmatrix}1&0\\0&5\end{pmatrix}\). Notice the parameter spuashes and stretches the figure along coordinate axis.

Figure 7. Density when scaling one of the diagonal entries.

3. change secondary diagonal entries of \(\Sigma\)

We now try to change entries along secondary diagonal. Figure 8. demonstrates that the variation of density is no longer parallel to \(X\) and \(Y\) axis, where \(\Sigma=\begin{pmatrix}1 &0.5\\0.5&1\end{pmatrix}\).

Figure 8. Density when scaling secondary diagonal entries to 0.5

When we alter secondary entries to negative 0.5, the direction of contour presents a mirror to contour when positive.

Figure 9. Density when scaling secondary diagonal entries to -0.5

In light of the importance of determinant of \(\Sigma\), what will happen if the determinant is close to zero. Actually, we can, informally, take determinant of a matrice as the volume of which. Similarly, when determinant is smaller, the volume under density curve become smaller. Figure 10. illustrates the circumstance we talked above where \(\Sigma=\begin{pmatrix}1&0.99\\0.99&1\end{pmatrix}\).

Figure 10. Density when determinant is close to zero.

Gaussian Discriminant Analysis

Intuition

When input features \(x\) are continuous variables, we can use GDA classify data. Firstly, let's take a look at how GDA to do the job. Figure 11. show us two gaussian distributions, they share the same covariance \(\Sigma=\begin{pmatrix}1&0\\0&1\end{pmatrix}\) , and repectively with parameter \(\mu_0=\begin{pmatrix}1\\1\end{pmatrix}\) and \(\mu_1=\begin{pmatrix}-1\\-1\end{pmatrix}\). Imagine you have some data which fall into the cover of the first and second Gaussian Distribution. If we can find such distributions to fit the data, then we'll have the capcity to decide which is new data coming from, the first or the second one.

Figure 11. Two gaussian distributions with respect to \(\mu_0=\begin{pmatrix}1\\1\end{pmatrix}\) and \(\mu_1=\begin{pmatrix}-1\\-1\end{pmatrix}\) , and \(\Sigma=\begin{pmatrix}1&0\\0&1\end{pmatrix}\)

Specifically, let's look at a concrete example, Figure 12 are samples drawn from two Gaussian distribution. There are 100 blue '+'s and 100 red 'o's. Assume that we have such data to be classified. We can apply GDA to solve the problem.

CODE:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
%% octave
pkg load statistics

m=200;
n=2;
rp=mvnrnd([1 1],[1 0;0 1],m/2);%生成正样本1
rn=mvnrnd([4 4],[1 0;0 1],m/2);%生成负样本0
y=[ones(m/2,1);zeros(m/2,1)];

figure;hold on;

plot3(rp(:,1),rp(:,2),y(1:m/2,1),'b+');
plot3(rn(:,1),rn(:,2),y(m/2+1:m,1),'ro');
axis([-3 8 -3 8]);

Figure 12. 200 samples drawn from two Gaussian Distribution with parameters \(\mu_0=\begin{bmatrix}1\\1\end{bmatrix},\mu_1=\begin{bmatrix}4\\4\end{bmatrix},\Sigma=\begin{bmatrix}1&0\\0&1\end{bmatrix}\).

Definition

Now, let's define the algorithm. Firstly we assume discrete random variable classes \(y\) are distributed Bernoulli and parameterized by \(\phi\), then we have:

\[y\sim {\rm Bernoulli}(\phi)\]

Concretely, the probablity of \(y=1\) is \(\phi\), and \(1-\phi\) when \(y=0\). We can simplify two equations to one:

\(p(y|\phi)=\phi^y(1-\phi)^{1-y}\)

Apparently, \(p(y=1|\phi)=\phi\) and \(p(y=0|\phi)=1-\phi\) given that y can only be \(0\) or \(1\).

Another assumption is that we consider \(x\) are subject to different Gaussian Distributions given different \(y\). We assume the two Gaussian distributions share the same covariance and different \(\mu\). Based on above all, then

\[p(x|y=0)=\frac{1}{(2\pi)^{\frac{d}{2 } } |\Sigma|^{\frac{1}{2 } } }e^{-\frac{1}{2}(x-\mu_0)^{T}\Sigma^{-1}(x-\mu_0)}\]

\[p(x|y=1)=\frac{1}{(2\pi)^{\frac{d}{2 } } |\Sigma|^{\frac{1}{2 } } }e^{-\frac{1}{2}(x-\mu_1)^{T}\Sigma^{-1}(x-\mu_1)}\]

i.e. \(x|y=0 \sim \mathcal{N}(\mu_0,\Sigma)\) and \(x|y=1 \sim \mathcal{N}(\mu_1,\Sigma)\). suppose we have \(m\) samples, it is hard to compute \(p(x^{(1)}, x^{(2)}, x^{(3)},\cdots,x^{(m)}|y=0)\) or \(p(x^{(1)}, x^{(2)}, x^{(3)},\cdots,x^{(m)}|y=1)\) . In general, we assume the probabilty of \(x^{(i)}\) \(p(x^{(i)}|y=0)\) is independent to any \(p(x^{(j)}|y=0)\), then we have:

\[p(X|y=0)=\prod_{i=1\,y^{(i)}\neq1}^{m}\frac{1}{(2\pi)^{\frac{d}{2 } } |\Sigma|^{\frac{1}{2 } } }e^{-\frac{1}{2}(x^{(i)}-\mu_0)^{T}\Sigma^{-1}(x^{(i)}-\mu_0)}\]

Vice versa,

\[p(X|y=1)=\prod_{i=1\,y^{(i)}\neq 0}^{m}\frac{1}{(2\pi)^{\frac{d}{2 } } |\Sigma|^{\frac{1}{2 } } }e^{-\frac{1}{2}(x^{(i)}-\mu_1)^{T}\Sigma^{-1}(x^{(i)}-\mu_1)}\]

Here \(X=(x^{(1)}, x^{(2)}, x^{(3)},\cdots,x^{(m)})\). Now, we want to maximize \(p(X|y=0)\) and \(p(X|y=1)\). Why is that, because we hope find parameters that let \(p(X|y=0)p(X|y=1)\) largest, based on that the samples are from the two Gaussian Distributions. These samples we have are more likely emerging. Thus, our task is to maximize \(p(X|y=0)p(X|y=1)\) , we let

\[\mathcal{L}(\phi,\mu_0,\mu_1,\Sigma)=\arg\max p(X|y=0)p(X|y=1)=\arg\max\prod_{i=1}^{m}p(x^{(i)}, y^{(i)};\phi,\mu_0,\mu_1,\Sigma)\]

It's tough for us to maximize \(\mathcal{L}(\phi,\mu_0,\mu_1,\Sigma)\). Notice function \(\log\) is monotonic increasing. Thus, we can maximize \(\log\mathcal{L}(\phi,\mu_0,\mu_1,\Sigma)\) instead of \(\mathcal{L}(\phi,\mu_0,\mu_1,\phi)\), then:

\[\begin{aligned}\ell(\phi,\mu_0,\mu_1,\Sigma)&=\log\mathcal{L}(\phi,\mu_0,\mu_1,\Sigma)\\&=\arg\max\log\prod_{i=1}^{m}p(x^{(i)},y^{(i)};\phi,\mu_0,\mu_1,\Sigma)\\&=\arg\max\log\prod_{i=1}^{m}p(x^{(i)}|y^{(i)};\mu_0,\mu_1,\Sigma)p(y^{(i)};\phi)\\&=\arg\max\sum_{i=1}^{m}p(x^{(i)}|y^{(i)};\mu_0,\mu_1,\Sigma)+p(y^{(i)};\phi)\\&=\arg\max\sum_{i=1}^{m}p(x^{(i)}|y^{(i)};\mu_0,\mu_1,\Sigma)+\sum_{i=1}^{m}p(y^{(i)};\phi)\end{aligned}\]

By now, we have found a convex function with respect parameters \(\mu_0, mu_1,\Sigma\) and \(\phi\). Next section, we'll obtain these parameter through partial derivative.

Solution

To estimate these four parameters, we just apply partial derivative to \(\ell\). Now we estimate \(\phi\) in the first place. We let \(\frac{\partial \ell}{\partial \phi}=0\), then

\[\begin{aligned}\frac{\partial \ell(\phi,\mu_0,\mu_1,\Sigma)}{\partial \phi}=0&\Rightarrow\frac{\partial \arg\max\sum_{i=1}^{m}p(x_i|y;\mu_0,\mu_1,\Sigma)+\sum_{i=1}^{m}p(y_i;\phi)}{\partial \phi}=0\\&\Rightarrow\frac{\partial\sum_{i=1}^{m}\log p(y^{(i)};\phi)}{\partial \phi}=0\\&\Rightarrow\frac{\partial\sum_{i=1}^{m}\log \phi^{y^{(i) } } (1-\phi)^{(1-y^{(i)}) } } {\partial \phi}=0\\&\Rightarrow\frac{\partial\sum_{i=1}^{m}{y^{(i) } } \log \phi+{(1-y^{(i)})}\log(1-\phi)}{\partial \phi}=0\\&\Rightarrow\frac{\partial\sum_{i=1}^{m}{ { 1}{\{y^{(i)}=1\} } }\log \phi+{1}{\{y^{(i)}=0\} } \log(1-\phi)}{\partial \phi}=0\\&\Rightarrow\phi=\frac{1}{m}\sum_{i=1}^{m}1\{y^{(i)}=1\}\end{aligned}\]

Note that \(\mu_0\) and \(\mu_1\) is symmetry in the equation, thus, we need only obtain one of them. Here we take the derivative to \(\mu_0\)

\[\begin{aligned}\frac{\partial \ell(\phi,\mu_0,\mu_1,\Sigma)}{\partial \mu_0}=0&\Rightarrow\frac{\partial \arg\max\sum_{i=1}^{m}p(x_i|y;\mu_0,\mu_1,\Sigma)+\sum_{i=1}^{m}p(y_i;\phi)}{\partial \mu_0}=0\\&\Rightarrow\frac{\partial\sum_{i=1}^{m} \log p(x^{(i)}|y^{(i)};\mu_0,\mu_1,\Sigma)}{\partial \mu_0}=0\\&\Rightarrow\frac{\partial \sum_{i=1}^{m}\log\frac{1}{(2\pi)^{\frac{d}{2 } } |\Sigma|^{\frac{1}{2 } } }e^{-\frac{1}{2}(x^{(i)}-\mu_0)^T\Sigma^{-1}(x^{(i)}-\mu_0) } } {\partial \mu_0}=0\\&\Rightarrow0+\frac{\partial \sum_{i=1}^{m}{-\frac{1}{2}(x^{(i)}-\mu_0)^T\Sigma^{-1}(x^{(i)}-\mu_0) } } {\partial \mu_0}=0\end{aligned}\]

We have \(\frac{\partial X^TAX}{\partial X}=(A+A^T)X\),let \((x^{(i)}-\mu_0)=X\), then,

\[\begin{aligned}\frac{\partial \ell(\phi,\mu_0,\mu_1,\Sigma)}{\partial \mu_0}=0&\Rightarrow 0+\frac{\partial \sum_{i=1}^{m}{-\frac{1}{2}(x^{(i)}-\mu_0)^T\Sigma^{-1}(x^{(i)}-\mu_0) } } {\partial \mu_0}=0\\&\Rightarrow{\sum_{i=1}^{m}-\frac{1}{2}((\Sigma^{-1})^T+\Sigma^{-1})(x^{(i)}-\mu_0)\cdot(-1)}=0\\&\Rightarrow \sum_{i=1}^{m}1\{y^{(i)}=0\}x^{(i)}=\sum_{i=1}^{m}1\{y^{(i)}=0\}\mu_0\\&\Rightarrow\mu_0=\frac{\sum_{i=1}^{m}1\{y^{(i)}=0\}x^{(i) } } {\sum_{i=1}^{m}1\{y^{(i)}=0\} } \end{aligned}\]

Simlarly,

\[\mu_1=\frac{\sum_{i=1}^{m}1\{y^{(i)}=1\}x^{(i) } } {\sum_{i=1}^{m}1\{y^{(i)}=1\} } \]

Before calculate \(\Sigma\), I first illustrate the truth that \(\frac{\partial|\Sigma|}{\partial\Sigma}=|\Sigma|\Sigma^{-1},\quad \frac{\partial\Sigma^{-1 } } {\partial\Sigma}=-\Sigma^{-2}\), then

\[\begin{aligned}\frac{\partial \ell(\phi,\mu_0,\mu_1,\Sigma)}{\partial \Sigma}=0&\Rightarrow\frac{\partial\sum_{i=1}^{m} \log p(x^{(i)}|y^{(i)};\mu_0,\mu_1,\Sigma)+\sum_{i=1}^{m} \log p(y^{(i)};\phi)}{\partial \Sigma}=0\\&\Rightarrow\frac{\partial \sum_{i=1}^{m}\log\frac{1}{(2\pi)^{\frac{d}{2 } } |\Sigma|^{\frac{1}{2 } } }e^{-\frac{1}{2}(x^{(i)}-\mu_{y^{(i) } } )^T\Sigma^{-1}(x^{(i)}-\mu_{y^{(i) } } ) } } {\partial \Sigma}=0\\&\Rightarrow\frac{\partial \sum_{i=1}^{m}-\frac{d}{2}\log2\pi}{\partial \Sigma}+\frac{\partial \sum_{i=1}^{m}-\frac{1}{2}\log|\Sigma|}{\partial \Sigma}+\frac{\partial \sum_{i=1}^{m}{-\frac{1}{2}(x^{(i)}-\mu_{y^{(i) } } )^T\Sigma^{-1}(x^{(i)}-\mu_{y^{(i) } } ) } } {\partial \Sigma}=0\\&\Rightarrow\frac{\partial \sum_{i=1}^{m}-\frac{1}{2}\log|\Sigma|}{\partial \Sigma}+\frac{\partial \sum_{i=1}^{m}{-\frac{1}{2}(x^{(i)}-\mu_{y^{(i) } } )^T\Sigma^{-1}(x^{(i)}-\mu_{y^{(i) } } ) } } {\partial \Sigma}=0\\&\Rightarrow m\frac{1}{|\Sigma|}|\Sigma|\Sigma^{-1}+\sum_{i=1}^m(x^{(i)}-\mu_{y^{(i) } } )^T(x^{(i)}-\mu_{y^{(i) } } )(-\Sigma^{-2}))=0\\&\Rightarrow\Sigma=\frac{1}{m}\sum_{i=1}^{m}(x^{(i)}-\mu_{y^{(i) } } )(x^{(i)}-\mu_{y^{(i) } } )^T\end{aligned}\]

In spite of the harshness of the deducing, the outcome are pretty beautiful. Next, we will apply these parameters and see how the estimation performance.

Apply GDA

Notice the data drawn from two Gaussian Distribution is random, thus, if you run the code, the outcome may be different. However, in most cases, distributions drawn by estimated parameters are roughly the same as the original distributions.

\[\begin{aligned}&\phi=0.5\\&\mu_0=\begin{bmatrix}4.0551\\4.1008\end{bmatrix}\\&\mu_1=\begin{bmatrix}0.85439\\1.03622\end{bmatrix}\\&\Sigma=\begin{bmatrix}1.118822&-0.058976\\-0.058976&1.023049\end{bmatrix}\end{aligned}\]

From Figure 13, We can see contours of two Gaussian distribution, and most of samples are correctly classified.

CODE:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
%% octave 
phi=length(find(y==1))/m;
mu_0 = sum(rn)/length(find(y==0))
mu_1 = sum(rp)/length(find(y==1))
X = [rp;rn];
X_mu1 = X(find(y==1),:)-mu_1;
X_mu0 = X(find(y==0),:)-mu_0;
X_mu = [X_mu1; X_mu1];
sigma = (X_mu'*X_mu)/m

[x1 y1]=meshgrid(linspace(-3,8,100)',linspace(-3,8,100)');
X1=[x1(:) y1(:)];
z1=mvnpdf(X1,mu_1,sigma);
contour(x1,y1,reshape(z1,100,100),8);
hold on;
z2=mvnpdf(X1,mu_0,sigma);
contour(x1,y1,reshape(z2,100,100),8);

Figure 13. Contours drawn from parameters estimated.

In fact, we can compute the probability of each data point to predict which distribution it is more likely belongs, for example, if we want to predict \(x=\begin{pmatrix}0.88007\\3.9501\end{pmatrix}\) is more of the left distribution or the right, we apply \(x\) to these two distribution:

\[\begin{aligned}p\left(x=\begin{bmatrix}0.88\\3.95\end{bmatrix}\Bigg|y=0\right)=&\frac{1}{2\pi|\Sigma|^{\frac{1}{2 } } }e^{-\frac{1}{2}{\begin{bmatrix}x_1-\mu_1\\x_2-\mu_2\end{bmatrix}^T\Sigma^{-1}\begin{bmatrix}x_1-\mu_1\\x_2-\mu_2\end{bmatrix} } }\\=&\frac{1}{ { 2\pi}\left|\begin{matrix}1.1188&-0.059\\-0.059&1.023\end{matrix}\right|^{\frac{1}{2 } } }e^{-\frac{1}{2}{\begin{bmatrix}-3.175\\-0.151\end{bmatrix}^T\begin{bmatrix}0.896&-0.052\\-0.0520&0.98\end{bmatrix}\begin{bmatrix}-3.175\\-0.151\end{bmatrix} } }\\=&\frac{1}{2\pi\sqrt{(1.141) } } e^{-\frac{1}{2}\times 9.11}=0.149\times 0.01=0.0015\end{aligned}\]

and

\[\begin{aligned}p\left(x=\begin{bmatrix}0.88\\3.95\end{bmatrix}\Bigg| y=1\right)&\frac{1}{2\pi|\Sigma|^{\frac{1}{2 } } }e^{-\frac{1}{2}{\begin{bmatrix}x_1-\mu_1\\x_2-\mu_2\end{bmatrix}^T\Sigma^{-1}\begin{bmatrix}x_1-\mu_1\\x_2-\mu_2\end{bmatrix} } }\\=&\frac{1}{ { 2\pi}\left|\begin{matrix}1.1188&-0.059\\-0.059&1.023\end{matrix}\right|^{\frac{1}{2 } } }e^{-\frac{1}{2}{\begin{bmatrix}0.03\\2.91\end{bmatrix}^T\begin{bmatrix}0.896&-0.052\\-0.0520&0.98\end{bmatrix}\begin{bmatrix}0.03\\2.91\end{bmatrix} } }\\=&\frac{1}{2\pi\sqrt{(1.141) } } e^{-\frac{1}{2}\times 8.336}=0.149\times 0.015=0.0022\end{aligned}\]

In light of the equivalency of \(p(y=1)\) and \(p(y=0)\) (both are \(0.5\)), we just compare\(p\left(x=\begin{bmatrix}0.88\\3.95\end{bmatrix}\Bigg| y=1\right)\) to \(p\left(x=\begin{bmatrix}0.88\\3.95\end{bmatrix}\Bigg| y=0\right)\). Apparently, this data point is predicted from the left distribution, which is a wrong assertion. Actually, in this example, we have only this data pointed classified incorrectly.

You may wonder why there is a blue line. It turns out that all the data point below the blue line will be considered as blue class. Otherwise, data points above the line is classified as the red class. How it work?

The blue line is decision boundary, if we know the expression of this line, the decision will be made easier. In fact GDA is a linear classifier, we will prove it later. Still, we see the data point above, if we just divide one probability to another, we just need find if the ratio larger or less than 1. For our example, the ratio is roughly 0.68, so the data point is classified to be the blue class.

\[\frac{p\left(x=\begin{bmatrix}0.88\\3.95\end{bmatrix}\Bigg| y=0\right)p(y=0)}{p\left(x=\begin{bmatrix}0.88\\3.95\end{bmatrix}\Bigg|y=1\right)p(y=1)}=\frac{0.0015}{0.0022}=0.68182<1\]

Figure 14. Decision Boundary

If we can obtain the expression of the ratio, that should be good. So given a new \(x\), we predict problem is tranformed as followed:

\[x\in \text{red class}\propto\mathcal{R}=\frac{p(x|y=1)p(y=1)}{p(x|y=0)p(y=0)} > 1\]

which is equal to

\[\mathcal{R}=\log\frac{p(x|y=1)p(y=1)}{p(x|y=0)p(y=0)} =\log\frac{\phi}{1-\phi}+\log\frac{\mathcal{N}(x;\mu_1,\Sigma)}{\mathcal{N}(x;\mu_0,\Sigma)}> 0\]

Then,

\[\begin{aligned}\mathcal{R}&=\log\frac{\frac{1}{(2\pi)^{\frac{d}{2 } } |\Sigma|^{\frac{1}{2 } } }\exp(-\frac{1}{2}(x-\mu_1)^T\Sigma^{-1}(x-\mu_1))}{\frac{1}{(2\pi)^{\frac{d}{2 } } |\Sigma|^{\frac{1}{2 } } }\exp(-\frac{1}{2}(x-\mu_0)^T\Sigma^{-1}(x-\mu_0))}+\log\frac{\phi}{1-\phi}\\&=-\frac{1}{2}(x-\mu_1)^T\Sigma^{-1}(x-\mu_1))+\frac{1}{2}(x-\mu_0)^T\Sigma^{-1}(x-\mu_0))+\log\frac{\phi}{1-\phi}\\&=-\frac{1}{2}x^T\Sigma^{-1}x+\mu_1^T\Sigma^{-1}x-\frac{1}{2}\mu_1^T\Sigma^{-1}\mu_1+-\frac{1}{2}x^T\Sigma^{-1}x-\mu_0^T\Sigma^{-1}x+\frac{1}{2}\mu_0^T\Sigma^{-1}\mu_0+\log\frac{\phi}{1-\phi}\\&=(\mu_0-\mu_1)^T\Sigma^{-1}x-\frac{1}{2}\mu_1^T\Sigma^{-1}\mu_1+\frac{1}{2}\mu_0^T\Sigma^{-1}\mu_0+\log\frac{\phi}{1-\phi}\end{aligned}\]

Here, \(\mu_1^T\Sigma^{-1}x=x^T\Sigma^{-1}\mu_1\) because it is a real number. For a real number \(a=a^T\), moreover, \(\Sigma^{-1}\) is symmetric, so \(\Sigma^{-T}=\Sigma^{-1}\). Let's set \(w^T=(\mu_1-\mu_0)^T\Sigma^{-1}\) and \(w_0=-\frac{1}{2}\mu_1^T\Sigma^{-1}\mu_1+\frac{1}{2}\mu_0^T\Sigma^{-1}\mu_0+\log\frac{\phi}{1-\phi}\), then we have:

\[\mathcal{R}=\log\frac{p(x|y=1)p(y=1)}{p(x|y=0)p(y=0)} =w^Tx+w_0\]

If you plug parameters in the formula, you will find:

\(\mathcal{R}=-3.0279x_1-3.1701x_2+15.575=0\)

It is the decision boundary(Figure 14.). Since you have got the decision boundary formula, it is convenient to use the decision boundary function predict if a data point \(x\) belongs to the blue or red class. If \(\mathcal{R}>0\), \(x\in \text{red class}\), otherwise, \(x\in \text{blue class}\).

Conclusion

Today, we have talked about Guassian Distribution and its Multivariate form. Then, we assume two groups of data drawn from Gaussian Distributions. We apply Gaussian Discriminant Analysis to the data. There are 200 data point, only one is misclassified. In fact we can deduce GDA to Logistic regression Algorithm(LR). But LR can not deduce GDA, i.e. LR is a better classifier, especially when we do not know the distribution of the data. However, if you have known that data is drawn from Gaussian Distribution, GDA is the better choice.

Reference

  1. Andrew Ng http://cs229.stanford.edu/notes/cs229-notes2.pdf
  2. https://en.wikipedia.org/wiki/Normal_distribution
  3. https://en.wikipedia.org/wiki/Multivariate_normal_distribution
  4. http://www.cnblogs.com/emituofo/archive/2011/12/02/2272584.html
  5. http://m.blog.csdn.net/article/details?id=52190572
  6. 张贤达《矩阵分析与应用》:156-158
  7. http://www.tk4479.net/hujingshuang/article/details/46357543
  8. http://www.chinacloud.cn/show.aspx?id=24927&cid=22
  9. http://www.cnblogs.com/jcchen1987/p/4424436.html
  10. http://www.xlgps.com/article/139591.html
  11. http://www.matlabsky.com/thread-10308-1-1.html
  12. http://classes.engr.oregonstate.edu/eecs/fall2015/cs534/notes/GaussianDiscriminantAnalysis.pdf

原文作者:Vincent Spruy

译者:程明波

英文文章地址

译文地址

译者注:由于历史原因,高斯分布(Gaussian Distribution),正态分布(Normal Distribution)皆指概率密度函数形如\(\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\)的分布。文中我会采用正态分布的提法。

简介

本文,呼应标题,我将推导著名正态分布数据均值和方差的计算公式。如果一些读者对于这个问题的“为什么”并不感兴趣,仅仅是对“什么时候使用”感兴趣,那答案就非常简单了:

如果你想预估一份数据的均值和方差(典型情况),那么方差公式除的是\(N-1\),即:

\[\sigma^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \mu)^2\]

另一种情况,如果整体的真实均值已知,那么方差公式除的就是\(N\),即:

\[\sigma^2 = \frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2\]

然而,前一种情况,会是你遇到更典型的情形。一会儿,我会举一个预估高斯白噪音的离散程度例子。例子中高斯白噪音的均值是已知的0,这种情况下,我们只需要估计方差。

如果数据是正态分布,我们可以完全用均值\(\mu\)和方差\(\sigma^2\)刻画这个分布。其中,方差是标准差\(\sigma\)的平方,标准差代表了每个数据点偏离均值点的平均距离,也就是说,方差表示了数据离散程度。对于正态分布,68.3%的数据的值会介于\(\mu-\sigma\)\(\mu+\sigma\)之间。下面图片展示是一个正态分布的概率密度函数,他的均值是\(\mu=10\),方差是\(\sigma^2=3^2=9\)

图1. 正态分布概率密度函数. 对于正态分布数据,68%的样本落在均值\(\pm\)方差。

通常,我们拿不到全部的全体数据。上面的例子中,典型的情况是我们有一些观察数据,但是,我们没有上图中x轴上所有可能的观察数据。例如我们可能有下面一些观察数据:

表1

观察数据ID 观察值
观察数据 1 10
观察数据 2 12
观察数据 3 7
观察数据 4 5
观察数据 5 11

现在如果我们通过把所有值相加并除以观察的次数,得到经验均值:

\[\mu=\frac{10+12+7+5+11}{5}=9\tag{1}\].

通常,我们会假设经验均值接近分布的未知的真实均值,因此,我们可以假设观测数据来自于均值为\(\mu=9\)的正态分布。在这个例子中,分布真实均值是10, 也就是说,经验均值实际上接近于真实均值。

数据的方差计算如下:

\[\begin{aligned}\sigma^2&= \frac{1}{N-1}\sum_{i=1}^N (x_i - \mu)^2\\&= \frac{(10-9)^2+(12-9)^2+(7-9)^2+(5-9)^2+(11-9)^2}{4})\\&= 8.5.\end{aligned}\tag{2}\]

同样,我们一般假设经验方差接近于基于分布真实未知方差。在此例中,真实方差是9,所以,经验方差也是接近于真实方差。

那么我们手上的问题现在就是为什么我们用于计算经验均值和经验方差的公式是正确的。事实上,另一个我们经常用于计算方差的公式是这样定义的:

\[\begin{aligned}\sigma^2 &= \frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2 \\&= \frac{(10-9)^2+(12-9)^2+(7-9)^2+(5-9)^2+(11-9)^2}{4}) \\&= 6.8.\end{aligned}\tag{3}\]

公式(2)和公式(3)的唯一不同是前一个公式除的是\(N-1\),而后一个除的是\(N\)。两个公式都是对的,只是根据不同的场景使用不同的公式。

接下来的部分,我们针对给定一个正态分布的样本集,完成对其未知方差和均值最好估计的完整推导。我们将会看到,一些情况下,方差除的是\(N\),另一些情况除的是\(N-1\)

用一个公式近似一个参数(均值或方差)叫做估计量。下面,我们定义一个分布的真实但未知的参数为\(\hat{\mu}\)\(\hat{\sigma}^2\)。而估计量,例如,经验的平均和经验方差,定义为\(\mu\)\(\sigma^2\)

为了找到最优的估计量,首先,一个整体均值为\(\mu\)标准差为\(\sigma\)的正态分布,对于特定的观察点\(x_i\),我们需要一个分析相似的表达式。对于一个已知参数的正态分布一般定义为\(N(\mu,\sigma^2)\)。似然函数为:

\[x_i \sim N(\mu,\sigma^2) \Rightarrow P(x_i; \mu,\sigma)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2\sigma^2}(x-\mu)^2}.\tag{4}\]

为了计算均值和方差,显然,我们需要这个分布一个以上的样本。接下来,设\(\vec{x}=(x_1,x_2,\cdots,x_N)\)为包含所有的可用样本的向量(例如:表一中所有的值)。如果所有这些样本统计独立,我们可以写出联合似然函数为所有似然函数的乘积:

\[\begin{aligned}P(\vec{x};\mu,\sigma^2)&=P(x_1,x_2,\cdots,x_n;\mu,\sigma^2)\\&=P(x_1;\mu,\sigma^2)P(x_2;\mu,\sigma^2)\cdots P(x_N;\mu,\sigma^2)\\&=\prod_{i=1}^{N}P(x_i;\mu,\sigma^2)\end{aligned}.\tag{5}\]

把公式(4)代入公式(5),可得出联合概率密度函数的分析表达式:

\[\begin{aligned}P({\vec{x};\mu,\sigma})&=\prod_{i=1}^{N}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2\sigma^2}(x_i-\mu)^2}\\&=\frac{1}{(2\pi\sigma^2)^{\frac{N}{2}}}e^{-\frac{1}{2\sigma^2}\sum_{i=1}^{N}(x_i-\mu)^2}\end{aligned}.\tag{6}\]

公式(6)在接下来的部分将非常重要。我们会用它推导关于正态分布著名的估计量均值和方差。

最小方差,无偏估计量

决定一个估计量是不是“好”估计量,首先我们需要定义什么是真正的“好” 估计量。说一个估计量好,依赖于两个度量,叫做其偏差(bias)和方差(variance)(是的,我们要讨论均值估计量的方差,以及方差估计量的方差)。本节将简单的讨论这两个度量。

参数偏差

想象一下,如果我们能拿到全体不同的(互斥)数据子集。类比之前的的例子,假设,除了【表1】中的数据,我们还有完全不同观察结果表2及表3。那么,一个关于均值好的估计量,应该使得这个估计量平均下来等于真实的均值。我们可以接受其中一个自己的经验均值不等于真实均值,但是,一个好的估计量应该保证:对于所有子集均值估计的平均值等于真实均值。这个限制条件用数学化的表示,就是估计量的期望值(Expected Value)应该等于参数值:

\[E(\mu)=\hat{\mu}\qquad E(\sigma^2)=\hat{\sigma}^2.\tag{7}\]

如果满足上面的条件,那么这些估计量就被称之为“无偏估计”。反之,如果上面的条件不满足,这些估计量叫做“有偏的”,也就是说平均来看,他们或者低估或者高估了参数的真实值。

参数方差

无偏估计量保证平均来看,它们估计的值等于真是参数。但是,这并不意味着每次估计是一个好的估计。比如,如果真实均值为10,一个无偏估计量可以估计全体的其中一个子集的均值为50,而另一个均值为-30。期望的估计的值确实是10,也等于真是的参数值,但是,估计量的质量明显依赖每次估计的离散程度。对于全体5个不同子集,一个估计量产生的估计值(10,15,5,12,8)是无偏的和另一个估计量产生的估计值(50,-30,100,-90,20)(译者注:原文作者最后一个是10,我计算换成20,这样均值才是10)。但是第一个估计量的所有估计值明显比第二个估计量的估计值更接近真实值。

因此,一个好的估计量不仅需要有低偏差,同时也需要低方差。这个方差表示为平均平方误差的估计量:

\[Var(\mu)=E[(\hat{\mu}-\mu)^2]\]

\[Var(\sigma^2)=E[(\hat{\sigma}-\sigma)^2]\]

因此一个好的估计量是低偏差,低方差的。如果存在最优的估计量,那么这个估计应该是无偏的,而且方差比所有的其他可能估计量都要低。这样的一个估计量被称之为最小方差,无偏(MVU)估计量。下一节,我们将会针对一个正态分布推导均值和方差估计量的数学表达式。我们将会看到,一个正态分布的方差MVU估计量在一些假设下需要除以\(N\),而在另一些假设下需要除以\(N-1\)

最大似然估计

基于整体的一个子集,尽管有大量的获取一个参数估计量的技术,所有这些技术中最简单的可能就数最大似然估计了。

观察值\(\vec{x}\)的概率在公式(6)定义为\(P(\vec{x};\mu,\sigma^2)\). 如果我们在此函数中固定\(x\)\(\sigma^2\),当使\(\vec{x}\)变化时,我们就可以获得图(1)的正态分布。但是,我们也可以固定\(\vec{x}\),使\(\mu\)和(或)\(\sigma^2\)变化。比如,我们可以选择类似前面例子中的\(\vec{x}=(10,12,7,5,11)\)。我们选择固定\(\mu=10\),同时使\(\sigma^2\)变化。图(2)展示了当\(x\)\(\mu\)固定时,\(\sigma^2\)对于这个分布取不同值的变化曲线:

图 2. 此图表示了似然函数在特定观察数据\(\vec{x}\),下固定\(\mu=10\)\(\sigma^2\)变化曲线。

上图,我们通过固定\(\mu=10\),令\(\sigma^2\)变化计算了\(P(\vec{x};\sigma^2)\)的似然函数。在结果曲线的每一个数据点代表了似然度,观察值\(\vec{x}\)是一个正态分布在参数\(\sigma^2\)下的样本。那么对应最大似然度的参数值最有可能是从我们定义的分布中产生数据的参数。因此,我们能通过找到似然度曲线的最大值决定最优的\(\sigma^2\)。在此例中,最大值在\(\sigma^2=7.8\),这样标准差就是\(\sqrt{(\sigma^2)=2.8}\)。事实上,如果给定\(\mu=10\),通过传统的方法计算,我们会发明方差就是7.8:

\[\frac{(10-10)^2+(12-10)^2+(7-10)^2+(5-10)^2+(11-10)^2}{5}=7.8\]

因此,基于样本数据的方差计算公式只需要简单的通过找到最大的似然函数的最高点。此外,除了固定\(\mu\),我们可以使\(\mu\)\(\sigma^2\)同时变化。然后找到两个估计量对应在两个维度的似然函数的最大值。

要找一个函数的最大值,也很简单,只需要求导使其等于0。如果想找一个有两个变量函数的最大值,我们需要计算每个变量的偏导,再把两个偏导全部设置为0。接下来,设\(\hat{\mu}_{ML}\)为通过极大似然方法得到的总体均值的最优估计量,设\(\hat{\sigma}^2_ML\)为方差的最优估计量。要最大化似然函数,我们可以简单的计算它的(偏)导数,然后赋值为0,如下:

\[\begin{aligned} &\hat{\mu}_{ML} = \arg\max_\mu P(\vec{x}; \mu, \sigma^2)\\ &\Rightarrow \frac{\partial P(\vec{x}; \mu, \sigma^2)}{\partial \mu} = 0 \end{aligned}\]

\[\begin{aligned} &\hat{\sigma}^2_{ML} = \arg\max_{\sigma^2} P(\vec{x}; \mu, \sigma^2)\\ &\Rightarrow \frac{\partial P(\vec{x}; \mu, \sigma^2)}{\partial \sigma^2} = 0 \end{aligned}\]

下一节,我们将利用这个技术得到\(\mu\)\(\sigma^2\)的MVU估计量。我们考虑两种情形:

第一种情形,我们假设分布的真正的均值\(\hat{\mu}\)是已知的,因此,我们只需要估计方差,那么问题就变成在参数为\(\sigma^2\)的一维的极大似然函数中对应找其最大值。这种情况不经常出现,但是,在实际应用中确实存在。例如,如果我们知道一个信号(比如:一幅图中一个像素的颜色值)本来应该有特定的值,但是,信号被白噪音污染了(均值为0的高斯噪音),这时分布的均值是已知的,我们只需要估计方差。

第二种情形就是处理均值和方差的真实值都不知道的情况。这种情况最常见,这时,我们需要基于样本数据估计均值和方差。

后面我们将看到,每种情形产生不同的MVU估计量。具体来说,第一种情形方差估计量需要除以\(N\)来标准化MVU。而第二种除的是\(N-1\)

均值已知的方差估计

参数估计

如果分布的均值真实值已知,那么似然函数只有一个参数\(\sigma^2\)。求最大似然估计量也就是解决:

\[\hat{\sigma^2}_{ML}=\arg\max_{\sigma^2} P(\vec{x};\sigma^2).\tag{8}\]

但是,根据公式(6)的定义,如果计算\(P(\vec{x};\sigma^2)\)涉及到计算函数中指数的偏导。事实上,计算对数似然函数比计算似然函数本身的导数要简单的多。因为对数函数是单调递增函数,其最大值取值位置与原似然函数是一样的。因此我们用下面的式子替换:

\[\hat{\sigma}^2_{ML}=\arg\max_{\sigma^2}\log(P(\vec{x};\sigma^2)).\tag{9}\]

下面,我令\(s=\sigma^2\)简化式子。我们通过计算公式(6)的对数的导数赋值为0来最大化对数似然函数:

\[\begin{aligned}&\frac{\partial \log(P(\vec{x};\sigma^2))}{\partial \sigma^2}=0\\&\Leftrightarrow\frac{\partial\log(P(\vec{x};s))}{\partial s}=0\\&\Leftrightarrow\frac{\partial}{\partial s}\log\left(\frac{1}{(2\pi s)^{\frac{N}{2}}}e^{-\frac{1}{2s}\sum_{i=1}^{N}(x_i-\mu)^2} \right)=0\\&\Leftrightarrow\frac{\partial}{\partial s}\log\left(\frac{1}{(2\pi)^{\frac{N}{2}}}\right)+\frac{\partial}{\partial s}\log\left(\frac{1}{\sqrt{s}^\frac{N}{2}}\right)+\frac{\partial}{\partial s} \log\left(e^{-\frac{1}{2s}\sum_{i=1}^{N}(x_i-\mu)^2}\right )=0\\&\Leftrightarrow0+\frac{\partial}{\partial s}\log\left((s)^{-\frac{N}{2}}\right)+\frac{\partial}{\partial s}\left(-\frac{1}{2s}\sum_{i=1}^{N}(x_i-\mu)^2\right)=0\\&\Leftrightarrow -\frac{N}{2}\log (s)+\frac{1}{2 s^2}\sum_{i=1}^{N}(x_i-\mu)^2=0\\&\Leftrightarrow -\frac{N}{2s}+\frac{1}{2s^2}\sum_{i=1}^{N}(x_i-\mu)^2=0\\&\Leftrightarrow \frac{N}{2s^2}\left(-s+\frac{1}{N}\sum_{i=1}^{N}(x_i-\mu)^2\right)=0\\&\Leftrightarrow\frac{N}{2s^2}\left(\frac{1}{N}\sum_{i=1}^{N}(x_i-\mu)^2-s\right)=0\end{aligned}\]

很明显,如果\(N>0\),那么上面等式唯一的解就是:

\[s=\sigma^2=\frac{1}{N}\sum_{i=1}^{N}(x_i-\mu)^2.\tag{10}\]

注意到,实际上\(\hat{\sigma}^2\)的极大似然估计估计量就是传统上一般计算方差的公式。这里标准化因子是\(\frac{1}{N}\).

但是,极大似然估计并不保证得出的是一个无偏估计量。另外,就算得到的估计量是无偏的,极大似然估计也不能保证估计是最小方差,即MVU。因此,我们需要检查公式(10)的的估计量是否是无偏的。

表现评价

我们需要检查公式(7)的等式是否成立,来确定是否公式(10)中的估计量是无偏的。即判断:

\[E(s)=\hat{s}.\]

我们把公式(10)代入到\(E(s)\),计算:

\[\begin{aligned}E[s] &= E \left[\frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2 \right] = \frac{1}{N} \sum_{i=1}^N E \left[(x_i - \mu)^2 \right] = \frac{1}{N} \sum_{i=1}^N E \left[x_i^2 - 2x_i \mu + \mu^2 \right]\\&= \frac{1}{N} \left( N E[x_i^2] -2N \mu E[x_i] + N \mu^2 \right)\\&= \frac{1}{N} \left( N E[x_i^2] -2N \mu^2 + N \mu^2 \right)\\&= \frac{1}{N} \left( N E[x_i^2] -N \mu^2 \right)\end{aligned}\]

另外,真实方差\(\hat{s}\)有一个非常重要的性质\(\hat{s}=E[x_i^2]-E[x_i]^2\),可变换公式为\(E[x_i^2]=\hat{s}+E[x_i]^2=\hat{s}+\mu^2\)。使用此性质我们可能从上面的公式推出:

\[\begin{aligned}E[s]&=\frac{1}{N}(N E[x_i^2]-N\mu^2)\\&=\frac{1}{N}(N\hat{s}+N\mu^2-N\mu^2)\\&=\frac{1}{N}(N\hat{s})\\&=\hat{s}\end{aligned}\]

满足了公式(7)的条件\(E[s]=\hat s\),因此,我们得到的数据方差\(\hat s\)的统计量是无偏的。此外,因为极大似然估计的如果是一个无偏的估计量,那么也是最小方差(MVU),也就是说,我们得到的估计量比任何一个其他的估计量都大。

因此,在分布真实均值已知的情况下,我们不用除以\(N-1\),而是用除\(N\)计算正态分布的方差。

均值未知的方差估计

参数估计

上一节,分布的真实均值已知,因此,我们只需要估计数据的方差。但是,如果真实的均值未知,我们均值的估计量就也需要计算了。

此外,方差的估计量需要使用均值的估计量。我们会看到,这时,之前我们得到的方差的估计量就不再无偏了。我们一会儿会通过除以N-1,而不是N来稍微的增加方差估计量的值,从而使方差估计无偏。

与之前一样,基于log似然函数,我们用极大似然估计计算两个估计量。首先我们先计算\(\hat\mu\)的极大似然估计量:

\[\begin{aligned}&\frac{\partial \log(P(\vec{x}; s, \mu))}{\partial \mu} = 0\\&\Leftrightarrow \frac{\partial}{\partial \mu} \log \left( \frac{1}{(2 \pi s)^{\frac{N}{2}}} e^{-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2} \right) = 0\\&\Leftrightarrow \frac{\partial}{\partial \mu} \log \left( \frac{1}{(2 \pi)^{\frac{N}{2}}} \right) + \frac{\partial}{\partial \mu} \log \left(e^{-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2} \right) = 0\\&\Leftrightarrow \frac{\partial}{\partial \mu} \left(-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2 \right) = 0\\&\Leftrightarrow -\frac{1}{2s}\frac{\partial}{\partial \mu} \left(\sum_{i=1}^N(x_i - \mu)^2 \right) = 0\\&\Leftrightarrow -\frac{1}{2s} \left(\sum_{i=1}^N -2(x_i - \mu) \right) = 0\\&\Leftrightarrow \frac{1}{s} \left(\sum_{i=1}^N (x_i - \mu) \right) = 0 \\&\Leftrightarrow \frac{N}{s} \left( \frac{1}{N} \sum_{i=1}^N (x_i) - \mu \right) = 0 \end{aligned}\]

显然,如果\(N>0\),那么上面的等式只有一种解:

\[\mu=\frac{1}{N}\sum_{i=1}^{N}x_i.\tag{11}\]

注意到,实际的这是计算一个分布均值的著名公式。虽然我们知道这个公式,但我们现在证明了极大似然估计量估计了一个正态分布未知均值的真实值。现在我们先假定我们之前公式(10)计算的方差\(\hat s\)的估计量仍然是MVU方差估计量。但下一节我们会证明这个估计量已经是有偏的了。

表现评价

我们需要通过检查估计量\(\mu\)对真实\(\hat \mu\)的估计是否无偏来确定公式(7)的条件能否成立:

\[E[\mu]=E\left[\frac{1}{N}\sum_{i=1}^{N}x_i\right]=\frac{1}{N}\sum_{i=1}^N E[x_i]=\frac{1}{N}N E[x_i]=\frac{1}{N} N \hat\mu=\hat\mu.\]

既然\(E[\mu]=\hat\mu\),那么也就是说我们对分布均值的估计量是无偏的。因为极大似然估计可以保证在估计是无偏的情况下得到的是最小方差估计量,所以我们就已经是证明了\(\mu\)是均值的MVU估计量。

现在我们检查基于经验均值\(\mu\),而不是真实均值\(\hat\mu\)的方差估计量\(s\)对真实方差\(\hat s\)的估计身上仍然是无偏的。我们只需要把得到的估计量\(\mu\)带入到之前在公式(10)推导出的公式:

\[\begin{aligned} s &= \sigma^2 = \frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2\\&=\frac{1}{N}\sum_{i=1}^N \left(x_i - \frac{1}{N} \sum_{i=1}^N (x_i) \right)^2\\&=\frac{1}{N}\sum_{i=1}^N \left[x_i^2 - 2 x_i \frac{1}{N} \sum_{i=1}^N (x_i) + \left(\frac{1}{N} \sum_{i=1}^N (x_i) \right)^2 \right]\\&=\frac{\sum_{i=1}^N x_i^2}{N} - \frac{2\sum_{i=1}^N x_i \sum_{i=1}^N x_i}{N^2} + \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2\\&=\frac{\sum_{i=1}^N x_i^2}{N} - \frac{2\sum_{i=1}^N x_i \sum_{i=1}^N x_i}{N^2} + \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2\\&=\frac{\sum_{i=1}^N x_i^2}{N} - \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2\end{aligned}\]

现在我们需要再次检查公式(7)的条件是否成立,来决定估计量是否无偏:

\[\begin{aligned} E[s]&= E \left[ \frac{\sum_{i=1}^N x_i^2}{N} - \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2 \right ]\\&= \frac{\sum_{i=1}^N E[x_i^2]}{N} - \frac{E[(\sum_{i=1}^N x_i)^2]}{N^2} \end{aligned}\]

记得我们在之前用过方差一个非常重要的性质,真实方差\(\hat s\)可以写成\(\hat s = E[x_i^2]-E[x_i]^2\),即,\(E[x_i^2]=\hat s + E[x_i]^2=\hat s +\mu^2\)。利用这个性质我们可以推出:

\[\begin{aligned} E[s] &= \frac{\sum_{i=1}^N E[x_i^2]}{N} - \frac{E[(\sum_{i=1}^N x_i)^2]}{N^2}\\&= s + \mu^2 - \frac{E[(\sum_{i=1}^N x_i)^2]}{N^2}\\&= s + \mu^2 - \frac{E[\sum_{i=1}^N x_i^2 + \sum_i^N \sum_{j\neq i}^N x_i x_j]}{N^2}\\&= s + \mu^2 - \frac{E[N(s+\mu^2) + \sum_i^N \sum_{j\neq i}^N x_i x_j]}{N^2}\\&= s + \mu^2 - \frac{N(s+\mu^2) + \sum_i^N \sum_{j\neq i}^N E[x_i] E[x_j]}{N^2}\\&= s + \mu^2 - \frac{N(s+\mu^2) + N(N-1)\mu^2}{N^2}\\&= s + \mu^2 - \frac{N(s+\mu^2) + N^2\mu^2 -N\mu^2}{N^2}\\&= s + \mu^2 - \frac{s+\mu^2 + N\mu^2 -\mu^2}{N}\\&= s + \mu^2 - \frac{s}{N} - \frac{\mu^2}{N} - \mu^2 + \frac{\mu^2}{N}\\&= s - \frac{s}{N}\\&= s \left( 1 - \frac{1}{N} \right)\\&= s \left(\frac{N-1}{N} \right) \end{aligned}\]

显然\(E[s]\neq\hat s\),上面公式可知分布的方差估计量不再是无偏的了。事实上,平均来看,这个估计量低估了真实方差,比例为\(\frac{N-1}{N}\)。当样本的数量趋于无穷时(\(N\rightarrow\infty\)),这个偏差趋近于0。但是对于小的样本集,这个偏差就意义了,需要被消除。

修正偏差

因为偏差不过是一个因子,我们只需通过对公式(10)的估计量乘以偏差的倒数。这样我们就可以定义一个如下的无偏的估计量\(s\prime\)

\[\begin{aligned} s\prime &= \left ( \frac{N-1}{N} \right )^{-1} s\\s\prime &= \left ( \frac{N-1}{N} \right )^{-1} \frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2\\s\prime &=\left ( \frac{N}{N-1} \right ) \frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2\\s\prime &= \frac{1}{N-1}\sum_{i=1}^N(x_i - \mu)^2\end{aligned}\]

这个估计量现在就是无偏的了,事实上,这个公式与传统计算方差的公式非常像,不同的是除的是\(N-1\)而不是\(N\)。然而,你可能注意到这个估计量不再是最小方差估计量,但是这个估计量是所有无偏估计量中最小方差的一个。如果我们除以\(N\),那么估计量就是有偏的了,如果我们除以\(N-1\),估计量就不是最小方差估计量。但大体来说,一个有偏的估计量要比一个稍高一点方差的估计量要糟糕的多。因此,如果当总体的均值是未知的情况下,方差除的是\(N-1\),而不是\(N\)

总结

本文,我们推导了如果从分布数据中计算常见的方差和均值公式。此外,我们还证明了在方差估计中,标准化因子在总体均值已知时是\(\frac{1}{N}\),在均值也需要估计时是\(\frac{1}{N-1}\)

本文PDF

0%