Introduction

We have:

  • a dataset XX in RDR^D consisting of n vectors (n training examples)

  • n vectors X1X_1, …, XnX_n where XiX_i are D-dimensional vectors

Objective:

  • find a low dimensional representation of the data that is as similar to XX as possible.

Three important concepts

1. linear combination of the basis vectors

The first one is that every vector in RDR^D can be represented as a linear combination of the basis vectors.

XnX_n can be written as the sum of i = 1 to D of βin\beta_{in} * bib_i.

  • D个basis vectors bib_i的线性加和
  • bib_i are an orthonormal bases of RDR^D
  • orthonormal: 正交(各自垂直)
    Coursera-Mathematics for Machine Learning: PCA Week4-1

2. orthogonal projection onto 1-D subspace

Coursera-Mathematics for Machine Learning: PCA Week4-1
We can interpret βin\beta_{in} as the orthogonal projection of XnX_n onto the one dimensional subspace spanned by the ithi_{th} basis vector.

3. orthogonal projection of X onto the subspace spanned by the M basis vectors

Coursera-Mathematics for Machine Learning: PCA Week4-1
If we have an orthonormal basis b1b_1 to bmb_m of RDR^D and we define BB to be the matrix that consists of these orthonormal basis vectors.

Then the projection of XX onto the subspace, we can write as X~\widetilde X is BB.TXB * B.T * X.

That means X~\widetilde X is the orthogonal projection of X onto the subspace spanned by the M basis vectors.

And B.TXB.T * X are the coordinates of X~\widetilde X with respect of the basis vectors collected in the matrix BB. This is also called the code, so coordinates or code.

PCA

The key idea in PCA

To find a lower dimensional representation X~n\widetilde X_n that can be expressed using fewer basis vectors, let’s say M.

Assumptions:

  • The data is centered, that means the dataset has mean zero.
  • b1b_1 to bDb_D are an orthonormal bases of RDR^D.

Generally, we can write any X~n\widetilde X_n in the following way:
Coursera-Mathematics for Machine Learning: PCA Week4-1

This entire thing is still living in RDR^D. So, we took our general way of writing any vector in RDR^D which comes from property one, and we split the sum in property one into two sums. One is living in an M-dimensional subspace and the other one is living in a D minus M-dimensional subspace which is an orthogonal complement to this particular subspace.

In PCA, we ignore the second term, so we get rid of this part.

Coursera-Mathematics for Machine Learning: PCA Week4-1

the principal subspace

And then we call the subspace that is spanned by the basis vectors b1b_1 to bMb_M the principal subspace. So b1b_1 to bMb_M span the principal subspace.

Although X~n\widetilde X_n is still a D-dimensional vector, it lives in an M-dimensional subspace of RDR^D and only M coordinates; βn1\beta_{n1} to βnM\beta_{nM} are necessary to represent it.

So, these ones are the coordinate of this, X~n\widetilde X_n vector. The betas of n also called the code of the coordinates of X~n\widetilde X_n with respect to the basis vectors b1b_1 to bMb_M.

Objective

And the setting now is as follows. Assuming we have data X1X_1 to XnX_n, we want to find parameters βin\beta_{in} and orthonormal basis vectors bib_i, such that the average squared reconstruction era is minimised.

the average squared reconstruction error

J: the average squared reconstruction error
Coursera-Mathematics for Machine Learning: PCA Week4-1

example

Let’s have a look at an example. We have data living in two dimensions and now we want to find a good one dimensional subspace such that the squared or average squared reconstruction error of the original data points and their corresponding projection is minimised.
Coursera-Mathematics for Machine Learning: PCA Week4-1

Here I’m plotting the original dataset with their corresponding projections onto one dimensional subspaces and I’m cycling through a couple of options of subspaces and you can see that some of these projections are significantly more informative than others and in PCA we are going to find the best one. Our approach is to compute the partial derivatives of JJ with respect to the parameters.

Coursera-Mathematics for Machine Learning: PCA Week4-1
The parameters are the βin\beta_{in} and the bib_i.

We set the partial derivatives of J with respect to these parameters to zero and solve for the optimal parameters. But one observation we can already make. And that observation is that the parameters only enter this loss function through X~n\widetilde X_n.

This means that in order to get our partial derivatives, we need to apply the chain rule. So, we can write dJd_J by d either βin\beta_{in} or bib_i can be written as dJd_J by dX~nd_{\widetilde X_n} tilde timesdX~nd_{\widetilde X_n} by d either βin\beta_{in} or bib_i.

We can already compute the first part.
Coursera-Mathematics for Machine Learning: PCA Week4-1

相关文章: