Introduction
We have:
-
a dataset in consisting of n vectors (n training examples)
-
n vectors , …, where are D-dimensional vectors
Objective:
- find a low dimensional representation of the data that is as similar to as possible.
Three important concepts
1. linear combination of the basis vectors
The first one is that every vector in can be represented as a linear combination of the basis vectors.
can be written as the sum of i = 1 to D of * .
- D个basis vectors 的线性加和
- are an orthonormal bases of
- orthonormal: 正交(各自垂直)
2. orthogonal projection onto 1-D subspace
We can interpret as the orthogonal projection of onto the one dimensional subspace spanned by the basis vector.
3. orthogonal projection of X onto the subspace spanned by the M basis vectors
If we have an orthonormal basis to of and we define to be the matrix that consists of these orthonormal basis vectors.
Then the projection of onto the subspace, we can write as is .
That means is the orthogonal projection of X onto the subspace spanned by the M basis vectors.
And are the coordinates of with respect of the basis vectors collected in the matrix . This is also called the code, so coordinates or code.
PCA
The key idea in PCA
To find a lower dimensional representation that can be expressed using fewer basis vectors, let’s say M.
Assumptions:
- The data is centered, that means the dataset has mean zero.
- to are an orthonormal bases of .
Generally, we can write any in the following way:
This entire thing is still living in . So, we took our general way of writing any vector in which comes from property one, and we split the sum in property one into two sums. One is living in an M-dimensional subspace and the other one is living in a D minus M-dimensional subspace which is an orthogonal complement to this particular subspace.
In PCA, we ignore the second term, so we get rid of this part.
the principal subspace
And then we call the subspace that is spanned by the basis vectors to the principal subspace. So to span the principal subspace.
Although is still a D-dimensional vector, it lives in an M-dimensional subspace of and only M coordinates; to are necessary to represent it.
So, these ones are the coordinate of this, vector. The betas of n also called the code of the coordinates of with respect to the basis vectors to .
Objective
And the setting now is as follows. Assuming we have data to , we want to find parameters and orthonormal basis vectors , such that the average squared reconstruction era is minimised.
the average squared reconstruction error
J: the average squared reconstruction error
example
Let’s have a look at an example. We have data living in two dimensions and now we want to find a good one dimensional subspace such that the squared or average squared reconstruction error of the original data points and their corresponding projection is minimised.
Here I’m plotting the original dataset with their corresponding projections onto one dimensional subspaces and I’m cycling through a couple of options of subspaces and you can see that some of these projections are significantly more informative than others and in PCA we are going to find the best one. Our approach is to compute the partial derivatives of with respect to the parameters.
The parameters are the and the .
We set the partial derivatives of J with respect to these parameters to zero and solve for the optimal parameters. But one observation we can already make. And that observation is that the parameters only enter this loss function through .
This means that in order to get our partial derivatives, we need to apply the chain rule. So, we can write by d either or can be written as by tilde times by d either or .
We can already compute the first part.