我的 Matlab 代码是否适合将 PCA 应用于数据？答案

【问题标题】：Is my Matlab code correct for applying PCA to data?我的 Matlab 代码是否适合将 PCA 应用于数据？
【发布时间】：2014-09-24 11:15:42
【问题描述】：

我有以下代码用于在 Matlab 中计算 PCA：

train_out = train';
test_out = test';
% subtract off the mean for each dimension
mn = mean(train_out,2);
train_out = train_out - repmat(mn,1,train_size);
test_out = test_out - repmat(mn,1,test_size);
% calculate the covariance matrix
covariance = 1 / (train_size-1) * train_out * train_out';
% find the eigenvectors and eigenvalues
[PC, V] = eig(covariance);
% extract diagonal of matrix as vector
V = diag(V);
% sort the variances in decreasing order
[junk, rindices] = sort(-1*V);
V = V(rindices);
PC = PC(:,rindices);
% project the original data set
out = PC' * train_out;
train_out = out';
out = PC' * test_out;
test_out = out';

训练和测试矩阵在行中具有观察值，在列中具有特征变量。当我对原始数据（没有 PCA）进行分类时，我得到的结果比使用 PCA 好得多，即使我保留了所有维度。当我尝试直接在整个数据集（训练 + 测试）上进行 PCA 时，我注意到这些新的主成分与以前的主成分之间的相关性接近 1 或接近 -1，我觉得这很奇怪。我可能做错了什么，但就是想不通。

【问题讨论】：

是否有一些限制不能使用 svd？ mathworks.ch/ch/help/matlab/ref/svd.html
没有限制，但是输出应该是一样的吧？我想我可以尝试另一种实现来看看我得到了什么。
据我所知，这是在 MATLAB 中进行 PCA 的标准方法。因此，除非您想调整程序，否则我认为使用它会更容易。因为看起来你的 PCA 计算出了点问题，我希望 svd 能给你带来好的结果。
我尝试了另外三个实现，它们都给出了相同的数字，只有我在这里发布的方法对某些主要组件有不同的符号。但这应该意味着他们正确地完成了这项工作。我只是不明白为什么分类在不同数据集上的准确率要低得多。

标签： matlab classification pca

【解决方案1】：

代码是正确的，但是使用 princomp 函数会更容易：

train_out=train; % save original data
test_out=test;
mn = mean(train_out);
train_out = bsxfun(@minus,train_out,mn); % substract mean
test_out = bsxfun(@minus,test_out,mn);
[coefs,scores,variances] = princomp(train_out,'econ'); % PCA
pervar = cumsum(variances) / sum(variances);
dims = max(find(pervar < var_frac)); % var_frac - e.g. 0.99 - fraction of variance explained
train_out = train_out*coefs(:,1:dims); % dims - keep this many dimensions
test_out = test_out*coefs(:,1:dims); % result is in train_out and test_out

【讨论】：