在 scikit 中使用 PCA/LDA/MDS 选择最佳功能数量答案

【问题标题】：selecting optimum no of features using PCA/LDA/MDS in scikit在 scikit 中使用 PCA/LDA/MDS 选择最佳功能数量
【发布时间】：2014-12-29 03:06:27
【问题描述】：

我想使用 PCA、LDA 和 MDS 减少数据集的特征。但我也想保留 95% 的方差。

我找不到在相应算法的公式中表示所需差异的方法。有一段似乎与 PCA 的 API (sklearn.decomposition.PCA) 相关 -

if n_components == ‘mle’, Minka’s MLE is used to guess the dimension if 0 < n_components < 1, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components

但是 n_components 怎么可能同时等于 'mle' 和分数呢？

设置 n_components='mle' 将特征从 40 减少到 39，这没有帮助。

【问题讨论】：

标签： python scikit-learn feature-selection

【解决方案1】：

sklearn.decomposition中的PCA对象有一个名为'explained_variance_ratio_'的属性，它是一个数组它给出了每个主成分负责的总方差的百分比，按降序排列。

所以，你可以先创建一个 PCA 对象来适应数据-

import sklearn.decomposition.PCA as PCA
pca_obj = PCA()
x_trans = pca_obj.fit_transform(x)                   // x is the data

现在，我们可以继续添加方差百分比，直到获得所需的值（在我的例子中为 0.95）-

s = pca_obj.explained_variance_ratio_
sum=0.0
comp=0

for _ in s:
    sum += _
    comp += 1
    if(sum>=0.95):
        break

所需组件的数量将是 comp

的值

【讨论】：