【问题标题】:Truncated SVD and PCA截断 SVD 和 PCA
【发布时间】:2021-11-24 14:02:49
【问题描述】:

理论上,如果特征均值为0,PCA和SVD的投影结果是相同的。所以我在python上尝试了。

from sklearn import datasets
cancer = datasets.load_breast_cancer()

from sklearn.preprocessing import StandardScaler
# we can set our feature to have mean 0 by setting with_mean=False
scaler = StandardScaler(with_mean=False,with_std=False)
scaler.fit(cancer.data)
X_scaled = scaler.transform(cancer.data)

from sklearn.decomposition import PCA
pca=PCA(n_components=3,svd_solver='randomized') 
pca.fit(X_scaled) 
X_pca=pca.transform(X_scaled) 

from sklearn.decomposition import TruncatedSVD
svdm=TruncatedSVD(n_components=3,algorithm='randomized') 
svdm.fit(X_scaled) 
X_svdm=svdm.transform(X_scaled)

但是当我打印结果时,情况就不同了。为什么会这样?

print(X_pca)
print(X_svdm)
>>>[[1160.1425737  -293.91754364   48.57839763]
 [1269.12244319   15.63018184  -35.39453423]
 [ 995.79388896   39.15674324   -1.70975298]
 ...
 [ 314.50175618   47.55352518  -10.44240718]
 [1124.85811531   34.12922497  -19.74208742]
 [-771.52762188  -88.64310636   23.88903189]]
>>>[[2241.97427647  347.71556015  -27.53741942]
 [2372.40840267   56.90166991   23.86316187]
 [2101.8402797    11.94762737   30.41138602]
 ...
 [1424.53280954  -55.0217124    -3.5794351 ]
 [2231.65579282   19.99439854    3.31619182]
 [ 331.69302638   -5.29733966  -39.12136435]]

我应该解决什么问题才能获得两种算法的相同结果?

【问题讨论】:

    标签: python matrix scikit-learn pca svd


    【解决方案1】:

    来自help page for scaler

    with_mean bool, default=True 如果为 True,则在缩放之前将数据居中。 尝试时这不起作用(并且会引发异常) 稀疏矩阵,因为使它们居中需要建立一个密集的矩阵 在常见用例中可能太大而无法放入的矩阵 记忆。

    要使 PCA 和 SVD 提供相同的输出,您需要对数据进行居中和缩放,另请参阅 this post for details,所以如果您这样做:

    # which is also the default
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_scaled = scaler.fit_transform(cancer.data)
    
    pca=PCA(n_components=3,svd_solver='randomized')
    pca.fit(X_scaled)
    X_pca=pca.transform(X_scaled)
    
    svdm=TruncatedSVD(n_components=3,algorithm='randomized')
    svdm.fit(X_scaled)
    X_svdm=svdm.transform(X_scaled)
    
    X_pca
    array([[ 9.19283683,  1.94858306, -1.12316567],
           [ 2.3878018 , -3.76817175, -0.52929196],
           [ 5.73389628, -1.0751738 , -0.55174751],
           ...,
           [ 1.25617928, -1.90229671,  0.56273027],
           [10.37479406,  1.67201011, -1.87702986],
           [-5.4752433 , -0.6706368 ,  1.49044385]])
    
    X_svdm
    array([[ 9.19283683,  1.94858307, -1.12316615],
           [ 2.3878018 , -3.76817174, -0.52929266],
           [ 5.73389628, -1.0751738 , -0.55174759],
           ...,
           [ 1.25617928, -1.90229671,  0.56273052],
           [10.37479406,  1.67201011, -1.87702935],
           [-5.4752433 , -0.67063679,  1.49044309]])
    

    【讨论】:

    • z = (x - u) / s “其中 u 是训练样本的平均值,如果 with_mean=False,则为零”。所以如果我们设置 with_mean=True,这个特征也会有均值 0?
    • 是的,如果我正确理解了您的问题,是的,这将使您的功能居中,使其平均值为零,这就是您对 PCA 所需要的
    猜你喜欢
    • 2015-11-15
    • 2015-09-11
    • 2018-07-03
    • 1970-01-01
    • 2019-09-30
    • 2020-06-15
    • 1970-01-01
    • 2018-01-18
    • 1970-01-01
    相关资源
    最近更新 更多