跨样本或跨特征的数据标准化？答案

【问题标题】：Data standardization, across samples or across features?跨样本或跨特征的数据标准化？
【发布时间】：2020-11-23 15:10:57
【问题描述】：

我有 4 个具有 5 个特征的样本数据，作为一个数组，data。

 import numpy as np


data = np.array([[1,1,1,1,0],
                 [0,0,0,0,0],
                 [1,1,1,1,0],
                 [1,0,0,0,0]])

print (data)

n_samples, n_features = data.shape = (4,5)

当我按如下方式对其应用 StandardScaler 时，它是跨特征还是跨样本标准化数据？

from sklearn.preprocessing import StandardScaler, MinMaxScaler
result = StandardScaler().fit_transform(data)
print (result)

[[ 0.57735027  1.          1.          1.          0.        ]
 [-1.73205081 -1.         -1.         -1.          0.        ]
 [ 0.57735027  1.          1.          1.          0.        ]
 [ 0.57735027 -1.         -1.         -1.          0.        ]]

在机器学习中，跨样本或跨特征数据标准化的最佳实践是什么？

【问题讨论】：

标签： numpy tensorflow machine-learning keras scikit-learn

【解决方案1】：

在 StandardScaler/MinMaxScaler 的情况下，数据跨特征缩放，这是最佳常见做法

import numpy as np
from sklearn.preprocessing import StandardScaler

data = np.array([[1,1,1,1,0],
                 [0,0,0,0,0],
                 [1,1,1,1,0],
                 [1,0,0,0,0]])

result = StandardScaler().fit_transform(data)
result

array([[ 0.57735027,  1.        ,  1.        ,  1.        ,  0.        ],
       [-1.73205081, -1.        , -1.        , -1.        ,  0.        ],
       [ 0.57735027,  1.        ,  1.        ,  1.        ,  0.        ],
       [ 0.57735027, -1.        , -1.        , -1.        ,  0.        ]])

您可以自行验证

(data - data.mean(0))/data.std(0).clip(1e-5)

array([[ 0.57735027,  1.        ,  1.        ,  1.        ,  0.        ],
       [-1.73205081, -1.        , -1.        , -1.        ,  0.        ],
       [ 0.57735027,  1.        ,  1.        ,  1.        ,  0.        ],
       [ 0.57735027, -1.        , -1.        , -1.        ,  0.        ]])

【讨论】：

0,0,0,0,0 是第二个样本的特征。但是，此样本的标准化值并非全为零（如果跨特征计算，则为预期值）。因此，您的答案是跨样本计算，例如，所有零都是最后一个特征的样本。
不，标准化的第二行不能全为零......对于第一列和第二行你有 (0-mean_first_col)/std_first_col 这不是 0...用我的检查第二次实施
np.mean([0,0,0,0,0]) = 0。 np.std([0,0,0,0,0]) = 0。所以，(0-0)/0 = 0.
np.mean([0.577, -1.73, 0.577, 0.577]) = mean_first_col != 0 ... np.std([0.577, -1.73, 0.577, 0.577]) = std_first_col != 0 ... data.mean(0) 是按列的平均值
scikit-learn.org/stable/modules/generated/… 标准化 FEATURES 而不是样本