如何以正确的方式缩放和预测单个样本答案

【问题标题】：How to scale and predict a single sample the right way如何以正确的方式缩放和预测单个样本
【发布时间】：2020-04-03 14:08:05
【问题描述】：

根据一个乳腺癌数据集（5 个特征 + 1 个诊断列），我已经在标准化数据（StandardScaler()）上训练和测试了一个逻辑模型。我使用 Pickle 导入模型：

log = pickle.load(open('./log.pkl', 'rb'))

并想预测一个新样本是属于 0 类（良性）还是 1 类（恶性）。

下面的测试数据属于class 1（我尝试了class 1的多个样本，所有结果都属于0的分类）：

radius = 11.41
texture = 10.82
perimeter = 73.34
area = 403.3
smoothness = 0.09373

为了创建样本并获得预测，我尝试了以下方法：

temp = [radius, texture, perimeter, area, smoothness]
temp = np.array(temp).reshape((len(temp), 1))
scaler = StandardScaler()
temp = scaler.fit_transform(temp)

# print(log.predict(temp))   # results in: ValueError: X has 1 features per sample; expecting 5
print(log.predict(temp.T)) # results in: [0] which is wrong

# print(log.predict_proba(temp)) # results in: ValueError: X has 1 features per sample; expecting 5
print(log.predict_proba(temp.T)) # results in: [[9.99999972e-01 2.78352951e-08]] which does not seem right

我也试过了：

new_sample = np.array([radius, texture, perimeter, area, smoothness])
# scaled_sample = scaler.fit_transform(new_sample.reshape(1, -1)) # resulting array: array([[0., 0., 0., 0., 0.]])
# scaled_sample = scaler.fit_transform(new_sample.reshape(1, -1).T) # same as below
scaled_sample = scaler.fit_transform(new_sample[:, np.newaxis])
print(log.predict(scaled_sample.T))  # results in [0] which is wrong 
print(log.predict_proba(scaled_sample.T)) # results in: [[9.99999972e-01 2.78352951e-08]] which differs from the predict_proba above, and seems off

如何进行这种预测的正确方法？

谢谢，

最好的祝愿，比吉特

【问题讨论】：

为什么说 0 作为预测类是错误的？也许模型的性能就是预测总是 0 的那么糟糕？这两个类是否平衡？
嗨 Márcio，班级分布约为 55/45%，所以我认为这不是不平衡问题。使用 70/30% 的训练/测试集，该模型的准确度为 0.959，准确度为 0.963，召回率为 0.972，F1 为 0.972。

标签： python machine-learning scikit-learn classification

【解决方案1】：

根据predict 函数上的 scikit-learn 文档，您的代码可能看起来更简单：

temp = np.array([[radius, texture, perimeter, area, smoothness]]) # use double brackets
scaler = StandardScaler()
print(log.predict(scaler.fit_transform(temp)))

这是使用它的正确方法。但是这个函数不能说明回归量拟合的质量。

【讨论】：

在多个新的 1 类样本（以及 0 类）上使用您的解决方案可以得到预期的结果！谢谢！