【发布时间】:2020-04-03 14:08:05
【问题描述】:
根据一个乳腺癌数据集(5 个特征 + 1 个诊断列),我已经在标准化数据(StandardScaler())上训练和测试了一个逻辑模型。我使用 Pickle 导入模型:
log = pickle.load(open('./log.pkl', 'rb'))
并想预测一个新样本是属于 0 类(良性)还是 1 类(恶性)。
下面的测试数据属于class 1(我尝试了class 1的多个样本,所有结果都属于0的分类):
radius = 11.41
texture = 10.82
perimeter = 73.34
area = 403.3
smoothness = 0.09373
为了创建样本并获得预测,我尝试了以下方法:
temp = [radius, texture, perimeter, area, smoothness]
temp = np.array(temp).reshape((len(temp), 1))
scaler = StandardScaler()
temp = scaler.fit_transform(temp)
# print(log.predict(temp)) # results in: ValueError: X has 1 features per sample; expecting 5
print(log.predict(temp.T)) # results in: [0] which is wrong
# print(log.predict_proba(temp)) # results in: ValueError: X has 1 features per sample; expecting 5
print(log.predict_proba(temp.T)) # results in: [[9.99999972e-01 2.78352951e-08]] which does not seem right
我也试过了:
new_sample = np.array([radius, texture, perimeter, area, smoothness])
# scaled_sample = scaler.fit_transform(new_sample.reshape(1, -1)) # resulting array: array([[0., 0., 0., 0., 0.]])
# scaled_sample = scaler.fit_transform(new_sample.reshape(1, -1).T) # same as below
scaled_sample = scaler.fit_transform(new_sample[:, np.newaxis])
print(log.predict(scaled_sample.T)) # results in [0] which is wrong
print(log.predict_proba(scaled_sample.T)) # results in: [[9.99999972e-01 2.78352951e-08]] which differs from the predict_proba above, and seems off
如何进行这种预测的正确方法?
谢谢,
最好的祝愿,比吉特
【问题讨论】:
-
为什么说 0 作为预测类是错误的?也许模型的性能就是预测总是 0 的那么糟糕?这两个类是否平衡?
-
嗨 Márcio,班级分布约为 55/45%,所以我认为这不是不平衡问题。使用 70/30% 的训练/测试集,该模型的准确度为 0.959,准确度为 0.963,召回率为 0.972,F1 为 0.972。
标签: python machine-learning scikit-learn classification