【发布时间】:2018-09-25 17:54:11
【问题描述】:
目标:确定 rfq_num_of_dealers 是否是完成交易的重要预测指标(完成 =1)。
我的数据:
df_Train_Test.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 139025 entries, 0 to 139024
Data columns (total 2 columns):
rfq_num_of_dealers 139025 non-null float64
Done 139025 non-null uint8
dtypes: float64(1), uint8(1)
df_Train_Test = df_Train_Test[['rfq_num_of_dealers','Done']]
df_Train_Test_GrpBy = df_Train_Test.groupby(['rfq_num_of_dealers','Done']).size().reset_index(name='Count').sort_values(['rfq_num_of_dealers','Done'])
display(df_Train_Test_GrpBy)
rfq_num_of_dealers 列的数据范围是 0 到 21,Done 列是 0 或 1。请注意,所有 rfq_num_of_dealers 的 Done 值都是 0 或 1。
逻辑回归:
x = df_Train_Test[['rfq_num_of_dealers']]
y = df_Train_Test['Done']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
# 2 Train and fit a logistic regression model on the training set.
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression() # create instance of model
logmodel.fit(x_train,y_train) # fit model against the training data
# 3. Now predict values for the testing data.
predictions = logmodel.predict(x_test) # Predict off the test data (note fit model is off train data)
# 4 Create a classification report for the model.
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))
# 5 Create a confusion matrix for the model.
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,predictions)) # The diagonals are the correct predictions
这会产生以下错误
UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
明显错误的报告和矩阵,注意混淆矩阵的右手边
precision recall f1-score support
0 0.92 1.00 0.96 41981
1 0.00 0.00 0.00 3898
avg / total 0.84 0.92 0.87 45879
[[41981 0]
[ 3898 0]]
如果“完成”有 1 或 0 并且全部填充(y 标签),如何引发此错误?我可以运行任何代码来准确确定哪些 y 标签导致错误?其他输出:
display(pd.Series(predictions).value_counts())
0 45879
dtype: int64
display(pd.Series(predictions).describe())
count 45879.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
dtype: float64
display(y_test)
71738 0
39861 0
16567 0
81750 1
88513 0
16314 0
113822 0
. .
display(y_test.describe())
count 45879.000000
mean 0.084963
std 0.278829
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: Done, dtype: float64
display(y_test.value_counts())
0 41981
1 3898
Name: Done, dtype: int64
这是否与 12439 条记录的 rfq_num_of_dealers 和 Done 都等于 0 的事实有关?
【问题讨论】:
标签: pandas numpy classification logistic-regression