【发布时间】:2020-07-25 15:34:32
【问题描述】:
更新:附上数据的链接,以防您想重现:
https://github.com/amandawang-dev/credit-worthiness-analysis/blob/master/credit_train.csv
https://github.com/amandawang-dev/credit-worthiness-analysis/blob/master/credit_test.csv
我正在尝试使用 sklearn 的逻辑回归模型来预测该人的银行帐户信用是好还是坏。初始数据集如下所示:
然后我将第一列“Class”二值化('Good'=1, 'Bad'=0),数据集如下所示:
所以我使用sklearn逻辑模型来预测测试数据(测试数据与预测数据集相同,'Class'列也被二值化),并尝试计算混淆矩阵,代码如下,然后混淆矩阵I得到的是
[[ 0 54]
[ 0 138]]
准确度得分为 0.71875,我认为混淆矩阵结果是错误的,因为没有真正的正值。有人知道如何解决这个问题吗?谢谢!
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
credit_train = pd.read_csv('credit_train.csv')
credit_test = pd.read_csv('credit_test.csv')
credit_train["Class"] = (credit_train["Class"] =="Good").astype(int)
credit_test["Class"] = (credit_test["Class"] =="Good").astype(int)
X=credit_train[['CreditHistory.Critical']]
y=credit_train['Class']
clf = LogisticRegression(random_state=0).fit(X, y)
X_test=credit_test[['CreditHistory.Critical']]
y_test=credit_test['Class']
y_pred=clf.predict(X_test)
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_pred=y_pred, y_true=y_test)
score = clf.score(X_test, y_test)
print(score)
print(cm)
每一列的数据类型:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 808 entries, 0 to 807
Data columns (total 17 columns):
Class 808 non-null int64
Duration 808 non-null int64
Amount 808 non-null int64
InstallmentRatePercentage 808 non-null int64
ResidenceDuration 808 non-null int64
Age 808 non-null int64
NumberExistingCredits 808 non-null int64
NumberPeopleMaintenance 808 non-null int64
Telephone 808 non-null int64
ForeignWorker 808 non-null int64
CheckingAccountStatus.lt.0 808 non-null int64
CheckingAccountStatus.0.to.200 808 non-null int64
CheckingAccountStatus.gt.200 808 non-null int64
CreditHistory.ThisBank.AllPaid 808 non-null int64
CreditHistory.PaidDuly 808 non-null int64
CreditHistory.Delay 808 non-null int64
CreditHistory.Critical 808 non-null int64
dtypes: int64(17)
memory usage: 107.4 KB
【问题讨论】:
-
类和“CreditHistory.Critical”有什么关系?如果相关性较低,分类器可能只会学习更常见的类
-
很可能你有严重的类不平衡(负样本比正样本多得多),而不是混淆矩阵“错误”,类不平衡需要特殊处理。
-
能否提供数据链接?如果没有数据集的链接,没有人可以重现您的结果.. ?
标签: python machine-learning scikit-learn statistics data-science