【问题标题】:decision tree sklearn: prediction accuracy 100%决策树sklearn:预测准确率100%
【发布时间】:2017-05-31 02:41:58
【问题描述】:

我的决策树分类器准确度为 1.0,决策树输出中只有一个节点,混淆矩阵中也只有一个元素。随机森林也有类似的问题。

  import pandas
  import numpy 
  from sklearn.cross_validation import train_test_split
  from sklearn.tree import DecisionTreeClassifier
  import sklearn.metrics

  data = pandas.read_csv('nesarc_pds.csv', low_memory=False)

#Setting variable to numeric.
data['CONSUMER'] = pandas.to_numeric(data['CONSUMER'], errors='coerce')
data['S2AQ16A'] = pandas.to_numeric(data['S2AQ16A'], errors='coerce')
data['S2DQ3C1'] = pandas.to_numeric(data['S2DQ3C1'], errors='coerce')
data['S2DQ3C2'] = pandas.to_numeric(data['S2DQ3C2'], errors='coerce')  
data['S2DQ4C1'] = pandas.to_numeric(data['S2DQ4C1'], errors='coerce')
data['S2DQ4C2'] = pandas.to_numeric(data['S2DQ4C2'], errors='coerce')
data['S2DQ1'] = pandas.to_numeric(data['S2DQ1'], errors='coerce')
data['S2DQ2'] = pandas.to_numeric(data['S2DQ2'], errors='coerce')
data['SEX'] = pandas.to_numeric(data['SEX'], errors='coerce')

 #subset data to the age 10 to 30 when started drinking 
 sub1=data[((data['S2AQ16A']>=10) & (data['S2AQ16A']<=30))]
 #Copy new DataFrame
sub2 = sub1.copy()

#Recording missing data
 sub2['S2AQ16A'] = sub2['S2AQ16A'].replace(99, numpy.nan)
 sub2['S2DQ3C1'] = sub2['S2DQ3C1'].replace(99, numpy.nan)
 sub2['S2DQ3C2'] = sub2['S2DQ3C2'].replace(9, numpy.nan)
 sub2['S2DQ4C1'] = sub2['S2DQ4C1'].replace(99, numpy.nan)
 sub2['S2DQ4C2'] = sub2['S2DQ4C2'].replace(9, numpy.nan)
 sub2['S2DQ1'] = sub2['S2DQ1'].replace(9, numpy.nan)
 sub2['S2DQ2'] = sub2['S2DQ2'].replace(9, numpy.nan)


  #creating a secondary variable for calculating sibling number.
  sub2['SIBNO'] = sub2['S2DQ3C1'] + sub2['S2DQ4C1']

#defining new variable for sibling drinking status by combining data of brothers and sisters
def SIBSTS(row):
if any([row['S2DQ3C2'] == 1, row['S2DQ4C2'] == 1]) :
    return 1       
elif all([row['S2DQ3C2'] == 2, row['S2DQ4C2'] == 2]) :
    return 0     
else :   
    return numpy.nan     
sub2['SIBSTS'] = sub2.apply(lambda row: SIBSTS (row),axis=1)  

#defining new variable for parent status status of drinking
def PRSTS(row):
    if any([row['S2DQ1'] == 1, row['S2DQ2'] == 1]) :
        return 1       
    elif all([row['S2DQ1'] == 2, row['S2DQ2'] == 2]) :
        return 0     
   else :   
        return numpy.nan     
   sub2['PRSTS'] = sub2.apply(lambda row: PRSTS (row),axis=1)  


  #recoding values for 'CONSUMER' into a new variable, DRSTS
  recode1 = {1: 1, 2: 1, 3: 0}
  sub2['DRSTS']= sub2['CONSUMER'].map(recode1)

 #recoding new values for SEX variable
 recode2 = {1: 1, 2: 0}
 sub2['GEN']= sub2['SEX'].map(recode2)

 data_clean = sub2.dropna()

 data_clean.dtypes
 data_clean.describe()

 #Modeling and Prediction

 #Split into training and testing sets

 predictors = data_clean[['S2AQ16A','SIBNO','SIBSTS','PRSTS','GEN']]

 targets = data_clean['DRSTS']

 pred_train, pred_test, tar_train, tar_test  =   train_test_split(predictors, targets, test_size=.4)

 print(pred_train.shape)
 print(pred_test.shape)
 print(tar_train.shape)
 print(tar_test.shape)

 #Build model on training data
 classifier=DecisionTreeClassifier()
 classifier=classifier.fit(pred_train,tar_train)

 predictions=classifier.predict(pred_test)

 print(sklearn.metrics.confusion_matrix(tar_test,predictions))
 print(sklearn.metrics.accuracy_score(tar_test, predictions))

 #Displaying the decision tree
 from sklearn import tree
 #from StringIO import StringIO
 import io
 #from StringIO import StringIO 
 from IPython.display import Image
 out = io.BytesIO()
 tree.export_graphviz(classifier, out_file=out)
 import pydotplus
 graph=pydotplus.graph_from_dot_data(out.getvalue())
 Image(graph.create_png())
 graph.write_pdf("iris.pdf")

输出:

代码中使用的数据集-nesar_pds

【问题讨论】:

    标签: python-2.7 machine-learning scikit-learn classification decision-tree


    【解决方案1】:

    在训练数据集上建立模型后,您应该使用测试数据集来预测分类器的准确性。

    错误在这一行predictions=classifier.predict(pred_train)

    应该是:predictions=classifier.predict(pred_test)

    【讨论】:

    • 感谢@Darshan 的帮助。我之前使用了你提到的代码,但结果是一样的,所以我这样做是为了检查,但在发布之前忘记更正它。现在我已经编辑了这个。我面临与随机森林相同的问题。如果您想尝试这样做,我已经分享了数据集的链接。
    • 如果是这样,那就不应该这样了。
    • 我会尽力找出原因并告诉你。
    • 我运行了您的代码并检查了数据集。看起来您的目标变量 DRSTS 只有一个值(即 1)。没有其他价值。如果你这样做 targets.describe 你会发现。因此任何分类器都非常容易将准确结果预测为 1。因此准确度为 1。
    • 谢谢@Darshan 你说得对。你把我从沮丧中拉了出来。
    【解决方案2】:

    在您的print(sklearn.metrics.accuracy_score(tar_test, predictions)) 中,将其用作print(sklearn.metrics.accuracy_score(tar_test, predictions, normalize = False))。根据documentation,它说:'如果为假,则返回正确分类的样本数。否则,返回正确分类样本的比例。在此结果中,正确预测的样本数与分离测试的目标数相同。那么,也许算法预测的一切都是正确的(这真的很奇怪)。

    【讨论】:

    • 数据管理过程中出现错误,导致目标变量只有一个类别。 @Darshan 指出了这个错误。
    猜你喜欢
    • 2020-10-23
    • 2023-01-01
    • 2023-04-08
    • 2020-03-27
    • 1970-01-01
    • 2013-12-21
    • 2019-05-24
    • 2016-07-12
    • 2017-04-11
    相关资源
    最近更新 更多