【问题标题】:why does cross validation for RandomForestRegressor fail in scikit-learn为什么 RandomForestRegressor 的交叉验证在 scikit-learn 中失败
【发布时间】:2013-10-09 06:54:13
【问题描述】:

输入的训练或测试文件格式如下:

-1 1 11.10115101|u 11.10115101 |s 2 |reason k:0.116|pv pv1000|g 2230444827 |k k3|w k:0
-1 1 11.10115101|u 11.10115101 |s 0 |reason c:0.080|pv pv1000|g 2235873129 |k k0|w c:1
-1 1 11.10115101|u 11.10115101 |s 1 |reason h:0.054 o:0.073|pv pv1000|g 2236879382 |k k10|w h:1 o:21
-1 1 11.10115101|u 11.10115101 |s 0 |reason u:0.133|pv pv1000|g 2237638819 |k k5|w u:26
-1 1 11.10115101|u 11.10115101 |s 0 |reason o:0.086|pv pv1000|g 2237694729 |k k5|w o:11
-1 1 11.10115101|u 11.10115101 |s 2 |reason l:0.111|pv pv1000|g 2237821631 |k k3|w l:0

代码如下,load_data()函数将训练数据或测试数据加载到python dict列表中,并返回一个元组([dict,...], [0,1,0...]) :

parser = argparse.ArgumentParser()
parser.add_argument('-t', '--train', required = True, help='train file')
parser.add_argument('-e', '--test', required = True, help='test file')
ns = parser.parse_args(sys.argv[1:])
f = open(ns.train)
inputs, targets = load_data( f )

print >>sys.stderr, 'load finish'
vec = DictVectorizer()
train = vec.fit_transform( inputs)
print >>sys.stderr, 'dict vectorizer finish'

print >>sys.stderr, 'training'
clf = RandomForestRegressor()
clf.fit(train.toarray(), targets)


print >>sys.stderr, 'testing'
f = open(ns.test)
test_inputs, test_targets = load_data( f )
test = vec.transform(test_inputs)
print cross_validation.cross_val_score(clf, test.toarray(), test_targets, scoring='roc_auc')

训练工作正常,但是在进行交叉验证时,代码的最后一行抛出异常:

  File "randomforest.py", line 72, in <module>
    print cross_validation.cross_val_score(clf, test.toarray(), test_targets, scoring='roc_auc')
  File "/Users/jerry/pkgs/vpy/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1152, in cross_val_score
    for train, test in cv)
  File "/Users/jerry/pkgs/vpy/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 517, in __call__
    self.dispatch(function, args, kwargs)
  File "/Users/jerry/pkgs/vpy/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 312, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/Users/jerry/pkgs/vpy/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 136, in __init__
    self.results = func(*args, **kwargs)
  File "/Users/jerry/pkgs/vpy/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1058, in _cross_val_score
    y_train = y[train]
TypeError: only integer arrays with one element can be converted to an index

我按照手动示例编写代码,但失败了。

【问题讨论】:

  • 请始终报告完整的回溯。还有什么是test_targets?它的类型和形状是什么?它是否具有与test_inputs 变量相同数量的样本?显然是无效的。最后,交叉验证意味着在模型选择的开发集上运行。通常在最终评估(测试)集上运行它并没有什么意义。
  • 对不起,我添加了更多代码。
  • 您仍然没有提供有关test_targets 变量性质的任何信息:它是一个numpy 数组、python 列表还是别的什么?它是一个数组 .shape.dtype 是什么?
  • test_targets 是一个 python 列表

标签: python scikit-learn


【解决方案1】:

此错误与最近报告的issue #2508 匹配。

一种解决方法是调用 add:

test_targets = np.asarray(test_targets)

在致电cross_val_score之前。

【讨论】:

    【解决方案2】:

    我用另一种方式来计算 auc 像:

    preds = clf.predict_proba(test)
    fpr, tpr, thresholds = roc_curve( test_targets, preds[:, 1])
    roc_auc = auc(fpr, tpr)
    

    【讨论】:

      猜你喜欢
      • 2016-04-25
      • 2017-09-02
      • 2018-10-17
      • 2016-05-09
      • 1970-01-01
      • 2012-01-07
      • 2015-06-22
      • 2021-10-25
      • 2015-12-11
      相关资源
      最近更新 更多