交叉验证时不在索引中的关键错误答案

【问题标题】：key error not in index while cross validation交叉验证时不在索引中的关键错误
【发布时间】：2019-01-21 23:13:38
【问题描述】：

我在我的数据集上应用了 svm。我的数据集是多标签的，这意味着每个观察都有多个标签。

虽然KFold cross-validation 会引发错误not in index。

它显示了从 601 到 6007 not in index 的索引（我有 1...6008 个数据样本）。

这是我的代码：

   df = pd.read_csv("finalupdatedothers.csv")
categories = ['ADR','WD','EF','INF','SSI','DI','others']
X= df[['sentences']]
y = df[['ADR','WD','EF','INF','SSI','DI','others']]
kf = KFold(n_splits=10)
kf.get_n_splits(X)
for train_index, test_index in kf.split(X,y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

SVC_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
            ])

for category in categories:
    print('... Processing {} '.format(category))
    # train the model using X_dtm & y
    SVC_pipeline.fit(X_train['sentences'], y_train[category])

    prediction = SVC_pipeline.predict(X_test['sentences'])
    print('SVM Linear Test accuracy is {} '.format(accuracy_score(X_test[category], prediction)))
    print 'SVM Linear f1 measurement is {} '.format(f1_score(X_test[category], prediction, average='weighted'))
    print([{X_test[i]: categories[prediction[i]]} for i in range(len(list(prediction)))])

实际上，我不知道如何应用 KFold 交叉验证，其中我可以分别获得每个标签的 F1 分数和准确率。看过this 和this 并没有帮助我如何成功申请我的案子。

为了可重现，这是数据框的一个小样本 最后七个特征是我的标签，包括 ADR、WD、...

,sentences,ADR,WD,EF,INF,SSI,DI,others
0,"extreme weight gain, short-term memory loss, hair loss.",1,0,0,0,0,0,0
1,I am detoxing from Lexapro now.,0,0,0,0,0,0,1
2,I slowly cut my dosage over several months and took vitamin supplements to help.,0,0,0,0,0,0,1
3,I am now 10 days completely off and OMG is it rough.,0,0,0,0,0,0,1
4,"I have flu-like symptoms, dizziness, major mood swings, lots of anxiety, tiredness.",0,1,0,0,0,0,0
5,I have no idea when this will end.,0,0,0,0,0,0,1

更新

当我做任何 Vivek Kumar 所说的事情时，它会引发错误

ValueError: Found input variables with inconsistent numbers of samples: [1, 5408]

在分类器部分。你知道如何解决吗？

stackoverflow 中有几个链接表示我需要重塑训练数据。我也这样做了，但没有成功link 谢谢:)

【问题讨论】：

您能详细说明一下吗？您写道，使用 KFold 时会出现错误。这是在您附加的代码中吗？在哪一行
@ShaharA 感谢您的评论。当它想要做 KFold 时会引发错误。这么早的代码行，我把整个代码放在这里的原因是为了显示我以后想要使用它们的目的。实际上，当我应用 train_test_split 时，代码运行良好，但使用 KFOLD 时却没有
你试过没有 y 的 kf.split(X) 吗？
@ShaharA 是的，实际上，它似乎与那个论点无关
我还更新了我的数据框的一个小样本，因此它现在可以重现了。

标签： python scikit-learn cross-validation

【解决方案1】：

train_index、test_index 是基于行数的整数索引。但是熊猫索引不是那样工作的。较新版本的 pandas 在如何从它们中切片或选择数据方面更加严格。

您需要使用.iloc 访问数据。更多信息是available here

这是你需要的：

for train_index, test_index in kf.split(X,y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    ...
    ...

    # TfidfVectorizer dont work with DataFrame, 
    # because iterating a DataFrame gives the column names, not the actual data
    # So specify explicitly the column name, to get the sentences

    SVC_pipeline.fit(X_train['sentences'], y_train[category])

    prediction = SVC_pipeline.predict(X_test['sentences'])

【讨论】：

非常感谢您的回答，那么我该如何弄清楚何时将训练传递给分类器？有了这个我得到了这个错误：ValueError：找到样本数量不一致的输入变量：[1, 5408] 在这一行 SVC_pipeline.fit(X_train, y_train[category])。感谢您抽出宝贵时间
再次感谢，实际上我也尝试过这种方式，但它会引发错误：
文件“pandas_libs\index.pyx”，第 140 行，在 pandas._libs.index.IndexEngine.get_loc 文件“pandas_libs\index.pyx”，第 162 行，在 pandas._libs.index.IndexEngine .get_loc 文件“pandas_libs\hashtable_class_helper.pxi”，第 1492 行，在 pandas._libs.hashtable.PyObjectHashTable.get_item 文件“pandas_libs\hashtable_class_helper.pxi”，第 1500 行，在 pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'ADR '
即使 X_train 和 y_train 有点黄色，表明它们有问题，但我无法弄清楚。我也在这里看到了你的这篇文章stackoverflow.com/questions/44429600/…，但它没有帮助：|
@sariaGoudarzi 在您在上述问题中提供的示例数据和代码中，我没有收到此错误。请使用当前完整代码（您在从该答案中获取提示后使用的代码）以及完整的错误堆栈跟踪来更新问题。