ValueError：给定的列不是数据框的列答案

【问题标题】：ValueError: A given column is not a column of the dataframeValueError：给定的列不是数据框的列
【发布时间】：2021-04-01 17:28:56
【问题描述】：

大家

我正在尝试使用 scikit-learn 创建管道。

基本上，我有一个jupyter-notebook，它使用 pandas 加载数据，拆分数据集来训练和测试模型。

我的问题出现在这一行：clf.fit(X_train, y_train) 你可以在我的 github repo jupyter-notebook 上看到整个代码

日志错误：

----------------------------------------------------------------------
KeyError                             Traceback (most recent call last)
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2656             try:
-> 2657                 return self._engine.get_loc(key)
   2658             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'survived'

During handling of the above exception, another exception occurred:

KeyError                             Traceback (most recent call last)
~/anaconda3/lib/python3.7/site-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
    446             for col in columns:
--> 447                 col_idx = all_columns.get_loc(col)
    448                 if not isinstance(col_idx, numbers.Integral):

~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2658             except KeyError:
-> 2659                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2660         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'survived'

The above exception was the direct cause of the following exception:

ValueError                           Traceback (most recent call last)
<ipython-input-16-17661ab0f723> in <module>
----> 1 clf.fit(X_train, y_train)
      2 print("model score: %.3f" % clf.score(X_test, y_test))

~/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    328         """
    329         fit_params_steps = self._check_fit_params(**fit_params)
--> 330         Xt = self._fit(X, y, **fit_params_steps)
    331         with _print_elapsed_time('Pipeline',
    332                                  self._log_message(len(self.steps) - 1)):

~/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
    294                 message_clsname='Pipeline',
    295                 message=self._log_message(step_idx),
--> 296                 **fit_params_steps[name])
    297             # Replace the transformer of the step with the fitted
    298             # transformer. This is necessary when loading the transformer

~/anaconda3/lib/python3.7/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    350 
    351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
    353 
    354     def call_and_shelve(self, *args, **kwargs):

~/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    738     with _print_elapsed_time(message_clsname, message):
    739         if hasattr(transformer, 'fit_transform'):
--> 740             res = transformer.fit_transform(X, y, **fit_params)
    741         else:
    742             res = transformer.fit(X, y, **fit_params).transform(X)

~/anaconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
    527         self._validate_transformers()
    528         self._validate_column_callables(X)
--> 529         self._validate_remainder(X)
    530 
    531         result = self._fit_transform(X, y, _fit_transform_one)

~/anaconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in _validate_remainder(self, X)
    325         cols = []
    326         for columns in self._columns:
--> 327             cols.extend(_get_column_indices(X, columns))
    328 
    329         remaining_idx = sorted(set(range(self._n_features)) - set(cols))

~/anaconda3/lib/python3.7/site-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
    454             raise ValueError(
    455                 "A given column is not a column of the dataframe"
--> 456             ) from e
    457 
    458         return column_indices

ValueError: A given column is not a column of the dataframe

我在传递数据框之前检查了列是否存在在训练和测试中拆分。

有人知道如何解决这个问题吗？

提前致谢！干杯

【问题讨论】：

标签： pandas scikit-learn jupyter-notebook

【解决方案1】：

错误来自您在定义X 时从一开始就删除了survived 列。您只在y_train 中检查了它的存在。

简单替换

X= df.drop('survived', axis=1)

通过

X= df

和你的

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

model score: 1.000

【讨论】：

嗨，我不明白...因为目标特征仍在 X 中，这可能是训练模型的问题吗？有人知道为什么吗？

【解决方案2】：

如果您使用的是Kaggle's pipeline，问题可能出在预处理器上：

preprocessor = ColumnTransformer(
transformers=[
    ('num', numerical_transformer, numerical_cols),
    ('cat', categorical_transformer, categorical_cols)
])

numerical_cols 和 categorical_cols 应该是特征列表，而不是数据集。

不要在 X_train 上使用你的目标列，它会过度拟合你的模型，它会给你 100% 的准确度，但在生产中将毫无用处。

【讨论】：