【发布时间】:2020-05-25 23:35:11
【问题描述】:
我有一个数据框X,其中包含整数、浮点数和字符串列。我想对每个“对象”类型的列进行一次热编码,所以我正在尝试这样做:
encoding_needed = X.select_dtypes(include='object').columns
ohe = preprocessing.OneHotEncoder()
X[encoding_needed] = ohe.fit_transform(X[encoding_needed].astype(str)) #need astype bc I imputed with 0, so some rows have a mix of zeroes and strings.
但是,我最终得到了IndexError: tuple index out of range。根据编码器期望的X: array-like, shape [n_samples, n_features],我不太了解documentation,所以我应该可以传递数据帧。如何对encoding_needed 中特别标记的列列表进行一次性编码?
编辑:
数据是机密的,所以我不能分享它,我也不能创建一个虚拟的,因为它有 123 列。
我可以提供以下内容:
X.shape: (40755, 123)
encoding_needed.shape: (81,) and is a subset of columns.
全栈:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-90-6b3e9fdb6f91> in <module>()
1 encoding_needed = X.select_dtypes(include='object').columns
2 ohe = preprocessing.OneHotEncoder()
----> 3 X[encoding_needed] = ohe.fit_transform(X[encoding_needed].astype(str))
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
3365 self._setitem_frame(key, value)
3366 elif isinstance(key, (Series, np.ndarray, list, Index)):
-> 3367 self._setitem_array(key, value)
3368 else:
3369 # set column
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/frame.py in _setitem_array(self, key, value)
3393 indexer = self.loc._convert_to_indexer(key, axis=1)
3394 self._check_setitem_copy()
-> 3395 self.loc._setitem_with_indexer((slice(None), indexer), value)
3396
3397 def _setitem_frame(self, key, value):
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
592 # GH 7551
593 value = np.array(value, dtype=object)
--> 594 if len(labels) != value.shape[1]:
595 raise ValueError('Must have equal len keys and value '
596 'when setting with an ndarray')
IndexError: tuple index out of range
【问题讨论】:
-
请提供您的数据样本和完整错误回溯,而不仅仅是最后一行
-
@G.Anderson 我更新了问题。
标签: python pandas scikit-learn one-hot-encoding