【发布时间】:2022-01-27 00:34:57
【问题描述】:
这是我的X_train:
> print(type(X_train))
> print(X_train)
<class 'pandas.core.frame.DataFrame'>
0 1 2 3 4 5 keyword
1386 2 1 1 0 1 1 bush%20fires
4048 0 1 1 0 1 0 forest%20fires
3086 0 0 0 0 0 0 electrocute
272 0 0 0 1 0 0 apocalypse
7462 0 0 0 0 0 0 wounds
... .. .. .. .. .. .. ...
4931 0 1 0 0 1 0 mayhem
3264 0 1 0 0 1 0 engulfed
1653 0 2 0 0 2 0 collapsed
2607 0 0 0 0 0 0 destroyed
2732 0 0 0 0 0 0 devastated
[6090 rows x 7 columns]
这是我在X_train上运行的预处理代码:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
numeric_features = [0, 1, 2, 3, 4, 5]
numeric_transformer = Pipeline(
steps=[("imputer", SimpleImputer(strategy="median"))]
)
categorical_features = ["keyword"]
categorical_transformer = Pipeline(
steps=[("imputer", SimpleImputer(strategy="most_frequent")),
("transformer", OneHotEncoder(handle_unknown="ignore"))]
)
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)
pipeline = Pipeline(
steps=[("preprocessor", preprocessor)]
)
X_train = pipeline.fit_transform(X_train)
X_test = pipeline.transform(X_test)
因为我在“关键字”列上使用OneHotEncoder,所以我希望为“关键字”的每个可能值添加一堆新列。我还希望我的数字列能够像以前一样保留。
但是...这是X_train 预处理后的样子:
> print(type(X_train))
> print(pd.DataFrame(X_train))
<class 'scipy.sparse.csr.csr_matrix'>
0
0 (0, 0)\t2.0\n (0, 1)\t1.0\n (0, 2)\t1.0\n ...
1 (0, 1)\t1.0\n (0, 2)\t1.0\n (0, 4)\t1.0\n ...
2 (0, 94)\t1.0
3 (0, 3)\t1.0\n (0, 13)\t1.0
4 (0, 223)\t1.0
... ...
6085 (0, 1)\t1.0\n (0, 4)\t1.0\n (0, 148)\t1.0
6086 (0, 1)\t1.0\n (0, 4)\t1.0\n (0, 99)\t1.0
6087 (0, 1)\t2.0\n (0, 4)\t2.0\n (0, 53)\t1.0
6088 (0, 80)\t1.0
6089 (0, 84)\t1.0
[6090 rows x 1 columns]
如您所见,OneHotEncoder 不起作用,并且不知何故数字列也消失了。
为什么会出现这种情况以及如何解决?
【问题讨论】:
-
OneHotEncoder未被使用,因为categorical_transformer未包含在preprocessor的转换器列表中,仅包含numeric_transformer。你说数字列不见了是什么意思? -
哎呀,我应该把它包括在内。刚刚更新了。
-
数字列仍然存在于您呈现的转换后数据中。你抄错了吗?您能否提供一个较小版本的数据集,我们可以复制并运行以重现此问题?
-
哎呀,请输入正确的输出。
-
为了更简洁的表示,您可能会考虑将
X_train转换为数组 (X_train.toarray()),然后再将其传递给pd.DataFrame
标签: python scikit-learn pipeline