【发布时间】:2020-07-02 21:18:06
【问题描述】:
环境:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
样本数据:
X_train = pd.DataFrame({'A': ['a1', 'a3', 'a2'],
'B': ['b2', 'b1', 'b3'],
'C': [1, 2, 3]})
y_train = pd.DataFrame({'Y': [1,0,1]})
期望的结果: 我想以这种格式在我的管道中包含 sklearn OneHotEncoder:
encoder = ### SOME CODE ###
scaler = StandardScaler()
model = RandomForestClassifier(random_state=0)
# This is my ideal pipeline
pipe = Pipeline([('OneHotEncoder', encoder),
('Scaler', scaler),
('Classifier', model)])
pipe.fit(X_train, y_train)
挑战: OneHotEncoder 正在对包括数字列在内的所有内容进行编码。我想保持数字列不变,并以与 Pipeline() 兼容的有效方式仅对分类特征进行编码。
encoder = OneHotEncoder(drop='first', sparse=False)
encoder.fit(X_train)
encoder.transform(X_train) # Columns C is encoded - this is what I want to avoid
解决方法(不理想):我可以使用 pd.get_dummies() 解决问题。但是,这意味着我不能将它包含在我的管道中。或者有什么办法?
X_train = pd.get_dummies(X_train, drop_first=True)
【问题讨论】:
标签: python pandas scikit-learn one-hot-encoding