Sklearn 的 SimpleImputer 不能在管道中工作？答案

【问题标题】：Sklearn's SimpleImputer doesn't work in a pipeline?Sklearn 的 SimpleImputer 不能在管道中工作？
【发布时间】：2019-01-15 10:26:00
【问题描述】：

我有一个 pandas 数据框，它在特定列中有一些 NaN 值：

1291   NaN
1841   NaN
2049   NaN
Name: some column, dtype: float64

为了处理它，我制作了以下管道：

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

scaler = StandardScaler(with_mean = True)
imputer = SimpleImputer(strategy = 'median')
logistic = LogisticRegression()

pipe = Pipeline([('imputer', imputer),
                 ('scaler', scaler), 
                 ('logistic', logistic)])

现在，当我将此管道传递给 RandomizedSearchCV 时，我收到以下错误：

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

实际上比这要长得多——如果有必要，我可以在编辑中发布整个错误。无论如何，我很确定该列是唯一包含 NaN 的列。此外，如果我在管道中从SimpleImputer 切换到（现已弃用）Imputer，则管道在我的RandomizedSearchCV 中运行良好。我检查了文档，但似乎SimpleImputer 的行为方式（几乎）与Imputer 完全相同。行为上有什么区别？如何在不使用已弃用的 Imputer 的情况下在我的管道中使用 imputer？

【问题讨论】：

如果您独立运行SimpleImputer（而不是从管道中），您会得到同样的错误吗？
通过时发现同样的错误 - SimpleImputer( strategy='constant', fill_value=0)
@FrédérandOuweric 的评论：您是否检查过目标变量不包含 NaN 值？ Imputer 只处理输入特征中的缺失值。
我遇到了同样的问题。原来我必须明确指定 missing_values=None 。实际上，我希望这是默认行为。
这个问题似乎在这里解决了：github.com/scikit-learn/scikit-learn/issues/21112

标签： scikit-learn pipeline sklearn-pandas

【解决方案1】：

make_pipeline 中的 SimpleImputer

preprocess_pipeline = make_pipeline(   
    FeatureUnion(transformer_list=[
        ('Handle numeric columns', make_pipeline(
            ColumnSelector(columns=['Amount']),
            SimpleImputer(strategy='constant', fill_value=0),
            StandardScaler()
        )),
        ('Handle categorical data', make_pipeline(
            ColumnSelector(columns=['Type', 'Name', 'Changes']),
            SimpleImputer(strategy='constant', missing_values=' ', fill_value='missing_value'),
            OneHotEncoder(sparse=False)
        ))
    ])
)

管道中的SimpleImputer

('features', FeatureUnion ([
     ('Cat Columns', Pipeline([
          ('Category Extractor', TypeSelector(np.number)),
                 ('Impute Zero', SimpleImputer(strategy="constant", fill_value=0))
                                    ])),
('Numerics', Pipeline([
      ('Numeric Extractor', TypeSelector("category")),
          ('Impute Missing', SimpleImputer(strategy="constant", fill_value='missing'))
          ]))        
     ]))

【讨论】：

【解决方案2】：

我遇到了同样的问题，但这解决了它：

imputer = SimpleImputer(strategy = 'median', fill_value = 0)

【讨论】：