将 reindex 与 fill_value 用于同一数据框中的分类和连续特征答案

【问题标题】：Using reindex with fill_value for categorical and continuous features in same dataframe将 reindex 与 fill_value 用于同一数据框中的分类和连续特征
【发布时间】：2017-06-29 06:56:27
【问题描述】：

我在拟合和分类时使用pandas.get_dummies 对分类特征进行编码，我刚刚注意到Imputer() 在分类时将平均值放在dataframe.reindex() 中添加的“关闭”分类开关中一个新的样本。

我读到了这个post，建议在reindex 调用上使用fill_value=0，这似乎是一个不错的解决方案，但在我将此代码投入生产之前，我有一个烦人的问题。

有谁知道 pandas DataFrame.reindex 函数会将所有 NaN 设置为 fill_value 中的值还是仅设置它添加的新列？我想确保任何带有 NaN 的非分类数据都由 Imputer() 处理。

【问题讨论】：

标签： python pandas scikit-learn

【解决方案1】：

如果我正确理解您的问题，我相信它将填充所有列中的 NaN 值。

来自 [http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html][1]

import pandas as pd
index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
new_index= ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10','Chrome']
df = pd.DataFrame({
      'http_status': [200,200,404,404,301],
      'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]},
       index=index)

df

                http_status  response_time
Firefox            200           0.04
Chrome             200           0.02
Safari             404           0.07
IE10               404           0.08
Konqueror          301           1.00

df.reindex(new_index, fill_value='missing') 返回时：

                  http_status   response_time
Safari                404          0.07
Iceweasel         missing       missing
Comodo Dragon     missing       missing
IE10                  404          0.08
Chrome                200          0.02

这些列都不是新的，但仍然填写了 nan 值。我肯定会在投入生产之前测试我的解释。我不确定我是否有正确的上下文。

编辑：

我应该补充一点，好像以前的值是“NaN”，.reindex 不会填充这些值：

import pandas as pd
index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
new_index= ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10','Chrome']
df = pd.DataFrame({
      'http_status': [200,'NaN',404,404,301],
      'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]},
       index=index)

df

               http_status  response_time
Safari                404           0.07
Iceweasel             NaN            NaN
Comodo Dragon         NaN            NaN
IE10                  404           0.08
Chrome                NaN           0.02

虽然 df.reindex(new_index, fill_value='missing') 返回：

              http_status response_time
Safari                404          0.07
Iceweasel         missing       missing
Comodo Dragon     missing       missing
IE10                  404          0.08
Chrome                NaN          0.02

HTTP Status-Chrome 值不受切换索引的影响。

【讨论】：

感谢@cptnhaddock 的回复。这很有帮助！