计算具有 Nan 值的最频繁组答案

【问题标题】：Count most frequent group with Nan values计算具有 Nan 值的最频繁组
【发布时间】：2018-12-10 10:40:29
【问题描述】：

基本上我想计算由 2 个变量分组的最常见项目的数量。我使用此代码：

dfgrouped = data[COLUMNS.copy()].groupby(['Var1','Var2']).agg(lambda x: stats.mode(x)[1])

此代码有效，但不适用于具有 Nan 值的列，因为 NaN 值是浮点数，而其他值是 str。所以显示这个错误：

'<' not supported between instances of 'float' and 'str'

我想省略 NaN 值和其余的计数模式。所以 str(x) 不是解决方案。并且 scipy.stats.mode(x, nan_policy='omit') 也不起作用，出现错误：

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

请您给我一个建议如何处理它。谢谢

【问题讨论】：

标签： python pandas dataframe scipy pandas-groupby

【解决方案1】：

我认为需要dropna 删除NaNs：

dfgrouped = data[COLUMNS.copy()].groupby(['Var1','Var2']).agg(lambda x: stats.mode(x.dropna())[1])

如果需要为所有 NaN 组设置 NaNs：

dfgrouped = (data[COLUMNS.copy()]
              .groupby(['Var1','Var2'])
              .agg(lambda x: None if x.isnull().all() else stats.mode(x.dropna())[1]))

【讨论】：

我不想删除只有 Nan 值的组。如果发生这种情况，模式应该为空。所以我尝试了你的第一个帮助 (stats.mode(x.dropna())[1])，但它只有在没有 [1] 的情况下才有效。你有什么建议吗？

【解决方案2】：

先放下

您可以在执行groupby 之前将dropna 作为初始步骤。如果您尝试在聚合中使用dropna，则具有所有NaN 值的组可能会产生stats.mode 错误。

这是一个最小的例子：

import pandas as pd
import numpy as np
from scipy import stats

df = pd.DataFrame([[1, 2, np.nan], [1, 2, 'hello'], [1, 2, np.nan],
                   [5, 6, 'next'], [5, 6, np.nan], [5, 6, 'next'],
                   [7, 8, np.nan], [7, 8, np.nan], [7, 8, np.nan]],
                  columns=['Var1', 'Var2', 'Value'])

res = df.dropna(subset=['Value'])\
        .groupby(['Var1', 'Var2'])\
        .agg(lambda x: stats.mode(x)[1][0])

print(res)

           Value
Var1 Var2       
1    2         1
5    6         2

捕获索引错误

如果您需要保留所有 NaN 值的组，则可以捕获 IndexError：

def mode_calc(x):
    try:
        return stats.mode(x.dropna())[1][0]
    except IndexError:
        return np.nan

res = df.groupby(['Var1', 'Var2'])\
        .agg(mode_calc)

print(res)

           Value
Var1 Var2       
1    2       1.0
5    6       2.0
7    8       NaN

【讨论】：

不幸的是，我不想删除只有 NaN 值的组。如果发生这种情况，模式应该为空。有什么建议吗？非常感谢！
@hta，当然，看看我添加的替代方案。

【解决方案3】：

nan 是 float 类型， np.nan == np.nan 也是 False。如果您需要将它们组合在一起，您可以尝试这样的操作：

# First replace nan values with something like 'Unavailable'
data.fillna('Unavailable', inplace=True)
# Then re-run your code
dfgrouped = data[COLUMNS.copy()].groupby(['Var1','Var2']).agg(lambda x: stats.mode(x)[1])

这会将所有不可用的产品组合为一个组。希望有帮助

【讨论】：