Pandas - 用组中最常见的值替换 Null答案

【问题标题】：Pandas - Replacing Nulls with the most frequent value from groupsPandas - 用组中最常见的值替换 Null
【发布时间】：2019-10-25 12:20:54
【问题描述】：

我有一个包含以下列的数据集：

['sex', 'age', 'relationship_status]

“relationship_status”列中有一些 NaN 值，我想根据年龄和性别将它们替换为每个组中最常见的值。

我知道如何分组和计算值：

df2.groupby(['age','sex'])['relationship_status'].value_counts()

然后它返回：

age   sex     relationship_status
17.0  female  Married with kids       1
18.0  female  In relationship         5
              Married                 4
              Single                  4
              Married with kids       2
      male    In relationship         9
              Single                  5
              Married                 4
              Married with kids       4
              Divorced                3
.
.
.

86.0  female  In relationship         1
92.0  male    Married                 1
97.0  male    In relationship         1

同样，我需要实现的是，每当“relationship_status”为空时，我需要程序根据人的年龄和性别将其替换为最常见的值。

谁能建议我该怎么做？

亲切的问候。

【问题讨论】：

标签： python pandas

【解决方案1】：

类似这样的：

mode = df2.groupby(['age','sex'])['relationship_status'].agg(lambda x: pd.Series.mode(x)[0])
df2['relationship_status'].fillna(mode, inplace=True)

【讨论】：

嘿@John-Zwinck，感谢您的回答。然而，它似乎不起作用。当我自己运行第一行时，它返回：IndexError: index out of bounds 当我将 (lambda x: pd.Series.mode(x)[0]) 更改为 (lambda x: pd.Series.mode(x)) 时，它起作用了，但随后第二行返回错误：ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long' 这可能是因为模式方法在某些情况下将多个值识别为模式，例如在这种情况下：77.0 female [Divorced, Married, Married with kids] 知道如何继续吗？
听起来有些年龄+性别的所有关系状态值都是 NaN。您需要决定如何处理并将其放入 lambda 中。例如，您可以将 lambda 更改为常规函数，并在模式为空时执行您想要的任何操作。

【解决方案2】：

选中此项，当 (age,sex) 子组中只有 nan 时，它返回 'ALL_NAN'：

import pandas as pd

df = pd.DataFrame(
        {'age': [25, 25, 25, 25, 25, 25,],
         'sex': ['F', 'F', 'F', 'M', 'M', 'M', ],
         'status': ['married', np.nan, 'married', np.nan, np.nan, 'single']
        })


df.loc[df['status'].isna(), 'status'] = df.groupby(['age','sex'])['status'].transform(lambda x: x.mode()[0] if any(x.mode()) else 'ALL_NAN')

输出：

   age sex   status
0   25   F  married
1   25   F  married
2   25   F  married
3   25   M   single
4   25   M   single
5   25   M   single

【讨论】：

这个版本也没有用，它返回了IndexError: index out of bounds。感谢您提交。
检查你的数据，应该修复它的小改动：lambda x: x.mode()[0] if any(x.mode()) else 'ALL_NAN'
是的！谢谢！如果有人感兴趣，我还找到了一个类似的解决方案：df2["relationship_status"] = df2.groupby(['age','sex'])['relationship_status'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else "Empty"))
mode().empty 更干净更好:) 请投票/接受我的回答。