pandas - 筛选在 groupby 中至少有一列包含非空值的组答案

【问题标题】：pandas - filter on groups which have at least one column containing non-null values in a groupbypandas - 筛选在 groupby 中至少有一列包含非空值的组
【发布时间】：2022-04-29 06:00:04
【问题描述】：

我有以下 python pandas 数据框：

df = pd.DataFrame({'Id': ['1', '1', '1', '2', '2', '3'], 'A': ['TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE'], 'B': [np.nan, np.nan, 'abc', np.nan, np.nan, 'def'],'C': [np.nan, np.nan, np.nan, np.nan, np.nan, '456']})

>>> print(df)
  Id      A    B    C
0  1   TRUE  NaN  NaN
1  1   TRUE  NaN  NaN
2  1   TRUE  abc  NaN
3  2   TRUE  NaN  NaN
4  2   TRUE  NaN  NaN
5  3  FALSE  def  456

我想最终得到以下数据框：

>>> print(dfout)
  Id     A    B   C
0  1  TRUE  abc NaN

相同的 Id 值可以出现在多行上。每个 Id 在其所有行的 A 列中都将具有值 TRUE 或 FALSE。 B 列和 C 列可以有任何值，包括 NaN。
对于每个具有 A=TRUE 的 Id，我希望在 dfout 中有一行，并显示在 B 和 C 列中看到的最大值。但是，如果在 B 和 C 列中看到的唯一值 = NaN 对于 Id 的所有行，那么 Id 是被排除在 dfout 之外。

Id 1 有A=TRUE，在第三行有B=abc，所以它满足要求。
Id 2 具有 A=TRUE，但 B 列和 C 列是 NaN 两行，所以它没有。
Id 3 有A=FALSE，所以它没有满足要求。

我在 Id 上创建了一个 groupby df，然后应用了一个掩码以仅包含 A=TRUE 的行。但无法理解如何删除 B 列和 C 列中所有行的 NaN 行。

grouped = df.groupby(['Id'])
mask = grouped['A'].transform(lambda x: 'TRUE' == x.max()).astype(bool)
df.loc[mask].reset_index(drop=True)

  Id     A    B    C
0  1  TRUE  NaN  NaN
1  1  TRUE  NaN  NaN
2  1  TRUE  abc  NaN
3  2  TRUE  NaN  NaN
4  2  TRUE  NaN  NaN

然后我尝试了几件事：

df.loc[mask].reset_index(drop=True).all(['B'],['C']).isnull

但出现错误，例如：

" TypeError: unhashable type: 'list' ".

使用python 3.6、pandas 0.23.0；在这里寻求帮助：keep dataframe rows meeting a condition into each group of the same dataframe grouped by

【问题讨论】：

标签： python pandas filter

【解决方案1】：

解决方案分为三个部分。

过滤数据框以保留 A 列为 True 的行
Groupby Id 并首先使用，这将首先返回非空值
在 B 列和 C 列的结果数据框上使用 dropna 并设置 how = 'all'

df.loc[df['A'] == True].groupby('Id', as_index = False).first().dropna(subset = ['B', 'C'], how = '全部')
```
    Id  A       B   C
0   1   True    abc NaN
```

【讨论】：

完美——非常干净；对于我的示例，我将位修改为 df['A'] == 'TRUE'