Pandas：当组中的值满足所需条件时，从数据中删除组答案

【问题标题】：Pandas: remove group from the data when a value in the group meets a required conditionPandas：当组中的值满足所需条件时，从数据中删除组
【发布时间】：2019-09-18 13:07:17
【问题描述】：

我在数据和每个组中都有值分组，我想检查组中的值是否低于8。如果满足此条件，则从数据集中删除整个组。

请注意我所指的值位于分组列的另一列中。

示例输入：

Groups Count
  1      7
  1      11
  1      9 
  2      12
  2      15
  2      21

输出：

Groups Count
  2      12
  2      15
  2      21

【问题讨论】：

标签： python pandas dataframe grouping

【解决方案1】：

根据您在问题中的描述，只要该组中至少有一个值低于 8，则应删除该组。所以等效的语句是只要该组中的最小值低于 8，就应该删除该组。

通过使用过滤功能，实际代码可以减少到只有一行，请参考Filtration，您可以使用以下代码：

dfnew = df.groupby('Groups').filter(lambda x: x['Count'].min()>8 )
dfnew.reset_index(drop=True, inplace=True) # reset index
dfnew = dfnew[['Groups','Count']] # rearrange the column sequence
print(dfnew)

Output:
   Groups  Count
0       2     12
1       2     15
2       2     21

【讨论】：

这应该被标记为关于OP问题的正确答案
啊.. 把我的评论搞砸了。这应该被标记为关于 OP 问题的正确答案，因为这是使用 pandas inbuild groupby 函数的最优雅方式。它高效、易读且单行。 1up
应该是>=。

【解决方案2】：

您可以使用isin、loc 和unique 通过反转掩码选择子集。最后你可以reset_index：

print df

  Groups  Count
0       1      7
1       1     11
2       1      9
3       2     12
4       2     15
5       2     21

print df.loc[df['Count'] < 8, 'Groups'].unique()
[1]

print ~df['Groups'].isin(df.loc[df['Count'] < 8, 'Groups'].unique())

0    False
1    False
2    False
3     True
4     True
5     True
Name: Groups, dtype: bool

df1 = df[~df['Groups'].isin(df.loc[df['Count'] < 8, 'Groups'].unique())]
print df1.reset_index(drop=True)

   Groups  Count
0       2     12
1       2     15
2       2     21

【讨论】：

【解决方案3】：

使用您的条件创建一个布尔系列，然后 groupby + transform('any') 为原始 DataFrame 形成一个掩码。这使您可以简单地对原始 DataFrame 进行切片。

df[~df.Count.lt(8).groupby(df.Groups).transform('any')]
#   Groups  Count
#3       2     12
#4       2     15
#5       2     21

虽然groupby + filter 的语法更直接，但它对大量组的性能要差得多，因此首选使用transform 创建布尔掩码。在这个例子中，有超过 1000 倍的改进。 .isin 方法对单个列的工作速度非常快，但如果在多个列上分组，则需要切换到合并。

import pandas as pd
import numpy as np

np.random.seed(123)
N = 50000
df = pd.DataFrame({'Groups': [*range(N//2)]*2,
                   'Count': np.random.randint(0, 1000, N)})

# Double check both are equivalent
(df.groupby('Groups').filter(lambda x: x['Count'].min() >= 8)
  == df[~df.Count.lt(8).groupby(df.Groups).transform('any')]).all().all()
#True

%timeit df.groupby('Groups').filter(lambda x: x['Count'].min() >= 8)
#8.15 s ± 80.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df[~df.Count.lt(8).groupby(df.Groups).transform('any')]
#6.54 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df[~df['Groups'].isin(df.loc[df['Count'] < 8, 'Groups'].unique())]
#2.88 ms ± 24 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

【讨论】：