（熊猫）删除由 GroupBy 创建的重复组答案

【问题标题】：(Pandas) drop duplicated groups created by GroupBy（熊猫）删除由 GroupBy 创建的重复组
【发布时间】：2019-05-04 22:48:17
【问题描述】：

我想通过自定义 ID 创建组，然后消除某些列中重复的组。

例如

| id | A   | B  |
|----|-----|----|
| 1  | foo | 40 |
| 1  | bar | 50 |
| 2  | foo | 40 |
| 2  | bar | 50 |
| 2  | cod | 0  |
| 3  | foo | 40 |
| 3  | bar | 50 |

到

| id | A   | B  |
|----|-----|----|
| 1  | foo | 40 |
| 1  | bar | 50 |
| 2  | foo | 40 |
| 2  | bar | 50 |
| 2  | cod | 0  |

这里我按 id 分组，然后我删除了 3，因为如果我们只考虑 A 列和 B 列，它们是相同的，而第 2 组有一些重复的行，但不是精确的副本。

我尝试过循环组，但即使只有大约 12.000 个组，它也非常慢。一种可能的复杂情况是组的大小可变。

这是我一直在研究的解决方案，但它需要很长时间，没有明显的重复点击（我知道这个数据库中存在）

grps = datafinal.groupby('Form_id') 
unique_grps={}

first=True
for lab1, grp1 in grps:
    if first:
        unique_grps[lab1] = grp1
        first=False
        continue
    for lab2, grp2 in unique_grps.copy().items():
        if grp2[['A','B']].equals(grp1[['A','B']]):
            print("hit")
            continue
        unique_grps[lab1] = grp1

【问题讨论】：

你不能只删除重复的 w.r.t 吗？先列 A 和 B？
@timgeb 我认为这行不通。想象第 2 组有第 3 组的 1 行，第 1 组有其他行。 drop_duplicates 将删除第 3 组，即使它没有被一组完全复制。
@timgeb 将删除第 2 组的前两行，我需要保留。
啊，好的，谢谢你的澄清。
您可以在结果 DF 上使用内置的 drop duplicates 子集为 'A', 'B' 吗？

标签： python pandas pandas-groupby data-manipulation

【解决方案1】：

使用aggtuple和duplicated

s=df.groupby('id').agg(tuple).sum(1).duplicated()
df.loc[df.id.isin(s[~s].index)]
Out[779]: 
   id    A   B
0   1  foo  40
1   1  bar  50
2   2  foo  40
3   2  bar  50
4   2  cod   0

更多信息：现在，组内的所有内容都在一个 tuple 中

df.groupby('id').agg(tuple).sum(1)
Out[780]: 
id
1            (foo, bar, 40, 50)
2    (foo, bar, cod, 40, 50, 0)
3            (foo, bar, 40, 50)
dtype: object

更新

from natsort import natsorted
s=df.groupby('id').agg(tuple).sum(1).map(natsorted).map(tuple).duplicated()

【讨论】：

很好，这让我达到了 90%，但是我需要指定我想要检查重复的列的子集，（为简单起见，我没有在示例中添加更多列，而是在此处说明是 5 列，其他 3 列我不关心）
哦，但我可以在第一步中对数据框进行子集化，不是吗？这样就可以了。
我想也许 .sort_values 首先以防组内的行顺序不一致？
@user3207377 你可以在第一步过滤，然后你就可以达到你所需要的
@ALollz 确实我现在正在考虑对值进行排序。不过我不确定在哪一点。

【解决方案2】：

您可以使用来自itertools 文档（也可在more_itertools 库中找到）中的unique_everseen recipe，以及pd.concat 和groupby：

from operator import itemgetter
from more_itertools import unique_everseen

def unique_key(x):
    return tuple(map(tuple, x[['A', 'B']].values.tolist()))

def jpp(df):
    groups = map(itemgetter(1), df.groupby('id'))
    return pd.concat(unique_everseen(groups, key=unique_key))

print(jpp(df))

   id    A   B
0   1  foo  40
1   1  bar  50
2   2  foo  40
3   2  bar  50
4   2  cod   0

【讨论】：