有重复时删除另一个数据框中的行答案

【问题标题】：Remove rows that are in another dataframe when there are duplicates有重复时删除另一个数据框中的行
【发布时间】：2019-11-03 09:54:56
【问题描述】：

如果另一个数据帧具有相同的行，我想删除一个数据帧中的行。但是，我不想删除所有行，只删除另一个数据框中的行数。参考这个例子：

df1

   col1  col2
0     1    10
1     1    10
2     2    11
3     3    12
4     1    10

df2

   col1  col2
0     1    10
1     2    11
2     1    10
3     3    12
4     3    12

期望的输出：

df1

   col1  col2
      1    10

因为df1 有 3 行 1,10，而 df2 有 2 行 1,10，因此您从每行中删除 2，为 df1 留下 1。如果df1 中有 4 行，我希望df1 中有两行 1,10。与下面的df2 相同：

df2

   col1  col2
      3    12

我的尝试：

我可能正在考虑计算每个数据帧中有多少重复项，并通过减去 dupe_count 创建新的 df1 和 df2，但想知道是否有更有效的方法。

df1g=df1.groupby(df1.columns.tolist(),as_index=False).size().reset_index().rename(columns={0:'dupe_count'})
df2g=df2.groupby(df2.columns.tolist(),as_index=False).size().reset_index().rename(columns={0:'dupe_count'})

【问题讨论】：

标签： python pandas dataframe

【解决方案1】：

这是另一种使用repeat的方法：

# count of the rows
c1 = df1.groupby(['col1', 'col2']).size()
c2 = df2.groupby(['col1', 'col2']).size()

# repeat the rows by values
(c1.repeat((c1-c2).clip(0))
   .reset_index()
   .drop(0, axis=1)
)
#   col1    col2
# 0 1   10

(c2.repeat((c2-c1).clip(0))
   .reset_index()
   .drop(0, axis=1)
)
#   col1    col2
# 0 3   12

【讨论】：

与其他答案相比，唯一的缺点是当某些列中有NaN 时出现错误：groupby().Size() 给出'值错误：传递值的长度是 x，索引意味着 0 'issue。也许它会在 0.25+ 时得到修复。

【解决方案2】：

这是一个不平凡的问题，但merge 是你的朋友：

a, b = (df.assign(count=df.groupby([*df]).cumcount()) for df in (df1, df2))    
df1[a.merge(b, on=[*a], indicator=True, how='left').eval('_merge == "left_only"')]

   col1  col2
4     1    10

这里的想法是添加一个cumcount 列来对列进行重复数据删除（为每个列分配一个唯一标识符）。然后我们可以查看在后续合并中哪些行不匹配。

a
   col1  col2  count
0     1    10      0
1     1    10      1
2     2    11      0
3     3    12      0
4     1    10      2

b
   col1  col2  count
0     1    10      0
1     2    11      0
2     1    10      1
3     3    12      0
4     3    12      1

a.merge(b, on=[*a], indicator=True, how='left')
   col1  col2  count     _merge
0     1    10      0       both
1     1    10      1       both
2     2    11      0       both
3     3    12      0       both
4     1    10      2  left_only

_.eval('_merge == "left_only"')
0    False
1    False
2    False
3    False
4     True
dtype: bool

如果您需要从df1 和df2 中获取不匹配的行，请使用外部合并：

out = a.merge(b, on=[*a], indicator=True, how='outer')
df1_filter = (
    out.query('_merge == "left_only"').drop(['count','_merge'], axis=1))
df2_filter = (
    out.query('_merge == "right_only"').drop(['count','_merge'], axis=1))

df1_filter
   col1  col2
4     1    10

df2_filter
   col1  col2
5     3    12

【讨论】：