在 Pandas 数据框中按其他列分组的列中删除频率最低的行答案

【问题标题】：Remove rows with the least frequent value in a column grouping by other columns in a Pandas Dataframe在 Pandas 数据框中按其他列分组的列中删除频率最低的行
【发布时间】：2020-07-23 18:27:44
【问题描述】：

我有一个 pandas 数据框，其中有行不一致。在下面的例子中key1和key2是两个值放在一起必须是唯一的，所以(key1 ,key2)是主键，应该在dataframe中出现一次，而info是(key1 ,key2)的二进制信息并且可以是T 或F。不幸的是，(key1 ,key2) 在数据框中重复出现，有时它们有info=T，有时有info=F，这显然是一个错误。

为了消除重复，我想采用以下推理：我想计算多少次（对于同一对夫妇(key1 ,key2)）info 是T 和多少次info 是@987654335 @和

如果频率不同（大部分时间）保持只有在T 之间具有最频繁值的行之一 和 F 具有类似 df.drop_duplicates(subset = ["key1","key2"] , keep = "first") 的函数，其中 first 应该是最常见值为info 的行。
如果改为 50% 行有info=T，50% 有info=F，我想删除所有他们，因为我不知道哪个是正确的功能喜欢df.drop_duplicates(subset = ["key1","key2"] , keep = False)。

我不知道如何进行这种过滤，因为我想在一种情况下保留 1 行，在另一种情况下保留 0 行，具体取决于相似行组中特定列的值。

期望的行为

在：

     key1  key2    info
0    a1    a2      T 
1    a1    a2      T #duplicated row of index 0
2    a1    a2      F #similar row of indexes 0 and 1 but inconsistent with info field
3    b1    b2      T 
4    b1    b2      T #duplicated row of index 3
5    b1    b3      T #not duplicated since key2 is different from indexes 3 and 4
6    c1    c2      T 
7    c1    c2      F #duplicated row of index 5 but inconsistent with info field

输出：

     key1  key2     info
0    a1    a2       T # for(a1,a2) T:2 and F:1
3    b1    b2       T # for(b1,b2) T:2 and F:0
5    b1    b3       T # for(b1,b3) T:1 and F:0
                    # no rows for (c1,c2) because T:1 and F:1

谢谢

【问题讨论】：

标签： python pandas dataframe duplicates

【解决方案1】：

groupby 并使用pd.Series.mode 获取模态值。 pd.Series.mode 将在 tie 的情况下返回模式，因此这允许我们使用 drop_duplicates 删除这些情况，因为我们希望每个唯一的 ['key1', 'key2'] 只有一个模式。

import pandas as pd

(df.groupby(['key1', 'key2'])['info']
   .apply(pd.Series.mode)
   .reset_index()
   .drop_duplicates(['key1', 'key2'], keep=False)
   .drop(columns='level_2')
)

#  key1 key2 info
#0   a1   a2    T
#1   b1   b2    T
#2   b1   b3    T

groupby + mode 的结果是：

key1  key2   
a1    a2    0    T
b1    b2    0    T
      b3    0    T
c1    c2    0    F   # Tied mode so it gets 2 rows with the last
            1    T   # index level indicating the # of items tied for mode.

【讨论】：

【解决方案2】：

另一种解决方案是创建两个临时列来计算组的count 和max。然后，过滤掉组的count 不等于max 的行（即，如果您只有T 和F 值，则过滤掉超过50%）然后drop_duplicates()。最后一个逻辑是过滤掉[key1 , key2] 值，其中50% 是T，50% 是F。为此，请再次使用 drop_duplicates，但在包含count 的不同子集上使用，因为如果count 相同，则意味着您不知道要选择哪一个，正如您在问题中提到的那样。最后，删除临时的 count 列。

df['count'] = df.groupby(['key1', 'key2', 'info'])['info'].transform('count')
df['max'] = df.groupby(['key1', 'key2'])['count'].transform('max')
df = (df.loc[(df['count'] == df['max']), ['key1', 'key2', 'info','count']]
        .drop_duplicates(subset=['key1', 'key2','info'])
        .drop_duplicates(subset=['key1', 'key2', 'count'], keep=False)
        .drop('count', axis=1))

输出：

    key1    key2    info
0   a1      a2      T
3   b1      b2      T
5   b1      b3      T

【讨论】：

【解决方案3】：

使用groupby、idxmin 和idxmin：

df_ = df.groupby(["key1","key2"]).info.value_counts().unstack(level=2, fill_value=0)
df_max = df_.idxmax(axis=1)
df = df_max.loc[df_max!=df_.idxmin(axis=1)].reset_index(name='info')

print(df)
  key1 key2 info
0   a1   a2    T
1   b1   b2    T
2   b1   b3    T

【讨论】：

【解决方案4】：

这只是我的看法。

df = pd.DataFrame(data=[["a1", "a2", "T"],
                          ["a1", "a2", "T"],
                          ["a1", "a2", "F"],
                          ["b1", "b2", "T"],
                          ["b1", "b2", "T"],
                          ["b1", "b3", "T"],
                          ["c1", "c2", "T"],
                          ["c1", "c2", "F"],], columns =["key1", "key2", "info"])
df = df.groupby(["key1", "key2", "info"]).size().reset_index()
df = df.drop_duplicates(subset=["key1", "key2", 0], keep=False)
df = df.groupby(["key1", "key2"]).max().reset_index()
df = df.drop(0, axis=1)

【讨论】：