多条件重复过滤器（可能丢弃）Pandas DataFrame Python答案

【问题标题】：Multiple Condition Duplicate Filter (maybe drop) Pandas DataFrame Python多条件重复过滤器（可能丢弃）Pandas DataFrame Python
【发布时间】：2020-12-03 02:09:35
【问题描述】：

对于初学者，我认为这两个问题是正确的，但没有完全达到我想要的。

Pandas : remove SOME duplicate values based on conditions

How to conditionally remove duplicates from a pandas dataframe

我有一个由票组成的非常大的 DataFrame。每张工单都有几种类型的文本字段。在某些工单中，两种不同类型的文本字段将具有相同的文本。如果是这种情况，我只使用DESCRIPTION 类型。一个示例DataFrame如下：

TICKETID    TYPE    TEXT
123 PROBLEMCODE I want to use description for this item because it is a duplicate
123 DESCRIPTION I want to use description for this item because it is a duplicate
123 CODE1       Other field
124 PROBLEMCODE I need both here
124 DESCRIPTION Because there are not duplicated
124 CODE1       Other field
125 PROBLEMCODE I need both here
125 DESCRIPTION I do not want to delete the above problem code because TICKETID is different
125 CODE1       This field is not super important but matches data and never know where problems arise

基本上，我想将每个TICKETID 视为它自己的实体。比较PROBLEMCODE 和DESCRIPTION 文本；如果相等，则过滤掉PROBLEMCODE 行并保留描述。

在我看来，伪代码是：

For a given ticketID:
    if Type = PROBLEMCODE or DESCRIPTION
        if TEXT = TEXT
            DROP PROBLEMCODE

显然，以这种方式循环数据帧效率不高。在前面发布的问题中提到，Pandas 有很多事情要做。我只是无法弄清楚哪种方法和作业的组合可以实现这一点。我试过了：

# to create a dup row
data['Dup'] = data.duplicated(subset=['TEXT'])
# Then groupby ticket?
data.groupby(['TICKETID'])

# somehow compare true and false, but I can only do that in order of index (down the frame). 

# I am 99% percent sure looking at the other questions there should be a one or two liner 
# something like this that can accomplish:

dataTest = data.loc[data.groupby(['TICKETID']) & (data['TYPE'] =='PROBLEMCODE' | 'DESCRIPTION')].duplicated(subset=['TEXT'])

# Then filter based on true false

我对示例案例的预期输出只会删除 TICKET=123 PROBLEMCODE 行，如下所示：

TICKETID    TYPE    TEXT
123 DESCRIPTION I want to use description for this item because it is a duplicate
123 CODE1       Other field
124 PROBLEMCODE I need both here
124 DESCRIPTION Because there are not duplicated
124 CODE1       Other field
125 PROBLEMCODE I need both here
125 DESCRIPTION I do not want to delete the above problem code because TICKETID is different
125 CODE1       This field is not super important but matches data and never know where problems arise

如果您需要更多信息，请告诉我

【问题讨论】：

嗨，给我们看看你预期输出的样本
查看编辑。它只会删除 123 问题代码行
查看我的回答，但您只想检查 PROBLEMCODE 和DESCRIPTION 之间的重复项？并留下DESCRIPTION？
是的，如果给定的 TICKETID 有 2 个重复的文本字段，请保留说明
请稍等，我将根据主要数据测试答案

标签： python pandas dataframe duplicates

【解决方案1】：

    df = pd.DataFrame(
        {
            'ticket':[123,123,123,124,124,124],
            'type':['PROBLEMCODE','DESCRIPTION','code1','PROBLEMCODE','DESCRIPTION','code1'],
            'text':[' I want to use description fo',' I want to use description fo','other',
             'another str','second one','other'],
    
        }
    )
    print(df)
       ticket         type                           text
    0     123  PROBLEMCODE   I want to use description fo
    1     123  DESCRIPTION   I want to use description fo
    2     123        code1                          other
    3     124  PROBLEMCODE                    another str
    4     124  DESCRIPTION                     second one
    5     124        code1                          other
    
    # you can see here in this df(duplicates), all duplicated rows for type == DESCRIPTION or PROBLEMCODE
    duplicates = df[
        (df.type.isin(['DESCRIPTION','PROBLEMCODE'])) &
        (df.duplicated(subset=['ticket','text'],keep=False))
    ]
    
    print(duplicates)
       ticket         type                           text
    0     123  PROBLEMCODE   I want to use description fo
    1     123  DESCRIPTION   I want to use description fo
    
# remove duplicates from main df (using index to improve time)

df = df.drop(duplicates.index.tolist())
print(df)

# now concat duplicates with df (without description and problemcode

result = pd.concat([
    duplicates[duplicates.type=='DESCRIPTION'],df
]).sort_values(by='ticket').reset_index(drop=True)
print(result)
       ticket         type                           text
0     123  DESCRIPTION   I want to use description fo
1     123        code1                          other
2     124  PROBLEMCODE                    another str
3     124  DESCRIPTION                     second one
4     124        code1                          other

对于上述解决方案，当票证和文本相同时，您将收到 DESCRIPTION 和 PROIBLEMCODE 的不重复输出

【讨论】：