【发布时间】:2020-12-03 02:09:35
【问题描述】:
对于初学者,我认为这两个问题是正确的,但没有完全达到我想要的。
Pandas : remove SOME duplicate values based on conditions
How to conditionally remove duplicates from a pandas dataframe
我有一个由票组成的非常大的 DataFrame。每张工单都有几种类型的文本字段。在某些工单中,两种不同类型的文本字段将具有相同的文本。如果是这种情况,我只使用DESCRIPTION 类型。一个示例DataFrame如下:
TICKETID TYPE TEXT
123 PROBLEMCODE I want to use description for this item because it is a duplicate
123 DESCRIPTION I want to use description for this item because it is a duplicate
123 CODE1 Other field
124 PROBLEMCODE I need both here
124 DESCRIPTION Because there are not duplicated
124 CODE1 Other field
125 PROBLEMCODE I need both here
125 DESCRIPTION I do not want to delete the above problem code because TICKETID is different
125 CODE1 This field is not super important but matches data and never know where problems arise
基本上,我想将每个TICKETID 视为它自己的实体。比较PROBLEMCODE 和DESCRIPTION 文本;如果相等,则过滤掉PROBLEMCODE 行并保留描述。
在我看来,伪代码是:
For a given ticketID:
if Type = PROBLEMCODE or DESCRIPTION
if TEXT = TEXT
DROP PROBLEMCODE
显然,以这种方式循环数据帧效率不高。在前面发布的问题中提到,Pandas 有很多事情要做。我只是无法弄清楚哪种方法和作业的组合可以实现这一点。我试过了:
# to create a dup row
data['Dup'] = data.duplicated(subset=['TEXT'])
# Then groupby ticket?
data.groupby(['TICKETID'])
# somehow compare true and false, but I can only do that in order of index (down the frame).
# I am 99% percent sure looking at the other questions there should be a one or two liner
# something like this that can accomplish:
dataTest = data.loc[data.groupby(['TICKETID']) & (data['TYPE'] =='PROBLEMCODE' | 'DESCRIPTION')].duplicated(subset=['TEXT'])
# Then filter based on true false
我对示例案例的预期输出只会删除 TICKET=123 PROBLEMCODE 行,如下所示:
TICKETID TYPE TEXT
123 DESCRIPTION I want to use description for this item because it is a duplicate
123 CODE1 Other field
124 PROBLEMCODE I need both here
124 DESCRIPTION Because there are not duplicated
124 CODE1 Other field
125 PROBLEMCODE I need both here
125 DESCRIPTION I do not want to delete the above problem code because TICKETID is different
125 CODE1 This field is not super important but matches data and never know where problems arise
如果您需要更多信息,请告诉我
【问题讨论】:
-
嗨,给我们看看你预期输出的样本
-
查看编辑。它只会删除 123 问题代码行
-
查看我的回答,但您只想检查 PROBLEMCODE 和DESCRIPTION 之间的重复项?并留下DESCRIPTION?
-
是的,如果给定的 TICKETID 有 2 个重复的文本字段,请保留说明
-
请稍等,我将根据主要数据测试答案
标签: python pandas dataframe duplicates