熊猫从具有某些特定条件的数据框（分组）中删除重复项答案

【问题标题】：pandas drop duplicates from a dataframe(grouped) with some specific condition熊猫从具有某些特定条件的数据框（分组）中删除重复项
【发布时间】：2017-04-05 20:17:14
【问题描述】：

大家好，我有一个数据框，其内容类似于

name,mv_str
abc,Exorsist part1
abc,doc str 2D
abc,doc str 3D
abc,doc str QA
abc,doc flash
def,plastic
def,plastic income
def,doc str 2D   ###i added this row for better clarity

我预期的 o/p 应该有 .... 每组在某种意义上获得唯一的记录行——对于每个 mailid mv_str 不应该是相似的类型 i:e 来自一个 'mv_str' 的第一个 2 个单词不应该在那里在该特定用户名的第二行/任何行中。

注意：应按用户名级别进行比较。

name,mv_str
abc,Exorist part1
abc,doc str 2D   ###3D and QA removes as 1st 2 words "doc str" matched
abc, doc flash   ###only 1st word is matching, 2nd word does not
def,plastic
def,plastic income  #It should be present as only one word is matching
def,doc str 2D   ###this row should be there as this is for another User

请任何人帮助我形成逻辑，否则代码示例将是很好的帮助。谢谢。

【问题讨论】：

标签： python string pandas group-by duplicates

【解决方案1】：

我认为您首先需要 mv_str 列中的 split 字符串 whitespace 并创建新的 DataFrame df1：

df1 = df.mv_str.str.split(expand=True)
print (df1)
          0       1     2
0  Exorsist   part1  None
1       doc     str    2D
2       doc     str    3D
3       doc     str    QA
4       doc   flash  None
5   plastic    None  None
6   plastic  income  None
7       doc     str    2D

添加原DataFramedf by concat:

df = pd.concat([df, df1], axis=1)
print (df)
  name          mv_str         0       1     2
0  abc  Exorsist part1  Exorsist   part1  None
1  abc      doc str 2D       doc     str    2D
2  abc      doc str 3D       doc     str    3D
3  abc      doc str QA       doc     str    QA
4  abc       doc flash       doc   flash  None
5  def         plastic   plastic    None  None
6  def  plastic income   plastic  income  None
7  def      doc str 2D       doc     str    2D

然后drop_duplicates 按列name、0 和1，保留第一个值：

print (df.drop_duplicates(['name',0,1]))
  name          mv_str         0       1     2
0  abc  Exorsist part1  Exorsist   part1  None
1  abc      doc str 2D       doc     str    2D
4  abc       doc flash       doc   flash  None
5  def         plastic   plastic    None  None
6  def  plastic income   plastic  income  None
7  def      doc str 2D       doc     str    2D

删除列0、1、2 by drop：

print (df.drop_duplicates(['name',0,1]).drop([0,1,2], axis=1))
  name          mv_str
0  abc  Exorsist part1
1  abc      doc str 2D
4  abc       doc flash
5  def         plastic
6  def  plastic income
7  def      doc str 2D

或者通过仅选择 name 和 mv_str 列来更好地删除列：

print (df.drop_duplicates(['name',0,1])[['name','mv_str']])
  name          mv_str
0  abc  Exorsist part1
1  abc      doc str 2D
4  abc       doc flash
5  def         plastic
6  def  plastic income
7  def      doc str 2D

【讨论】：

@jezrael-你能解释几句吗，你在上面做了什么..我是初学者，所以很难分析。
是的，我找到了另一个解决方案。等一下，我解释一下。
请检查我的解释，我还不能 100% 确定我是否理解你。
@jazrael：假设我有另一行名称 def as def doc str 2D...那么我在这里丢失了那行...这是不正确的
是的，但是为什么def,plastic income 行也没有被删除？我不明白条件。