如何对 Pandas DF Column 中的值进行排序并删除重复项答案

【问题标题】：How to sort values within Pandas DF Column and remove duplicates如何对 Pandas DF Column 中的值进行排序并删除重复项
【发布时间】：2018-03-17 23:05:46
【问题描述】：

这可能是一个非常基本的问题，但我一直无法找到答案，所以这里......

问题：

有没有一种方法可以按字母顺序对值进行排序，同时删除任何重复的实例？

这是我所拥有的：

data = ['Car | Book | Apple','','Book | Car | Apple | Apple']
df = pd.DataFrame(data,columns=['Labels']
print(df)

    Labels
0   Car | Book | Apple
1   
2   Book | Car | Apple | Apple

期望的输出：

    Labels
0   Apple | Book | Car
1   
2   Apple | Book | Car

谢谢！

【问题讨论】：

标签： python python-3.x pandas sorting

【解决方案1】：

str.join 在str.split 之后

df=df.replace({' ':''},regex=True)
df.Labels.str.split('|').apply(set).str.join('|')
Out[339]: 
0    Apple|Book|Car
1                  
2    Apple|Book|Car
Name: Labels, dtype: object

基于评论添加sorted

df.Labels.str.split('|').apply(lambda x : sorted(set(x),reverse=False)).str.join(' | ')

【讨论】：

OP 也想对值进行排序。
每行中的值应按字母顺序排序。例如：Car | Book | Apple -> Apple | Book | Car。请参阅我的解决方案。 AFAIK set 不会排序。你有这方面的文件吗？
您的输出是正确的，但据我所知set 不能保证排序，但也许我错了？
来自python 3 docs for set，示例似乎表明项目未排序。您在此处加入之前已明确对值进行排序。
@J_Win yw :-) 快乐编码

【解决方案2】：

一种方法是使用pd.Series.map 与sorted 和set 在被| 拆分后：

import pandas as pd

data = ['Car | Book | Apple','','Book | Car | Apple | Apple']
df = pd.DataFrame(data,columns=['Labels'])

df['Labels'] = df['Labels'].map(lambda x: ' | '.join(sorted(set(x.split(' | ')))))

#                Labels
# 0  Apple | Book | Car
# 1                    
# 2  Apple | Book | Car

【讨论】：

【解决方案3】：

df['Labels'].str.split('|') 将分割| 上的字符串并返回一个列表

#0             [Car ,  Book ,  Apple]
#1                                 []
#2    [Book ,  Car ,  Apple ,  Apple]
#Name: Labels, dtype: object

看到结果列表元素中有多余的空格。删除这些的一种方法是将str.strip() 应用于列表中的每个元素：

df['Labels'].str.split('|').apply(lambda x: map(str.strip, x))
#0           [Car, Book, Apple]
#1                           []
#2    [Book, Car, Apple, Apple]
#Name: Labels, dtype: object

最后，我们应用set 构造函数来删除重复项，对值进行排序，然后使用" | " 作为分隔符将它们重新连接在一起：

df['Labels'] = df['Labels'].str.split('|').apply(
    lambda x: " | ".join(sorted(set(map(str.strip, x))))
)
print(df)
#               Labels
#0  Apple | Book | Car
#1                    
#2  Apple | Book | Car

【讨论】：