一种方法是使用frozenset 生成不区分顺序的密钥:
# change data.csv to the name of your file
with open("data.csv") as infile:
uniques = set(frozenset(line.strip().split()) for line in infile)
for value in uniques:
print(*value)
输出 (对于给定的输入)
10310 3466
5233 10310
10310 4583
19607 3466
1854 10310
3466 8579
10310 9572
10310 13056
10310 14982
5233 3466
17038 3466
15931 3466
10310 10841
937 3466
18720 3466
16310 10310
替代方案,使用sorted将每一行转换为相同的key:
# change data.csv to the name of your file
with open("data.csv") as infile:
uniques = set(" ".join(sorted(line.strip().split())) for line in infile)
for value in uniques:
print(value)
为了更好地理解使用frozenset 的方法,请参见下面的代码:
frozenset((1, 2)) == frozenset((2, 1))
Out[2]: True
可以看出,两个frozenset 相等,与用作输入的元组的顺序无关。常规集也会发生这种情况,但frozensets是可散列的,来自文档:
frozenset 类型是不可变和可散列的——它的内容不能
创建后更改;因此它可以用作字典
键或作为另一个集合的元素。
注意
要将重复数据删除的行写入新文件,请执行以下操作:
# change data.csv to the name of your file
with open("data.csv") as infile:
uniques = set(frozenset(line.strip().split()) for line in infile)
# change output.csv to the name of your output file
with open("output.csv", mode="w") as outfile:
for value in uniques:
outfile.write(f'{" ".join(value)}\n')