在Python中复制没有重复和空行的文件答案

【问题标题】：Copying a file without duplicate and blank lines in Python在Python中复制没有重复和空行的文件
【发布时间】：2016-12-06 21:15:09
【问题描述】：

我用 Python 编写了一段代码，用于将现有文本文件 (.txt) 复制到同一位置的新文件（使用不同的名称）。这会按预期复制原始文本文件中的所有文本：

a=open("file1.txt", "r") #existing file
b=open("file2.txt", "w") #file did not previously exist, hence "w"
for reform1 in a.readlines():
    b.write(reform1) #write the lines from 'reform1'
    reform1=a.readlines() #read the lines in the file
a.close() #close file a (file1)
b.close() #close file b (file2)

我现在被要求修改新文件，从文件中删除复制的重复行和空白行（同时保留原始文件），并保留其余文本（唯一行）原样。如何做到这一点？

【问题讨论】：

去除重复行是什么意思？删除所有多次出现的行？只删除第一行之后重复的行吗？
您必须跟踪您已经看到的所有行，并对照此记录检查每一行。只有在记录中不时才会写入一行。
Remove Duplicates from Text File的可能重复
您绝对不需要reform1=a.readlines() 行。另外：如果一条线之前曾经见过，或者仅当它与它上面立即的线相同时，它是否被视为“重复”？
非常感谢您的回复！我将删除 refrom1=a.readlines() 行，看看它是如何工作的。

标签： python

【解决方案1】：

这将写入'file2.txt' 'file1.txt' 中的所有行，除了那些仅由空格组成或重复的行。顺序被保留，但假设重复只应写入第一个实例：

seen = set()
with open('file1.txt') as f, open('file2.txt','w') as o:
    for line in f:
        if not line.isspace() and not line in seen:
            o.write(line)
            seen.add(line)

注意str.isspace() 是True 用于所有空格（例如制表符）而不仅仅是换行符，使用if not line == '\n' 进行更严格的定义（假设没有'/r' 换行符）。

我使用with 语句处理文件的打开/关闭并逐行读取文件，这是最pythonic的方式。

对于仅在 Python 中复制文件，您应该按照 here 的说明使用 shutil。

【讨论】：

一个改进可能是只检查 not line in o.readlines() 而不是单独的列表
@PeterKrasinski readlines() 返回一个列表，因此首先它必须遍历文件中的每一行，然后检查是否存在字符串，它必须再次遍历文件中的每个项目检查是否匹配的列表
@Chris_Rands 感谢您的帮助。在出现错误之前，我已经到达了“seen.add(line)”部分 - AttributeError: 'dict' object has no attribute 'add'。如果我删除了这个 seen.add 行，该程序会执行我想要删除空白行的部分操作，但不会删除重复的行。我不确定如何解决这个问题？谢谢
@motoverdi 我的错误，现在尝试编辑后的代码，顶部有seen = set()

【解决方案2】：

试试这个：

import re
a=open("file1.txt", "r") #existing file
b=open("file2.txt", "w") #file did not previously exist, hence "w"
exists = set()
for reform1 in a.readlines():
    if reform1 in exists:
        continue
    elif re.match(r'^\s$', reform1):
        continue
    else:
        b.write(reform1) #write the lines from 'reform1'
        exists.add(reform1)
a.close() #close file a (file1)
b.close() #close file b (file2)

【讨论】：

这不会删除任何空行，因为每行至少包含'\n'。您可以改为检查 elif Reform1.strip()
怎么样？我使用正则表达式来匹配任何空白行
作为旁注，这个程序最终可能会使用过多的内存，具体取决于行数和单行的长度。一种有效的方法是为行而不是行本身存储一组哈希值。
@sid-m 使用哈希是个好主意。但是，此代码可以首先通过不使用 readlines() 来优化，它会创建整个列表而不是 lazy

【解决方案3】：

试试：

a=open("file1.txt", "r") #existing file
b=open("file2.txt", "w") #file did not previously exist, hence "w"
seen = []
for reform1 in a.readlines():
    if reform1 not in seen and len(reform1) > 1:
        b.write(reform1) #write the lines from 'reform1'
        seen.append(reform1)
a.close() #close file a (file1)
b.close() #close file b (file2)

我使用“len(reform1) > 1”是因为当我创建测试文件时，空白行有 1 个字符，大概是“\r”或“\n”字符。根据您的应用程序的需要进行调整。

【讨论】：