在迭代过程中从两个文本文件中丢失行答案

【问题标题】：Losing lines from two text files over iteration在迭代过程中从两个文本文件中丢失行
【发布时间】：2013-11-11 01:34:19
【问题描述】：

我有两个文本文件（A 和 B），如下所示：

A:
1 stringhere 5
1 stringhere 3
...
2 stringhere 4
2 stringhere 4
...

B:
1 stringhere 4
1 stringhere 5
...
2 stringhere 1
2 stringhere 2
...

我要做的是读取这两个文件，而不是像这样的新文本文件：

1 stringhere 5
1 stringhere 3
...
1 stringhere 4
1 stringhere 5
...
2 stringhere 4
2 stringhere 4
...
2 stringhere 1
2 stringhere 2
...

使用 for 循环，我创建了函数（使用 Python）：

def find(arch, i):
    l = arch   
    for line in l:
        lines = line.split('\t')
        if i == int(lines[0]):
           write on the text file
        else:            
            break

然后我这样调用函数：

for i in range(1,3):        
    find(o, i)
    find(r, i)

发生的情况是我丢失了一些数据，因为读取了包含不同数字的第一行，但它不在最终的 .txt 文件中。在这个例子中，2 stringhere 4 和 2stringhere 1 丢失了。

有什么办法可以避免这种情况吗？

提前致谢。

【问题讨论】：

什么是拱门？它是一个文件对象，还是一个列表，还是什么？
它是一个文件对象。 (.txt)
标题编号是单调递增的（一直递增还是一样）？
是的，第一个数字是单调递增的，但最后一个数字是随机的。
问题是for line in l: 不是每次都从文件的开头开始，而是从你离开的地方开始。最好的办法是先将文件读入列表，然后处理顺序。

标签： python for-loop

【解决方案1】：

如果文件适合内存：

with open('A') as file1, open('B') as file2:
     L = file1.read().splitlines() 
     L.extend(file2.read().splitlines()) 
L.sort(key=lambda line: int(line.partition(' ')[0])) # sort by 1st column
print("\n".join(L)) # print result

如果总行数低于一百万，这是一种有效的方法。否则，特别是如果您有许多已排序的文件；你可以使用heapq.merge() to combine them。

【讨论】：

谢谢，这对这两个文件有效。如果我有更多的文本文件，我应该遵循相同的方法吗？
谢谢，确实是这样。

【解决方案2】：

在您的循环中，当该行的开头与i 的值不同时，您会中断，但您已经消耗了一行，因此当使用i+1 第二次调用该函数时，它从第二个开始有效线路。

要么事先读取内存中的整个文件（请参阅@J.F.Sebastian 的回答），或者，如果这不是一个选项，请将您的函数替换为：

def find(arch, i):
    l = arch
    while True:
        line=l.readline()
        lines = line.split('\t')
        if line != "" and i == int(lines[0]): # Need to catch end of file
            print " ".join(lines),
        else:
            l.seek(-len(line), 1) # Need to 'unread' the last read line
            break

此版本“倒回”光标，以便下次调用readline 再次读取正确的行。请注意，不建议将隐式 for line in l 与 seek 调用混合使用，因此不建议使用 while True。

示例：

$ cat t.py
o = open("t1")
r = open("t2")
print o
print r


def find(arch, i):
    l = arch
    while True:
        line=l.readline()
        lines = line.split(' ')
        if line != "" and i == int(lines[0]):
            print " ".join(lines),
        else:
            l.seek(-len(line), 1)
            break

for i in range(1, 3):
    find(o, i)
    find(r, i)

$ cat t1 
1 stringhere 1
1 stringhere 2
1 stringhere 3
2 stringhere 1
2 stringhere 2
$ cat t2
1 stringhere 4
1 stringhere 5
2 stringhere 1
2 stringhere 2
$ python t.py
<open file 't1', mode 'r' at 0x100261e40>
<open file 't2', mode 'r' at 0x100261ed0>
1 stringhere 1
1 stringhere 2
1 stringhere 3
1 stringhere 4
1 stringhere 5
2 stringhere 1
2 stringhere 2
2 stringhere 1
2 stringhere 2
$

【讨论】：

我正在使用这个ValueError: invalid literal for int() with base 10: ''。我做错了吗？
我在测试中使用空格作为分隔符，而不是列表。我刚刚更新了代码以适合您的初始代码 sn-ps。也许这就是问题所在？如果一行是空的，你也可以得到那个错误，但是if line != ""part 应该避免它们。
我这里也用了'\t'，文本文件中没有空行。打印了第一行，但错误发生在第一个数字从 1 变为 2 时。
你能重现我的例子吗？
我设法让它工作，好像我忘记了你的答案！非常感谢！

【解决方案3】：

可能有一种不太复杂的方法来实现这一点。以下内容还按照它们在文件中出现的顺序保持行，就像您想要做的那样。

lines = []
lines.extend(open('file_a.txt').readlines())
lines.extend(open('file_b.txt').readlines())
lines = [line.strip('\n') + '\n' for line in lines]
key = lambda line: int(line.split()[0])
open('out_file.txt', 'w').writelines(sorted(lines, key=key))

前三行将输入文件读入单个行数组。

第四行确保每一行末尾都有一个换行符。如果您确定两个文件都以换行符结尾，则可以省略此行。

第五行将排序的键定义为字符串第一个单词的整数版本。

第六行对行进行排序并将结果写入输出文件。

【讨论】：

不要使用cmp，而是使用key as in my answer。 cmp 在 Python 3 中被删除（出于某种原因）。不要使用"\r\n"：Python 将"\n" 转换为"\r\n"，同时在Windows 上自动为您编写。
sorted(lines) 可能会破坏数字排序。比较 sorted("1 10 9".split()) 和 sorted(map(int, "1 10 9".split()))
@JFSebastian 感谢您的帮助！我已将您的两个建议都纳入答案中。