如何以更有效的方式比较 Python 中的两个文件？答案

【问题标题】：How can I compare two files in Python in a more efficiently way?如何以更有效的方式比较 Python 中的两个文件？
【发布时间】：2014-10-29 11:34:12
【问题描述】：

我必须比较两个大文件，但我遇到了一些与性能相关的问题。

所以，让我们考虑两个文件 X 和 Y。

X 有 42000 条记录。每行一个字。

Y 有 881000。每行三个字，即三列。

我想将X 文件的单词与Y 文件的第一个单词进行比较。

如果我在Y_first_column_word中找到X_word，那么我将Y文件的第二列的单词写入@987654330 @。

查看代码：

to_file = open( output_file, 'w' )                # opening the file to write
f1      = open( input_file1, "rU" ).readlines()   # reading 1st file  42000 records
f2      = open( input_file2, "rU" ).readlines()   # reading 2nd file 881000 records

for i, w1 in enumerate( f1 ):
    for j, line in enumerate( f2 ):
        w2 = line.split(',')                      # splitting words from  2nd file
        if w1.strip() == w2[0].strip():           # removing trails
            if w2[1].strip() == '':               # when it is blank, get 1st column word 
                w2[1] = w2[0]
            print>>to_file, w2[1]

to_file.close()                                   # closing the file

我已经使用测试数据进行了测试，它可以满足我的需求。但是当我使用真实数据运行它时，它变得没有响应。我上次尝试用了 18 个小时。

有什么方法可以改进此代码以使其以更有效的方式运行？

【问题讨论】：

标签： python performance file string-comparison

【解决方案1】：

您当前的方法是O(N**2)，如果您使用字典来存储第二个文件的内容，那么您可以在线性时间内执行此操作。

with open(input_file1, "rU")as f1, open(input_file2, "rU") as f2:
    words_dict = {k:v for k, v, _ in (line.split(',') for line in f2)}
    for word in f1:
        word = word.rstrip()
        if word in words_dict:
           #write words_dict[word] to to_file

【讨论】：