如何在 python 中更快地比较文件？答案

【问题标题】：How compare files quicker in python?如何在 python 中更快地比较文件？
【发布时间】：2015-08-11 22:29:12
【问题描述】：

有什么方法可以让这个脚本更快？如果第二列相等，我正在使用一个文件来比较另一个文件以打印行。

import csv
output =[]
a = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Phase1_missing.vcf', 'r')
list1 = a.readlines()
reader1 = a.read()
b = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais.vcf', 'r')
list2 = b.readlines()
reader2 = b.read()

f3 = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais_and_YRI.vcf', 'w')

for line1 in list1:
        separar = line1.split("\t")
        gene = separar[2]
        for line2 in list2:
        separar2 = line2.split("\t")
                gene2 = separar2[2]
        if gene == gene2:
                        print line1
                        f3.write(line1)

输入示例（两个文件）：

1   14107321    rs187821037 C   T   100 PASS    AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout

1   14107321    rs187821037 C   T   100 PASS    AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout

1   14107321    rs187821037 C   T   100 PASS    AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout

下面的命令行在 bash 中同样适用于相同的目的：

awk 'FNR==NR {a[$3]; next} $3 in a' Neandertais.vcf Phase1_missing.vcf > teste.vcf

我的问题是：如何改进这个 python 脚本？

【问题讨论】：

你试过了吗？如果是，结果是否一样？最好向我们展示一些示例输入和所需的输出，而不是让我们弄清楚两种不同语言的两个实现是否做同样的事情。

标签： python file comparison two-columns

【解决方案1】：

如果您将行存储在以您感兴趣的列为关键字的字典中，您可以轻松地使用 Python 的内置集合函数（以 C 速度运行）来查找匹配的行。我测试了一个稍微修改过的版本（文件名发生了变化，并将 split('\t') 更改为 split() 因为 stackoverflow 格式），它似乎工作正常：

import collections

# Use 'rb' to open files

infn1 = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Phase1_missing.vcf'
infn2 = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais.vcf'
outfn = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais_and_YRI.vcf'

def readfile(fname):
    '''
    Read in a file and return a dictionary of lines, keyed by the item in the second column
    '''
    results = collections.defaultdict(list)
    # Read in binary mode -- it's quicker
    with open(fname, 'rb') as f:
        for line in f:
            parts = line.split("\t")
            if not parts:
                continue
            gene = parts[2]
            results[gene].append(line)
    return results

dict1 = readfile(infn1)
dict2 = readfile(infn2)

with open(outfn, 'wb') as outf:
    # Find keys that appear in both files
    for key in set(dict1) & set(dict2):
        # For these keys, print all the matching
        # lines in the first file
        for line in dict1[key]:
            print(line.rstrip())
            outf.write(line)

【讨论】：