在python中有效地提取大文件的子集答案

【问题标题】：Efficiently extracting a subset of a large file in python在python中有效地提取大文件的子集
【发布时间】：2017-11-07 22:46:32
【问题描述】：

我有一个包含几百万行文本的大文件。我想随机地从这个文件中提取一个较小的（250000 行）。我做了以下代码，但它非常慢，实际上慢得无法使用。我该怎么做才能加快速度？

def get_shorter_subset(fname, new_len):
"""Extract a random shorter subset of length new_len from a given file"""
   out_lines = []
   with open(fname + "short.out", 'w') as out_file:
      with open(fname, 'r') as in_file:
        all_lines = in_file.readlines()
        total = len(all_lines)
        print "Total lines:", total
        for i in range(new_len):
            line = np.random.choice(all_lines)
            out_lines.append(line.rstrip('\t\r\n'))
            #out_file.write(line.rstrip('\t\r\n'))
            print "Done with", i, "lines"
            all_lines.remove(line)
      out_file.write("\n".join(out_lines))

【问题讨论】：

标签： python list file numpy

【解决方案1】：

所以，问题：

all_lines = in_file.readlines() 将所有行读入内存可能不是最好的方法...但是如果您打算这样做，那么绝对不要这样做：all_lines.remove(line) 因为这是一个 O(N) 操作，您在循环中执行此操作，从而为您提供二次复杂度。

我怀疑您只需执行以下操作即可获得巨大的性能改进：

idx = np.arange(total, dtype=np.int32)
idx = np.random.choice(idx, size=new_len, replace=False)
for i in idx:
    outfile.write(all_lines[i])

【讨论】：

【解决方案2】：

您也可以尝试使用 mmap：

https://docs.python.org/3.6/library/mmap.html

【讨论】：

【解决方案3】：

您读入所有行，将它们保存在内存中，然后对生成的文本执行 250K 大字符串操作。每次从文件中删除一行时，Python 都必须为剩余的行创建一个新副本。

相反，只需随机抽取样本。例如，如果你有 500 万行，你想要文件的 5%。读取文件，一次一行。滚动一个随机浮点数。如果它是

有了这么大的样本，您最终会得到所需大小的输出。

【讨论】：

【解决方案4】：

利用 Python numpy 库。 numpy.choice() 函数提供了您需要的功能。它将在一次调用中获取您需要的大小的行样本。所以你的函数看起来像：

import numpy as np

def get_shorter_subset(fname, new_len):
    """Extract a random shorter subset of length new_len from a given file"""

    with open(fname + " short.out", 'w') as out_file, open(fname, 'r') as in_file:
        out_file.write(''.join(np.random.choice(list(in_file), new_len, False)))

get_shorter_subset('input.txt', 250000)

【讨论】：

【解决方案5】：

感谢您的回答，我做了一个解决方案，在每个索引处生成一个随机数（概率对应于 new_size/full_size），并根据该值选择或丢弃每个元素。所以代码是：

def get_shorter_subset(fname, new_len):
"""Extract a random shorter subset of length new_len from a given 
   file"""
   out_lines = []
   with open(fname + "short.out", 'w') as out_file:
       with open(fname, 'r') as in_file:
           all_lines = in_file.readlines()
           total = len(all_lines)

           freq = total/new_len + 1
           print "Total lines:", total, "new freq:", freq
           for i, line in enumerate(all_lines):
               t = np.random.randint(1,freq+1)
               if t == 1:
                   out_lines.append(line.rstrip('\t\r\n'))
               #out_file.write(line.rstrip('\t\r\n'))
               if i % 10000 == 0:
                   print "Done with", i, "lines"

       out_file.write("\n".join(out_lines))

【讨论】：