在python中使用多线程读取txt文件答案

【问题标题】：Read txt file with multi-threaded in python在python中使用多线程读取txt文件
【发布时间】：2011-10-16 00:19:43
【问题描述】：

我正在尝试在 python 中读取一个文件（扫描它的行并查找术语）并编写结果 - 比如说，每个术语的计数器。我需要对大量文件（超过 3000 个）执行此操作。有可能做到多线程吗？如果是，怎么做？

所以，场景是这样的：

读取每个文件并扫描其行
将计数器写入我已读取的所有文件的同一输出文件。

第二个问题是，它是否提高了读写速度。

希望它足够清楚。谢谢，

罗恩。

【问题讨论】：

标签： python multithreading text-files

【解决方案1】：

我同意@aix，multiprocessing 绝对是要走的路。无论您将受到 i/o 限制——无论您运行了多少并行进程，您都只能如此快速地阅读。但是很容易一些加速。

考虑以下内容（input/ 是一个目录，其中包含来自 Project Gutenberg 的多个 .txt 文件）。

import os.path
from multiprocessing import Pool
import sys
import time

def process_file(name):
    ''' Process one file: count number of lines and words '''
    linecount=0
    wordcount=0
    with open(name, 'r') as inp:
        for line in inp:
            linecount+=1
            wordcount+=len(line.split(' '))

    return name, linecount, wordcount

def process_files_parallel(arg, dirname, names):
    ''' Process each file in parallel via Poll.map() '''
    pool=Pool()
    results=pool.map(process_file, [os.path.join(dirname, name) for name in names])

def process_files(arg, dirname, names):
    ''' Process each file in via map() '''
    results=map(process_file, [os.path.join(dirname, name) for name in names])

if __name__ == '__main__':
    start=time.time()
    os.path.walk('input/', process_files, None)
    print "process_files()", time.time()-start

    start=time.time()
    os.path.walk('input/', process_files_parallel, None)
    print "process_files_parallel()", time.time()-start

当我在我的双核机器上运行它时，有明显的（但不是 2 倍）加速：

$ python process_files.py
process_files() 1.71218085289
process_files_parallel() 1.28905105591

如果文件小到可以放入内存，并且您有很多不受 i/o 限制的处理要完成，那么您应该会看到更好的改进。

【讨论】：

如果你有很多文件，我认为它会创建太多进程。我得到了这个 process.process_files() 16.5930001736 process_files_parallel() 100.887000084
注意还有 pool.imap —— 对应于 python2 上的 itertools.imap —— 如果你也在那里寻找生成器版本。这是一个很好的插图，顺便说一句，干得好。

【解决方案2】：

是的，应该可以以并行方式执行此操作。

但是，在 Python 中，很难实现多线程的并行性。出于这个原因，multiprocessing 是并行处理的更好的默认选择。

很难说您可以实现什么样的加速。这取决于可以并行完成的工作量的哪一部分（越多越好），以及必须串行完成的部分（越少越好）。

【讨论】：

“但是，在 Python 中很难实现多线程的并行性”，您能否参考一下原因？很好的答案+1