如何在 Python 中实现更快的文件 I/O？答案

【问题标题】：How to achieve Faster File I/O In Python?如何在 Python 中实现更快的文件 I/O？
【发布时间】：2019-03-18 09:24:28
【问题描述】：

我有一个关于 Python 的速度/效率相关问题：

我需要从嵌套的 JSON 文件中提取多个字段（写入 .txt 文件后，它们有 ~64k 行，当前的 sn-p 在 ~ 9 mins)，其中每行可以包含浮点数和字符串。

通常情况下，我会将所有数据放入numpy 并使用np.savetxt() 保存它..

我已经采取了简单地将行组装成字符串，但这相当慢。到目前为止，我正在做：

将每一行组装成一个字符串（从 JSON 中提取所需的字段）
将字符串写入相关文件

我有几个问题：

这导致更多单独的file.write() 命令也很慢（大约 64k * 8 次调用（对于 8 个文件））

所以我的问题是：

解决这类问题有什么好的例程？一种平衡 speed vs memory-consumption 以实现最高效的磁盘写入。
我应该增加我的DEFAULT_BUFFER_SIZE 吗？（目前是 8192）

我已经检查了这个File I/O in Every Programming Language 和这个python org: IO，但没有太大帮助（在我理解之后，文件 io 应该已经在 python 3.6.x 中缓冲）我发现我的默认 @ 987654330@ 是 8192。

这是我的 sn-p 的一部分 -

def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result

def extract_features_and_write(path_to_data, inp_filename, is_train=True):
    # It's currently having 8 lines of file.write(), which is probably making it slow as writing to disk is  involving a lot of overheads as well
    features = ['meta_tags__twitter-data1', 'url', 'meta_tags__article-author', 'domain', 'title', 'published__$date',\
                'content', 'meta_tags__twitter-description']
    
    prefix = 'train' if is_train else 'test'
    
    feature_files = [open(os.path.join(path_to_data,'{}_{}.txt'.format(prefix, feat)),'w', encoding='utf-8')
                    for feat in features]
    
    with open(os.path.join(PATH_TO_RAW_DATA, inp_filename), 
              encoding='utf-8') as inp_json_file:

        for line in tqdm_notebook(inp_json_file):
            for idx, features in enumerate(features):
                json_data = read_json_line(line)  

                content = json_data['meta_tags']["twitter:data1"].replace('\n', ' ').replace('\r', ' ').split()[0]
                feature_files[0].write(content + '\n')

                content = json_data['url'].split('/')[-1].lower()
                feature_files[1].write(content + '\n')

                content = json_data['meta_tags']['article:author'].split('/')[-1].replace('@','').lower()
                feature_files[2].write(content + '\n')

                content = json_data['domain']
                feature_files[3].write(content + '\n')

                content = json_data['title'].replace('\n', ' ').replace('\r', ' ').lower()
                feature_files[4].write(content + '\n')

                content = json_data['published']['$date']
                feature_files[5].write(content + '\n')

                content = json_data['content'].replace('\n', ' ').replace('\r', ' ')
                content = strip_tags(content).lower()
                content = re.sub(r"[^a-zA-Z0-9]", " ", content)
                feature_files[6].write(content + '\n')

                content = json_data['meta_tags']["twitter:description"].replace('\n', ' ').replace('\r', ' ').lower()
                feature_files[7].write(content + '\n')

【问题讨论】：

为什么你认为 8 次写入会导致 8 次物理写入硬盘？文件对象本身会缓冲要写入的内容，如果它决定写入您的操作系统，您的操作系统不妨稍等一下，直到它物理写入 - 即使这样，您的 harrdrives 也有可能将文件内容保留一段时间直到它开始的缓冲区真正写...
见how often does python flush a file
谢谢帕特里克，我一定会检查他们，有什么其他方法可以提高速度吗？我同意你的评论，但仍然需要 9 分钟来写 64k 行，其中不超过 30 个单词（除了一个）仍然很慢

标签： python python-3.x performance file-io io

【解决方案1】：

来自评论：

为什么您认为 8 次写入会导致 8 次物理写入您的硬盘？文件对象本身会缓冲要写入的内容，如果它决定写入您的操作系统，您的操作系统不妨稍等一下，直到它物理写入 - 即使这样，您的 harrdrives 也有可能将文件内容保留一段时间直到它开始的缓冲区真正写。见How often does python flush to a file?

您不应将异常用作控制流，也不应在不需要的地方递归。每次递归都会为函数调用准备新的调用堆栈 - 这需要资源和时间 - 而且所有这些都必须还原。

最好的办法是在将数据输入 json.load() 之前清理数据……下一个最好的办法是避免递归……尝试以下方式：

def read_json_line(line=None):
    result = None

    while result is None and line: # empty line is falsy, avoid endless loop
        try:        
            result = json.loads(line)
        except Exception as e:
            result = None      
            # Find the offending character index:
            idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
            # slice away the offending character:
            line = line[:idx_to_replace]+line[idx_to_replace+1:]

     return result

【讨论】：