Python 延迟加载答案

【问题标题】：Python Lazy LoadingPython 延迟加载
【发布时间】：2017-01-01 14:16:23
【问题描述】：

以下代码将懒惰地逐行打印文本文件的内容，每次打印都在 '/n' 处停止。

   with open('eggs.txt', 'rb') as file:
       for line in file:
           print line

是否有任何配置可以延迟打印文本文件的内容，每次打印都在 ', ' 处停止？

（或任何其他字符/字符串）

我之所以这样问，是因为我正在尝试读取一个文件，该文件包含一个用逗号分隔的 2.9 GB 长行。

PS。我的问题与这个不同：Read large text files in Python, line by line without loading it in to memory 我在问如何在换行符以外的字符处停止 ('\n')

【问题讨论】：

@grael 这根本不相关。
split() 函数是否也不能很好地完成这项工作？
@TamasHegedus 它很懒，因为它不会一次将所有文本文件加载到内存中，而是一次加载它的一小部分（您当前正在打印的那个）。这样，如果文件太大，您仍然可以访问它的内容而不会耗尽 RAM。
@VaibhavBajaj 不会偷懒吧？
@DhruvPathak 该问题专门询问如何在换行符以外的字符处停止。

标签： python lazy-loading

【解决方案1】：

以下答案可以被认为是懒惰的，因为它一次读取文件一个字符：

def commaBreak(filename):
    word = ""
    with open(filename) as f:
        while True:
            char = f.read(1)
            if not char:
                print "End of file"
                yield word
                break
            elif char == ',':
                yield word
                word = ""
            else:
                word += char

您可以选择一次读取更多字符，例如 1000 个字符。

【讨论】：

这仍然会将整个文件加载到内存中，即列表wordList。
@SvenMarnach，它会一次加载一个字符，直到内存已满，对吧？
这是一个小问题。您应该将此代码放入生成器函数并生成您找到的每个单词，而不是将它们附加到列表中，因此消费者能够迭代地使用这些位，而无需一次将所有这些位加载到内存中。这种方法的主要问题是它会很慢。
@SvenMarnach 这更好吗？
是的，我就是这个意思。你通常不希望这样的生成器打印任何东西，只是为了让步（并且你有一个不一致的地方，因为最后一个单词没有被打印出来）。这段代码还有一个问题，就是比较慢，但是正确且简单。

【解决方案2】：

我认为没有内置的方法可以实现这一点。您必须使用file.read(block_size) 逐块读取文件，以逗号分隔每个块，并手动重新加入跨越块边界的字符串。

请注意，如果您长时间没有遇到逗号，您仍然可能会耗尽内存。（当遇到很长的行时，同样的问题也适用于逐行读取文件。）

这是一个示例实现：

def split_file(file, sep=",", block_size=16384):
    last_fragment = ""
    while True:
        block = file.read(block_size)
        if not block:
            break
        block_fragments = iter(block.split(sep))
        last_fragment += next(block_fragments)
        for fragment in block_fragments:
            yield last_fragment
            last_fragment = fragment
    yield last_fragment

【讨论】：

在速度方面，你认为如果我像这样预处理文件会更好：g = open(file, "w").next().replace(",", "/n"); g2 = open(file, "w").write(g);g = None 然后以正常方式延迟加载它？
我所做的是暂时将文件加载到内存中，用“\n”替换逗号，然后将文件设置为无，以释放内存，因为如果文件保留在内存中，那么我会遇到执行进一步操作时执行时间变慢（抖动）。
@RetroCode 将整个文件加载到内存中是您想要避免的。我不认为这样做会提高性能，不。（旁注：要取消绑定名称，请使用 del name 而不是分配 None。）

【解决方案3】：

它一次从文件中产生每个字符，这意味着没有内存过载。

def lazy_read():
    try:
        with open('eggs.txt', 'rb') as file:
            item = file.read(1)
            while item:
                if ',' == item:
                    raise StopIteration
                yield item
                item = file.read(1)
    except StopIteration:
        pass

print ''.join(lazy_read())

【讨论】：

什么是exit()？如果它只是一个字符，你为什么要遍历line？另外，你的缩进被破坏了。

【解决方案4】：

使用缓冲读取文件（Python 3）：

buffer_size = 2**12
delimiter = ','

with open(filename, 'r') as f:
    # remember the characters after the last delimiter in the previously processed chunk
    remaining = ""

    while True:
        # read the next chunk of characters from the file
        chunk = f.read(buffer_size)

        # end the loop if the end of the file has been reached
        if not chunk:
            break

        # add the remaining characters from the previous chunk,
        # split according to the delimiter, and keep the remaining
        # characters after the last delimiter separately
        *lines, remaining = (remaining + chunk).split(delimiter)

        # print the parts up to each delimiter one by one
        for line in lines:
            print(line, end=delimiter)

    # print the characters after the last delimiter in the file
    if remaining:
        print(remaining, end='')

请注意，这是当前的编写方式，它只会按原样打印原始文件的内容。不过，这很容易改变，例如通过更改循环中传递给print() 函数的end=delimiter 参数。

【讨论】：

f.read() 无论如何都会被缓冲，除非你禁用它，所以不需要再做一次。
@dhke 由于 Python 函数调用开销，使用 f.read(1) 逐个字符读取字符会非常慢，因此您绝对应该一次读取更大的缓冲区。这也将减少您需要调用str.split() 的次数。
@dhke text = f.read() 会将整个文件内容读入内存，f.read.split(',') 也会如此。您提到的缓冲处于较低水平；使用f.read() 的代码需要仔细编写才能利用这一点，目前尚不清楚如何这样做才能实现问题中的要求。
@SvenMarnach 这是另一个问题，是的，但是正如这里实现的那样，数据被缓冲了两次。
这种方法的一个小问题（这也影响了我自己的代码）是，如果那是空字符串，它不会打印最后的项目。

【解决方案5】：

with open('eggs.txt', 'rb') as file:
for line in file:
    str_line = str(line)
    words = str_line.split(', ')
    for word in words:
        print(word)

我不完全确定我是否知道你在问什么，你的意思是这样的吗？

【讨论】：

这行不通，因为他无法读取由for line in file 分隔的 2.9GB 长行。请关注cmets。