在 Python 中从内存中解压缩流 BZ2答案

【问题标题】：Decompress streaming BZ2 from memory in Python在 Python 中从内存中解压缩流 BZ2
【发布时间】：2015-05-09 16:01:54
【问题描述】：

我在内存中有一块 CSV 数据 bz2 压缩数据

compressed = load_from_network_service(...)

我想遍历解压缩的行流。

for line in bz2_decompress_stream(compressed):
    ...

有这样的功能吗？

原则上我可以写入磁盘然后使用bz2.BZ2File 读入，这似乎只是想消耗一个文件名

with open('tmp', 'w') as f: 
    f.write(compressed)
with bz2.BZ2File('tmp') as f:
    for line in f:
        ...

但是，对于我当前的应用程序来说，磁盘 I/O 非常重要，所以这很痛苦。

大概bz2.BZ2Decompressor 对象在这里可能会有所帮助。我的经验是我给它我的压缩数据，它给我整个解压缩的结果；它似乎没有流式传输。也许这是我数据的限制？

【问题讨论】：

"filename" 只是误导，你也可以给它一个文件对象。来自文档：“如果文件名是 str 或字节对象，则直接打开命名文件。否则，文件名应该是文件对象，将用于读取或写入压缩数据。”

标签： python compression bzip2

【解决方案1】：

有两个明显的问题：

流媒体
不写入磁盘

为了解决2.，你是对的，你可以使用bz2.BZ2Compressor。但是 1.... 的解决方案完全取决于您的第一行是什么

compressed = load_from_network_service(...)

真的回来了。如果compressed 是一个字符串，那么您无能为力：您必须等到全部检索到它，然后再解压缩。相反，如果它是一个增量“填充”StringIO，那么您可以执行类似（未经测试）的操作：

decompressed = ''
while True:
    compressed_chunk = compressed.read(100)
    # Can be empty (even before the stream is exhausted):
    decompressed_chunk = decompressor.decompress(data)
    if decompressed_chunk:
        decompressed += decompressed_chunk
        new_lines = decompressed.splitlines()
        decompressed = new_lines[-1]
        for line in new_lines[:-1]:
            do_something(line)
    if len(chunk) < 100:
        # Reached EOF
        break

【讨论】：

我收到了一个完整的压缩字符串。我想流式传输即使在内存中有完整原始输入的情况下仍然有价值的解压缩过程。听起来我应该手动将数据流式传输到解压缩器，它会为我处理保留数据块。
是的，那么在我的解决方案之前compressed = StringIO(compressed) 就足够了。