压缩文件上的高效 numpy.fromfile ？答案

【问题标题】：efficient numpy.fromfile on zipped files?压缩文件上的高效 numpy.fromfile ？
【发布时间】：2013-04-04 16:25:16
【问题描述】：

我有一些大文件（甚至压缩到 10GB 左右），其中包含一个 ASCII 标头，然后原则上每个大约 3MB 的 numpy.recarrays，我们称它们为“事件”。我的第一种方法如下所示：

f = gzip.GzipFile(filename)
f.read(10000) # fixed length ascii header
event_dtype = np.dtype([
        ('Id', '>u4'),                # simplified
        ('UnixTimeUTC', '>u4', 2), 
        ('Data', '>i2', (1600,1024) ) 
        ])
event = np.fromfile( f, dtype = event_dtype, count=1 )

但是，这是不可能的，因为 np.fromfile 需要一个真正的 FILE 对象，因为它确实进行了低级调用（找到了一张相当老的票 https://github.com/numpy/numpy/issues/1103）。

据我所知，我必须这样做：

s = f.read( event_dtype.itemsize )
event = np.fromstring(s, dtype=event_dtype, count=1)

是的，它有效！但这不是非常低效吗？不是为 s 分配了内存，并为每个事件收集垃圾吗？在我的笔记本电脑上，我达到了 16 个事件/秒，即 ~50MB/秒

我想知道是否有人知道一个聪明的方法，分配一次内存，然后让 numpy 直接读入那个内存。

顺便说一句。我是一名物理学家，所以......在这个行业还是个新手。

【问题讨论】：

I/O 所花费的时间是分配/解除分配该字符串所花费的时间的数千倍。您应该分析代码以查看瓶颈在哪里，然后对其进行优化......猜测瓶颈在哪里是不好的，如果您不习惯高效编程，那就更糟糕了。
只要你对只读数组没问题，你可以使用numpy.frombuffer来避免重复内存，只使用字符串作为内存缓冲区。
@Bakariu 感谢您清楚地表达了这一点。我没有分析代码的经验。很高兴听到，这种猜测很糟糕。
@Joe Kington。感谢您提供清晰的示例！我会去的。

标签： python numpy zip fromfile

【解决方案1】：

@Bakuriu 可能是正确的，这可能是一个微优化。你的瓶颈几乎肯定是IO，然后就是解压。两次分配内存可能并不重要。

但是，如果您想避免额外的内存分配，您可以使用numpy.frombuffer 将字符串视为 numpy 数组。

这避免了重复内存（字符串和数组使用相同的内存缓冲区），但默认情况下数组将是只读的。然后，如果需要，您可以将其更改为允许写入。

在您的情况下，只需将fromstring 替换为frombuffer：

f = gzip.GzipFile(filename)
f.read(10000) # fixed length ascii header
event_dtype = np.dtype([
        ('Id', '>u4'),                # simplified
        ('UnixTimeUTC', '>u4', 2), 
        ('Data', '>i2', (1600,1024) ) 
        ])
s = f.read( event_dtype.itemsize )
event = np.frombuffer(s, dtype=event_dtype, count=1)

只是为了证明使用这种方法不会复制内存：

import numpy as np

x = "hello"
y = np.frombuffer(x, dtype=np.uint8)

# Make "y" writeable...
y.flags.writeable = True

# Prove that we're using the same memory
y[0] = 121
print x # <-- Notice that we're outputting changing y and printing x...

这会产生：yello 而不是 hello。

无论在这种特殊情况下这是否是一项重大优化，它都是一种需要注意的有用方法。

【讨论】：

frombuffer 的大加一！我曾尝试在压缩文件上使用 fromfile 一段时间，这就是关键。