压缩/压缩内存中的 numpy 数组答案

【问题标题】：Compress/Zip numpy arrays in Memory压缩/压缩内存中的 numpy 数组
【发布时间】：2016-12-26 10:25:21
【问题描述】：

我的内存对于我的数据来说太小了，所以我尝试将它打包到内存中。

以下代码确实有效，但我必须记住数据的类型，这有点不妥（很多不同的数据类型）。

有更好的建议吗？更小的运行时间也将不胜感激

import numpy as np    
import zlib

A = np.arange(10000)
dtype = A.dtype

B = zlib.compress(A, 1)
C = np.fromstring(zlib.decompress(B), dtype)
np.testing.assert_allclose(A, C)

【问题讨论】：

您可能希望使用 blosc 包而不是 python 的 zlib 和 bz2 实现来显着加速。
blosc的速度提升确实令人印象深刻，压缩比也不错。你帮了我很多。
很高兴知道:)。一些进一步的指示：blosc.set_nthreads(6)。 compr_arr = blosc.pack_array(numpy_arr); numpy_arr = blosc.unpack_array(compr_arr) 在内部保留形状和数据类型。

标签： python numpy memory zip

【解决方案1】：

您可以尝试使用 numpy 的内置数组压缩器 np.savez_compressed()。这将为您节省跟踪数据类型的麻烦，但可能会为您的方法提供类似的性能。这是一个例子：

import io
import numpy as np

A = np.arange(10000)
compressed_array = io.BytesIO()    # np.savez_compressed() requires a file-like object to write to
np.savez_compressed(compressed_array, A)

# load it back
compressed_array.seek(0)    # seek back to the beginning of the file-like object
decompressed_array = np.load(compressed_array)['arr_0']

>>> print(len(compressed_array.getvalue()))    # compressed array size
15364
>>> assert A.dtype == decompressed_array.dtype
>>> assert all(A == decompressed_array)

请注意，任何大小的缩减都取决于数据的分布。随机数据本质上是不可压缩的，因此尝试压缩它可能看不到太多好处。

【讨论】：

“类似文件”的对象很有趣，但是打包速度要慢 10 倍。数据是可压缩的，噪音不大，我看到的平均比率约为 8 到 10。
@Okapi575L 是的，现在我已经用timeit 对其进行了测试，我可以确认np.savez_compressed() 慢了大约 10 倍。唯一的好处是数据类型是自动保存的，但是写一个类来包装zlib压缩和解压并存储数据类型会很容易。
@Okapi575：我也试过bz2，但这也比zlib慢得多，尽管它是一个更有效的压缩器。
bz2 在我的示例中更有效，并且可能仍然比将内容写入磁盘更快。很高兴知道。

【解决方案2】：

我想发布我的最终代码，以防它对任何人有所帮助。它可以使用不同的打包算法在 RAM 中进行压缩，或者，如果 RAM 不足，则将数据存储在 hdf5 文件中。任何加速或提供更好代码的建议都值得赞赏。

import zlib,bz2
import numpy as np
import h5py
import os

class packdataclass():
    def __init__(self,packalg='nocompress',Filename=None):
        self.packalg=packalg
        if self.packalg=='hdf5_on_drive':
            self.Filename=Filename
            self.Running_Number=0
            if os.path.isfile(Filename):
                os.remove(Filename)
            with h5py.File(self.Filename,'w') as hdf5_file:
                hdf5_file.create_dataset("TMP_File", data="0")

    def clean_up(self):
        if self.packalg=='hdf5_on_drive':
            if os.path.isfile(self.Filename):
                os.remove(self.Filename)

    def compress (self, array):
        Returndict={'compression':self.packalg,'type':array.dtype}
        if array.dtype==np.bool:
            Returndict['len_bool_array']=len(array)            
            array=np.packbits(array.astype(np.uint8)) # Code converts 8 bool to an int8
            Returndict['type']='bitfield'
        if self.packalg == 'nocompress':
            Returndict['data'] = array

        elif self.packalg == 'zlib':
            Returndict['data'] = zlib.compress(array,1)

        elif self.packalg == 'bz2':
            Returndict['data'] = bz2.compress(array,1)
        elif self.packalg == 'hdf5_on_drive':
            with h5py.File(self.Filename,'r+') as hdf5_file:
                datatype=array.dtype
                Returndict['data']=str(self.Running_Number)
                hdf5_file.create_dataset(Returndict['data'], data=array, dtype=datatype, compression='gzip',compression_opts=4)
            self.Running_Number+=1

        else:
            raise ValueError("Algorithm for packing {} is unknown".format(self.packalg))

        return(Returndict)

    def decompress (self, data):

        if data['compression'] == 'nocompress':
            data_decompressed=data['data']
        else:
            if data['compression'] == 'zlib':
                data_decompressed = zlib.decompress(data['data'])

            elif data['compression'] == 'bz2':
                data_decompressed = bz2.decompress(data['data'])
            elif data['compression'] == 'hdf5_on_drive':
                with h5py.File(self.Filename, "r") as Readfile:
                    data_decompressed=np.array(Readfile[data['data']])
            else:
                raise
            if type(data['type'])!=np.dtype and data['type']=='bitfield':
                data_decompressed =np.fromstring(data_decompressed, np.uint8)
            else:                            
                data_decompressed =np.fromstring(data_decompressed, data['type'])

        if type(data['type'])!=np.dtype and data['type']=='bitfield':
            return np.unpackbits(data_decompressed).astype(np.bool)[:data['len_bool_array']]
        else:
            return(data_decompressed)

【讨论】：

【解决方案3】：

您可以尝试 bcolz，这是我在谷歌搜索类似问题的答案时发现的：https://bcolz.readthedocs.io/en/latest/intro.html

它是 numpy 数组之上的附加层，可为您组织压缩。

【讨论】：

这个项目很少更新了，太可惜了。