numpy中的hdf到ndarray - 快速方式答案

【问题标题】：hdf to ndarray in numpy - fast waynumpy中的hdf到ndarray - 快速方式
【发布时间】：2017-03-30 21:49:04
【问题描述】：

我正在寻找一种快速的方法来将我的 hdf 文件集合设置为一个 numpy 数组，其中每一行都是图像的扁平版本。我的意思是：

除了其他信息之外，我的 hdf 文件还存储每帧图像。每个文件包含 51 帧和 512x424 图像。现在我有 300 多个 hdf 文件，我希望将图像像素存储为每帧一个矢量，其中所有图像的所有帧都存储在一个 numpy ndarray 中。下图应该有助于理解：

到目前为止，我得到的是一个非常慢的方法，我实际上不知道如何让它更快。问题是我认为我的最终数组被调用得太频繁了。因为我观察到第一个文件加载到数组中的速度非常快，但速度下降得很快。（通过打印当前hdf文件的编号来观察）

我当前的代码：

os.chdir(os.getcwd()+"\\datasets")

# predefine first row to use vstack later
numpy_data = np.ndarray((1,217088))

# search for all .hdf files
for idx, file in enumerate(glob.glob("*.hdf5")):
  f = h5py.File(file, 'r')
  # load all img data to imgs (=ndarray, but not flattened)
  imgs = f['img']['data'][:]

  # iterate over all frames (50)
  for frame in range(0, imgs.shape[0]):
    print("processing {}/{} (file/frame)".format(idx+1,frame+1))
    data = np.array(imgs[frame].flatten())
    numpy_data = np.vstack((numpy_data, data))

    # delete first row after another is one is stored
    if idx == 0 and frame == 0:
        numpy_data = np.delete(numpy_data, 0,0)

f.close()

有关更多信息，我需要这个来学习决策树。由于我的 hdf 文件比我的 RAM 大，我认为转换为 numpy 数组可以节省内存，因此更适合。

感谢您的每一个输入。

【问题讨论】：

您的算法一次需要多于一帧吗？我猜测速度下降来自对 vstack 的所有调用，您可能不需要做类似的事情。
另外，我不确定if idx == 0 and frame == 0: 条件发生了什么。我想你只是从中得到一个 0x217088 元素数组。
不幸的是，我将使用使用所有特征空间的随机 forrest。也许还有另一种选择如何用 scikit learn 来喂养他们，但我不知道这样的。
@Elliot 提到的行用于删除第一个随机初始化的行。

标签： python numpy hdf5 h5py

【解决方案1】：

我认为你不需要迭代

imgs = f['img']['data'][:]

并重塑每个二维数组。只是重塑整个事情。如果我理解您的描述正确，imgs 是一个 3d 数组：(51, 512, 424)

imgs.reshape(51, 512*424)

应该是二维的。

如果您必须循环，请不要使用vstack（或一些变体来构建更大的数组）。一，它很慢，二是清理最初的“虚拟”条目很痛苦。使用列表追加，并在最后进行一次堆叠

alist = []
for frame....
   alist.append(data)
data_array = np.vstack(alist)

vstack（和家人）将数组列表作为输入，因此它可以同时处理多个数组。迭代完成时，列表追加要快得多。

我怀疑把东西放在一个数组中是否会有所帮助。我不确切知道hdf5 文件的大小与下载数组的大小有何关系，但我希望它们处于相同的数量级。因此，尝试将所有 300 个文件加载到内存中可能行不通。那是什么，3G像素？

对于单个文件，h5py 可以加载太大而无法放入内存的数组块。这表明问题通常是相反的，文件容纳不下。

Is it possible to load large data directly into numpy int8 array using h5py?

【讨论】：

【解决方案2】：

您真的不想将所有图像加载到 RAM 中而不使用单个 HDF5 文件吗？如果您不犯任何错误（不必要的精美索引、不正确的块大小），访问 HDF5 文件可能会非常快。如果你不想要 numpy 方式，这将是一种可能性：

os.chdir(os.getcwd()+"\\datasets")
img_per_file=51

# get all HDF5-Files
files=[]
for idx, file in enumerate(glob.glob("*.hdf5")):
    files.append(file)

# allocate memory for your final Array (change the datatype if your images have some other type)
numpy_data=np.empty((len(files)*img_per_file,217088),dtype=np.uint8)

# Now read all the data
ii=0
for i in range(0,len(files)):
    f = h5py.File(files[0], 'r')
    imgs = f['img']['data'][:]
    f.close()
    numpy_data[ii:ii+img_per_file,:]=imgs.reshape((img_per_file,217088))
    ii=ii+img_per_file

将数据写入单个 HDF5 文件会非常相似：

f_out=h5py.File(File_Name_HDF5_out,'w')
# create the dataset (change the datatype if your images have some other type)
dset_out = f_out.create_dataset(Dataset_Name_out, ((len(files)*img_per_file,217088), chunks=(1,217088),dtype='uint8')

# Now read all the data
ii=0
for i in range(0,len(files)):
    f = h5py.File(files[0], 'r')
    imgs = f['img']['data'][:]
    f.close()
    dset_out[ii:ii+img_per_file,:]=imgs.reshape((img_per_file,217088))
    ii=ii+img_per_file

f_out.close()

如果您以后不想访问整个图像，那么块大小应该没问题。如果不是，您必须根据自己的需要进行更改。

访问 HDF5 文件时应该做什么：

使用适合您需要的块大小。
设置合适的块大小。这可以通过 h5py 低级 api 或 h5py_cache 来完成。 https://pypi.python.org/pypi/h5py-cache/1.0

避免任何类型的花哨的索引。如果您的数据集有 n 个维度，则以返回的数组也有 n 个维度的方式访问它。

# Chunk size is [50,50] and we iterate over the first dimension
numpyArray=h5_dset[i,:] #slow
numpyArray=np.squeeze(h5_dset[i:i+1,:]) #does the same but is much faster

编辑这显示了如何将数据读取到 memmaped numpy 数组。我认为您的方法需要格式为 np.float32 的数据。 https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html#numpy.memmap

 numpy_data = np.memmap('Your_Data.npy', dtype='np.float32', mode='w+', shape=((len(files)*img_per_file,217088)))

其他一切都可以保持不变。如果可行，我还建议使用 SSD 而不是硬盘。

【讨论】：

我将使用随机的 forrests/Decision 树处理这些数据，我发现这些方法一次需要整个数据。这就是为什么我认为我不能使用分块版本。还是我误解了对 hdf 文件进行分块的工作原理？
好的，我的第一个建议（只读取 numpy 数组中的数据）对你有用吗？
非常适合我要求的目的。但是我不知道如何将我的数据提供给学习算法（决策树）。它以 numpy 二进制格式将我的数据集从 26GB 减少到 ~3GB，因为这只是我实际数据集的一个子集，大约大 20 倍我不知道如何在不耗尽核心内存的情况下处理这个问题。
你用的是这个方法吗？ scikit-learn.org/stable/modules/generated/… 它需要一个类似数组的矩阵。也许它接受一个 memmaped numpy 数组或一个 dask 数组，希望它不会在内部复制大部分数据。
我将编辑我的答案以展示如何创建一个 np.float32 memaped 数组。