h5py读取时间在读取速度上有随机且剧烈的波动答案

【问题标题】：h5py read time has random and harsh flucuations in read speedh5py读取时间在读取速度上有随机且剧烈的波动
【发布时间】：2020-04-20 15:19:29
【问题描述】：

我正在使用h5py 读取预处理数据，然后将其输入卷积神经网络。所有输入图像的大小相同。我正在使用以下读/写语法：

# Read
with h5py.File() as x:
    numpy_array = x['key'][:]

# Write
x = h5py.File(data_path)
x.create_dataset('key', data = numpy_array)
x.close()

我的数据集有大约 500 个样本。出于某种奇怪的原因，训练的第一个 N 迭代次数（N 似乎有所不同），在每次迭代中，我在 hdf5 文件中读取，我看到以下性能时间

加载数据时间：0.10571813583374023

但是，突然之间，在N+1 迭代中，加载数据开始花费更多时间。

加载数据时间：1.5208463668823242

任何想法可能导致这种情况？一旦性能转变发生，它就永远不会回头。鉴于所有文件的大小相同，这对我来说没有任何意义。即使我浏览了所有示例并回到开头，最初快速读取的文件也需要更长的时间来加载。

编辑：这是使用h5py.File() as x 语法和示例输出行为的确切代码示例。


def train(points_h5f, img_h5f, labels_h5f):
    '''
    Populating dictionaries used by external libraries later on in code
    '''

    for i in range(num_samples):
        a = time.time()

        # Load points
        points = {}
        points['dict_key'] = {'points':points_h5f['points/point_{}'.format(i)][:]}

        # Load images
        images = {}
        for cam in camera_sensors:
            prop_d = {}
            for prop in camera_prop:
                prop_d[prop] = img_h5f['{}/{}/{}_{}'.format(cam,prop,prop,i)][:]                  
            images[cam] = prop_d

        # Load labels
        labels = []
        for j in range(num_labels):
            labels.append(labels_h5f['label_groups/label_{}_{}'.format(i,j)][:])

        b = time.time()

        print('Iteration: {} \nload data time: {}\n'.format(i, b-a))

with h5py.File('path/all_points.hdf5', 'r') as points_h5f:
        with h5py.File('path/all_images.hdf5', 'r') as img_h5f:
            with h5py.File('path/all_labels.hdf5', 'r') as labels_h5f:
                train(points_h5f, img_h5f, labels_h5f)


> output

>Iteration: 0
load data time: 0.09873628616333008

Iteration: 1
load data time: 0.09973263740539551

Iteration: 2
load data time: 0.09973430633544922

Iteration: 3
load data time: 0.1057431697845459
.
.
.

Iteration: 125
load data time: 0.09771347045898438

Iteration: 126
load data time: 0.24407505989074707

Iteration: 127
load data time: 1.0163114070892334

Iteration: 128
load data time: 1.0114076137542725

Iteration: 129
load data time: 1.0284936428070068

Iteration: 130
load data time: 1.1249558925628662

Iteration: 131
load data time: 1.025432825088501

.
.
. 

Iteration: 500
load data time: 1.114523423498758

【问题讨论】：

每个样本数组有多大？（生成的 HDF5 文件有多大？）每个样本是否保存到单独的数据集（又名 x[key1]、x[key2]、x[key3] 等？或者您是否要将新样本数据添加到现有数据集的末尾？
另一个问题：您是否打开/关闭文件以在每个“加载数据”循环中写入？如果是这样，那可能是罪魁祸首。我进行了 2 次测试。首先，我使用with h5py.File() as x: 在整个数据写入过程中保持文件打开。它没有显示性能下降。第二个测试在每个循环上打开/关闭。它显示了与您类似的时间波动。
@kcw78 感谢您的回复。请参阅我提供确切代码 sn-p 的帖子编辑。仅供参考，我尝试将代码中的 open file -> close file 结构更改为 h5py.File() as x 并发现我遇到了同样的问题。我是否正确解释了您的评论？意思是，从x = h5py.File('path') 到h5py.File() as x 是你的建议吗？
是的，您正确解释了我的评论。您添加到帖子中的代码反映了不同循环内的 open file -> close file 方法。发表第二条评论后，我意识到我在启用 OneDrive 同步的驱动器上运行了我的测试用例。我发现它会导致 I/O 降级。我在不同的驱动器上重新进行了测试，但无法重现您所经历的时间波动。我将发布我的测试，您可以自己运行它们。也许他们会帮助您诊断瓶颈。
@kcw78 感谢您的提示。此外，我确实有 OneDrive，关闭后确实加快了速度。但是，我仍然面临这个问题。我认为我的问题不清楚，所以我在这篇文章中添加了一个示例输出，并使用您建议的语法更新了示例代码。我看到超过 10 倍的性能突然下降，一旦发生，它就永远不会消失。

标签： python hdf5 h5py

【解决方案1】：

这里有 2 个简单的测试，它们创建了一个包含 1000 个形状为 (200,200,3) 的浮点数组的 HDF5 文件。

使用方法 1，我始终可以写入 0.17-0.20 秒/100 个数据集。

使用方法 2，我始终可以写入 0.23-0.25 秒/100 个数据集。

时间正在写入硬盘。期望在 SDD 上获得更快的结果。方法 2 稍微慢一些，但没有你看到的那么多。

方法一：使用with -- as:打开一次HDF5

import h5py
import numpy as np
import time

num = 1000

with h5py.File('SO_59555208_1.h5', 'w') as h5f:

    start = time.clock()
    for cnt in range(num):
        if cnt % (num/10) == 0 and cnt > 1:
            print ('dataset count: {}/{}'.format(cnt, num) )
            print ('Elapsed time =', (time.clock() - start) ) 
            start = time.clock()

        ds_name = 'key_' + str(cnt) 
        # Create sample image data and add to a dataset
        img_data = np.random.rand(200*200*3,1).reshape(200,200,3)
        dset = h5f.create_dataset(ds_name, data=img_data )

print ('dataset count: {}/{}'.format(cnt, num) )
print ('Elapsed time =', (time.clock() - start) ) 
print ('DONE')

方法二：打开/关闭 HDF5 以添加每个数据集

import h5py
import numpy as np
import time

num = 1000

start = time.clock()
for cnt in range(num):
    if cnt % (num/10) == 0 and cnt > 1:
        print ('dataset count: {}/{}'.format(cnt, num) )
        print ('Elapsed time =', (time.clock() - start) ) 
        start = time.clock()
    h5f = h5py.File('SO_59555208_m.h5', 'a')     
    ds_name = 'key_' + str(cnt) 
    # Create sample image data and add to a dataset
    img_data = np.random.rand(200*200*3,1).reshape(200,200,3)
    dset = h5f.create_dataset(ds_name, data=img_data )
    h5f.close()

print ('dataset count: {}/{}'.format(cnt, num) )
print ('Elapsed time =', (time.clock() - start) ) 
print ('DONE')

【讨论】：