广播错误 HDF5：无法广播 (3, 2048, 1, 1) -> (4, 2048, 1, 1)答案

【问题标题】：Broadcast error HDF5: Can't broadcast (3, 2048, 1, 1) -> (4, 2048, 1, 1)广播错误 HDF5：无法广播 (3, 2048, 1, 1) -> (4, 2048, 1, 1)
【发布时间】：2020-08-18 04:26:35
【问题描述】：

我收到以下错误：

TypeError: 无法广播 (3, 2048, 1, 1) -> (4, 2048, 1, 1)

我正在提取特征并将它们放入 hdf5 数据集中，如下所示：

array_40 = hdf5_file.create_dataset(
                    f'{phase}_40x_arrays',  shape, maxshape=(None, args.batch_size, 2048, 1, 1))

在 (None, args.batch_size, 2048, 1, 1) 中，由于数据集大小的未知性质，未指定 None。在这种情况下，args.batch_size 为 4，2048、1 和 1 是提取的特征数量及其空间维度。

形状定义为：

shape = (dataset_length, args.batch_size, 2048, 1, 1)

但是，我不确定我可以用 args.batch_size 做什么，在本例中为 4。我不能将其保留为 None，因为它会出现非法错误：

ValueError: 块元组中的非法值

编辑：是的，你是绝对正确的。我正在尝试逐步写入 hdf5 数据集。我在下面展示了更多代码。我正在提取特征并将它们增量存储到 hdf5 数据集中。尽管批次大小为 4，但最好将批次中的每个项目以增量方式保存为自己的实例/行。

shape = (dataset_length, 2048, 1, 1)
            all_shape = (dataset_length, 6144, 1, 1)
            labels_shape = (dataset_length)
            batch_shape = (1,)

            path = args.HDF5_dataset + f'{phase}.hdf5'

            #hdf5_file = h5py.File(path, mode='w')
            with h5py.File(path, mode='a') as hdf5_file:

                array_40 = hdf5_file.create_dataset(
                    f'{phase}_40x_arrays',  shape, maxshape=(None, 2048, 1, 1)
                )
                array_labels = hdf5_file.create_dataset(
                    f'{phase}_labels', labels_shape, maxshape=(None), dtype=string_type
                )
                array_batch_idx = hdf5_file.create_dataset(
                    f'{phase}_batch_idx', data=np.array([-1, ])
                )

                hdf5_file.close()

        # either new or checkpionted file exists
        # load file and create references to exisitng h5 datasets
        with h5py.File(path, mode='r+') as hdf5_file:
            array_40 = hdf5_file[f'{phase}_40x_arrays']
            array_labels = hdf5_file[f'{phase}_labels']
            array_batch_idx = hdf5_file[f'{phase}_batch_idx']

            batch_idx = int(array_batch_idx[0]+1)

            print("Batch ID is restarting from {}".format(batch_idx))

            dataloaders_dict = torch.utils.data.DataLoader(datasets_dict, batch_size=args.batch_size, sampler=SequentialSampler2(
                datasets_dict, batch_idx, args.batch_size),drop_last=True, num_workers=args.num_workers, shuffle=False)  # make sure shuffling is false for sampler to work and incase you restart


            for i, (inputs40x, paths40x, labels) in enumerate(dataloaders_dict):

                print(f'Batch ID: {batch_idx}')

                inputs40x = inputs40x.to(device)
                labels = labels.to(device)
                paths = paths40x

                x40 = resnet(inputs40x)

                # torch.Size([1, 2048, 1, 1]) batch, feats, 1l, 1l
                array_40[...] = x40.cpu()
                array_labels[batch_idx, ...] = labels[:].cpu()
                array_batch_idx[:,...] = batch_idx

                batch_idx +=1
                hdf5_file.flush()

【问题讨论】：

这个错误强烈表明args.batch_size在你使用它的两个不同的地方是不一样的（它在某个地方是3）。
感谢您的回复。我明白了，我应该改写我的问题。如何处理该维度的可变尺寸？例如，我在数据集中有 51 个实例/行。批量大小为 4 时，我可以填写我的 hdf5 数据集 12 次，但是，包含 3 的最后一批会产生错误。我希望能够处理 args.batch_size 维度中的可变输入大小。如果我将其保留为 None，则会收到以下错误：ValueError：块元组中的非法值。我不确定我能做什么......
@Taran，我不是 ML/AI 人，所以不要使用 pytorch DataLoader。据我了解，它返回一个可迭代的数据来访问数据。您的代码使用enumerate() 对其进行迭代。当您获得每个批次时，您必须将 inputs40x, paths40x, labels 中的数据映射到匹配的 HDF5 数据集中的下一个打开行。您不能使用 [...] 您需要批处理行的索引。使用位置计数器来执行此操作。
您好 kcw78，感谢您的回复，您真的很有帮助。数据加载器有一个定制的顺序采样器，它允许数据加载器保持顺序:) 关于这个问题，我放弃了最后一批。我实际上还通过本质应用修复了附加每个批处理项的代码：`array_40[batch_idx*args.batch_size:(batch_idx+1)*args.batch_size, ...] = x40.cpu()`

标签： python pytorch hdf5 h5py

【解决方案1】：

我认为您对maxshape=() 参数的使用感到困惑。它设置每个维度中分配的最大数据集大小。第一个数据集维度在创建时设置为dataset_length，maxshape[0]=None 允许大小无限增长。创建时第二个数据集维度的大小为args.batch_size。您为 maxshape 指定了相同的大小，因此您无法增加此维度。

我对你的例子有点困惑。听起来您试图以args.batch_size 的行/实例将数据增量写入数据集。您的示例有 51 行/实例数据，并且您希望分批写入 args.batch_size=4。使用 51 行，您可以编写前 48 行（0-3、4-7...44-47），然后坚持使用剩下的 3 行。您不能通过添加一个计数器（称为nrows_left）并将批量大小参数更改为min(args.batch_size, rows_left) 来解决这个问题吗？对我来说似乎是最简单的解决方案。

没有更多信息，我无法编写完整的示例。
我将尝试在下面说明我的意思：

# args.batch_size = 4
shape = (dataset_length, 2048, 1, 1)
array_40 = hdf5_file.create_dataset(
           f'{phase}_40x_arrays', shape, maxshape=(None, 2048, 1, 1))
nrows_left= dataset_length
rcnt = 0
loopcnt = dataset_length/args.batch_size
if dataset_length%args.batch_size != 0:
    loopcnt += 1 
for loop in range(loopcnt) :
    nload = min(nrows_left, args.batch_size)
    array_40[rcnt :row+nload] = img_data[rcnt:row+nload ]
    rcnt += nload 
    nrows_left -= nload

【讨论】：

编辑：是的，你说得对。我正在尝试逐步写入 hdf5 数据集。我在下面展示了更多代码。我正在提取特征并将它们增量存储到 hdf5 数据集中。尽管批次大小为 4，但最好将批次中的每个项目保存为自己的实例/行。我已经更新了上面的代码来反映这个任务。