SVHN数据集中访问hdf5组的两种方式有什么区别？答案

【问题标题】：What is the difference between the two ways of accessing the hdf5 group in SVHN dataset?SVHN数据集中访问hdf5组的两种方式有什么区别？
【发布时间】：2019-08-29 05:10:19
【问题描述】：

我需要读取 SVHN 数据集并尝试读取第一张图像的文件名。

我很难理解 HDF5 的结构，尤其是在理解 SVHN 数据集的层次结构/结构方面

这两种读取图片名称的方法有什么区别？

我在getName()函数的定义中遇到了这个脚本中的方法1：https://github.com/bdiesel/tensorflow-svhn/blob/master/digit_struct.py

我玩弄了 hdf5 格式的文件并想出了方法 2，同时尝试了显示相同结果的不同事物。

# Both these methods read the first character of the name of the 1st
# image in svhn dataset
f = h5py.File(path_to_svhn_dataset,'r')

# method 1 
f[f['digitStruct']['name'][0][0]].value

# method 2
f[f['digitStruct']['name'].value[0].item()].value[0][0]

第一张图片是文件名为“1.png”的文件。上述两种获取文件名第一个字符的方法都会为我们提供 int 等价于 ascii '1'-> 49

【问题讨论】：

我知道您的 HDF5 文件中有 SVHN 图像数据，但我不清楚文件中的模式和数据集格式。可以分享详细信息吗？那也许我可以帮忙。此外，您始终可以使用 HDFViewer 打开 HDF5 文件以“查看”其中的内容（组和数据集层次结构以及数据）。
我自己很难理解架构并且无法将其可视化。我尝试了 HDFViewer，但没有奏效。该文件很大，但如果有帮助，可以在此处访问：ufldl.stanford.edu/housenumbers 有一个 kaggle 内核以及数据集所在的位置：kaggle.com/stanfordu/street-view-house-numbers 很抱歉无法在此处提供更多内容。
我昨天查看了这两个位置，但没有找到任何 HDF5 (.h5) 文件。第一个链接有 tar.gz 和 .mat 文件。 tar.gz 文件有 png 文件和 digitStruct.mat - 没有 HDF5。在 kaggle 网站上也是如此。您是否从其他程序创建了 h5 文件？
@kcw78 感谢您的检查。 digitStruct.mat 实际上是以 HDF5 格式编码的。我使用 h5py 读取 digitStruct.mat 文件。我提供的代码 sn-p 中的“path_to_svhn_dataset”是 digitStruct.mat 文件的路径。为混乱道歉。
帕洛，我取得了进步。我放弃了standford.edu 上的文件。我从 kaggle.com 下载了文件并且能够打开test_digitStruct.mat； h5py.File() 工作正常。现在我可以看到架构和数据，我明白为什么会有这么多关于 SVHN 数据的问题。这不是一个简单的数据模型。 /digitStruct/name 数据集包含指向 /#ref#/ 组中的数据集的对象引用，并且这些数据集具有对其他数据集的对象引用。解开很复杂。我会写一个完整的答案来说明这些是如何工作的，然后再发布。

标签： python dataset hdf5 h5py

【解决方案1】：

首先，您的 2 种方法的输出略有不同。
方法 1：返回（编码文件名的）完整数组
方法二：只返回数组的第一个元素（字符）

让我们解构您的代码以了解您拥有什么。
第一部分处理h5py 数据对象。

f['digitStruct'] -> 返回一个 h5py 组对象
f['digitStruct']['name'] -> 返回一个 h5py 数据集 对象
f['digitStruct']['name'].name -> 返回数据集对象的名称（路径）

注意：
/digitStruct/name 数据集包含“对象引用”。每个数组条目都是指向另一个 h5py 对象（在本例中是另一个数据集）的指针。例如（用于描述 2 个对象引用的空格）：
f[ f['digitStruct']['name'][0][0] ] -> 返回在 [0][0] 处引用的对象
因此，外部 f[ obj_ref ] 就像其他对象引用一样工作。

对于f['digitStruct']['name'][0][0]，这是一个指向数据集/#refs#/b的对象换句话说，f['digitStruct']['name'][0][0] 引用了相同的对象： f['#refs#']['b'] 或 f['/#refs#/b']

h5py 对象引用就这么多。
让我们继续使用方法1从这个对象引用中获取数据。

f[f['digitStruct']['name'][0][0]].value -> 将整个 /#refs#/b 数据集作为 NumPy 数组返回。

但是，dataset.value 已被弃用，而首选 NumPy 索引，如下所示： f[f['digitStruct']['name'][0][0]][:]（获取整个数组）

注意：这两个都返回整个编码字符数组。在这一点上，获得名称是 Python 和 NumPy 功能。使用它以字符串形式返回文件名：
f[f['digitStruct']['name'][0][0]][:].tostring().decode('ascii')

现在让我们解构您用于方法2的对象引用。

f['digitStruct']['name'].value -> 将整个 /digitStruct/name 数据集作为 NumPy 数组返回。它有 13,068 行对象引用

f['digitStruct']['name'].value[0] -> 是第一行

f['digitStruct']['name'].value[0].item() -> 将该数组元素复制到 python 标量

所以所有这些都指向同一个对象：
方法一：f['digitStruct']['name'][0][0]
方法二：f['digitStruct']['name'].value[0].item()
并且在本例中都与f['#refs#']['b'] 或f['/#refs#/b'] 相同。

和方法一一样，获取字符串是Python和NumPy的功能。

f[f['digitStruct']['name'].value[0].item()][:].tostring().decode('ascii')

是的，对象引用很复杂....
我的建议：
使用 NumPy 索引而不是 .value 从对象中提取 NumPy 数组（如上面修改的方法 1 所示）。

完整性示例代码。用于显示正在发生的事情的中间打印语句。

import h5py

# Both of these methods read the name of the 1st
# image in svhn dataset
f = h5py.File('test_digitStruct.mat','r')
print (f['digitStruct'])
print (f['digitStruct']['name'])
print (f['digitStruct']['name'].name)

# method 1
print('\ntest method 1')
print (f[f['digitStruct']['name'][0][0]])
print (f[f['digitStruct']['name'][0][0]].name)
#  both of these get the entire array / filename:
print (f[f['digitStruct']['name'][0][0]].value)
print (f[f['digitStruct']['name'][0][0]][:]) # same as .value above
print (f[f['digitStruct']['name'][0][0]][:].tostring().decode('ascii'))

# method 2
print('\ntest method 2')
print (f[f['digitStruct']['name'].value[0].item()]) 
print (f[f['digitStruct']['name'].value[0].item()].name) 

# this only gets the first array member / character:
print (f[f['digitStruct']['name'].value[0].item()].value[0][0])
print (f[f['digitStruct']['name'].value[0].item()].value[0][0].tostring().decode('ascii'))
#  this gets the entire array / filename:
print (f[f['digitStruct']['name'].value[0].item()][:])
print (f[f['digitStruct']['name'].value[0].item()][:].tostring().decode('ascii'))

每种方法的最后 2 个打印语句的输出是相同的：

[[ 49]
 [ 46]
 [112]
 [110]
 [103]]
1.png

【讨论】：

注意，在数据集f['digitStruct']['name'] 中获取对象的引用使用 NumPy 索引。我复制了原始帖子中使用的[0][0] 索引格式。您也可以使用[0,0]（这可能更直观）。此外，由于这是标准 NumPy 索引，因此第一个索引指向数据集中的行。您可以使用[-1,0]（或[-1][0]）来获取最后一行。