如何将 HDF5 数组组合成一个表？答案

【问题标题】：How do I combine HDF5 arrays into a table?如何将 HDF5 数组组合成一个表？
【发布时间】：2020-10-28 03:03:34
【问题描述】：

我有一个包含 15 个数组的 HDF5 文件，仅此而已。通常我会使用 SQL 查询这些信息，但数据库已关闭，我有一个 HDF5 文件和 PyTables。我可以在 PyTables 上找到的唯一查询输出“行”而不是列中的特定元素是在表上完成的，而不是数组。

目前，我一直在自己的 h5 文件中从头开始创建表，单独填充每一行，并每隔一段时间刷新一次。这需要很长时间，因为有 2900 万行。这是我用来创建表格的代码：

#Defining Table Structure
table_description = {
        'Column1':tables.FloatCol(),
        'Column2':tables.FloatCol(),
        'Column3':tables.FloatCol(),
         ....
        'Column15':tables.FloatCol()}

#Opening the HDF5 file
hdf5_file = h5py.File('File Path','r')

#Pulling out the arrays (the future columns)
Column1_array = np.array(hdf5_file.get('Column1'))
Column2_array = np.array(hdf5_file.get('Column2'))
Column3_array = np.array(hdf5_file.get('Column3'))
...
Column15_array = np.array(hdf5_file.get('Column15'))

#Creating a New H5 file
new_file = tables.open_file('new_table.h5','w')

#Creating a New Table in the File
tbl = new_file.create_table('/','Big_Table',table_description)

i = 0
row = tbl.row #A row pointer
while i < 29069765: #Since I know the length of the columns, I'm able to just index.
    row['Column1'] = Column1_array[i] #Filling each column in a row.
    row['Column2'] = Column2_array[i] #I have pulled each column out of the HDF5 file,
    row['Column3'] = Column3_array[i] #using h5py. 
    ...
    row['Column15'] = Column15_array[i]
    row.append() #Adding the row to the table
    i += 1
    if math.fmod(i,100) == 0: #Every 100 rows, I flush the table and the file
        tbl.flush()
        h5file.flush()

new_file.close()

我还没有开始查询它的过程，但我打算在Big_Table上使用Table.where()函数。

有没有更快的方法将所有这些列数组组合到一个表中并在其上运行多参数查询？

【问题讨论】：

如何将这些columns 数组放入pandas 数据帧中，然后从那里写入表格？每列都应作为pandas 系列工作，您也可以将column_stack 数组转换为一个二维数组。一次填充一个数字肯定看起来很慢：row['Column1'] = Column1_array[i] Column1_array 是一个数组，这样处理时效果最好。这样的迭代比使用 python 列表慢！
我没想过要尝试熊猫！它一直很慢，大约需要一个半小时才能填满表格。我会考虑使用熊猫。
用行迭代器填充是最慢的方法。由于您为每一列创建了一个数组，因此只需将每一列的数组写入Big_Table 中的匹配字段即可。

标签： python arrays hdf5 h5py pytables

【解决方案1】：

我修改了您的示例，从原始文件中读取每个 1 列表，然后使用单个表将数据写入新的 HDF5 文件。这使用get_node() 访问每个表对象以及.read() 方法以读取为NumPy 数组。使用.modify_column() 将数据写入新表。参数是column= 数据（例如Col_array）和colname= 要写入数据的列/字段名称（例如Column#）。我还添加了一个循环。这简化了代码，并减少了内存占用，因为它一次只读取和写入一列数据。

import tables as tb
import numpy as np
import h5py

##Code to create the first HDF5 file used in my example
#hdf5_file = h5py.File('SO_62782315_1.h5','w')
#
#for cnt in range(1,16,1):
#    arr = np.random.rand(1000)
#    hdf5_file.create_dataset('Column'+str(cnt),data=arr)
#hdf5_file.close()

#Defining Table Structure
table_dt = np.dtype( [ 
               ('Column1', 'f8'), ('Column2', 'f8'), ('Column3', 'f8'),
               ('Column4', 'f8'), ('Column5', 'f8'), ('Column6', 'f8'),
               ('Column7', 'f8'), ('Column8', 'f8'), ('Column9', 'f8'),
               ('Column10', 'f8'), ('Column11', 'f8'), ('Column12', 'f8'),
               ('Column13', 'f8'), ('Column14', 'f8'), ('Column15', 'f8') ] )   

#Creating a New H5 file
new_file = tb.open_file('SO_62782315_2.h5','w')

#Creating a New Table in the File
tbl = new_file.create_table('/','Big_Table',table_dt)
# create array of zeros and append to table to allocate space
table_arr = np.ndarray((1000,15),dtype=table_dt)
tbl.append(table_arr)   

#Open the existing HDF5 file with h5py
hdf5_file = h5py.File('SO_62782315_1.h5','r')

for cnt in range(1,16,1):
# alternate method (easier to program using a loop)
    Col_array = hdf5_file['Column'+str(cnt)][:]
    tbl.modify_column(column=Col_array, colname='Column'+str(cnt))
    h5file.flush()

new_file.close()
hdf5_file.close()

您可以使用 PyTables (tables) 执行所有操作，只需对打开文件和读取数组数据进行这个小修改。

#Open the existing HDF5 file with tables
hdf5_file = tb.File('SO_62782315_1.h5','r')

for cnt in range(1,16,1):
# alternate method (easier to program using a loop)
    Col_array = hdf5_file.get_node('/','Column'+str(cnt)).read()
    tbl.modify_column(column=Col_array, colname='Column'+str(cnt))
    new_file.flush()

【讨论】：

这要简单得多！我什至没有想过只是修改新表中已经存在的列。谢谢！
提前致歉，我没有测试我发布的代码。我刚刚意识到您正在使用h5py 阅读并使用tables 写作。这不是问题，但我的代码不起作用。此外，当您创建带有描述字典的表时，没有行，因此您必须添加空白行以在表中创建空间。我更新了答案，它适用于我的虚拟数据。
再想一想：您可能没有足够的内存来一次性创建和追加 2900 万行。您可能不得不修改创建 table_arr 和调用 .append() - 从 100 万行开始，然后重复 29 次。
非常感谢！