如何在 Pandas 中向现有 hdf5 文件添加新列？答案

【问题标题】：How do I add new columns to an existing hdf5 file in pandas?如何在 Pandas 中向现有 hdf5 文件添加新列？
【发布时间】：2021-10-16 18:13:03
【问题描述】：

我有一个包含大约 100 列（5400 万行）的大型数据集。我无法在内存中处理所有内容，我想逐列处理并将输出存储在单个 hdf5 文件中。但是，我真的在为此苦苦挣扎。当我尝试使用时，我不断收到错误：

store = pd.HDF5Store('file.h5', mode='a') 
# using an existing h5 file
store.put(key, column_frame, append=True) 
# also tried .append

我不断收到此错误：“cannot match existing table structure”

提前谢谢你。

【问题讨论】：

标签： python pandas hdf5

【解决方案1】：

据我了解，您没有足够的 RAM 来创建具有 5400 万行和 100 列的数据框。一些问题：您将如何增量加载数据以使其适合内存？此外，一旦将数据保存到 HDF5 文件中，您将如何使用这些数据？（同样，因为它不适合内存）恕我直言，您加载和处理数据的方式将影响您写入和读取该数据的方式。

阅读文档并运行一些简单的测试后，（对我而言）不清楚您是否可以将列附加到 HDF5 文件（使用 Pandas 或 PyTables）中的现有表中。这似乎是 HDF5 对“表格”的限制（用于 2d 异构数据，类似于 Pandas 数据帧）。当我尝试将数据框系列附加为新列时，我收到了与您相同的错误消息。

您是否考虑过分批处理行（比如一次 1M 行）？有几个示例展示了如何追加行。如果使用此方法，您可以创建一个具有 100 列和 54M 行的单个 HDF5 数据集（表）。

有一种方法可以写入数据列 - 但您必须为每列单独的数据集（Pandas 术语中的“键”）。以下代码显示了如何执行此操作。它写入数据，然后将其读回新的数据帧。（它还使用 2 种不同的方法将整个数据帧写入其他 2 个文件，以便您可以比较这些值。）

代码如下：

dates = ['2021-08-01','2021-08-02','2021-08-03','2021-08-04','2021-08-05',
         '2021-08-06','2021-08-07','2021-08-08','2021-08-09','2021-08-10' ]
precip = [ 0.0, 0.02, 0.0, 0.12, 0.0, 0.0, 1.11, 0.0, 0.0,  0.05]
temps = [ 80.0, 71.2, 77.5, 85.4, 83.3, 90.0, 78.9, 80.1, 72.4, 88.8]
weather_df = pd.DataFrame({'dates': dates, 'precip': precip, 'temps': temps})

# write entire dataframe to 1 Table with pd.HDFStore()
with pd.HDFStore('file1-a.h5', mode='w') as store1:
    store1.put("weather_data", weather_df, format='table')

# write entire dataframe to 1 Table with df.to_hdf()
weather_df.to_hdf('./file1-b.h5', 'weather_data', mode='w', format='table')    

# write each series to a different Table with pd.HDFStore()
with pd.HDFStore('file2.h5', mode='w') as store2:
    store2.put('dates', weather_df['dates'], format='table')

with pd.HDFStore('file2.h5', mode='a') as store2:
    store2.put('precip', weather_df['precip'])

with pd.HDFStore('file2.h5', mode='a') as store2:
    store2.append('temps', weather_df['temps'])

# load the HDF5 data into a new dataframe
with pd.HDFStore('file2.h5', mode='r') as store3:
    w_df = pd.DataFrame({'dates': store3.get('dates'), 
                         'precip': store3.get('precip'), 
                         'temps': store3.get('temps')  })

【讨论】：

非常感谢您的回答。是的，问题是我没有足够的 RAM 来加载内存中的所有数据。我正在使用 DASK 数据框来加载和分析数据。最终我意识到唯一的方法似乎是添加行。 'key' 选项使得难以轻松计算和比较特征统计信息，因为我必须单独加载每个特征，而不是在 1 个数据帧中加载。我将使用行添加选项。再次感谢您。
RE：“我必须单独加载每个特性，而不是在 1 个数据帧中”——查看我代码中的最后 4 行。这会将 3 个“键”（列）加载到 1 个数据框中。 # of 'keys' 的唯一限制是有足够的内存来保存它们。