Pandas to_hdf 的溢出错误答案

【问题标题】：OverflowError with Pandas to_hdfPandas to_hdf 的溢出错误
【发布时间】：2017-12-22 16:08:16
【问题描述】：

这里是 Python 新手。

我正在尝试使用 to_hdf 将大型数据帧保存到具有 lz4 压缩的 HDF 文件中。

我使用 Windows 10、Python 3、Pandas 20.2

我收到错误“OverflowError: Python int too large to convert to C long”。

没有机器资源接近其限制（RAM、CPU、SWAP 使用）

以前的帖子讨论过 dtype，但下面的示例显示还有其他问题，可能与大小有关？

import numpy as np
import pandas as pd


# sample dataframe to be saved, pardon my French 
n=500*1000*1000
df= pd.DataFrame({'col1':[999999999999999999]*n,
                  'col2':['aaaaaaaaaaaaaaaaa']*n,
                  'col3':[999999999999999999]*n,
                  'col4':['aaaaaaaaaaaaaaaaa']*n,
                  'col5':[999999999999999999]*n,
                  'col6':['aaaaaaaaaaaaaaaaa']*n})

# works fine
lim=200*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')

# works fine
lim=300*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')


# Error
lim=400*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')


....
OverflowError: Python int too large to convert to C long

【问题讨论】：

您确实期望999999999999999999 的整数值吗？或者这只是一个坏例子？如果是前者，使用浮点值会损害精度吗？
“以前的帖子讨论了 dtype”：这个问题也与 dtype 有关，因为这些整数值太大而无法被 4 字节整数容纳。您可能想要显示数据框的 dtype。
感谢 Evert 的评论。该示例旨在说明它与整数值或数据类型无关。有 500M 个相同的行。少写一个 300M 行的文件就可以了。 400M 失败。

标签： python pandas hdf5 lz4

【解决方案1】：

我遇到了同样的问题，它似乎确实与数据框的大小有关，而不是与 dtype 相关（我将所有列都存储为字符串，并且能够将它们分别存储到 .h5）。

对我有用的解决方案是使用mode='a' 将数据框保存在块中。正如pandas documentation 中所建议的那样：mode{'a', 'w', 'r+'}, default 'a': 'a': append,打开现有文件进行读写，如果文件不存在，则创建它。

因此示例代码如下所示：

batch_size = 1000
for i, df_chunk in df.groupby(np.arange(df.shape[0]) // batch_size):
    df_chunk.to_hdf('df.h5','table', complib= 'blosc:lz4', mode='a')

【讨论】：

嗨，我有一个类似的问题，但是当我尝试这个解决方案时，它只保存了最后一批。这正常吗？

【解决方案2】：

正如@Giovanni Maria Strampelli 指出的那样，@Artem Snorkovenko 的答案只保存了最后一批。 Pandas documentation 声明如下：

要向现有 HDF 文件添加另一个 DataFrame 或 Series，请使用附加模式和不同的 a 键。

这是保存所有批次的可能解决方法（根据@Artem Snorkovenko的回答调整）：

for i in range(len(df)):
    sr = df.loc[i] #pandas series object for the given index
    sr.to_hdf('df.h5', key='table_%i'%i, complib='blosc:lz4', mode='a')

此代码使用不同键保存每个 Pandas Series 对象。每个键都由 i 索引。

要在保存后加载现有的 .h5 文件，可以执行以下操作：

i = 0
dfdone = False #if True, all keys in the .h5 file are successfully loaded.
srl = [] #df series object list
while dfdone == False:
    #print(i) #this is to see if code is working properly.
    try: #check whether current i value exists in the keys of the .h5 file
        sdfr = pd.read_hdf('df.h5', key='table_%i'%i) #Current series object
        srl.append(sdfr) #append each series to a list to create the dataframe in the end.
        i += 1 #increment i by 1 after loading the series object
    except: #if an error occurs, current i value exceeds the number of keys, all keys are loaded.
        dfdone = True #Terminate the while loop.

df = pd.DataFrame(srl) #Generate the dataframe from the list of series objects.

我使用了一个 while 循环，假设我们不知道 .h5 文件中数据帧的确切长度。如果长度已知，也可以使用for循环。

请注意，我在这里没有将数据帧保存为块。因此，加载过程的当前形式不适合保存在块中，其中每个块的数据类型将是 DataFrame。在我的实现中，每个保存的对象都是 Series，DataFrame 是从 Series 列表生成的。我提供的代码可以调整为以块的形式保存并从 DataFrame 对象列表生成 DataFrame（可以找到一个很好的起点in ths Stack Overflow entry。）。

【讨论】：