【问题标题】:HDF5 min_itemsize error: ValueError: Trying to store a string with len [##] in [y] column but this column has a limit of [##]!HDF5 min_itemsize 错误:ValueError: Trying to store a string with len [##] in [y] column but this column has a limit of [##]!
【发布时间】:2017-02-18 13:07:34
【问题描述】:

使用pandas.HDFStore().append()后出现以下错误

ValueError: Trying to store a string with len [150] in [values_block_0] column but  this column has a limit of [127]!

Consider using min_itemsize to preset the sizes on these columns

我正在创建一个 pandas DataFrame 并将其附加到 HDF5 文件中,如下所示:

import pandas as pd

store = pd.HDFStore("test1.h5", mode='w')

hdf_key = "one_key"

columns = ["col1", "col2", ... ]

df = pd.Dataframe(...)
df.col1 = df.col1.astype(str)
df.col2 = df.col2astype(int)
df.col3 = df.col3astype(str)
.... 
store.append(hdf_key, df, data_column=columns, index=False)

我收到上述错误:“ValueError: Trying to store a string with len [150] in [values_block_0] column but this column has a limit of [127]!”

之后,我执行代码:

store.get_storer(hdf_key).table.description

哪个输出

{
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": StringCol(itemsize=127, shape=(5,), dflt=b'', pos=1),
  "values_block_1": Int64Col(shape=(5,), dflt=0, pos=2),
  "col1": StringCol(itemsize=20, shape=(), dflt=b'', pos=3),
  "col2": StringCol(itemsize=39, shape=(), dflt=b'', pos=4)}

values_block_0values_block_1 是什么?

所以,在这个 StackOverflow Pandas pytable: how to specify min_itemsize of the elements of a MultiIndex 之后,我尝试了

store.append(hdf_key, df, data_column=columns, index=False,  min_itemsize={"values_block_0":250})

这不起作用——现在我得到这个错误:

ValueError: Trying to store a string with len [250] in [values_block_0] column but  this column has a limit of [127]!

Consider using min_itemsize to preset the sizes on these columns

我做错了什么?

编辑:此代码从filename.py 产生错误ValueError: min_itemsize has the key [values_block_0] which is not an axis or data_column

import pandas as pd
store = pd.HDFStore("test1.h5", mode='w')
hdf_key = "one_key"

my_columns = ["col1", "col2", ... ]

df = pd.Dataframe(...)
df.col1 = df.col1.astype(str)
df.col2 = df.col2astype(int)
df.col3 = df.col3astype(str)
.... 
store.append(hdf_key, df, data_column=my_columns, index=False, min_itemsize={"values_block_0":350})

这是完整的错误:

(python-3) -bash:1008 $ python filename.py
Traceback (most recent call last):
  File "filename.py", line 50, in <module>
    store.append(hdf_key, dicts_into_df,  data_column=my_columns, index=False, min_itemsize={'values_block_0':350})
  File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 970, in append
    **kwargs)
  File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 1315, in _write_to_group
    s.write(obj=value, append=append, complib=complib, **kwargs)
  File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 4263, in write
    obj=obj, data_columns=data_columns, **kwargs)
  File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 3853, in write
    **kwargs)
  File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 3535, in create_axes
    self.validate_min_itemsize(min_itemsize)
  File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 3174, in validate_min_itemsize
    "data_column" % k)
ValueError: min_itemsize has the key [values_block_0] which is not an axis or data_column

【问题讨论】:

    标签: python pandas hdf5 pytables hdfstore


    【解决方案1】:

    更新:

    您拼错了data_columns 参数:data_column - 它应该是data_columns。结果,您的 HDF 存储中没有任何索引列,并且添加了 HDF 存储values_block_X

    In [70]: store = pd.HDFStore(r'D:\temp\.data\my_test.h5')
    

    拼写错误的参数将被忽略:

    In [71]: store.append('no_idx_wrong_dc', df, data_column=df.columns, index=False)
    
    In [72]: store.get_storer('no_idx_wrong_dc').table
    Out[72]:
    /no_idx_wrong_dc/table (Table(10,)) ''
      description := {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
      "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
      "values_block_2": StringCol(itemsize=30, shape=(1,), dflt=b'', pos=3)}
      byteorder := 'little'
      chunkshape := (1213,)
    

    与以下相同:

    In [73]: store.append('no_idx_no_dc', df, index=False)
    
    In [74]: store.get_storer('no_idx_no_dc').table
    Out[74]:
    /no_idx_no_dc/table (Table(10,)) ''
      description := {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
      "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
      "values_block_2": StringCol(itemsize=30, shape=(1,), dflt=b'', pos=3)}
      byteorder := 'little'
      chunkshape := (1213,)
    

    让我们正确拼写:

    In [75]: store.append('no_idx_dc', df, data_columns=df.columns, index=False)
    
    In [76]: store.get_storer('no_idx_dc').table
    Out[76]:
    /no_idx_dc/table (Table(10,)) ''
      description := {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "value": Float64Col(shape=(), dflt=0.0, pos=1),
      "count": Int64Col(shape=(), dflt=0, pos=2),
      "s": StringCol(itemsize=30, shape=(), dflt=b'', pos=3)}
      byteorder := 'little'
      chunkshape := (1213,)
    

    旧答案:

    AFAIK 你可以有效地设置min_itemsize参数在第一个只追加。

    演示:

    In [33]: df
    Out[33]:
       num                 s
    0   11  aaaaaaaaaaaaaaaa
    1   12    bbbbbbbbbbbbbb
    2   13     ccccccccccccc
    3   14       ddddddddddd
    
    In [34]: store = pd.HDFStore(r'D:\temp\.data\my_test.h5')
    
    In [35]: store.append('test_1', df, data_columns=True)
    
    In [36]: store.get_storer('test_1').table.description
    Out[36]:
    {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "num": Int64Col(shape=(), dflt=0, pos=1),
      "s": StringCol(itemsize=16, shape=(), dflt=b'', pos=2)}
    
    In [37]: df.loc[4] = [15, 'X'*200]
    
    In [38]: df
    Out[38]:
       num                                                  s
    0   11                                   aaaaaaaaaaaaaaaa
    1   12                                     bbbbbbbbbbbbbb
    2   13                                      ccccccccccccc
    3   14                                        ddddddddddd
    4   15  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
    
    In [39]: store.append('test_1', df, data_columns=True)
    ...
    skipped
    ...
    ValueError: Trying to store a string with len [200] in [s] column but
    this column has a limit of [16]!
    Consider using min_itemsize to preset the sizes on these columns    
    

    现在使用min_itemsize,但仍附加到现有的store 对象:

    In [40]: store.append('test_1', df, data_columns=True, min_itemsize={'s':250})
    ...
    skipped
    ...
    ValueError: Trying to store a string with len [250] in [s] column but
    this column has a limit of [16]!
    Consider using min_itemsize to preset the sizes on these columns
    

    如果我们要在 store 中创建一个新对象,则以下工作:

    In [41]: store.append('test_2', df, data_columns=True, min_itemsize={'s':250})
    

    检查列大小:

    In [42]: store.get_storer('test_2').table.description
    Out[42]:
    {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "num": Int64Col(shape=(), dflt=0, pos=1),
      "s": StringCol(itemsize=250, shape=(), dflt=b'', pos=2)}
    

    【讨论】:

    • 谢谢。在迭代多个数据帧并追加时,我仍然有点困惑如何实现这个解决方案? for chunk in pd.csv_reader(): store.append(key, chunk, data_columns)for i in range: df=pd.Dataframe(); store.append(key, chunk, data_columns) 喜欢这里的答案:stackoverflow.com/questions/39925077/… 看来您运行脚本。如果有错误,请在新密钥上store.append
    • @ShanZhengYang,您要么需要知道values_block_0 列的最大长度,要么使用肯定能够保持最大值的值。长度,例如:min_itemsize={"values_block_0":1000}
    • 这种方法的问题(即使用min_itemsize={"values_block_0":1000})是我得到这个错误:ValueError: min_itemsize has the key [values_block_0] which is not an axis or data_column。只有在ValueError: Trying to store a string with len [200] in [values_block_0] column but this column has a limit of [16]! 引发第一个错误之后,values_block_0 才会被识别为列
    • 我应该使用与value_block_0不同的值吗?
    • @ShanZhengYang,你能发一个产生ValueError: min_itemsize has the key [values_block_0] which is not an axis or data_column的代码吗?
    【解决方案2】:

    我大约在将 Pandas 从 18.1 更新到 22.0 的同时开始收到此错误(尽管这可能无关)。

    我通过手动读取数据帧来修复现有 HDF5 文件中的错误,然后为错误中提到的列写入一个具有更大 min_itemsize 的新 HDF5 文件:

    filename_hdf5 = "C:\test.h5"
    df = pd.read_hdf(filename_hdf5, 'table_name')
    hdf = HDFStore(filename_hdf5)
    hdf.put('table_name', df, format='table', data_columns=True, min_itemsize={'ColumnNameMentionedInError': 10})
    hdf.close()
    

    然后我更新了现有代码以在创建密钥时设置min_itemsize


    专家补充

    发生错误是因为尝试将更多行附加到现有数据帧,其固定列宽对于新数据来说太窄。固定列宽最初是根据第一次写入数据帧时列中最长的字符串设置的。

    我认为 pandas 应该透明地处理这个错误,而不是为所有未来的附加操作留下一个有效的定时炸弹。这个问题可能需要数周甚至数年才能浮出水面。

    【讨论】:

      猜你喜欢
      • 2022-12-02
      • 2022-12-27
      • 2022-12-02
      • 2023-02-25
      • 2022-12-27
      • 2022-12-28
      • 1970-01-01
      • 2022-12-02
      • 2022-12-02
      相关资源
      最近更新 更多