【发布时间】:2017-03-28 13:12:27
【问题描述】:
只想在 HDF5 存储(.h5 文件)中存档一些 Pandas 数据帧。下面是我正在使用的代码。
# Fake data over N runs
Data_N = []
for n in range(5):
Data_N.append(np.random.randn(5000,15,125))
# Create HDFStore object
store = pd.HDFStore('test.h5')
# For each run:
for n in range(len(Data_N)):
Data = Data_N[n]
# Pandas DataFrame for "flattened" fake data
Data_subDFs = []
nanbuff = np.nan*np.zeros((1,len(Data[0,0])))
for i in range(len(Data)):
Data_i = np.vstack((nanbuff,Data[i,:,:]))
Data_subDFs.append(pd.DataFrame(data = Data_i))
Data_DF = pd.concat(Data_subDFs)
# Row and column labels for the DataFrame
Data_rows = []
for i in range(len(Data)):
Data_rows.append(['Layer %d:' % (i+1)] + range(1,len(Data[0])+1))
Data_DF.index = sum(Data_rows,[])
Data_DF.columns = range(1,len(Data[0,0])+1)
# Put Pandas DataFrame into store
store.put('Data_DF_%d' % (n+1), Data_DF)
#store.put('Data_DF_%d' % (n+1), Data_DF, format='table')
#store.put('Data_DF_%d' % (n+1), Data_DF, format='table', data_columns=True)
# Save the HDF5 file
store.close()
这给出了以下输出:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->axis1] [items->None]
如果我使用 put 的第二个版本,它会给出:
TypeError: Passing an incorrect value to a table column. Expected a Col (or subc
lass) instance and got: "ObjectAtom()". Please make use of the Col(), or descend
ant, constructor to properly initialize columns.
如果我使用 put 的第三个版本,它会给出:
ValueError: cannot have non-object label DataIndexableCol
谁能解释一下不同的版本,为什么我不能在没有酸洗的情况下将我认为是有效的 Pandas DataFrame 保存在 HDF5 中?
如果有帮助,我认为我不需要能够附加 DataFrame/store。我只想要使用 Pandas HDF5 界面保存 DF 的最佳性能方式。
谢谢!
编辑 1:
我将“每次运行:”之后的代码更新为此
# For each run:
for run in range(len(Data_N)):
Data = Data_N[run]
l = len(Data)
m = len(Data[0])
n = len(Data[0,0])
# Pandas DataFrame for "flattened" fake data
Data_subDFs = []
for i in range(len(Data)):
Data_i = Data[i,:,:]
Data_subDFs.append(pd.DataFrame(data = Data_i))
Data_DF = pd.concat(Data_subDFs)
# Row and column labels for the DataFrame
L1 = np.zeros((l*m,1), dtype=object) # Layer number
L2 = np.zeros((l*m,1), dtype=object) # Row number
for i in range(l):
for j in range(m):
L1[i*m + j,0] = 'Layer %d' % (i+1)
L2[i*m + j,0] = '%d' % (j+1)
Data_DF.index = np.hstack((L1,L2))
Data_DF.columns = range(1,n+1)
# Put Pandas DataFrame into store
store.put('Data_DF_%d' % (run+1), Data_DF)
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table')
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table', data_columns=True)
但是对于每个 put 行,这会给出相同的警告或错误。
编辑 2(这工作!):
# For each run:
for run in range(len(Data_N)):
Data = Data_N[run]
l = len(Data)
m = len(Data[0])
n = len(Data[0,0])
# Pandas DataFrame for "flattened" fake data
Data_DF = pd.DataFrame(Data.reshape(l*m,n))
# Layer and row labels
layers = np.arange(1,l+1)
rows = np.arange(1,m+1)
# Pandas multi-index
mindex = pd.MultiIndex.from_product([layers,rows], names=['Layer','Row'])
# DataFrame multi-index and column labels
Data_DF.index = mindex
Data_DF.columns = range(1,n+1)
# Put Pandas DataFrame into store
store.put('Data_DF_%d' % (run+1), Data_DF)
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table')
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table', data_columns=True)
第三行仍然给出同样的错误,但由于第二行有效,我假设第三行在这种情况下只是一个无效的命令。
第二行也比第一行快很多,而且都比酸洗路线快得多。谢谢!
【问题讨论】:
标签: python pandas dataframe hdf5 hdf