【发布时间】:2020-08-30 17:55:58
【问题描述】:
无法将 parquet 文件作为 dask 数据帧读取。我可以用熊猫阅读。请建议! 我无法弄清楚我错过了什么! dask 版本 == 1.0.0,pyarrow 版本 == 0.13.0,pandas 版本 ==0.23.4
Paruet 文件样本
UniqueReference DateTime Consumption
0 ABCD 2018-08-01 00:00:00 9
1 EFGH 2018-08-01 01:00:00 0
2 IJKL 2018-08-01 02:00:00 0
3 MNOP 2018-08-01 03:00:00 0
import pyarrow
import dask.dataframe as dd
data = dd.read_parquet('myfile.parquet', engine = 'pyarrow')
错误回溯:
TypeError Traceback (most recent call last)
<ipython-input-22-068eb0627791> in <module>
----> 1 data = dd.read_parquet('myfile.parquet', engine = 'pyarrow').compute()
C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\io\parquet.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, infer_divisions)
1152
1153 return read(fs, fs_token, paths, columns=columns, filters=filters,
-> 1154 categories=categories, index=index, infer_divisions=infer_divisions)
1155
1156
C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\io\parquet.py in _read_pyarrow(fs, fs_token, paths, columns, filters, categories, index, infer_divisions)
685 pandas_metadata = json.loads(schema.metadata[b'pandas'].decode('utf8'))
686 index_names, column_names, storage_name_mapping, column_index_names = (
--> 687 _parse_pandas_metadata(pandas_metadata)
688 )
689 else:
C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\io\parquet.py in _parse_pandas_metadata(pandas_metadata)
89 # index name
90 index_names = list(index_storage_names) # make a copy
---> 91 index_storage_names2 = set(index_storage_names)
92 column_names = [name for (storage_name, name)
93 in pairs if name not in index_storage_names2]
TypeError: unhashable type: 'dict'
【问题讨论】:
-
dask和pyarrow版本是什么?如果它们不是最新的,请升级 -
dask 版本 == 1.0.0,pyarrow 版本 == 0.13.0,pandas 版本 ==0.23.4
-
您能否发布您的
parquet文件的样本。您拥有的列类型。看起来有问题。 -
试试
pip install dask --upgrade。我要求您升级的原因是当前版本的 dask 需要pyarrow >0,14.0 -
它适用于 pyarrow == '0.15.1'、pandas == '0.25.1'、dask == '2.24.0'
标签: python-3.x pandas dask parquet