如何使用 Dask/pyarrow 在 python 中从远程 HDFS 读取镶木地板文件答案

【问题标题】：How to read parquet files from remote HDFS in python using Dask/ pyarrow如何使用 Dask/pyarrow 在 python 中从远程 HDFS 读取镶木地板文件
【发布时间】：2020-07-23 07:26:47
【问题描述】：

请帮助我从远程 HDFS 读取镶木地板文件，即；在 python 中使用 Dask 或 pyarrow 在 Linux 服务器上设置？

如果除了上述两个选项之外还有其他更好的方法来做同样的事情，还建议我。

尝试以下代码

from dask import dataframe as dd
df = dd.read_parquet('webhdfs://10.xxx.xx.xxx:xxxx/home/user/dir/sample.parquet',engine='pyarrow',storage_options={'host': '10.xxx.xx.xxx', 'port': xxxx, 'user': 'xxxxx'})
print(df)

错误是

KeyError: "推断和指定存储选项之间的冲突：\n- 'host'\n- 'port'"

【问题讨论】：

标签： python dask parquet pyarrow webhdfs

【解决方案1】：

在此处查看此帖子：https://github.com/dask/dask/issues/2757

您是否尝试过使用 3 个斜线？

df = dd.read_parquet('webhdfs:///10.xxx.xx.xxx:xxxx/home/user/dir/sample.parquet',engine='pyarrow',storage_options={'host': '10.xxx.xx.xxx', 'port': xxxx, 'user': 'xxxxx'})

【讨论】：

3 个斜杠会给出以下错误 OSError: Passed non-file path: 10.xxx.xx.xxx:xxxx/home/user/dir/sample.parquet

【解决方案2】：

您需要在 URL 中提供主机/端口 或在 kwargs 中提供，而不是两者都提供。以下都应该工作：

df = dd.read_parquet('webhdfs://10.xxx.xx.xxx:xxxx/home/user/dir/sample.parquet',
    engine='pyarrow', storage_options={'user': 'xxxxx'})

df = dd.read_parquet('webhdfs:///home/user/dir/sample.parquet',
    engine='pyarrow', storage_options={'host': '10.xxx.xx.xxx', 'port': xxxx, 'user': 'xxxxx'})

【讨论】：