pyarrow 通过列索引或顺序读取镶木地板？答案

【问题标题】：pyarrow read parquet via column index or order?pyarrow 通过列索引或顺序读取镶木地板？
【发布时间】：2021-01-10 23:48:34
【问题描述】：

是否有一种解决方法可以通过列索引而不是列名选择性地读取 parquet 文件？

文档显示通过列名读取：

pq.read_table('example.parquet', columns=['one', 'three'])

我正在寻找的是这样的：

pq.read_table('example.parquet', columns=[0, 2])

类似问题：Pandas Read/Write Parquet Data using Column Index

尝试更新

这是多余的，我不妨用 pandas 或 numpy 删除内存中的列。

desired_cols = [0,2]

pat = pq.read_table('file.parquet.gzip')

cols_names = pat.column_names

del pat

desired_cols = [cols_names[c] for c in desired_cols]

pq.read_table('file.parquet.gzip',columns=desired_cols)

"""
pyarrow.Table
anzsic06: string
year: int64
"""

【问题讨论】：

标签： parquet pyarrow

【解决方案1】：

您可以阅读ParquetFile，它为您提供了架构，而无需加载基础数据。从那里您可以根据索引找出您想要的列的名称，并仅加载这些列：

# Load meta data & guess column names:
pq_file = pq.ParquetFile('file.parquet')
column_indices = [1, 2]
column_names = [pq_file.schema[i].name for i in column_indices]

# Load the actual data:
pq.read_table('file.parquet', columns=column_names)

见http://arrow.apache.org/docs/python/parquet.html#inspecting-the-parquet-file-metadata

【讨论】：