有效地折叠 Parquet 中的行组

【问题标题】：Collapsing row-groups in Parquet efficiently有效地折叠 Parquet 中的行组
【发布时间】：2019-05-17 16:03:10
【问题描述】：

我有一个包含许多小行组的大型 Parquet 文件。我想生成一个带有单个（更大）行组的新 Parquet 文件，并且我正在使用 Python 进行操作。我可以这样做：

import pyarrow.parquet as pq
table = pq.read_table('many_tiny_row_groups.parquet')
pq.write_table(table, 'one_big_row_group.parquet')

# Lots of row groups...
print (pq.ParquetFile('many_tiny_row_groups.parquet').num_row_groups)
# Now, only 1 row group...
print (pq.ParquetFile('one_big_row_group.parquet').num_row_groups)

但是，这需要我一次将整个 Parquet 文件读入内存。我想避免这样做。是否有某种“流式处理”方法可以保持较小的内存占用？

【问题讨论】：

标签： python memory compression parquet

【解决方案1】：

fastparquet 的文档指出了对大到无法放入内存的数据集进行迭代的可能性。对于阅读，您可以使用：

pf = ParquetFile('myfile.parquet')
for df in pf.iter_row_groups():
    print(df.shape)
    # process sub-data-frame df

对于写入，您可以append 到文件中。

【讨论】：

感谢您的建议，但我认为 fastparquet 的 append=True 选项不会帮助我合并行组：append: bool (False) If False, construct data-set from scratch; if True, add new row-group(s) to existing data-set. In the latter case, the data-set must exist, and the schema must match the input data. 检查输出确认您最终会得到多个行组。跨度>