使用 PyArrow 读取 CSV答案

【问题标题】：Read CSV with PyArrow使用 PyArrow 读取 CSV
【发布时间】：2019-02-24 01:38:50
【问题描述】：

我有大型 CSV 文件，我最终希望将其转换为镶木地板。由于内存限制和处理 NULL 值的困难（这在我的数据中很常见），Pandas 无济于事。我检查了 PyArrow 文档，并且有用于读取镶木地板文件的工具，但我没有看到任何有关读取 CSV 的信息。是我遗漏了什么，还是这个功能与 PyArrow 不兼容？

【问题讨论】：

标签： python pyarrow

【解决方案1】：

您可以使用pd.read_csv(chunksize=...) 分块读取 CSV，然后使用 Pyarrow 一次写入一个块。

一个警告是，正如您所提到的，如果您有一列在一个块中全是空值，Pandas 将给出不一致的 dtype，因此您必须确保块大小大于数据中最长的空值运行.

这会从标准输入读取 CSV 并将 Parquet 写入标准输出 (Python 3)。

#!/usr/bin/env python
import sys

import pandas as pd
import pyarrow.parquet

# This has to be big enough you don't get a chunk of all nulls: https://issues.apache.org/jira/browse/ARROW-2659
SPLIT_ROWS = 2 ** 16

def main():
    writer = None
    for split in pd.read_csv(sys.stdin.buffer, chunksize=SPLIT_ROWS):
        table = pyarrow.Table.from_pandas(split, preserve_index=False)
        # Timestamps have issues if you don't convert to ms. https://github.com/dask/fastparquet/issues/82
        writer = writer or pyarrow.parquet.ParquetWriter(sys.stdout.buffer, table.schema, coerce_timestamps='ms', compression='gzip')
        writer.write_table(table)
    writer.close()

if __name__ == "__main__":
    main()

【讨论】：

【解决方案2】：

我们正在开发此功能，现在有一个拉取请求：https://github.com/apache/arrow/pull/2576。您可以通过测试来提供帮助！

【讨论】：