【问题标题】:Pyarrow s3fs partition by timestampPyarrow s3fs 按时间戳分区
【发布时间】:2018-08-11 15:30:06
【问题描述】:

在将 parquet 文件写入s3 时,是否可以使用pyarrow 表中的时间戳字段将s3fs 文件系统按“YYYY/MM/DD/HH”分区?

【问题讨论】:

    标签: python pyarrow


    【解决方案1】:

    我可以使用 pyarrow write_to_dataset 函数来实现,该函数允许您指定分区列来创建子目录。

    例子:

    import os
    import s3fs
    import pandas as pd
    import pyarrow as pa
    import pyarrow.parquet as pq
    from pyarrow.filesystem import S3FSWrapper
    
    access_key = <access_key>
    secret_key = <secret_key>
    bucket_name = <bucket_name>
    
    fs = s3fs.S3FileSystem(key=access_key, secret=secret_key)
    
    bucket_uri = 's3://{0}/{1}'.format(bucket_name, "data")
    
    data = {'date': ['2018-03-04T14:12:15.653Z', '2018-03-03T14:12:15.653Z', '2018-03-02T14:12:15.653Z', '2018-03-05T14:12:15.653Z'],
            'battles': [34, 25, 26, 57],
            'citys': ['london', 'newyork', 'boston', 'boston']}
    df = pd.DataFrame(data, columns=['date', 'battles', 'citys'])
    df['date'] = df['date'].map(lambda t: pd.to_datetime(t, format="%Y-%m-%dT%H:%M:%S.%fZ"))
    df['year'], df['month'], df['day'] = df['date'].apply(lambda x: x.year), df['date'].apply(lambda x: x.month), df['date'].apply(lambda x: x.day)
    df.groupby(by=['citys'])
    table = pa.Table.from_pandas(df)
    pq.write_to_dataset(table, bucket_uri, filesystem=fs, partition_cols=['year', 'month', 'day'], use_dictionary=True,  compression='snappy', use_deprecated_int96_timestamps=True)
    

    【讨论】:

    【解决方案2】:

    据我所知:没有。

    它可以读取分区数据,但与写入无关。

    有几个地方记录了写入函数,它们都没有分区选项。

    Can pyarrow write multiple parquet files to a folder like fastparquet's file_scheme='hive' option?

    https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L941

    https://issues.apache.org/jira/browse/ARROW-1858

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2022-10-05
      • 2018-10-19
      • 1970-01-01
      • 2021-11-26
      • 1970-01-01
      • 2015-12-13
      • 1970-01-01
      相关资源
      最近更新 更多