我可以存储 Parquet 文件，其字典列的值中包含混合类型吗？答案

【问题标题】：Can I store a Parquet file with a dictionary column having mixed types in their values?我可以存储 Parquet 文件，其字典列的值中包含混合类型吗？
【发布时间】：2020-11-25 21:35:33
【问题描述】：

我正在尝试将 Python Pandas DataFrame 存储为 Parquet 文件，但遇到了一些问题。我的 Pandas DF 的其中一列包含以下字典：

import pandas as pandas

df = pd.DataFrame({
    "ColA": [1, 2, 3],
    "ColB": ["X", "Y", "Z"],
    "ColC": [
        { "Field": "Value" },
        { "Field": "Value2" },
        { "Field": "Value3" }
    ]
})

df.to_parquet("test.parquet")

现在，这工作得很好，问题是字典的嵌套值之一与其他值的类型不同。例如：

import pandas as pandas

df = pd.DataFrame({
    "ColA": [1, 2, 3],
    "ColB": ["X", "Y", "Z"],
    "ColC": [
        { "Field": "Value" },
        { "Field": "Value2" },
        { "Field": ["Value3"] }
    ]
})

df.to_parquet("test.parquet")

这会引发以下错误：

ArrowInvalid: ('cannot mix list and non-list, non-null values', 'Conversion failed for column ColC with type object')

注意，对于 DF 的最后一行，ColC 字典的 Field 属性是一个列表而不是一个字符串。

是否有任何解决方法可以将此 DF 存储为 Parquet 文件？

【问题讨论】：

标签： python pandas dataframe parquet pyarrow

【解决方案1】：

ColC 是一个 UDT（用户定义类型），其中一个字段名为 Field，类型为 Union of String, List of String。

理论上箭头支持它，但实际上它很难弄清楚ColC 的类型是什么。即使您明确提供数据框的架构，它也不起作用，因为尚不支持这种类型的转换（将联合从熊猫转换为箭头/镶木地板）。

union_type = pa.union(
    [pa.field("0",pa.string()), pa.field("1", pa.list_(pa.string()))],
    'dense'
)
col_c_type = pa.struct(
    [
        pa.field('Field', union_type)
    ]
)

schema=pa.schema(
    [
        pa.field('ColA', pa.int32()),
        pa.field('ColB', pa.string()),
        pa.field('ColC', col_c_type),
    ]
)

df = pd.DataFrame({
    "ColA": [1, 2, 3],
    "ColB": ["X", "Y", "Z"],
    "ColC": [
        { "Field": "Value" },
        { "Field": "Value2" },
        { "Field": ["Value3"] }
    ]
})

pa.Table.from_pandas(df, schema)

这会给你这个错误：

('Sequence converter for type union[dense]<0: string=0, 1: list<item: string>=1> not implemented', 'Conversion failed for column ColC with type object'

即使您手动创建箭头表，它也无法将其转换为镶木地板（同样，不支持联合）。

import io
import pyarrow.parquet as pq

col_a = pa.array([1, 2, 3], pa.int32())
col_b = pa.array(["X", "Y", "Z"], pa.string())

xs = pa.array(["Value", "Value2", None], type=pa.string())
ys = pa.array([None, None, ["value3"]], type=pa.list_(pa.string()))
types = pa.array([0, 0, 1], type=pa.int8())

col_c = pa.UnionArray.from_sparse(types, [xs, ys])

table = pa.Table.from_arrays(
    [col_a, col_b, col_c],
    schema=pa.schema([
        pa.field('ColA', col_a.type),
        pa.field('ColB', col_b.type),
        pa.field('ColC', col_c.type),
    ])
)

with io.BytesIO() as buffer:
    pq.write_table(table, buffer)

Unhandled type for Arrow to Parquet schema conversion: sparse_union<0: string=0, 1: list<item: string>=1>

我认为你现在唯一的选择是使用一个结构体，其中字段的字符串值和字符串值列表具有不同的名称。

df = pd.DataFrame({
    "ColA": [1, 2, 3],
    "ColB": ["X", "Y", "Z"],
    "ColC": [
        { "Field1": "Value" },
        { "Field1": "Value2" },
        { "Field2": ["Value3"] }
    ]
})

df.to_parquet('/tmp/hello')

【讨论】：