【问题标题】:Explode array of structs to columns in pyspark将结构数组分解为 pyspark 中的列
【发布时间】:2019-10-12 07:23:16
【问题描述】:

我想将结构数组分解为列(由结构字段定义)。例如

    root
 |-- news_style_super: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- sbox_ctr: double (nullable = true)
 |    |    |    |-- wise_ctr: double (nullable = true)

应该转化为

|-- name: string (nullable = true)
|-- sbox_ctr: double (nullable = true)
|-- wise_ctr: double (nullable = true)

我该怎么做?

【问题讨论】:

标签: pyspark explode


【解决方案1】:
def get_final_dataframe(pathname, df):
cur_names = pathname.split(".")
if len(cur_names) > 1:
    root_name = cur_names[0]
    delimiter = "."
    new_path_name = delimiter.join(cur_names[1:len(cur_names)])

    for field in df.schema.fields:
        if field.name == root_name:
            if type(field.dataType) == ArrayType:
                return get_final_dataframe(pathname, df.select(explode(root_name).alias(root_name)))
            elif type(field.dataType) == StructType:
                if hasColumn(df, delimiter.join(cur_names[0:2])):
                    return get_final_dataframe(new_path_name, df.select(delimiter.join(cur_names[0:2])))
                else:
                    return -1, -1
            else:
                return -1, -1

else:
    root_name = cur_names[0]
    for field in df.schema.fields:
        if field.name == root_name:
            if type(field.dataType) == StringType:
                return df, "string"
            elif type(field.dataType) == LongType:
                return df, "numeric"
            elif type(field.dataType) == DoubleType:
                return df, "numeric"
            else:
                return df, -1

return -1, -1

那么,你可以

key = "a.b.c.name"
# key = "context.content_feature.tag.name"
df2, field_type = get_final_dataframe(key, df1)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2018-05-13
    • 1970-01-01
    • 2020-06-26
    • 2021-12-23
    • 2021-12-03
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多