【问题标题】:parse string of jsons pyspark解析jsons pyspark的字符串
【发布时间】:2019-05-08 19:16:30
【问题描述】:

我正在尝试解析 json 字符串列表的一列,但即使在使用 structType、structField 等尝试了多个模式之后,我也根本无法获取模式。

[{"event":"empCreation","count":"148"},{"event":"jobAssignment","count":"3"},{"event":"locationAssignment","count":"77"}]

[{"event":"empCreation","count":"334"},{"event":"jobAssignment","count":33"},{"event":"locationAssignment","count":"73"}]

[{"event":"empCreation","count":"18"},{"event":"jobAssignment","count":"32"},{"event":"locationAssignment","count":"72"}]

根据这篇 SO 帖子,我能够导出 json 架构,但即使在应用 from_json 函数之后,它仍然无法工作

Pyspark: Parse a column of json strings

你能帮忙吗?

【问题讨论】:

    标签: pyspark fromjson


    【解决方案1】:

    您可以使用以下 schame 定义解析给定的 json 架构,并将 json 作为提供架构信息的 DataFrame 读取。

    >>> dschema = StructType([
    ...         StructField("event", StringType(),True),
    ...         StructField("count", StringType(),True)])
    >>>
    
    >>>
    >>> df = spark.read.json('/<json_file_path>/json_file.json', schema=dschema)
    >>>
    >>> df.show()
    +------------------+-----+
    |             event|count|
    +------------------+-----+
    |       empCreation|  148|
    |     jobAssignment|    3|
    |locationAssignment|   77|
    |       empCreation|  334|
    |     jobAssignment|   33|
    |locationAssignment|   73|
    |       empCreation|   18|
    |     jobAssignment|   32|
    |locationAssignment|   72|
    +------------------+-----+
    
    >>>
    

    json文件内容:

    cat json_file.json
    [{"event":"empCreation","count":"148"},{"event":"jobAssignment","count":"3"},{"event":"locationAssignment","count":"77"}]
    [{"event":"empCreation","count":"334"},{"event":"jobAssignment","count":"33"},{"event":"locationAssignment","count":"73"}]
    [{"event":"empCreation","count":"18"},{"event":"jobAssignment","count":"32"},{"event":"locationAssignment","count":"72"}]
    

    【讨论】:

      【解决方案2】:

      非常感谢@Lakshmanan,但我只需要对架构稍作改动即可:

      eventCountSchema = ArrayType(StructType([StructField("event", StringType(),True),StructField("count", StringType(),True)]), True)

      然后将此模式应用于数据框复杂数据类型列

      【讨论】:

        猜你喜欢
        • 2017-04-27
        • 2021-06-28
        • 2022-01-23
        • 2023-03-30
        • 1970-01-01
        • 1970-01-01
        • 2019-11-11
        • 2014-01-01
        • 2010-11-07
        相关资源
        最近更新 更多