【发布时间】:2020-12-19 07:06:01
【问题描述】:
我有一个具有以下架构的数据:索引属性是 Struct --> 带有数组 --> struct 中的每个数组元素
root
|-- id_num: string (nullable = true)
|-- indexes: struct (nullable = true)
| |-- customer_rating: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- data_sufficiency_indicator: boolean (nullable = true)
| | | |-- value: double (nullable = true)
| | | |-- version: string (nullable = true)
| |-- reputation: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- data_sufficiency_indicator: boolean (nullable = true)
| | | |-- low_value_reason: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- value: double (nullable = true)
| | | |-- version: string (nullable = true)
| |-- visibility: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- data_sufficiency_indicator: boolean (nullable = true)
| | | |-- low_value_reason: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- value: double (nullable = true)
| | | |-- version: string (nullable = true)
我想将架构转换为以下格式并将数据值放入相应的列中
root
|-- id_num: string (nullable = true)
|-- indexes_type: string (nullable = true) --> this field hold indexes struct elements as a row
|-- data_sufficiency_indicator: boolean (nullable = true)
|-- value: double (nullable = true)
|-- version: string (nullable = true)
|-- low_value_reason: string (nullable = true) --> each element in the array becomes a new row
这里是json格式的示例输入数据:
{"id_num":"1234","indexes":{"visibility":[{"version":"2.0","data_sufficiency_indicator":true,"value":2.16,"low_value_reason":["low scores from reviews_and_visits","low scores from online_presence"]}],"customer_rating":[{"version":"2.0","data_sufficiency_indicator":false}],"reputation":[{"version":"2.0","data_sufficiency_indicator":false}]}}
{"data_id":"5678","indexes":{"visibility":[{"version":"2.0","data_sufficiency_indicator":true,"value":2.71,"low_value_reason":["low scores from reviews_and_visits","low scores from online_presence"]}],"customer_rating":[{"version":"2.0","data_sufficiency_indicator":false}]}}
{"data_id":"9876","indexes":{"visibility":[{"version":"2.0","data_sufficiency_indicator":true,"value":3.06}],"customer_rating":[{"version":"2.0","data_sufficiency_indicator":false}],"reputation":[{"version":"2.0","data_sufficiency_indicator":false}]}}
预期输出
id_num | indexes_type | version | data_sufficiency_indicator | value | low_value_reason
==============================================================================================================
9999 visibility 2.0 true 2.16 low scores from reviews_and_visits
9999 visibility 2.0 true 2.16 low scores from online_presence
9999 customer_rating 2.0 false
9999 reputation 2.0 false
8888 visibility 2.0 true 2.71 low scores from reviews_and_visits
8888 visibility 2.0 true 2.71 low scores from online_presence
8888 customer_rating 2.0 false
7898 visibility 2.0 true 3.06
7898 customer_rating 2.0 false
7898 reputation 2.0 false
非常感谢对此用例的任何帮助。也有可能在不硬编码代码中的结构值的情况下获得输出,因为它们可以超出示例中的内容。
【问题讨论】:
-
您是否能够控制数据加载,即在使用 spark.read.json(..) 时指定架构?
-
@jxc 不确定我是否完全理解您的问题。我想我能做到。现在我正在加载完整的 json 文件并转换为 parquet 格式,然后给我上面的 Schema。你能帮我解决这个问题吗
标签: arrays struct pyspark apache-spark-sql