【问题标题】:hive sql join different table as array of structshive sql将不同的表连接为结构数组
【发布时间】:2021-01-20 07:46:23
【问题描述】:

让我们有表header

id |  col1 | col2
1  |  "a" | "b"
2  |  "c" | "d"

还有表body

header_id | body_id | body_col
1         | 6       | "abc"
1         | 7       | "def"
2         | 8       | "ghi"
2         | 9       | "jkl"

我想将body作为结构数组插入到header中,在json中,结果是这样的:

{
  id: 1,
  col1: "a",
  col2: "b",
  body: [{body_id: 6, body_col: "abc"}, {body_id: 7, body_col: "def"}]
},
{
  id: 2,
  col1: "c",
  col2: "d",
  body: [{body_id: 8, body_col: "ghi"}, {body_id: 9, body_col: "jkl"}]
}

我如何实现这一目标? AFAIK collect_setcollect_list 不起作用,因为它们只会将整列收集到一个数组中。

【问题讨论】:

    标签: sql apache-spark hive


    【解决方案1】:

    您必须首先加入两个数据框。那么collect_list实际上是实现你想要的方法。您只需首先将body_idbody_col 绑定在struct 中即可。

    代码如下所示:

    val result = header
        .join(body.withColumnRenamed("header_id", "id"), Seq("id"))
        .groupBy("id", "col1", "col2")
        .agg(collect_list(struct('body_id, 'body_col)) as "body")
    result.show(false)
    
    +---+----+----+--------------------+
    |id |col1|col2|body                |
    +---+----+----+--------------------+
    |2  |c   |d   |[[8, ghi], [9, jkl]]|
    |1  |a   |b   |[[6, abc], [7, def]]|
    +---+----+----+--------------------+
    

    我们还可以打印结果的架构,这正是您的 json 的构建方式:

    root
     |-- id: string (nullable = true)
     |-- col1: string (nullable = true)
     |-- col2: string (nullable = true)
     |-- body: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- body_id: string (nullable = true)
     |    |    |-- body_col: string (nullable = true)
    

    【讨论】:

      猜你喜欢
      • 2016-08-11
      • 1970-01-01
      • 2020-08-12
      • 1970-01-01
      • 2019-04-03
      • 2017-11-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多