【问题标题】:Flatten BigQuery nested field contents into new columns instead of rows将 BigQuery 嵌套字段内容展平为新列而不是行
【发布时间】:2016-08-08 22:41:56
【问题描述】:

我有一些格式如下的 BigQuery 数据:

"thing": [
  {
    "name": "gameLost",
    "params": [
      {
        "key": "total_games",
        "val": {
          "str_val": "3",
          "int_val": null
        }
      },
      {
        "key": "games_won",
        "val": {
          "str_val": "2",
          "int_val": null
        }
      },
      {
        "key": "game_time",
        "val": {
          "str_val": "44",
          "int_val": null
        }
      }
    ],
    "dt_a": "1470625311138000",
    "dt_b": "1470620345566000"
  }

我知道 FLATTEN() 函数会产生 3 行的输出,如下所示:

+------------+------------------+------------------+--------------------+--------------------------+--------------------------+
| thing.name | thing.dt_a       | event_dim.dt_b   | thing.params.key   | thing.params.val.str_val | thing.params.val.int_val |
+------------+------------------+------------------+--------------------+--------------------------+--------------------------+
| gameLost   | 1470625311138000 | 1470620345566000 | total_games_played | 3                        | null                     |
|            |                  |                  |                    |                          |                          |
| gameLost   | 1470625311138000 | 1470620345566000 | games_won          | 2                        | null                     |
|            |                  |                  |                    |                          |                          |
| gameLost   | 1470625311138000 | 1470620345566000 | game_time          | 44                       | null                     |
+------------+------------------+------------------+--------------------+--------------------------+--------------------------+

更高级别的键/值被重复到每个更深层次对象的新行中。

但是,我需要将更深层次的键/值输出为全新的列,而不是重复字段,因此结果如下所示:

+------------+------------------+------------------+--------------------+-----------+-----------+
| thing.name | thing.dt_a       | event_dim.dt_b   | total_games_played | games_won | game_time |
+------------+------------------+------------------+--------------------+-----------+-----------+
| gameLost   | 1470625311138000 | 1470620345566000 | 3                  | 2         | 44        |
+------------+------------------+------------------+--------------------+-----------+-----------+

我该怎么做?
谢谢!

【问题讨论】:

    标签: google-bigquery


    【解决方案1】:

    Standard SQL 使这更容易表达(取消选中“显示选项”下的“使用旧版 SQL”):

    WITH T AS (
      SELECT STRUCT(
        "gameLost" AS name,
        ARRAY<STRUCT<key STRING, val STRUCT<str_val STRING, int_val INT64>>>[
          STRUCT("total_games", STRUCT("3", NULL)),
          STRUCT("games_won", STRUCT("2", NULL)),
          STRUCT("game_time", STRUCT("44", NULL))] AS params,
        1470625311138000 AS dt_a,
        1470620345566000 AS dt_b) AS thing
    )
    SELECT
      (SELECT AS STRUCT thing.* EXCEPT (params)) AS thing,
      thing.params[OFFSET(0)].val.str_val AS total_games_played,
      thing.params[OFFSET(1)].val.str_val AS games_won,
      thing.params[OFFSET(2)].val.str_val AS game_time
    FROM T;
    +-------------------------------------------------------------------------+--------------------+-----------+-----------+
    |                                  thing                                  | total_games_played | games_won | game_time |
    +-------------------------------------------------------------------------+--------------------+-----------+-----------+
    | {"name":"gameLost","dt_a":"1470625311138000","dt_b":"1470620345566000"} | 3                  | 2         | 44        |
    +-------------------------------------------------------------------------+--------------------+-----------+-----------+
    

    如果不知道数组中键的顺序,可以使用子选择来提取相关值:

    WITH T AS (
      SELECT STRUCT(
        "gameLost" AS name,
        ARRAY<STRUCT<key STRING, val STRUCT<str_val STRING, int_val INT64>>>[
          STRUCT("total_games", STRUCT("3", NULL)),
          STRUCT("games_won", STRUCT("2", NULL)),
          STRUCT("game_time", STRUCT("44", NULL))] AS params,
        1470625311138000 AS dt_a,
        1470620345566000 AS dt_b) AS thing
    )
    SELECT
      (SELECT AS STRUCT thing.* EXCEPT (params)) AS thing,
      (SELECT val.str_val FROM UNNEST(thing.params) WHERE key = "total_games") AS total_games_played,
      (SELECT val.str_val FROM UNNEST(thing.params) WHERE key = "games_won") AS games_won,
      (SELECT val.str_val FROM UNNEST(thing.params) WHERE key = "game_time") AS game_time
    FROM T;
    

    【讨论】:

    • 喜欢标准 SQL 的新特性!!!真的会!同时,我认为您不能依靠 order/offset 来检索键的值-除非保证键按特定顺序排列-在我的实践中通常不是这样
    • 谢谢!我更新了我的答案,以说明这也是如何工作的。
    • 感谢@MikhailBerlyant 和@ElliottBrossard!你们让我开始了,但是在尝试应用到我更复杂的数据源时遇到了更多问题。我在这里打开了一个新的相关问题:stackoverflow.com/questions/38860534/…
    【解决方案2】:

    试试下面(旧版 SQL)

    SELECT 
      thing.name AS name,
      thing.dt_a AS dt_a,
      thing.dt_b AS dt_b
      MAX(IF(thing.params.key = "total_games_played", INTEGER(thing.params.val.str_val), 0)) WITHIN RECORD AS total_games_played,
      MAX(IF(thing.params.key = "games_won", INTEGER(thing.params.val.str_val), 0)) WITHIN RECORD AS games_won,
      MAX(IF(thing.params.key = "game_time", INTEGER(thing.params.val.str_val), 0)) WITHIN RECORD AS game_time,
    FROM YourTable  
    

    对于标准 SQL,您可以尝试(受 Elliott 的回答启发 - 重要区别 - 数组按键排序,因此键值的顺序得到保证)

    WITH Temp AS (
      SELECT 
        (SELECT AS STRUCT thing.* EXCEPT (params)) AS thing,
        ARRAY(SELECT val.str_val AS val FROM UNNEST(thing.params) ORDER BY key) AS params
      FROM YourTable
    )
    SELECT 
      thing, 
      params[OFFSET(2)] AS total_games_played,
      params[OFFSET(1)] AS games_won,
      params[OFFSET(0)] AS game_time
    FROM Temp 
    

    注意:如果参数中有其他键 - 你应该在 ARRAY 中添加 WHERE 子句到 SELECT

    【讨论】:

      猜你喜欢
      • 2019-01-09
      • 1970-01-01
      • 2014-09-04
      • 2019-02-28
      • 2020-10-18
      • 2014-11-15
      • 2019-01-17
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多