【问题标题】:BigQuery: Union two different tables which are based on federated Google SpreadsheetBigQuery:联合两个基于联合 Google 电子表格的不同表
【发布时间】:2018-01-11 01:18:07
【问题描述】:

我有两个不同的 Google 电子表格:

一个有 4 列

+------+------+------+------+
| Col1 | Col2 | Col5 | Col6 |
+------+------+------+------+
| ID1  | A    | B    | C    |
| ID2  | D    | E    | F    |
+------+------+------+------+

一个包含上一个文件的 4 列,以及另外 2 个列

+------+------+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 | Col5 | Col6 |
+------+------+------+------+------+------+
| ID3  | G    | H    | J    | K    | L    |
| ID4  | M    | N    | O    | P    | Q    |
+------+------+------+------+------+------+

我在 Google BigQuery 中将它们配置为联合源,现在我需要创建一个视图来连接两个表的数据。

两个表都有 Col1 列,其中包含一个 ID,此 ID 在所有表中是唯一的,不包含复制数据。

我要查找的结果表如下:

+------+------+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 | Col5 | Col6 |
+------+------+------+------+------+------+
| ID1  | A    | NULL | NULL | B    | C    |
| ID2  | D    | NULL | NULL | E    | F    |
| ID3  | G    | H    | J    | K    | L    |
| ID4  | M    | N    | O    | P    | Q    |
+------+------+------+------+------+------+

对于第一个文件没有的列,我期待 NULL 值。

我使用的是标准 SQL,这里有一个语句可以用来生成示例数据:

#standardsQL

WITH table1 AS (
  SELECT "A" as Col1, "B" as Col2, "C" AS Col3
  UNION ALL
  SELECT "D" as Col1, "E" as Col2, "F" AS Col3
),

table2 AS (
  SELECT "G" as Col1, "H" as Col2, "J" AS Col3, "K" AS Col4, "L" AS Col5
  UNION ALL
  SELECT "M" as Col1, "N" as Col2, "O" AS Col3, "P" AS Col4, "Q" AS Col5
)

一个简单的UNION ALL 不起作用,因为表有不同的列

SELECT * FROM table1
UNION ALL
SELECT * FROM table2

Error: Queries in UNION ALL have mismatched column count; query 1 has 3 columns, query 2 has 5 columns at [17:1]

通配符运算符不是一种合适的方式,因为联合来源不支持该方式

SELECT * FROM `table*`

Error: External tables cannot be queried through prefix

当然这是一个示例数据,只有 3-5 列,真实表有 20-40 列。因此,我需要逐个字段显式地 SELECT 的示例,这不是一个可观的方式。

有没有一种可行的方法来加入这两个表?

【问题讨论】:

    标签: sql google-sheets google-bigquery bigquery-standard-sql


    【解决方案1】:

    您可以通过 UDF 传递行来处理列名未按位置对齐或表之间存在不同数量的情况。这是一个例子:

    CREATE TEMP FUNCTION CoerceRow(json_row STRING)
    RETURNS STRUCT<Col1 STRING, Col2 STRING, Col3 STRING, Col4 STRING, Col5 STRING>
    LANGUAGE js AS """
    return JSON.parse(json_row);
    """;
    
    WITH table1 AS (
      SELECT "A" as Col5, "B" as Col3, "C" AS Col2
      UNION ALL
      SELECT "D" as Col5, "E" as Col3, "F" AS Col2
    ),
    
    table2 AS (
      SELECT "G" as Col1, "H" as Col2, "J" AS Col3, "K" AS Col4, "L" AS Col5
      UNION ALL
      SELECT "M" as Col1, "N" as Col2, "O" AS Col3, "P" AS Col4, "Q" AS Col5
    )
    SELECT CoerceRow(json_row).*
    FROM (
      SELECT TO_JSON_STRING(t1) AS json_row
      FROM table1 AS t1
      UNION ALL
      SELECT TO_JSON_STRING(t2) AS json_row
      FROM table2 AS t2
    );
    +------+------+------+------+------+
    | Col1 | Col2 | Col3 | Col4 | Col5 |
    +------+------+------+------+------+
    | NULL | C    | B    | NULL | A    |
    | NULL | F    | E    | NULL | D    |
    | G    | H    | J    | K    | L    |
    | M    | N    | O    | P    | Q    |
    +------+------+------+------+------+
    

    请注意,CoerceRow 函数需要在输出中声明您想要的显式行类型。除此之外,被联合的表中的列只是按名称匹配。

    【讨论】:

    • 我正在寻找一种不需要逐个配置列名的方法,因为在真实表中,附加列最多为 20//30,因此维护该架构可以难的。有没有更自动的方法?如果表是“真实”表而不是外部表,通配符可以是合适的方式吗?
    • 通配符方法适用于“真实”表。但是,鉴于您的问题陈述的限制,将模式拼写一次是您的最佳选择。不过,您可以考虑使用 filing a feature request 按列名进行一种“自然”联合。
    • 太棒了!!!天才的定义是把复杂的事情变得简单(阿尔伯特爱因斯坦)有时这很明显——但我仍然从艾略特的帖子中学到了很多东西!谢谢!
    【解决方案2】:

    有没有一种可行的方法来加入这两个表?

    #standardsQL
    SELECT *, NULL AS Col5, NULL AS Col6 FROM table1
    UNION ALL
    SELECT * FROM table2  
    

    你可以用你的例子来检查这个

    #standardsQL
    WITH table1 AS (
      SELECT "ID1" AS Col1, "A" AS Col2, "B" AS Col3, "C" AS Col4 
      UNION ALL
      SELECT "ID2", "D", "E", "F"
    ),
    table2 AS (
      SELECT "ID3" Col1, "G" AS Col2, "H" AS Col3, "J" AS Col4, "K" AS Col5, "L" AS Col6 
      UNION ALL
      SELECT "ID4", "M", "N", "O", "P", "Q" 
    )
    SELECT *, NULL AS Col5, NULL AS Col6 FROM table1
    UNION ALL
    SELECT * FROM table2
    

    【讨论】:

    • 我正在寻找一种不需要逐个配置列名的方法,因为在真实表中,附加列最多为 20//30,因此维护该架构可以难的。有没有更自动的方法?如果表是“真实”表而不是外部表,通配符可以是合适的方式吗?
    • 我遇到了同样的问题,而且我的列太多,无法手动配置所有名称。 BQ 中的外部联合选项非常有用。
    猜你喜欢
    • 2017-10-19
    • 2010-12-06
    • 2018-07-18
    • 1970-01-01
    • 2011-09-09
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多