【问题标题】:BigQuery combine tables based on closest timerstamp and matching valueBigQuery 根据最接近的时间戳和匹配值组合表
【发布时间】:2016-11-03 23:45:14
【问题描述】:

我有两个表,对于表 numberTwo 的每一行,我需要在表 numberOne 中获取具有相同 hint >cod 值以及在比较 time1time2 时具有 最接近时间 的值。

为了更容易理解我需要做的是:

表号一:

|  id |  cod  |   hint  |           time1         |
---------------------------------------------------
|  1  |  ABC  |    V    | 2016-11-03 18:00:00 UTC |
|  2  |  ABC  |    W    | 2016-11-03 12:00:00 UTC |
|  3  |  CDE  |    X    | 2016-11-03 19:00:00 UTC |
|  4  |  CDE  |    Y    | 2016-11-03 19:30:00 UTC |
|  5  |  EFG  |    Z    | 2016-11-03 18:00:00 UTC |

表号二

|  id |  cod  |   value  |         time2           |
----------------------------------------------------
|  1  |  ABC  |   xyz2   | 2016-11-03 18:20:00 UTC |
|  2  |  ABC  |   h323   | 2016-11-03 11:30:00 UTC |
|  3  |  ABC  |   rewq   | 2016-11-03 09:00:00 UTC |
|  4  |  CDE  |   abce   | 2016-11-03 19:10:00 UTC |

因此,对于表 numberTworow #1,我将使用 cod: ABC 获取表 numberOne 中的所有行强>

|  1  |  ABC  |    V    | 2016-11-03 18:00:00 UTC |
|  2  |  ABC  |    W    | 2016-11-03 12:00:00 UTC |

在这两者之间,我会得到一个与 time2 最接近的时间戳:

|  1  |  ABC  |    V    | 2016-11-03 18:00:00 UTC |

处理完每一行后,我会得到一个像这样的表格:

所需的表

|  id |  cod  |   hint  |   value  |         time2           |
--------------------------------------------------------------
|  1  |  ABC  |    V    |   xyz2   | 2016-11-03 18:20:00 UTC |
|  2  |  ABC  |    W    |   h323   | 2016-11-03 11:30:00 UTC |
|  3  |  ABC  |    W    |   rewq   | 2016-11-03 09:00:00 UTC |
|  4  |  CDE  |    X    |   abce   | 2016-11-03 19:10:00 UTC |

【问题讨论】:

    标签: mysql sql google-bigquery


    【解决方案1】:

    对于 BigQuery 标准 SQL - 在下面尝试

    您可以取消注释带有示例数据的注释块以进行快速测试

    WITH 
    /*    
    TableNumberOne AS (
      SELECT 1 AS id, 'ABC' AS cod, 'V' AS hint, TIMESTAMP '2016-11-03 18:00:00 UTC' AS time1 UNION ALL
      SELECT 2 AS id, 'ABC' AS cod, 'W' AS hint, TIMESTAMP '2016-11-03 12:00:00 UTC' AS time1 UNION ALL
      SELECT 3 AS id, 'CDE' AS cod, 'X' AS hint, TIMESTAMP '2016-11-03 19:00:00 UTC' AS time1 UNION ALL
      SELECT 4 AS id, 'CDE' AS cod, 'Y' AS hint, TIMESTAMP '2016-11-03 19:30:00 UTC' AS time1 UNION ALL
      SELECT 5 AS id, 'EFG' AS cod, 'Z' AS hint, TIMESTAMP '2016-11-03 18:00:00 UTC' AS time1 
    ),
    TableNumberTwo AS (
      SELECT 1 AS id, 'ABC' AS cod, 'xyz2' AS value, TIMESTAMP '2016-11-03 18:20:00 UTC' AS time2 UNION ALL
      SELECT 2 AS id, 'ABC' AS cod, 'h323' AS value, TIMESTAMP '2016-11-03 11:30:00 UTC' AS time2 UNION ALL
      SELECT 3 AS id, 'ABC' AS cod, 'rewq' AS value, TIMESTAMP '2016-11-03 09:00:00 UTC' AS time2 UNION ALL
      SELECT 4 AS id, 'CDE' AS cod, 'abce' AS value, TIMESTAMP '2016-11-03 19:10:00 UTC' AS time2 
    ),
    */
    tempTable AS (
      SELECT 
        t2.id, t2.cod, t2.value, t2.time2, t1.hint, 
        ROW_NUMBER() OVER(PARTITION BY t2.id, t2.cod, t2.value 
                          ORDER BY ABS(TIMESTAMP_DIFF(t2.time2, t1.time1, SECOND))) AS win
      FROM TableNumberTwo AS t2
      JOIN TableNumberOne AS t1
      ON t1.cod = t2.cod
    )
    SELECT id, cod, hint, value, time2
    FROM tempTable
    WHERE win = 1
    

    【讨论】:

    • 我是否应该保留不匹配的行,我该怎么做? (那些提示将是 NULL ......)
    • 你会使用 LEFT JOIN 而不是 JOIN
    • 哦,好吧,现在我开始明白了哈哈。非常感谢你的帮助,你是最棒的。
    • 还有其他方法吗?因为如果我使用左连接(包括在另一个问题中),从 68 开始的计费层基本上是无限的(需要 4628414464 或更高。)并且不断上升,使得无法运行查询。
    • 哇,不好。即使是68也足够大。我会想。今晚晚些时候回来。
    【解决方案2】:

    还有其他方法吗?因为如果我使用左连接(包括在 其他问题)从 68 开始的计费等级基本上是无限的 (需要 4628414464 或更高版本。)并且不断上升 无法运行查询

    我觉得可以玩的地方很少

    1 - ABS(TIMESTAMP_DIFF(t2.time2, t1.time1, SECOND)) - 此函数针对连接中的所有排列运行。您可能想尝试在单独的子选择中将每个表的相应时间字段转换为秒,而不是使用它而不是原始表 - 因此您只需要执行类似 ABS(t2.time2inSeconds - t1.time1inSeconds) 的操作

    2 - 使用ROW_NUMBER() 是另一个潜在的费用来源 - 请参阅下面的查询,我试图完全重写逻辑 - 但这是非常盲目的尝试,因为我无法对其进行测试,看看这是否真的修复或改进了东西或不。如果您可以尝试并告知结果(计费等级),那就太好了

    WITH 
    /*    
    TableNumberOne AS (
      SELECT 1 AS id, 'ABC' AS cod, 'V' AS hint, TIMESTAMP '2016-11-03 18:00:00 UTC' AS time1 UNION ALL
      SELECT 2 AS id, 'ABC' AS cod, 'W' AS hint, TIMESTAMP '2016-11-03 12:00:00 UTC' AS time1 UNION ALL
      SELECT 3 AS id, 'CDE' AS cod, 'X' AS hint, TIMESTAMP '2016-11-03 19:00:00 UTC' AS time1 UNION ALL
      SELECT 4 AS id, 'CDE' AS cod, 'Y' AS hint, TIMESTAMP '2016-11-03 19:30:00 UTC' AS time1 UNION ALL
      SELECT 5 AS id, 'EFG' AS cod, 'Z' AS hint, TIMESTAMP '2016-11-03 18:00:00 UTC' AS time1 
    ),
    TableNumberTwo AS (
      SELECT 1 AS id, 'ABC' AS cod, 'xyz2' AS value, TIMESTAMP '2016-11-03 18:20:00 UTC' AS time2 UNION ALL
      SELECT 2 AS id, 'ABC' AS cod, 'h323' AS value, TIMESTAMP '2016-11-03 11:30:00 UTC' AS time2 UNION ALL
      SELECT 3 AS id, 'ABC' AS cod, 'rewq' AS value, TIMESTAMP '2016-11-03 09:00:00 UTC' AS time2 UNION ALL
      SELECT 4 AS id, 'CDE' AS cod, 'abce' AS value, TIMESTAMP '2016-11-03 19:10:00 UTC' AS time2 
    ),
    */
    tempTable1 AS (
      SELECT 
        t2.id, t2.cod, t2.value, 
        MIN(ABS(TIMESTAMP_DIFF(t2.time2, t1.time1, SECOND))) AS delta 
      FROM TableNumberTwo AS t2
      JOIN TableNumberOne AS t1
      ON t1.cod = t2.cod
      GROUP BY t2.id, t2.cod, t2.value
    ),
    tempTable2 AS (
      SELECT a.id, a.cod, a.value, a.time2, b.delta
      FROM TableNumberTwo AS a 
      JOIN tempTable1 AS b 
      ON a.id = b.id AND a.cod = b.cod AND a.value = b.value
    )
    SELECT a.id, a.cod, t1.hint, a.value, a.time2
    FROM tempTable2 AS a
    JOIN TableNumberOne AS t1
    ON t1.cod = a.cod AND ABS(TIMESTAMP_DIFF(a.time2, t1.time1, SECOND)) = delta   
    

    注意:上面的查询仍然应该是完整的,因为它可以从 tableOne 返回几个匹配的行,这些行与 tableTwo 中的相应行同样接近。但就目前而言 - 只是为了验证成本问题是否已得到修复或至少得到改进

    3 - 顺便说一句,它也可以是你的倾斜数据等。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-10-06
      • 2022-12-23
      • 1970-01-01
      • 2014-02-07
      • 1970-01-01
      • 1970-01-01
      • 2016-03-23
      • 1970-01-01
      相关资源
      最近更新 更多