【问题标题】:Querying in BigQuery?在 BigQuery 中查询?
【发布时间】:2016-02-09 03:26:36
【问题描述】:

我在 BigQuery 中有一个 Package 表,如下所示:

 Packageid  Scanid  dispatchid  timestamp   status
   p1         s1       null        t1        'in'
   p2         s1       xxx         t2        'in'
   p1         s2       yyy         t3        'pkin'
   p1         s3       sss         t4        'iwi'
   p1         s4       eee         t5        'lhp'
   p2         s2       uuuu        t6        'uio'
   p2         s3       null        t7        'jsk'

我想检索以下详细信息:

Packageid   Latest-Scanid   First-Dispatch-time  Last-Dispatch-time   latest-status

 p1            s4                 t3                 t5                 'lhp'
 p2            s3                 t2                 t6                 'jsk'  

First-Dispatch-time 是第一次 dispatch id 出现在包裹扫描中的时间。 Last-Dispatch-time 是最后一次 dispatch id 出现在包扫描中的时间。

有没有办法使用 BigQuery 或 BigQuery 中的用户定义函数来获取上表?

【问题讨论】:

    标签: sql database google-bigquery user-defined-functions


    【解决方案1】:

    一种方法使用windows函数和条件聚合:

    select packageid,
           max(case when seqnum = 1 then dispatchid end) as dispatchid,
           min(case when dispatchid is not null then timestamp end) as first_dispatchid,
           max(case when dispatchid is not null then timestamp end) as last_dispatchid,
           max(case when seqnum = 1 then status end) as status
    from (select t.*,
                 row_number() over (partition by packageid order by timestamp desc) as seqnum
          from t
         ) t
    group by packageid;
    

    【讨论】:

      【解决方案2】:

      我会注意,这是针对 SQL Server 的,可能适用于 MYSQL,也可能不适用。

      SELECT Packageid, 
          MAX(Scanid) [Latest_Scanid], 
          MIN(timestamp) [First-Dispatch-time], 
          MAX(timestamp) [Last-Dispatch-time],
          (SELECT status FROM Package p WHERE p.timestamp = Package.timestamp AND p.Packageid = Package.Packageid) [latest-status]
      FROM Package
      

      【讨论】:

        【解决方案3】:

        下面的查询使用了一个“肮脏”的技巧(参见 not_null_ts),它允许消除外部组,而是在内部选择中计算所有内容

        SELECT packageid, latest_scanid, first_dispatch_time, last_dispatch_time, latest_status
        FROM (
          SELECT packageid, 
            IF(dispatchid IS NULL, NULL, ts) AS not_null_ts,
            FIRST_VALUE(scanid) OVER(PARTITION BY packageid ORDER BY ts DESC) AS latest_scanid,
            MIN(not_null_ts) OVER(PARTITION BY packageid) AS first_dispatch_time,
            MAX(not_null_ts) OVER(PARTITION BY packageid) AS last_dispatch_time,
            FIRST_VALUE(status) OVER(PARTITION BY packageid ORDER BY ts DESC) AS latest_status,
            ROW_NUMBER() OVER(PARTITION BY packageid ORDER BY not_null_ts DESC) AS line
          FROM YourTable 
        )
        WHERE line = 1
        

        前段时间我发现这种技巧对我有用,但我认为我从来没有看到过明确记录,除非这可能是明显的用途 - 我从来没有想太多。

        【讨论】:

          猜你喜欢
          • 2016-12-25
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2021-04-16
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多