【发布时间】:2020-04-08 05:23:14
【问题描述】:
我最终通过下载所有数据并在 Python 中遍历它来解决了这个问题,但我想知道是否有办法在 BigQuery 中做到这一点。
我们有一个包含开始日期和结束日期的表格:
begin_date, end_date
'2016-02-19', '2016-02-19'
'2016-02-20', '2016-02-25'
'2016-02-21', '2016-02-25'
'2016-02-22', NULL
我们想要 begin_date
SELECT COUNT(*) FROM `table` WHERE begin_date <= '2016-12-19' AND (end_date >= '2016-12-19' OR end_date IS NULL)
因此,如果我为每个感兴趣的值手动执行此操作,所需的输出可能如下所示:
begin_date, count
2016-02-19, 1
2016-02-20, 1
2016-02-21, 2
2016-02-22, 3
2016-02-23, 3
2016-02-24, 3
2016-02-25, 3
2016-02-26, 1
etc.
创建要迭代的日期列表很容易:
WITH dates AS (SELECT * FROM UNNEST(GENERATE_DATE_ARRAY('2018-10-01', '2020-09-30', INTERVAL 1 DAY)) AS example)
现在我正在努力在所有这些日期中应用上述 WHERE 子句。我看到在匹配单个列 (like here) 时,具有范围的分区是如何工作的,但我需要同时匹配 begin_date 和 end_date。
我认为我可以这样做:
SELECT
status_begin_date,
(SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE (e >= status_begin_date OR e IS NULL)) AS cnt
FROM (
SELECT
status_begin_date,
ARRAY_AGG(status_end_date) OVER(ORDER BY status_begin_date) AS ends
FROM `table`
)
ORDER BY status_begin_date
取自here。这适用于 StackOverflow 答案中给出的小示例,但我在有几亿行的表上使用它时遇到资源错误: BigQuery 中是否有可扩展的解决方案?
【问题讨论】:
-
你能试试这个代码吗? SELECT begin_date, COUNT(*) FROM 'table' CROSS JOIN dates WHERE begin_date = example OR end_date IS NULL) GROUP BY begin_date ORDER BY begin_date
-
如果这是您要搜索的内容,请告诉我
-
@rmesteves 谢谢,但这并没有给出相同的结果。我不确定有什么区别。有时它们的值高于预期,有时低于预期。
-
@rmesteves 你可以看到结果的差异:
sql WITH data AS ( SELECT DATE('2016-02-19') AS begin_date, DATE('2016-02-19') AS end_date UNION ALL SELECT '2016-02-20', '2016-02-25' UNION ALL SELECT '2016-02-21', '2016-02-25' UNION ALL SELECT '2016-02-22', NULL ), dates AS (SELECT * FROM UNNEST(GENERATE_DATE_ARRAY('2016-02-19', '2016-02-26', INTERVAL 1 DAY)) AS example) SELECT begin_date, COUNT(*) FROM data CROSS JOIN dates WHERE begin_date <= example AND (end_date >= example OR end_date IS NULL) GROUP BY begin_date ORDER BY begin_date -
你能用这个小改动试试我的代码吗?也许可以优化结果: SELECT example, COUNT(*) FROM 'table' CROSS JOIN dates WHERE begin_date = example OR end_date IS NULL) GROUP BY example ORDER BY example
标签: arrays date google-bigquery intervals