【问题标题】:Clickhouse - how to group by with forward filling missing valuesClickhouse - 如何通过前向填充缺失值进行分组
【发布时间】:2021-12-25 14:24:22
【问题描述】:

我在 Clickhouse 中有一张表格,结构如下:

x_id | y_id | z_id | rank | timestamp
1231 | 1324 | 9412 | 1    | 2021-03-12 00:13:34
121  | 5524 | 765  | 21   | 2021-03-13 15:43:21
54   | 76   | 8822 | 125  | 2021-05-14 17:23:12
213  | 61   | 7651 | 51   | 2021-03-16 12:15:43
53   | 65   | 123  | 23   | 2021-03-12 13:28:54
1231 | 432  | 7651 | 1541 | 2021-03-12 16:54:24
...

几个星期没有特定组(x_id、y_id、z_id)的记录,在这种情况下,如果有值,我需要取该组(x_id、y_id、z_id)的前一个排名(从前一周开始)存在。

例如:

group_ids, rank, timestamp
(1, 1, 1, 25, '2021-03-12 00:13:34') -> group (1, 1, 1), week 2021-03-08
(2, 2, 2, 30, '2021-03-16 00:13:34') -> group (2, 2, 2), week 2021-03-15

no data for group (1, 1, 1) for week 2021-03-15 - fill from the previous week and set "week" as the current week:

(1, 1, 1, 25, 2021-03-15)

and so on ...

然后使用子查询计算此数据的指标

SELECT
    week,
    SUM(CASE
        WHEN rank BETWEEN 1 AND 3 THEN 1
        ELSE 0
    END) AS metric1,
    /* ... */
FROM (
    SELECT min(rank) AS rank, toStartOfWeek(Timestamp, 1) AS week FROM table GROUP BY week, x_id, y_id, z_id
) GROUP BY week ORDER BY week;


metric1 | metric2 |  week
0       |  2      |  2021-03-22
1       |  0      |  2021-03-29 
0       |  1      |  2021-04-05

是否可以使用前向填充缺失值构建查询?

【问题讨论】:

    标签: group-by missing-data clickhouse


    【解决方案1】:

    你可以使用WITH FILL修饰符:https://clickhouse.com/docs/en/sql-reference/statements/select/order-by/#orderby-with-fill

    我猜你的查询中有这部分:

    ORDER BY week
    

    只需将其扩展为

    ORDER BY week WITH FILL STEP 7
    

    【讨论】:

      【解决方案2】:

      我认为最好在服务器端填补空白(在应用程序代码中消耗了这个结果)。

      尽管如此,在 ClickHouse 方面考虑这种方式:

      SELECT 
          week, 
          last_value(metric1) OVER w AS metric1
          /*, ..*/
      FROM (
          SELECT 
              toStartOfWeek(timestamp) week, 
              toNullable(minIf(rank, rank < 100)) metric1 
              /*, .. */
          FROM (
              /* Emulate the test dataset. */
              SELECT data.1 x_id, data.2 y_id, data.3 z_id, data.4 rank, toDateTime(data.5) timestamp
              FROM (
                  SELECT arrayJoin([
                      (1231, 1324, 9412, 1   , '2021-03-12 00:13:34'),
                      (121 , 5524, 765 , 21  , '2021-03-13 15:43:21'),
                      (54  , 76  , 8822, 125 , '2021-05-14 17:23:12'),
                      (213 , 61  , 7651, 51  , '2021-03-16 12:15:43'),
                      (53  , 65  , 123 , 23  , '2021-03-12 13:28:54'),
                      (1231, 432 , 7651, 1541, '2021-03-12 16:54:24')]) data
                  )
              )        
          GROUP BY week
          ORDER BY week WITH FILL STEP 7
      )
      WINDOW w AS (ORDER BY week ROWS BETWEEN 100 PRECEDING AND CURRENT ROW)
      SETTINGS allow_experimental_window_functions = 1
      
      /*
      ┌───────week─┬─metric1─┐
      │ 2021-03-07 │       1 │
      │ 2021-03-14 │      51 │
      │ 2021-03-21 │      51 │
      │ 2021-03-28 │      51 │
      │ 2021-04-04 │      51 │
      │ 2021-04-11 │      51 │
      │ 2021-04-18 │      51 │
      │ 2021-04-25 │      51 │
      │ 2021-05-02 │      51 │
      │ 2021-05-09 │       0 │
      └────────────┴─────────┘
      */
      
      

      备注:

      • 度量值被标记为 Nullable(请参阅toNullable 调用)以 NULL 而不是 0 填充缺失值
      • 窗口大小定义为100 - 当间隔小于100周时就足够了(如果需要,增加这个值)

      查看更多信息:

      【讨论】:

      • 感谢您的回答!我在我的问题中添加了示例。我需要按周和多个字段(周、x_id、y、id_、z_id)分组的主要问题,如果下周每周都没有该组的值,则为每组字段创建间隔并从前一周填充它们。我想用 Clickhouse 解决这个问题,但如果不可能,我会在服务器端填补空白。
      猜你喜欢
      • 1970-01-01
      • 2016-11-03
      • 2020-05-30
      • 1970-01-01
      • 1970-01-01
      • 2018-10-18
      • 1970-01-01
      • 2019-10-21
      • 1970-01-01
      相关资源
      最近更新 更多