【问题标题】:How to select data with percentages from Clickhouse?如何从 Clickhouse 中选择百分比数据?
【发布时间】:2022-01-15 12:22:10
【问题描述】:

给定下表:

CREATE TABLE main
(
    `job_id` UUID,
    `request_time` DateTime,
    `host_id` UInt8,
    `status_code` LowCardinality(String),
)
ENGINE = MergeTree
ORDER BY request_time
SETTINGS index_granularity = 8192

我正在尝试按主机获取所有状态及其相对百分比。为此,我需要计算按主机和状态分组的结果,并将每个计数字段除以其总主机字段数。

例如,这个查询可以在 MySQL 中工作:

SELECT
    main.host_id,
    main.status_code,
    COUNT() AS status_count,
    COUNT() / sub.host_total * 100 AS percent
FROM
    main
INNER JOIN (
    SELECT host_id, COUNT() AS host_total
    FROM main
    GROUP BY host_id
) AS sub ON (sub.host_id = main.host_id)
GROUP BY
    main.host_id,
    main.status_code

但是 ClickHouse 抛出:

DB::Exception:未知标识符:host_total;有列:host_id、status_code、count():处理时 host_id、status_code、count() AS status_count、(count() / host_total) * 100 AS 百分比。 (UNKNOWN_IDENTIFIER)

可能是因为correlated (dependent) subqueries are not supported

有人建议我使用CTE,所以我尝试了这个:

WITH sub AS (
    SELECT host_id, COUNT() AS host_total
    FROM main
    GROUP BY host_id
)
SELECT
    main.host_id,
    main.status_code,
    COUNT() AS status_count,
    COUNT() / (SELECT host_total FROM sub WHERE sub.host_id = main.host_id) * 100 AS percent
FROM
    main
GROUP BY
    main.host_id,
    main.status_code

但还是没有运气:

DB::Exception: 处理查询时缺少列:'main.host_id':'SELECT host_total FROM sub WHERE host_id = main.host_id',必需列:'host_total''host_id''main.host_id''host_total' 'host_id' 'main.host_id': 处理时 (SELECT host_total FROM sub WHERE sub.host_id = main.host_id) AS _subquery20: 处理时 count() / ((SELECT host_total FROM sub WHERE sub.host_id = main.host_id) AS _subquery20): 处理时 (count() / ((SELECT host_total FROM sub WHERE sub.host_id = main.host_id) AS _subquery20)) * 100 AS 百分比。 (UNKNOWN_IDENTIFIER)

【问题讨论】:

    标签: sql group-by common-table-expression percentage clickhouse


    【解决方案1】:

    CH 抛出不正确的错误。 https://github.com/ClickHouse/ClickHouse/issues/4567

    host_total 应该在 groupby 部分或在聚合函数下

    insert into main(request_time, host_id,status_code) values ( now(), 1, 200);
    insert into main(request_time, host_id,status_code) values ( now(), 1, 500);
    insert into main(request_time, host_id,status_code) values ( now(), 1, 200);
    insert into main(request_time, host_id,status_code) values ( now(), 2, 500);
    insert into main(request_time, host_id,status_code) values ( now(), 2, 200);
    insert into main(request_time, host_id,status_code) values ( now(), 3, 500);
    
    SELECT
        main.host_id,
        main.status_code,
        COUNT() AS status_count,
        round((COUNT() / any(sub.host_total)) * 100, 2) AS percent
    FROM main
    INNER JOIN
    (
        SELECT
            host_id,
            COUNT() AS host_total
        FROM main
        GROUP BY host_id
    ) AS sub ON sub.host_id = main.host_id
    GROUP BY
        main.host_id,
        main.status_code
    ORDER BY
        main.host_id ASC,
        main.status_code ASC
    
    ┌─host_id─┬─status_code─┬─status_count─┬─percent─┐
    │       1 │ 200         │            2 │   66.67 │
    │       1 │ 500         │            1 │   33.33 │
    │       2 │ 200         │            1 │      50 │
    │       2 │ 500         │            1 │      50 │
    │       3 │ 500         │            1 │     100 │
    └─────────┴─────────────┴──────────────┴─────────┘
    

    但是有更好的方法来解决它:

    窗口函数

    SELECT
        host_id,
        status_code,
        status_count,
        round((status_count / host_total) * 100, 2) AS percent
    FROM
    (
        SELECT
            host_id,
            status_code,
            status_count,
            sum(status_count) OVER (PARTITION BY host_id) AS host_total
        FROM
        (
            SELECT
                host_id,
                status_code,
                COUNT() AS status_count
            FROM main
            GROUP BY
                host_id,
                status_code
        )
    )
    ORDER BY
        host_id ASC,
        status_code ASC
    
    ┌─host_id─┬─status_code─┬─status_count─┬─percent─┐
    │       1 │ 200         │            2 │   66.67 │
    │       1 │ 500         │            1 │   33.33 │
    │       2 │ 200         │            1 │      50 │
    │       2 │ 500         │            1 │      50 │
    │       3 │ 500         │            1 │     100 │
    └─────────┴─────────────┴──────────────┴─────────┘
    

    数组

    SELECT
        host_id,
        status_code,
        status_count,
        round((status_count / host_total) * 100, 2) AS percent
    FROM
    (
        SELECT
            host_id,
            sumMap([CAST(status_code, 'String')], [1]) AS ga,
            count() AS host_total
        FROM main
        GROUP BY host_id
    )
    ARRAY JOIN
        ga.1 AS status_code,
        ga.2 AS status_count
    
    ┌─host_id─┬─status_code─┬─status_count─┬─percent─┐
    │       1 │ 200         │            2 │   66.67 │
    │       1 │ 500         │            1 │   33.33 │
    │       2 │ 200         │            1 │      50 │
    │       2 │ 500         │            1 │      50 │
    │       3 │ 500         │            1 │     100 │
    └─────────┴─────────────┴──────────────┴─────────┘
    
    

    【讨论】:

    • 谢谢丹尼!第一个答案就像一个魅力,所以我会接受这个答案。第二个产生奇怪的数字,第三个拒绝运行(Illegal type of argument for aggregate function sumMap. (ILLEGAL_TYPE_OF_ARGUMENT)
    • @WadeC.Blake 我修复了第二个和第三个。看来您使用的是过时的 CH。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-02-26
    • 2017-03-22
    • 1970-01-01
    • 2016-09-10
    • 2023-01-16
    • 2013-08-21
    相关资源
    最近更新 更多