【发布时间】:2019-12-27 12:04:39
【问题描述】:
我有一张桌子
CREATE TABLE StatsFull (
Timestamp Int32,
Uid String,
ErrorCode Int32,
Name String,
Version String,
Date Date MATERIALIZED toDate(Timestamp),
Time DateTime MATERIALIZED toDateTime(Timestamp)
) ENGINE = MergeTree() PARTITION BY toMonday(Date)
ORDER BY Time SETTINGS index_granularity = 8192
我需要获得具有唯一 Uid 的前 100 个名称或前 100 个错误代码。
明显的查询是
SELECT Name, uniq(PcId) as cnt FROM StatsFull
WHERE Time > subtractDays(toDate(now()), 1)
GROUP BY Name ORDER BY cnt DESC LIMIT 100
但数据太大,所以我创建了一个 AggregatingMergeTree,因为我不需要按小时(仅按日期)过滤数据。
CREATE MATERIALIZED VIEW StatsAggregated (
Date Date,
ProductName String,
ErrorCode Int32,
Name String,
Version String,
UniqUsers AggregateFunction(uniq, String),
) ENGINE = AggregatingMergeTree() PARTITION BY toMonday(Date)
ORDER BY
(
Date,
ProductName,
ErrorCode,
Name,
Version
) SETTINGS index_granularity = 8192 AS
SELECT
Date,
ProductName,
ErrorCode,
Name,
Version,
uniqState(Uid) AS UniqUsers,
FROM
StatsFull
GROUP BY
Date,
ProductName,
ErrorCode,
Name,
Version
而我目前的查询是:
SELECT Name FROM StatsAggregated
WHERE Date > subtractDays(toDate(now()), 1)
GROUP BY Name
ORDER BY uniqMerge(UniqUsers) DESC LIMIT 100
查询运行良好,但最终一天中的数据行变得更多,现在它对内存太贪婪了。所以我正在寻找一些优化。
我找到了函数 topK(N)(column),它返回指定列中最常见值的数组,但这不是我需要的。
【问题讨论】:
-
你的例子很抽象——你能提供真实的例子和模式定义吗?您需要考虑在 MergeTree 中分配正确的主键、分区等或依赖 AggregatingMergeTree 的能力。
标签: clickhouse