【发布时间】:2020-10-12 04:42:59
【问题描述】:
我有两个查询来计算表中的一些属性 - 'agg_table'。第二个基本上是找出按 msgdate 分组的中值。我的预期输出应该有这 5 个字段: p>
msgdate、avg-Total、avg-duration、stddev 和 median。目前我正在使用 UNION 工作正常。我将在 AWS Athena 中执行此查询。为了计算中位数,因为第二个查询再次访问 agg_data,数据扫描加倍,假设输入数据大小为 4 mb,在 Athena 历史页面中,我可以看到扫描的数据为 8 mb。
我想避免第二次数据扫描以节省成本。您能否通过仅调用一次 agg_data 表来帮助我实现这一目标?
查询 1:计算 avg-Total,avg-duration,stddev
SELECT b.msgdate1 as msgdate,ROUND(b.avrg,3) AS avg-Total,
ROUND(AVG(b.duration),3) AS avg-duration,ROUND(b.stdv,3) AS stddev
FROM
(
SELECT AVG(a2.duration) OVER(PARTITION BY a2.msgdate) AS avrg, a2.duration as duration,a2.msgdate msgdate1,
CASE
WHEN stddev(a2.duration) OVER(PARTITION BY a2.msgdate) IS NULL THEN 0
ELSE stddev(a2.duration) OVER(PARTITION BY a2.msgdate)
END AS stdv
FROM (
agg_data
) a2
) AS b
查询 2:计算中位数
WITH RankedTable AS
(
SELECT msgdate, duration,
ROW_NUMBER() OVER (PARTITION BY msgdate ORDER BY duration) AS Rnk,
COUNT(*) OVER (PARTITION BY msgdate) AS Cnt
FROM agg_data
)
SELECT msgdate,duration as median
FROM RankedTable
WHERE Rnk = Cnt / 2 + 1 or Cnt=1
【问题讨论】:
标签: mysql amazon-athena