【发布时间】:2021-12-30 06:10:37
【问题描述】:
我正在处理一个查询,我希望在该查询中获取最后一次提交特定时间点的值的时间。使用聚合函数 LAST_VALUE 应该相当容易,但是当我使用函数 AWS Athena 无法将该函数识别为聚合函数时,会出现此错误。
'"first_value"(cv.delay) OVER (PARTITION BY cv.trip_id ORDER BY cv.timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)' must be an aggregate expression or appear in GROUP BY clause
当我尝试添加聚合函数时,它也会中断说 GROUP BY 不能接受聚合函数。因此,取决于它是在 SELECT 还是 GROUP BY 中,它既是聚合子句又是非聚合子句。
有人在 Athena 中成功处理过这样的查询吗?
我的查询
select
cv.trip_id as trip_id
,cv.route_id as route_id
,cv.route_long_name as route_long_name
,cv.route_short_name as routes_short_name
,cv.direction_id as direction_id
-- ,cv.route_type as route_type
,max(cv.delay) as delay_max
,min(cv.delay) as delay_min
,from_unixtime(cv.timestamp)
,first_value(cv.delay)
over(partition by cv.trip_id
order by cv.timestamp desc
rows between unbounded preceding and unbounded following)
as last_delay
from stats_vehicles cv
where
cv.year = 2021
and cv.month = 10
and cv.day = 22
group by cv.trip_id,
cv.timestamp,
cv.route_id,
cv.route_long_name,
cv.route_short_name,
cv.direction_id
LAST_VALUE 的雅典娜示例
select venuestate, venueseats, venuename,
last_value(venuename)
over(partition by venuestate
order by venueseats desc
rows between unbounded preceding and unbounded following)
from (select * from venue where venueseats >0)
order by venuestate;
LAST_VALUE/FIRST_VALUE Athena Documentation
更新: 我试图在子查询中添加以避免在 group by 子句中添加 last_delay,但我得到了与上面相同的错误。
子查询语句:
select
cv.last_delay,
cv.stop_id,
cv.trip_id
from
(select
cv.trip_id as trip_id
,cv.route_id as route_id
,cv.route_long_name as route_long_name
,cv.route_short_name as routes_short_name
,cv.direction_id as direction_id
-- ,cv.route_type as route_type
,max(cv.delay) as delay_max
,min(cv.delay) as delay_min
,from_unixtime(cv.timestamp)
,first_value(cv.delay)
over(partition by cv.trip_id
order by cv.timestamp desc
rows between unbounded preceding and unbounded following)
as last_delay
from stats_vehicles cv
where
cv.year = 2021
and cv.month = 10
and cv.day = 22
group by cv.trip_id,
cv.timestamp,
cv.route_id,
cv.route_long_name,
cv.route_short_name,
cv.direction_id
) as cv
group by
cv.trip_id,
cv.stop_id,
cv.last_delay
我还尝试使用 min 作为聚合函数的占位符,但它表示您不能在聚合中嵌套窗口函数。
Cannot nest window functions inside aggregation 'min': ["first_value"(cv.delay) OVER (PARTITION BY cv.trip_id ORDER BY cv.timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)]
聚合查询:
select
cv.trip_id as trip_id
,cv.route_id as route_id
,cv.route_long_name as route_long_name
,cv.route_short_name as routes_short_name
,cv.direction_id as direction_id
,max(cv.delay) as delay_max
,min(cv.delay) as delay_min
,from_unixtime(cv.timestamp)
,min(first_value(cv.delay)
over(partition by cv.trip_id
order by cv.timestamp desc
rows between unbounded preceding and unbounded following)) as last_delay
from (select * from stats_vehicles
where
year = 2021
and month = 10
and day = 22) cv
group by cv.trip_id,
cv.timestamp,
cv.route_id,
cv.route_long_name,
cv.route_short_name,
cv.direction_id
这是一些示例数据。数据是从公共汽车发送的,是来自传感器的 GPS 信号,每 60 秒报告一次。每辆公共汽车在任何给定时刻都有一个与其关联的路线 ID 和行程 ID。我已经在任何给定时刻对每个站点的每个到达进行了细分,以获得估计的延迟。我们的想法是获取每个站点的每次行程的最后一次延误报告,以获得每次行程的每辆公共汽车的最准确到达时间。
【问题讨论】:
-
您能否发布一些示例值和预期输出,因为目前我无法理解您的查询?某些
trip_id是否存在多个delay值?delay是否存在null值?同样rows between unbounded preceding and unbounded following使窗口成为整个子集,即具有相同trip_id的所有行的值都相同。
标签: sql amazon-athena presto