【发布时间】:2021-10-11 18:14:50
【问题描述】:
给定一个如下所示的数据,其中日期为字符串格式YYYYMMDD:
| item | vietnamese | cost | unique_id | sales_date |
|---|---|---|---|---|
| fruits | trai cay | 10 | abc123 | 20211001 |
| fruits | trai cay | 8 | foo99 | 20211001 |
| fruits | trai cay | 9 | foo99 | 20211001 |
| vege | rau | 3 | rr1239 | 20211001 |
| vege | rau | 3 | rr1239 | 20211001 |
| fruits | trai cay | 12 | abc123 | 20211002 |
| fruits | trai cay | 14 | abc123 | 20211002 |
| fruits | trai cay | 8 | abc123 | 20211002 |
| fruits | trai cay | 5 | foo99 | 20211002 |
| vege | rau | 8 | rr1239 | 20211002 |
| vege | rau | 1 | rr1239 | 20211002 |
| vege | rau | 12 | ud9213 | 20211002 |
| vege | rau | 19 | r11759 | 20211002 |
| fruits | trai cay | 6 | foo99 | 20211003 |
| fruits | trai cay | 2 | abc123 | 20211003 |
| fruits | trai cay | 12 | abc123 | 20211003 |
| vege | rau | 1 | ud97863 | 20211003 |
| vege | rau | 9 | r112359 | 20211003 |
| fruits | trai cay | 6 | foo99 | 20211004 |
| fruits | trai cay | 2 | abc123 | 20211004 |
| fruits | trai cay | 12 | abc123 | 20211004 |
| vege | rau | 9 | r112359 | 20211004 |
目标是
- 在特定时间范围内,每个
sales_date最多选择 N 行 - 在 item 列上使用
group by聚合数据,
例如在“20211002”和“20211004”之间每天最多 3 行:
SELECT *
FROM
(SELECT item,
max(vietnamese) as vietnamese,
sum(cost) as total_cost,
array_agg(cost) as costs,
array_agg(unique_id) as unique_ids,
row_number() over (partition by max(sales_date) order by rand()) as row
FROM mytable
where sales_date between '20211002' and '20211004'
GROUP BY item)
where row <= 3
limit 9
注意:每个item 的vietnamese 列是一对一映射,因此max(vietnamese)
上面的结果应该类似于:
| item | vietnamese | costs | unique_ids |
|---|---|---|---|
| fruits | trai cay | [8] | [abc123] |
| vege | rau | [8, 1] | [rr1239, rr1239] |
| fruits | trai cay | [2, 12] | [abc123, abc123] |
| vege | rau | [1] | [ud97863] |
| fruits | trai cay | [6, 2, 12] | [foo99, abc123, abc123] |
所需的输出被保存为parquet格式:
| item | vietnamese | costs | unique_ids | sales_date |
|---|---|---|---|---|
| fruits | trai cay | [8] | [abc123] | 20211002 |
| vege | rau | [8, 1] | [rr1239, rr1239] | 20211002 |
| fruits | trai cay | [2, 12] | [abc123, abc123] | 20211003 |
| vege | rau | [1] | [ud97863] | 20211003 |
| fruits | trai cay | [6, 2, 12] | [foo99, abc123, abc123] | 20211004 |
目的是将其保存到s3://somes3path/,目录中有一些结构:
s3://somes3path/
item=fruits/
sales_date=20211002
sales_date=20211003
item=vege/
sales_date=20211002
sales_date=20211003
sales_date=20211004
如何在上面列出的目录结构中实现预期的输出?
我已经尝试过了,但它并没有像我预期的那样将它保存在正确的目录结构中:
CREATE TABLE somedb.mytable
WITH ( format = 'PARQUET', external_location = 's3://somes3path/',
partitioned_by = ARRAY['item'],
bucketed_by = ARRAY['sales_date'], bucket_count = 30) AS
SELECT *
FROM
(SELECT item,
max(vietnamese) as vietnamese,
sum(cost) as total_cost,
array_agg(cost) as costs,
array_agg(unique_id) as unique_ids,
first(sales_date) as sales_date,
row_number() over (partition by max(sales_date) order by rand()) as row
FROM mytable
where sales_date between '20211002' and '20211004'
GROUP BY item)
where row <= 3
limit 9
【问题讨论】:
标签: sql amazon-athena create-table database-partitioning