如何在 SQL 中对通过 group by 聚合的日期使用“create table as”？答案

【问题标题】：How to use "create table as" with dates that is aggregated through group by in SQL?如何在 SQL 中对通过 group by 聚合的日期使用“create table as”？
【发布时间】：2021-10-11 18:14:50
【问题描述】：

给定一个如下所示的数据，其中日期为字符串格式YYYYMMDD：

item	vietnamese	cost	unique_id	sales_date
fruits	trai cay	10	abc123	20211001
fruits	trai cay	8	foo99	20211001
fruits	trai cay	9	foo99	20211001
vege	rau	3	rr1239	20211001
vege	rau	3	rr1239	20211001
fruits	trai cay	12	abc123	20211002
fruits	trai cay	14	abc123	20211002
fruits	trai cay	8	abc123	20211002
fruits	trai cay	5	foo99	20211002
vege	rau	8	rr1239	20211002
vege	rau	1	rr1239	20211002
vege	rau	12	ud9213	20211002
vege	rau	19	r11759	20211002
fruits	trai cay	6	foo99	20211003
fruits	trai cay	2	abc123	20211003
fruits	trai cay	12	abc123	20211003
vege	rau	1	ud97863	20211003
vege	rau	9	r112359	20211003
fruits	trai cay	6	foo99	20211004
fruits	trai cay	2	abc123	20211004
fruits	trai cay	12	abc123	20211004
vege	rau	9	r112359	20211004

目标是

在特定时间范围内，每个 sales_date 最多选择 N 行
在 item 列上使用 group by 聚合数据，

例如在“20211002”和“20211004”之间每天最多 3 行：

SELECT *
FROM 
    (SELECT item, 
            max(vietnamese) as vietnamese,
            sum(cost) as total_cost,
            array_agg(cost) as costs,
            array_agg(unique_id) as unique_ids,
            row_number() over (partition by max(sales_date) order by rand()) as row
     FROM mytable
     where sales_date between '20211002' and '20211004'
  GROUP BY item)
where row <= 3
limit 9

注意：每个item 的vietnamese 列是一对一映射，因此max(vietnamese)

上面的结果应该类似于：

item	vietnamese	costs	unique_ids
fruits	trai cay	[8]	[abc123]
vege	rau	[8, 1]	[rr1239, rr1239]
fruits	trai cay	[2, 12]	[abc123, abc123]
vege	rau	[1]	[ud97863]
fruits	trai cay	[6, 2, 12]	[foo99, abc123, abc123]

所需的输出被保存为parquet格式：

item	vietnamese	costs	unique_ids	sales_date
fruits	trai cay	[8]	[abc123]	20211002
vege	rau	[8, 1]	[rr1239, rr1239]	20211002
fruits	trai cay	[2, 12]	[abc123, abc123]	20211003
vege	rau	[1]	[ud97863]	20211003
fruits	trai cay	[6, 2, 12]	[foo99, abc123, abc123]	20211004

目的是将其保存到s3://somes3path/，目录中有一些结构：

s3://somes3path/
     item=fruits/
        sales_date=20211002
        sales_date=20211003
     item=vege/
        sales_date=20211002
        sales_date=20211003
        sales_date=20211004

如何在上面列出的目录结构中实现预期的输出？

我已经尝试过了，但它并没有像我预期的那样将它保存在正确的目录结构中：

CREATE TABLE somedb.mytable
WITH ( format = 'PARQUET', external_location = 's3://somes3path/', 
       partitioned_by = ARRAY['item'], 
       bucketed_by = ARRAY['sales_date'], bucket_count = 30) AS 
SELECT *
FROM 
    (SELECT item, 
            max(vietnamese) as vietnamese,
            sum(cost) as total_cost,
            array_agg(cost) as costs,
            array_agg(unique_id) as unique_ids,
            first(sales_date) as sales_date,
            row_number() over (partition by max(sales_date) order by rand()) as row
     FROM mytable
     where sales_date between '20211002' and '20211004'
  GROUP BY item)
where row <= 3
limit 9

【问题讨论】：

标签： sql amazon-athena create-table database-partitioning

【解决方案1】：

您的输出仅由item 分区，如果您将其更改为由item 和sales_date 分区，您将获得所需的目录结构。删除分桶，因为在sales_date 上分区时它不会有任何影响：

WITH (
  format = 'PARQUET',
  external_location = 's3://somes3path/', 
  partitioned_by = ARRAY['item', 'sales_date']
)

【讨论】：