【问题标题】:How to use "create table as" with dates that is aggregated through group by in SQL?如何在 SQL 中对通过 group by 聚合的日期使用“create table as”?
【发布时间】:2021-10-11 18:14:50
【问题描述】:

给定一个如下所示的数据,其中日期为字符串格式YYYYMMDD

item vietnamese cost unique_id sales_date
fruits trai cay 10 abc123 20211001
fruits trai cay 8 foo99 20211001
fruits trai cay 9 foo99 20211001
vege rau 3 rr1239 20211001
vege rau 3 rr1239 20211001
fruits trai cay 12 abc123 20211002
fruits trai cay 14 abc123 20211002
fruits trai cay 8 abc123 20211002
fruits trai cay 5 foo99 20211002
vege rau 8 rr1239 20211002
vege rau 1 rr1239 20211002
vege rau 12 ud9213 20211002
vege rau 19 r11759 20211002
fruits trai cay 6 foo99 20211003
fruits trai cay 2 abc123 20211003
fruits trai cay 12 abc123 20211003
vege rau 1 ud97863 20211003
vege rau 9 r112359 20211003
fruits trai cay 6 foo99 20211004
fruits trai cay 2 abc123 20211004
fruits trai cay 12 abc123 20211004
vege rau 9 r112359 20211004

目标是

  • 在特定时间范围内,每个 sales_date 最多选择 N 行
  • 在 item 列上使用 group by 聚合数据,

例如在“20211002”和“20211004”之间每天最多 3 行:

SELECT *
FROM 
    (SELECT item, 
            max(vietnamese) as vietnamese,
            sum(cost) as total_cost,
            array_agg(cost) as costs,
            array_agg(unique_id) as unique_ids,
            row_number() over (partition by max(sales_date) order by rand()) as row
     FROM mytable
     where sales_date between '20211002' and '20211004'
  GROUP BY item)
where row <= 3
limit 9

注意:每个itemvietnamese 列是一对一映射,因此max(vietnamese)

上面的结果应该类似于:

item vietnamese costs unique_ids
fruits trai cay [8] [abc123]
vege rau [8, 1] [rr1239, rr1239]
fruits trai cay [2, 12] [abc123, abc123]
vege rau [1] [ud97863]
fruits trai cay [6, 2, 12] [foo99, abc123, abc123]

所需的输出被保存为parquet格式:

item vietnamese costs unique_ids sales_date
fruits trai cay [8] [abc123] 20211002
vege rau [8, 1] [rr1239, rr1239] 20211002
fruits trai cay [2, 12] [abc123, abc123] 20211003
vege rau [1] [ud97863] 20211003
fruits trai cay [6, 2, 12] [foo99, abc123, abc123] 20211004

目的是将其保存到s3://somes3path/,目录中有一些结构:

s3://somes3path/
     item=fruits/
        sales_date=20211002
        sales_date=20211003
     item=vege/
        sales_date=20211002
        sales_date=20211003
        sales_date=20211004

如何在上面列出的目录结构中实现预期的输出?


我已经尝试过了,但它并没有像我预期的那样将它保存在正确的目录结构中:

CREATE TABLE somedb.mytable
WITH ( format = 'PARQUET', external_location = 's3://somes3path/', 
       partitioned_by = ARRAY['item'], 
       bucketed_by = ARRAY['sales_date'], bucket_count = 30) AS 
SELECT *
FROM 
    (SELECT item, 
            max(vietnamese) as vietnamese,
            sum(cost) as total_cost,
            array_agg(cost) as costs,
            array_agg(unique_id) as unique_ids,
            first(sales_date) as sales_date,
            row_number() over (partition by max(sales_date) order by rand()) as row
     FROM mytable
     where sales_date between '20211002' and '20211004'
  GROUP BY item)
where row <= 3
limit 9

【问题讨论】:

    标签: sql amazon-athena create-table database-partitioning


    【解决方案1】:

    您的输出仅由item 分区,如果您将其更改为由itemsales_date 分区,您将获得所需的目录结构。删除分桶,因为在sales_date 上分区时它不会有任何影响:

    WITH (
      format = 'PARQUET',
      external_location = 's3://somes3path/', 
      partitioned_by = ARRAY['item', 'sales_date']
    ) 
    

    【讨论】:

      猜你喜欢
      • 2012-01-11
      • 2014-12-27
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-06-06
      • 2012-09-24
      • 2021-01-15
      • 1970-01-01
      相关资源
      最近更新 更多