【问题标题】:Manipulate data in SQL (backfilling, pivoting)在 SQL 中操作数据(回填、透视)
【发布时间】:2021-04-26 14:55:38
【问题描述】:

我有一个类似于这个小例子的表格:

我想把它改成这种格式:

这是一个用于创建示例输入表的示例 SQL 脚本:

CREATE TABLE sample_table
(
    id INT,
    hr INT,
    tm DATETIME,
    score INT, 
 )

INSERT INTO sample_table
VALUES (1, 0, '2021-01-21 00:26:45', 2765),
(1, 0, '2021-01-21 00:49:00', 2765),
(1, 5, '2021-01-21 07:47:03', 1593),
(1, 7, '2021-01-21 11:50:48', 1604),
(1, 7, '2021-01-21 12:00:32', 1604),
(2, 0, '2021-01-21 00:50:45', 3500),
(2, 2, '2021-01-21 01:49:00', 2897),
(2, 2, '2021-01-21 05:47:03', 2897),
(2, 4, '2021-01-21 09:30:48', 2400),
(2, 6, '2021-01-21 12:00:32', 1647);

我尝试使用 LAG 和 CASE WHEN 的组合,目前没有成功。寻找有关如何操作(什么功能等)的一些想法。看到用于操作的示例脚本会很棒。

如果每个 id 和 hr 有多个值,则使用最早的值。例如。 id=1 & hr=7,然后 hr_7=使用 11:50 的值。尽管在此示例中,两条记录的值相同,但它可以不同。

【问题讨论】:

  • Postgres 还是 Redshift?尽管它们有一些古老的根源,但它们却截然不同
  • 红移。一旦我对操作有了一个想法,我总是可以将逻辑转换为对 RedShift 友好的。在一个想法之后更多。
  • hr_0 对应 id=1 有两个条目,你用什么逻辑只选择一个
  • 啊,很好,我应该澄清一下。理想情况下,在这种情况下,我想根据 'tm' @GeorgeJoseph 选择最早的一个
  • 如果可以的话,您可能还希望为您的预期输出包含 DDL 或表格降价,因为它遵循指南并防止图片链接失效导致此帖子无法挽救

标签: sql amazon-redshift data-manipulation


【解决方案1】:

我建议这个逻辑:

with u as (   -- get unique values
      select id, hr, tm, score,
             lead(hr) over (partition by id order by hr) as next_hr
      from (select t.*,
                   row_number() over (partition by id, hr order by tm asc) as seqnum
            from t
           )
      where seqnum = 1
     )
select id,
       max(case when hr <= 1 and (next_hr > 1 or next_hr is null) then score end) as hr_1,
       max(case when hr <= 2 and (next_hr > 2 or next_hr is null) then score end) as hr_2,
       max(case when hr <= 3 and (next_hr > 3 or next_hr is null) then score end) as hr_3,
       max(case when hr <= 4 and (next_hr > 4 or next_hr is null) then score end) as hr_4,
       max(case when hr <= 5 and (next_hr > 5 or next_hr is null) then score end) as hr_5,
       max(case when hr <= 6 and (next_hr > 6 or next_hr is null) then score end) as hr_6,
       max(case when hr <= 7 and (next_hr > 7 or next_hr is null) then score end) as hr_7,
       max(case when hr <= 8 and (next_hr > 8 or next_hr is null) then score end) as hr_8
from t
group by id;

这首先删除重复项,然后添加一个 范围 用于分数有效的时间。然后条件聚合使用此信息。

【讨论】:

    【解决方案2】:

    感谢让生活变得轻松的测试脚本。

    这里有一个关于如何使用 postgresql 进行此操作的想法。

    在第一个块中 -> 数据。我尝试获得 id 和 mutate 的所有可能组合 8 次。因此我会得到数据

    id num
    1  0
    1  1
    ...
    1  8
    2  0
    2..8
    

    之后,在 raw_data 块中,我离开了与 sample_table 中的实际数据的连接,这样我可以保证每个 hrs 0..8 一行 我还根据 (id,hr) 的最早分数对行进行排名 --> rnk

    然后我使用 rnk=1 并使用 max(score) over(partition by grp) 得到之前的分数。

    然后,我按 id 对数据进行分组,并使用 max 逻辑执行“pivot”,得到预期的输出。

    Output
    
    +-----+------+------+------+------+------+------+------+------+------+
    | id1 | hr_0 | hr_1 | hr_2 | hr_3 | hr_4 | hr_5 | hr_6 | hr_7 | hr_8 |
    +-----+------+------+------+------+------+------+------+------+------+
    |   1 | 2765 | 2765 | 2765 | 2765 | 2765 | 1593 | 1593 | 1604 | 1604 |
    |   2 | 3500 | 3500 | 2897 | 2897 | 2400 | 2400 | 1647 | 1647 | 1647 |
    +-----+------+------+------+------+------+------+------+------+------+
    
    
    
    /*
    CREATE TABLE sample_table
    (
        id INT,
        hr INT,
        tm timestamp,
        score INT
     );
     
    INSERT INTO sample_table
    VALUES (1, 0, '2021-01-21 00:26:45', 2765),
    (1, 0, '2021-01-21 00:49:00', 2765),
    (1, 5, '2021-01-21 07:47:03', 1593),
    (1, 7, '2021-01-21 11:50:48', 1604),
    (1, 7, '2021-01-21 12:00:32', 1604),
    (2, 0, '2021-01-21 00:50:45', 3500),
    (2, 2, '2021-01-21 01:49:00', 2897),
    (2, 2, '2021-01-21 05:47:03', 2897),
    (2, 4, '2021-01-21 09:30:48', 2400),
    (2, 6, '2021-01-21 12:00:32', 1647);
    */
    with data
      as (select b.id as id
                 ,f  as num
            from generate_series(0,8) f
            join (select distinct id from sample_table) as b          
              on 1=1
         )       
        ,raw_data
         as (
       select d.id as id1
              ,d.num as num1
              ,st.*
              ,row_number() over(partition by d.id,d.num order by st.tm asc) as rnk
         from data d
    left join sample_table st
           on d.id=st.id
          and d.num=st.hr
           )
         ,prep_data
          as (select id1
                    ,num1
                    ,max(score) over(partition by id1,grp) as earliest_score 
               from (select id1,num1,score
                           ,sum(case when score is not null then 1 else 0 end)
                            over(partition by id1 order by num1) as grp
                       from raw_data
                      where rnk=1  
                     )x
              )
    select id1
           ,max(case when num1=0 then earliest_score end) as hr_0
           ,max(case when num1=1 then earliest_score end) as hr_1
           ,max(case when num1=2 then earliest_score end) as hr_2
           ,max(case when num1=3 then earliest_score end) as hr_3
           ,max(case when num1=4 then earliest_score end) as hr_4
           ,max(case when num1=5 then earliest_score end) as hr_5
           ,max(case when num1=6 then earliest_score end) as hr_6
           ,max(case when num1=7 then earliest_score end) as hr_7
           ,max(case when num1=8 then earliest_score end) as hr_8
      from prep_data
    group by id1 
    order by id1;
    

    我尝试在 db-fiddle 上设置脚本,但是对于我使用的 postgresql 查询,它一直在崩溃。

    但它确实在 postgresql 数据库中工作,因为我已经在下面运行它并且它工作..

    https://extendsclass.com/postgresql-online.html

    【讨论】:

    • 聪明,成功了!非常感谢 George 对这个问题如此关注并且对输出的关注度很高! :>
    • 乐于助人 :-)
    【解决方案3】:
    DECLARE @columns VARCHAR(MAX) = '',
    @sql VARCHAR(MAX) = ''
    SELECT  @columns+=QUOTENAME(hr) + ',' 
    FROM (
    SELECT DISTINCT hr
    from sample_table
    ) M
    SET @columns = LEFT(@columns, LEN(@columns) - 1);
    --SELECT @columns
    SET @sql ='
    (SELECT * FROM   
    (
    select ID,hr,score from sample_table
    ) t 
    PIVOT(MAX(score)
    FOR hr IN ('+ @columns +')
    ) AS pivot_table) ';
    EXEC (@sql)
    
    output:
    ID  0   2   4   5   6   7
    1   2765    NULL    NULL    1593    NULL    1604
    2   3500    2897    2400    NULL    1647    NULL
    

    试试这个。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2011-11-06
      • 1970-01-01
      • 1970-01-01
      • 2012-07-07
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多