在 SQL 中操作数据（回填、透视）答案

【问题标题】：Manipulate data in SQL (backfilling, pivoting)在 SQL 中操作数据（回填、透视）
【发布时间】：2021-04-26 14:55:38
【问题描述】：

我有一个类似于这个小例子的表格：

我想把它改成这种格式：

这是一个用于创建示例输入表的示例 SQL 脚本：

CREATE TABLE sample_table
(
    id INT,
    hr INT,
    tm DATETIME,
    score INT, 
 )

INSERT INTO sample_table
VALUES (1, 0, '2021-01-21 00:26:45', 2765),
(1, 0, '2021-01-21 00:49:00', 2765),
(1, 5, '2021-01-21 07:47:03', 1593),
(1, 7, '2021-01-21 11:50:48', 1604),
(1, 7, '2021-01-21 12:00:32', 1604),
(2, 0, '2021-01-21 00:50:45', 3500),
(2, 2, '2021-01-21 01:49:00', 2897),
(2, 2, '2021-01-21 05:47:03', 2897),
(2, 4, '2021-01-21 09:30:48', 2400),
(2, 6, '2021-01-21 12:00:32', 1647);

我尝试使用 LAG 和 CASE WHEN 的组合，目前没有成功。寻找有关如何操作（什么功能等）的一些想法。看到用于操作的示例脚本会很棒。

如果每个 id 和 hr 有多个值，则使用最早的值。例如。 id=1 & hr=7，然后 hr_7=使用 11:50 的值。尽管在此示例中，两条记录的值相同，但它可以不同。

【问题讨论】：

Postgres 还是 Redshift？尽管它们有一些古老的根源，但它们却截然不同
红移。一旦我对操作有了一个想法，我总是可以将逻辑转换为对 RedShift 友好的。在一个想法之后更多。
hr_0 对应 id=1 有两个条目，你用什么逻辑只选择一个
啊，很好，我应该澄清一下。理想情况下，在这种情况下，我想根据 'tm' @GeorgeJoseph 选择最早的一个
如果可以的话，您可能还希望为您的预期输出包含 DDL 或表格降价，因为它遵循指南并防止图片链接失效导致此帖子无法挽救。

标签： sql amazon-redshift data-manipulation

【解决方案1】：

我建议这个逻辑：

with u as (   -- get unique values
      select id, hr, tm, score,
             lead(hr) over (partition by id order by hr) as next_hr
      from (select t.*,
                   row_number() over (partition by id, hr order by tm asc) as seqnum
            from t
           )
      where seqnum = 1
     )
select id,
       max(case when hr <= 1 and (next_hr > 1 or next_hr is null) then score end) as hr_1,
       max(case when hr <= 2 and (next_hr > 2 or next_hr is null) then score end) as hr_2,
       max(case when hr <= 3 and (next_hr > 3 or next_hr is null) then score end) as hr_3,
       max(case when hr <= 4 and (next_hr > 4 or next_hr is null) then score end) as hr_4,
       max(case when hr <= 5 and (next_hr > 5 or next_hr is null) then score end) as hr_5,
       max(case when hr <= 6 and (next_hr > 6 or next_hr is null) then score end) as hr_6,
       max(case when hr <= 7 and (next_hr > 7 or next_hr is null) then score end) as hr_7,
       max(case when hr <= 8 and (next_hr > 8 or next_hr is null) then score end) as hr_8
from t
group by id;

这首先删除重复项，然后添加一个范围用于分数有效的时间。然后条件聚合使用此信息。

【讨论】：

【解决方案2】：

感谢让生活变得轻松的测试脚本。

这里有一个关于如何使用 postgresql 进行此操作的想法。

在第一个块中 -> 数据。我尝试获得 id 和 mutate 的所有可能组合 8 次。因此我会得到数据

id num
1  0
1  1
...
1  8
2  0
2..8

之后，在 raw_data 块中，我离开了与 sample_table 中的实际数据的连接，这样我可以保证每个 hrs 0..8 一行我还根据 (id,hr) 的最早分数对行进行排名 --> rnk

然后我使用 rnk=1 并使用 max(score) over(partition by grp) 得到之前的分数。

然后，我按 id 对数据进行分组，并使用 max 逻辑执行“pivot”，得到预期的输出。

Output

+-----+------+------+------+------+------+------+------+------+------+
| id1 | hr_0 | hr_1 | hr_2 | hr_3 | hr_4 | hr_5 | hr_6 | hr_7 | hr_8 |
+-----+------+------+------+------+------+------+------+------+------+
|   1 | 2765 | 2765 | 2765 | 2765 | 2765 | 1593 | 1593 | 1604 | 1604 |
|   2 | 3500 | 3500 | 2897 | 2897 | 2400 | 2400 | 1647 | 1647 | 1647 |
+-----+------+------+------+------+------+------+------+------+------+



/*
CREATE TABLE sample_table
(
    id INT,
    hr INT,
    tm timestamp,
    score INT
 );
 
INSERT INTO sample_table
VALUES (1, 0, '2021-01-21 00:26:45', 2765),
(1, 0, '2021-01-21 00:49:00', 2765),
(1, 5, '2021-01-21 07:47:03', 1593),
(1, 7, '2021-01-21 11:50:48', 1604),
(1, 7, '2021-01-21 12:00:32', 1604),
(2, 0, '2021-01-21 00:50:45', 3500),
(2, 2, '2021-01-21 01:49:00', 2897),
(2, 2, '2021-01-21 05:47:03', 2897),
(2, 4, '2021-01-21 09:30:48', 2400),
(2, 6, '2021-01-21 12:00:32', 1647);
*/
with data
  as (select b.id as id
             ,f  as num
        from generate_series(0,8) f
        join (select distinct id from sample_table) as b          
          on 1=1
     )       
    ,raw_data
     as (
   select d.id as id1
          ,d.num as num1
          ,st.*
          ,row_number() over(partition by d.id,d.num order by st.tm asc) as rnk
     from data d
left join sample_table st
       on d.id=st.id
      and d.num=st.hr
       )
     ,prep_data
      as (select id1
                ,num1
                ,max(score) over(partition by id1,grp) as earliest_score 
           from (select id1,num1,score
                       ,sum(case when score is not null then 1 else 0 end)
                        over(partition by id1 order by num1) as grp
                   from raw_data
                  where rnk=1  
                 )x
          )
select id1
       ,max(case when num1=0 then earliest_score end) as hr_0
       ,max(case when num1=1 then earliest_score end) as hr_1
       ,max(case when num1=2 then earliest_score end) as hr_2
       ,max(case when num1=3 then earliest_score end) as hr_3
       ,max(case when num1=4 then earliest_score end) as hr_4
       ,max(case when num1=5 then earliest_score end) as hr_5
       ,max(case when num1=6 then earliest_score end) as hr_6
       ,max(case when num1=7 then earliest_score end) as hr_7
       ,max(case when num1=8 then earliest_score end) as hr_8
  from prep_data
group by id1 
order by id1;

我尝试在 db-fiddle 上设置脚本，但是对于我使用的 postgresql 查询，它一直在崩溃。

但它确实在 postgresql 数据库中工作，因为我已经在下面运行它并且它工作..

https://extendsclass.com/postgresql-online.html

【讨论】：

聪明，成功了！非常感谢 George 对这个问题如此关注并且对输出的关注度很高！ :>
乐于助人 :-)

【解决方案3】：

DECLARE @columns VARCHAR(MAX) = '',
@sql VARCHAR(MAX) = ''
SELECT  @columns+=QUOTENAME(hr) + ',' 
FROM (
SELECT DISTINCT hr
from sample_table
) M
SET @columns = LEFT(@columns, LEN(@columns) - 1);
--SELECT @columns
SET @sql ='
(SELECT * FROM   
(
select ID,hr,score from sample_table
) t 
PIVOT(MAX(score)
FOR hr IN ('+ @columns +')
) AS pivot_table) ';
EXEC (@sql)

output:
ID  0   2   4   5   6   7
1   2765    NULL    NULL    1593    NULL    1604
2   3500    2897    2400    NULL    1647    NULL

试试这个。

【讨论】：