【发布时间】:2021-03-05 19:22:00
【问题描述】:
我希望将事务数据集转入 SCD2,以便捕获每个组合在转点粒度上的有效间隔。
Snowflake 是我使用的实际 DBMS,但也标记了 Oracle,因为它们的方言几乎相同。不过,我可能会找到为任何 DBMS 提供的解决方案。
我有工作的 sql,但它源于反复试验,我觉得必须有一种更优雅的方式我错过了,因为它非常丑陋且计算成本很高。
(注意:输入数据中的第二条记录“过期”了第一条记录。可以假设感兴趣的每一天都会作为 add_dts 至少出现一次。) (在最后添加为图像,直到我弄清楚为什么标记不起作用)
输入:
| Original_Grain | Pivot_Grain | Pivot_Column | Pivot_Attribute | ADD_TS |
|---|---|---|---|---|
| OG-1 | PG-1 | First_Col | A | 2020-01-01 |
| OG-1 | PG-1 | First_Col | B | 2020-01-02 |
| OG-2 | PG-1 | Second_Col | A | 2020-01-01 |
| OG-3 | PG-1 | Third_Col | C | 2020-01-02 |
| OG-3 | PG-1 | Third_Col | B | 2020-01-03 |
输出:
| Pivot_Grain | First_Col | Second_Col | Third_Col | From_Dt | To_Dt |
|---|---|---|---|---|---|
| PG-1 | A | A | NULL | 2020-01-01 | 2020-01-02 |
| PG-1 | B | A | C | 2020-01-02 | 2020-01-03 |
| PG-1 | B | A | B | 2020-01-03 | 9999-01-01 |
WITH INPUT AS
( SELECT 'OG-1' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'First_Col' AS Pivot_Column,
'A' AS Pivot_Attribute,
TO_DATE('2020-01-01','YYYY-MM-DD') AS Add_Dts
FROM dual
UNION
SELECT 'OG-1' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'First_Col' AS Pivot_Column,
'B' AS Pivot_Attribute,
TO_DATE('2020-01-02','YYYY-MM-DD')
FROM dual
UNION
SELECT 'OG-2' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'Second_Col' AS Pivot_Column,
'A' AS Pivot_Attribute,
TO_DATE('2020-01-01','YYYY-MM-DD')
FROM dual
UNION
SELECT 'OG-3' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'Third_Col' AS Pivot_Column,
'C' AS Pivot_Attribute,
TO_DATE('2020-01-02','YYYY-MM-DD')
FROM dual
UNION
SELECT 'OG-3' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'Third_Col' AS Pivot_Column,
'B' AS Pivot_Attribute,
TO_DATE('2020-01-03','YYYY-MM-DD')
FROM dual
),
GET_NORMALIZED_RANGES AS
( SELECT I.*,
COALESCE(
LEAD(Add_Dts) OVER (
PARTITION BY I.Original_Grain
ORDER BY I.Add_Dts), TO_DATE('9000-01-01')
) AS Next_Add_Dts
FROM INPUT I
),
GET_DISTINCT_ADD_DATES AS
( SELECT DISTINCT Add_Dts AS Driving_Date
FROM Input
),
NORMALIZED_EFFECTIVE_AT_EACH_POINT AS
( SELECT GNR.*,
GDAD.Driving_Date
FROM GET_NORMALIZED_RANGES GNR
INNER
JOIN GET_DISTINCT_ADD_DATES GDAD
ON GDAD.driving_date >= GNR.add_dts
AND GDAD.driving_Date < GNR.next_add_dts
),
PIVOT_EACH_POINT AS
( SELECT DISTINCT
Pivot_Grain,
Driving_Date,
MAX("'First_Col'") OVER ( PARTITION BY Pivot_Grain, Driving_Date) AS First_Col,
MAX("'Second_Col'") OVER ( PARTITION BY Pivot_Grain, Driving_Date) AS Second_Col,
MAX("'Third_Col'") OVER ( PARTITION BY Pivot_Grain, Driving_Date) AS Third_Col
FROM NORMALIZED_EFFECTIVE_AT_EACH_POINT NEP
PIVOT (MAX(Pivot_Attribute) FOR PIVOT_COLUMN IN ('First_Col','Second_Col','Third_Col'))
)
SELECT Pivot_Grain,
Driving_Date AS From_Dt,
COALESCE(LEAD(Driving_Date) OVER ( PARTITION BY pivot_grain ORDER BY Driving_Date),TO_DATE('9999-01-01')) AS To_Dt,
First_Col,
Second_Col,
Third_Col
FROM PIVOT_EACH_POINT
【问题讨论】:
-
表格标记在预览中有效,但在实际帖子中无效
-
我们通过对“跟踪”的列进行哈希来构建我们的 SCD1 和 2(又名 SCD6)表,因此每次我们拉取数据时,如果哈希相同,则剩余时间在遥远的未来,否则改变(现有的)并插入新行。我很确定这都是通过同一个 MERGE 指令完成的。
-
@SimeonPilgrim 是的,我熟悉这种模式。问题的症结在于通过跨枢轴粒度的组合来处理日期间隔。
-
通常降价就像在不同的格式块之间存在空白行。
标签: sql oracle pivot snowflake-cloud-data-platform scd