【发布时间】:2019-02-14 20:11:08
【问题描述】:
(这是 Redshift 特有的,应该考虑到它的柱状性质、排序顺序等)
当按时间戳排序时,我需要按类别从每列中获取第一个非 NULL 值。
本质上与 FIRST_VALUE() 相同,但作为一个聚合。
或者,将 COALESCE() 作为聚合。
然而,Redshift 没有更高版本的 PostgreSQL 或 Oracle 的优点。所以,我正在寻找选项来测试我的 1 亿行导入 :)
(我不喜欢我的任何一个选择,但我很难找到更好的选择。)
示例输入
category | row_timestamp | value_a | value_b | value_c
----------+---------------+---------+---------+---------
01 | 001 | NULL | NULL | 4
01 | 010 | 7 | NULL | NULL
01 | 100 | NULL | 1 | 2
01 | 999 | 6 | 3 | 6
02 | 001 | 1 | NULL | NULL
02 | 010 | NULL | 2 | NULL
02 | 100 | NULL | 1 | 9
02 | 999 | 6 | 3 | 2
预期结果
category | value_a | value_b | value_c
----------+-------------------------+---------+---------
01 | 7 | 1 | 4
02 | 1 | 2 | 9
当前解决方案
SELECT DISTINCT
category,
FIRST_VALUE(value_a IGNORE NULLS)
OVER (PARTITION BY category
ORDER BY row_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
AS value_a,
FIRST_VALUE(value_b IGNORE NULLS)
OVER (PARTITION BY category
ORDER BY row_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
AS value_b,
FIRST_VALUE(value_c IGNORE NULLS)
OVER (PARTITION BY category
ORDER BY row_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
AS value_c
FROM
mytable
它有效,但 DISTINCT 可能适用于数百或数千行。不太理想。
如果它只用于一两列,这可能会起作用(但它是用于十几列,所以这很可怕)...
WITH
sorted_value_a AS
(
SELECT
category,
value_a,
ROW_NUMBER() OVER (PARTITION BY category
ORDER BY value_a IS NOT NULL, row_timestamp
)
AS row_ordinal
FROM
myTable
),
sorted_value_b AS
(
SELECT
category,
value_b,
ROW_NUMBER() OVER (PARTITION BY category
ORDER BY value_b IS NOT NULL, row_timestamp
)
AS row_ordinal
FROM
myTable
),
sorted_value_c AS
(
SELECT
category,
value_c,
ROW_NUMBER() OVER (PARTITION BY category
ORDER BY value_c IS NOT NULL, row_timestamp
)
AS row_ordinal
FROM
myTable
)
SELECT
*
FROM
sorted_value_a AS a
INNER JOIN
sorted_value_b AS b
ON b.category = a.category
INNER JOIN
sorted_value_c AS c
ON c.category = a.category
【问题讨论】:
标签: sql amazon-redshift aggregation