Presto 中的压缩数组答案

【问题标题】：Condensing arrays in PrestoPresto 中的压缩数组
【发布时间】：2021-10-13 02:21:09
【问题描述】：

我有一个查询，它使用 array_agg() 函数生成数组字符串

SELECT 
array_agg(message) as sequence
from mytable
group by id

生成一个如下所示的表格：

                 sequence
1 foo foo bar baz bar baz
2     foo bar bar bar baz
3 foo foo foo bar bar baz

但我的目标是压缩字符串数组，以便没有一个可以连续重复多次，例如，所需的输出如下所示：

    sequence
1 foo bar baz bar baz
2 foo bar baz
3 foo bar baz

如何使用 Presto SQL 做到这一点？

【问题讨论】：

标签： sql presto

【解决方案1】：

您可以通过以下两种方式之一执行此操作：

使用array_distinct 函数从结果数组中删除重复项：

WITH mytable(id, message) AS (VALUES
  (1, 'foo'), (1, 'foo'), (1, 'bar'), (1, 'bar'), (1, 'baz'), (1, 'baz'),
  (2, 'foo'), (2, 'bar'), (2, 'bar'), (2, 'bar'), (2, 'baz'),
  (3, 'foo'), (3, 'foo'), (3, 'foo'), (3, 'bar'), (3, 'bar'), (3, 'baz')
)
SELECT array_distinct(array_agg(message)) AS sequence
FROM mytable
GROUP BY id

使用聚合中的DISTINCT 限定符在将重复值传递到array_agg 之前删除它们。

WITH mytable(id, message) AS (VALUES
  (1, 'foo'), (1, 'foo'), (1, 'bar'), (1, 'bar'), (1, 'baz'), (1, 'baz'),
  (2, 'foo'), (2, 'bar'), (2, 'bar'), (2, 'bar'), (2, 'baz'), (3, 'foo'),
  (3, 'foo'), (3, 'foo'), (3, 'bar'), (3, 'bar'), (3, 'baz')
)
SELECT array_agg(DISTINCT message) AS sequence
FROM mytable
GROUP BY id

两种选择产生相同的结果：

    sequence
-----------------
 [foo, bar, baz]
 [foo, bar, baz]
 [foo, bar, baz]
(3 rows)

更新：您可以使用最近引入的MATCH_RECOGNIZE 功能删除重复的元素序列：

WITH mytable(id, message) AS (VALUES
  (1, 'foo'), (1, 'foo'), (1, 'bar'), (1, 'baz'), (1, 'bar'), (1, 'baz'),
  (2, 'foo'), (2, 'bar'), (2, 'bar'), (2, 'bar'), (2, 'baz'),
  (3, 'foo'), (3, 'foo'), (3, 'foo'), (3, 'bar'), (3, 'bar'), (3, 'baz')
)
SELECT array_agg(value) AS sequence
FROM mytable
 MATCH_RECOGNIZE(
    PARTITION BY id
    MEASURES A.message AS value
    PATTERN (A B*)
    DEFINE B AS message = PREV(message)
)
GROUP BY id

【讨论】：

感谢您的回答，但我想我没有提到重要的一点（我不需要不同的值，而是不重复的连续值）。我修改了我的数据的第一行来说明这一点