postgresql split_part() 仅返回由分隔符分隔的唯一子字符串？答案

【问题标题】：postgresql split_part() returning only unique substrings separated by a delimiter?postgresql split_part() 仅返回由分隔符分隔的唯一子字符串？
【发布时间】：2023-03-22 16:14:01
【问题描述】：

我知道 split_part() 可以分隔由分隔符分隔的字符串元素组成的连接字符串，如下所示：

SELECT split_part(col, ',', 1) AS col1
     , split_part(col, ',', 2) AS col2
     , split_part(col, ',', 3) AS col3
     , split_part(col, ',', 4) AS col4
FROM   tbl;

但是，我有一种情况，连接字符串中的元素是多余的，即有重复的元素。如何仅检索唯一的字符串元素（子字符串），即不重复相同的子字符串？

【问题讨论】：

请以表格文本形式提供示例数据和所需结果。您想拆分为列，那么您想如何处理这些重复项？
我认为我的描述非常准确 - 如果您只阅读过它，您就会知道我想要的重复值是什么。我想将带有分隔值的字符串列拆分为单独的列，但没有重复值，只有唯一值。
示例日期可能会消除您问题中的一些歧义。如果唯一元素小于 4 怎么办？你期望什么？输入 1,1,1,1,1 或 1,2,1,2,1,2,3,4,3,4 或 1,1,2,2,3,3,4,4,5,5?

标签： sql string postgresql split distinct

【解决方案1】：

如果我正确地遵循这一点，您可以先将字符串拆分为派生表，获取不同的值，然后转至列。我们需要跟踪每个值的原始位置，with ordinality 就派上用场了：

select t.*, x.*
from tbl t
cross join lateral (
    select 
        max(colx) filter(where rn = 1) col1,
        max(colx) filter(where rn = 2) col2,
        max(colx) filter(where rn = 3) col3,
        max(colx) filter(where rn = 4) col4
    from (
        select colx, row_number() over(partition by colx order by min(n)) rn
        from regexp_split_to_table(t.col, ',') with ordinality x(colx, n)
        group by colx
    ) x
) x

当有重复时，只保留每个值的第一次出现。

【讨论】：

【解决方案2】：

我会将列转换为具有不同元素的数组：

select elements[1] as col1, 
       elements[2] as col2, 
       elements[3] as col3, 
       elements[4] as col4
from (
  select array(select distinct on (e) e
               from unnest(string_to_array(col, ',')) with ordinality as c(e,idx) 
               order by e,idx)  as elements
  from tbl
) t

如果你需要做很多事情，一个函数会让它更具可读性：

create function distinct_elements(p_input text, p_delim text)
  returns text[]
as
$$  
  select array(select distinct on (e) e
               from unnest(string_to_array(p_input, p_delim)) with ordinality as c(e,idx) 
               order by e,idx);
$$
language sql
immutable;

然后像这样使用它：

select elements[1] as col1, 
       elements[2] as col2, 
       elements[3] as col3, 
       elements[4] as col4
from (
  select distinct_elements(col, ',') as elements
  from tbl
) t;

但更好的方法是规范化您的数据模型，而不是将逗号分隔的值存储在单个列中。

【讨论】：

【解决方案3】：

我怎样才能只检索唯一的字符串元素（子字符串），即不重复相同的子字符串？

如果这是你想要的，我不明白你为什么想要四列。只需使用：

select distinct part
from regexp_split_to_table(col, ',') part;

我将您的问题解释为删除重复跨行以及内给定行。

【讨论】：