Redshift：正则表达式删除列数据中的重复项答案

【问题标题】：Redshift: regexp to remove duplicates within column dataRedshift：正则表达式删除列数据中的重复项
【发布时间】：2018-04-27 17:20:50
【问题描述】：

我需要在 redshift 数据库中编写查询以删除列中的重复项。

select regexp_replace('GiftCard,GiftCard',  '([^,]*)(,\2)+($|,)', '\2\3')

预期结果：GiftCard

得到的结果：GiftCard,GiftCard

基本上，我想搜索列中的值，如果重复则删除。

谁能帮我解决这个问题？

【问题讨论】：

数据总是看起来像 , 吗？您还以哪些其他格式查看此列的数据？
是的，数据总是string1,string2,
我的意思是询问数据是否始终具有 (string1,string1) 或 (string1,string2) (string1,string1) 之类的列。您只想从中找到重复的值并仅获取一个字符串作为输出？
我认为 python UDF 在这里可能工作得很好。
@hadooper 。它将是任意值和任意数量的字符串。

标签： sql regex amazon-redshift regexp-replace

【解决方案1】：

不确定这是否可以仅使用正则表达式查询来完成，但正如 Jon 提到的那样，UDF 会很好地工作。

只需拆分逗号上的文本，创建一组独特的单词，然后以某种格式返回。该函数将类似于：

CREATE FUNCTION f_unique_words (s text)
    RETURNS text
IMMUTABLE
AS $$
    return ','.join(set(s.split(',')))
$$ LANGUAGE plpythonu;

示例用法：

> select f_unique_words('GiftCard,GiftCard');
[GiftCard]
> select f_unique_words('GiftCard,Cat,Dog,Cat,Cat,Frog,frog,GiftCard');
[frog,GiftCard,Dog,Frog,Cat]

这取决于您是否拥有对集群的适当访问权限。要创建该函数，还请确保您已为您的用户授予使用语言“plpythonu”的 USAGE。

作为旁注，如果您想要一个不区分大小写的版本，并且不会将所有输出都以小写形式显示，则可以这样做：

CREATE FUNCTION f_unique_words_ignore_case (s text)
    RETURNS text
IMMUTABLE
AS $$
    wordset = set(s.split(','))
    return ','.join(item for item in wordset if item.istitle() or item.title() not in wordset)
$$ LANGUAGE plpythonu;

示例用法：

> select f_unique_words_ignore_case('GiftCard,Cat,Dog,Cat,Cat,Frog,frog,GiftCard');
[GiftCard,Dog,Frog,Cat]

【讨论】：