SQL：查找行之间的最长公共字符串答案

【问题标题】：SQL: Find longest common string between rowsSQL：查找行之间的最长公共字符串
【发布时间】：2015-05-19 02:51:53
【问题描述】：

我有一张桌子 T1：

    Col
   -------
  1 THE APPLE
 THE APPLE
 THE APPLE 123
 THE APPLE 12/16
 BEST THE APPLE

我想要 T2：

 Result
--------
 THE APPLE

我正在使用 Redshift，正在寻找一些在 SQL 中进行模糊字符串匹配的方法。列的最长可能长度为 100 个字符。在任何时候我都不必比较超过 25 行。

【问题讨论】：

看看这篇文章：simple-talk.com/blogs/2014/12/20/…
此列中所有值的最大长度是多少？

标签： sql string postgresql amazon-redshift

【解决方案1】：

如果您可以在所有行中获取最常出现的单词（以空格分隔的最常见单词），您可以使用：

select word, count(distinct rn) as num_rows
from(
select unnest(string_to_array(col, ' ')) as word,
       row_number() over(order by col) as rn
from tbl
) x
group by word
order by num_rows desc

小提琴： http://sqlfiddle.com/#!15/bc803/9/0

请注意，这会在 4 行中找到单词 apple，而不是 5。这是因为 APPLE123 是一个单词，而 APPLE 123 将是两个单词，其中一个是 APPLE，并且会计数，但它没有。

【讨论】：

您可能也可以通过使用regexp_split_to_table（而不是string_to_array 和unnest）和在“真实”单词上拆分的正则表达式来“识别”APPLE123跨度>
我可以做多个单词吗？
Idk 关于以动态方式执行此操作，但如果您要定位“行中最常见的 2 个单词”，您可以使用 sqlfiddle.com/#!15/bc803/29/0（执行最常见的 3 个单词，或最常见的 4 个单词）等。您必须为每个额外的单词添加一个连接到内联视图（现在我有 2 个，内联视图标记为 X 和 Y）。

【解决方案2】：

这个问题需要相当程度的复杂性来解决，而且它的运行时间会随着字符串长度和记录数量的增加而急剧增加。但是，鉴于您的表 T1 相当小，您可能只需使用以下 PL/pgSQL 函数即可。

算法

在 T1(col) 中找到最短的值。这是所有记录中最长的匹配。这是候选字符串。
查看候选人是否出现在 T1 的所有其他行中。如果是，则将当前候选者返回结果集。
将候选者移动到最短值的一个位置，然后返回步骤 2，直到候选者到达最短字符串的末尾。
如果找到匹配的候选者，则从函数返回。否则，将候选者缩短 1 并从最短字符串的开头重新开始，然后转到步骤 2。如果无法从最短字符串中提取更多候选者，则返回 NULL。

代码

下面代码中重要的是检查匹配的短路：只要单个记录不匹配col 到候选字符串，就不需要进一步检查。因此，对于长字符串，比较实际上是从最短的字符串与另一个字符串进行比较，仅当候选字符串变得如此短以至于它们确实更普遍时才增加检查的行数。

字符串比较区分大小写；如果要使其不区分大小写，请将LIKE 更改为ILIKE。作为一项奖励功能，您将获得所有行中都存在的多个匹配字符串（显然都是相同的长度）。不利的一面是，一旦达到单个字符匹配（可能还有一些 2 字符和更长的字符串），它将报告多个相同的字符串。您可以使用SELECT DISTINCT * 删除这些重复项。

CREATE FUNCTION find_longest_string_in_T1() RETURNS SETOF text AS $$
DECLARE
  shortest  varchar;       -- The shortest string in T1(col) so the longest possible match
  candidate varchar;       -- Candidate string to test
  sz_sh     integer;       -- Length of "shortest"
  l         integer := 1;  -- Starting position of "candidate" in "shortest"
  sz        integer;       -- Length of "candidate"
  fail      boolean;       -- Has "candidate" been found in T1(col)?
  found_one boolean := false; -- Flag if we found at least one match
BEGIN
  -- Find the shortest string and its size, don't worry about multiples, need just 1
  SELECT col, char_length(col) INTO shortest, sz_sh
  FROM T1
  ORDER BY char_length(col) ASC NULLS LAST
  LIMIT 1;

  -- Get all the candidates from the shortest string and test them from longest to single char
  candidate := shortest;
  sz := sz_sh;
  LOOP
    -- Check rows in T1 if they contain the candidate string.
    -- Short-circuit as soon as a record does not match the candidate
    <<check_T1>>
    BEGIN
      FOR fail IN SELECT col NOT LIKE '%' || candidate || '%' FROM T1 LOOP
        EXIT check_T1 WHEN fail;
      END LOOP;
      -- Block was not exited, so the candidate is present in all rows: we have a match
      RETURN NEXT candidate;
      found_one := true;
    END;

    -- Produce the next candidate
    IF l+sz > sz_sh THEN -- "candidate" reaches to the end of "shortest"
      -- Exit if we already have at least one matching candidate
      EXIT WHEN found_one;
      -- .. otherwise shorthen the candidate
      sz := sz - 1;
      l := 1;
    ELSE
      -- Exit with empty result when all candidates have been examined
      EXIT WHEN l = sz_sh;
      -- .. otherwise move one position over to get the next candidate
      l := l + 1;
    END IF;
    candidate := substring(shortest from l for sz);
  END LOOP;

  RETURN;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

调用SELECT * FROM find_longest_string_in_T1() 应该可以解决问题。

简单测试

创建一些测试数据：

INSERT INTO T1 
  SELECT 'hello' || md5(random()::text) || md5(random()::text) || 'match' || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1 
  SELECT md5(random()::text) || 'match' || 'hello' || md5(random()::text)  || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1 
  SELECT 'match' || md5(random()::text) || 'hello' || md5(random()::text)  || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1 
  SELECT md5(random()::text) || 'hello' || md5(random()::text) || 'match' || md5(random()::text) FROM generate_series(1, 25);

这会生成 100 行，长度为 106 个字符，并生成匹配“hello”和“match”（不太可能出现任何其他匹配）。这会在不到半秒的时间内生成正确的两个字符串（没有多余的 Ubuntu 服务器、PG 9.3、CPU i5、4GB RAM）。

【讨论】：

我不认为它是 np-complete。我认为蛮力算法类似于 - O(#records with a given id * number of characters in a string^4)。 NP完全问题必须处理所有组合。
@GordonLinoff 你是对的，我实际上对它最终的效率感到惊讶。从表面上看，快捷方式使它更像 O(log(#recs) * #chars^2 * log(#chars))。一点也不差。答案已更新。