有效地在字符串中找到给定的子序列，最大化连续字符的数量答案

【问题标题】：Efficiently find a given subsequence in a string, maximizing the number of contiguous characters有效地在字符串中找到给定的子序列，最大化连续字符的数量
【发布时间】：2016-09-08 13:57:58
【问题描述】：

长问题描述

模糊字符串匹配器实用程序，如fzf 或CtrlP 过滤具有给定搜索字符串作为子序列的字符串列表。例如，假设用户想要在文件列表中搜索特定照片。查找文件

/home/user/photos/2016/pyongyang_photo1.png

输入ph2016png 就足够了，因为这个搜索字符串是这个文件名的一个子序列。（注意这不是 LCS。整个搜索字符串必须是文件名的子序列。）

检查给定的搜索字符串是否是另一个字符串的子序列很简单，但我想知道如何有效地获得最佳匹配：在上面的示例中，有多个可能的匹配。一个是

/home/user/<b>ph</b>otos/<b>2016</b>/<b>p</b>yo<b>n</b>gyan<b>g</b>_photo1.png

但用户可能想到的是

/home/user/photos/2016/pyongyang_photo1.png

为了形式化这一点，我将“最佳”匹配定义为由最少数量的子字符串组成的匹配。第一个示例匹配的数字为 5，第二个示例匹配的数字为 3。

我想出这个是因为获得最佳匹配来为每个结果分配分数以进行排序会很有趣。不过我对近似解不感兴趣，我对这个问题的兴趣主要是学术性质的。

tl;dr 问题描述

给定字符串s 和t，在t 的子序列中找到一个等于s 的子序列，该子序列使t 中的连续元素对的数量最大化。

到目前为止我已经尝试过什么

为了讨论，让我们调用搜索查询s 和测试字符串t。问题的解决方案用fuzzy(s, t) 表示。我将使用 Python 的字符串切片表示法。最简单的方法如下：

由于任何解决方案都必须按顺序使用来自s 的所有字符，因此解决此问题的算法可以从在t（索引i）中搜索s[0] 的第一次出现开始，然后使用更好的两种解决方案中的一种

t[:i+1] + fuzzy(s[1:], t[i+1:])    # Use the character
t[:i]   + fuzzy(s,     t[i+1:])    # Skip it and use the next occurence 
                                   # of s[0] in t instead

这显然不是解决这个问题的最佳方案。相反，这是显而易见的蛮力。（我曾尝试同时搜索最后一次出现的 s[-1] 并在此问题的早期版本中使用此信息，但结果证明这种方法不起作用。）

→ 我的问题是：解决这个问题最有效的方法是什么？

【问题讨论】：

不错的一个！！！一些澄清：（1）你说最好的匹配是具有最少数量的部分。这是否意味着输入必须的所有字符都可以找到并且顺序正确？ (2) 如果在同一个字符串中有多个可能的匹配项，并且都带有相同数量的零件？部件之间的距离是否有趣？
输入的所有字符必须以正确的顺序找到。如果有几个可能的匹配最大化约束，那么返回其中一个就足够了。
第二部分表示仅找到第一个匹配项是不够的，但必须扫描整个字符串以确认没有（更好的）匹配项，对吧？
是的，除非有办法证明不可能有更好的匹配。例如，如果我在上面概述的情况下搜索photo，那么整个查询作为单个子字符串包含在文件名中 - 显然没有比这更好的匹配了。
对于长度为 n 和 m 的 2 个字符串，这可以使用 Gotoh 的算法在 O(nm) 时间和空间内精确解决，该算法将 2 个字符串与间隙打开成本对齐。您需要做的就是指定一个非零的间隙打开成本，并禁止删除（或等效地，为删除分配巨额罚款）。

标签： algorithm dynamic-programming string-matching fuzzy-search

【解决方案1】：

我建议创建一个搜索树，其中每个节点代表大海捞针中与针字符之一匹配的字符位置。

顶部节点是兄弟节点，表示大海捞针中第一个针字符的出现次数。

父节点的子节点是表示大海捞针中下一个针字符出现的节点，但仅限于位于该父节点所表示的位置之后的节点。

这在逻辑上意味着一些孩子被几个父母共享，所以这个结构并不是真正的树，而是一个有向无环图。一些兄弟姐妹的父母甚至可能有完全相同的孩子。其他父母可能根本没有孩子：他们是死胡同，除非他们在图表的底部，叶子代表最后一个针字符的位置。

一旦建立此图，在其中进行深度优先搜索可以轻松得出从某个节点开始仍需要的段数，然后在备选方案中将其最小化。

我在下面的 Python 代码中添加了一些 cmets。这段代码可能还有待改进，但与您的解决方案相比，它似乎已经相当高效了。

def fuzzy_trincot(haystack, needle, returnSegments = False):
    inf = float('inf')

    def getSolutionAt(node, depth, optimalCount = 2):
        if not depth: # reached end of needle
            node['count'] = 0
            return
        minCount = inf # infinity ensures also that incomplete branches are pruned
        child = node['child']
        i = node['i']+1
        # Optimisation: optimalCount gives the theoretical minimum number of  
        # segments needed for any solution. If we find such case, 
        # there is no need to continue the search.
        while child and minCount > optimalCount:
            # If this node was already evaluated, don't lose time recursing again.
            # It works without this condition, but that is less optimal.
            if 'count' not in child:
                getSolutionAt(child, depth-1, 1)
            count = child['count'] + (i < child['i'])
            if count < minCount:
                minCount = count
            child = child['sibling']
        # Store the results we found in this node, so if ever we come here again,
        # we don't need to recurse the same sub-tree again.
        node['count'] = minCount

    # Preprocessing: build tree
    # A node represents a needle character occurrence in the haystack.
    # A node can have these keys:
    #   i:       index in haystack where needle character occurs
    #   child:   node that represents a match, at the right of this index, 
    #            for the next needle character
    #   sibling: node that represents the next match for this needle character
    #   count:   the least number of additional segments needed for matching the 
    #            remaining needle characters (only; so not counting the segments
    #            already taken at the left)
    root = { 'i': -2, 'child': None, 'sibling': None }
    # Take a short-cut for when needle is a substring of haystack
    if haystack.find(needle) != -1:
        root['count'] = 1
    else:
        parent = root
        leftMostIndex = 0
        rightMostIndex = len(haystack)-len(needle)
        for j, c in enumerate(needle):
            sibling = None
            child = None
            # Use of leftMostIndex is an optimisation; it works without this argument
            i = haystack.find(c, leftMostIndex)
            # Use of rightMostIndex is an optimisation; it works without this test
            while 0 <= i <= rightMostIndex:
                node = { 'i': i, 'child': None, 'sibling': None }
                while parent and parent['i'] < i:
                    parent['child'] = node
                    parent = parent['sibling']
                if sibling: # not first child
                    sibling['sibling'] = node
                else: # first child
                    child = node
                    leftMostIndex = i+1
                sibling = node
                i = haystack.find(c, i+1)
            if not child: return False
            parent = child
            rightMostIndex += 1
        getSolutionAt(root, len(needle))

    count = root['count']
    if not returnSegments:
        return count

    # Use the `returnSegments` option when you need the character content 
    # of the segments instead of only the count. It runs in linear time.

    if count == 1: # Deal with short-cut case 
        return [needle]
    segments = []
    node = root['child']
    i = -2
    start = 0
    for end, c in enumerate(needle):
        i += 1
        # Find best child among siblings
        while (node['count'] > count - (i < node['i'])):
            node = node['sibling']
        if count > node['count']:
            count = node['count']
            if end:
                segments.append(needle[start:end])
                start = end
        i = node['i']
        node = node['child']
    segments.append(needle[start:])
    return segments

可以使用可选的第三个参数调用该函数：

haystack = "/home/user/photos/2016/pyongyang_photo1.png"
needle = "ph2016png"

print (fuzzy_trincot(haystack, needle))

print (fuzzy_trincot(haystack, needle, True))

输出：

3
['ph', '2016', 'png']

由于函数被优化为只返回计数，第二次调用将增加一点执行时间。

【讨论】：

乍一看，这个解决方案听起来与上面 cmets 中建议的 Gotoh 算法有关，而且看起来也具有相同的渐近行为。（在实践中应该会更好，因为树的大小只有在特殊情况下才为 n*m）。不过，它更容易理解。而且，至少在这个实现中，它确实是相当快的！在实践中，这是about on par with Nelxiost's solution，通常甚至快一点——除了haystack = "a"*50, needle = "a"*50 之类的情况，原因很明显。太好了！
谢谢。我喜欢这个问题！我正在查看一些随机字符串，只想提一下 fuzzy_pberndt 对测试用例 ("abbcabaacbbabab", "ababb", 2) 失败。
另一个：fuzzy_pberndt 和 fuzzy_nelxiost 在测试用例 ('babaacb', 'abab', 2) 上均失败。
... 一个非常简单的：fuzzy_nelxiost 在测试用例 ('babaa','aba',1) 上失败
我还测试了fuzzy_j_random_hacker：它未能通过测试用例('bacbcbaab', 'bcba', 1)。 fuzzy_nelxiost 和 fuzzy_pberndt 也未能通过该测试。

【解决方案2】：

这可能不是最高效的解决方案，但它是一种高效且易于实施的解决方案。为了说明，我将借用您的示例。设/home/user/photos/2016/pyongyang_photo1.png 为文件名，ph2016png 为输入。

第一步（预计算）是可选的，但可能有助于加快下一步（设置）的速度，尤其是当您将算法应用于许多文件名时。

预计算
创建一个表，计算输入中每个字符的出现次数。由于您可能只处理 ASCII 字符，因此 256 个条目就足够了（可能 128 个，甚至更少，具体取决于字符集）。

"ph2016png"
['p'] : 2
['h'] : 1
['2'] : 1
['0'] : 1
['b'] : 0
...

设置
通过丢弃输入中不存在的字符，将文件名分割成子字符串。同时，检查输入的每个字符是否在文件名中出现正确的次数（如果已完成预计算）。最后，检查输入的每个字符是否按顺序出现在子字符串列表中。如果将子字符串列表视为单个字符串，则对于该字符串的任何给定字符，在输入中在它之前找到的每个字符都必须在该字符串中找到它之前。这可以在创建子字符串时完成。

"/home/user/photos/2016/pyongyang_photo1.png"
"h", "ph", "2016", "p", "ng", "ng", "ph", "1", "png"
'p' must come before "h", so throw this one away
"ph", "2016", "p", "ng", "ng", "ph", "1", "png"

核心
将最长的子字符串与输入匹配并跟踪最长的匹配。此匹配可以保留子字符串的开头（例如，将ababa（子字符串）与babaa（输入）匹配将导致aba，而不是baba），因为它更容易实现，尽管它没有不得不。如果你没有得到一个完整的匹配，使用最长的再次切分子串，然后用下一个最长的子串重试。

Since there is no instance of incomplete match with your example,
let's take something else, made to illustrate the point.
Let's take "babaaababcb" as the filename, and "ababb" as input.
Substrings : "abaaabab", "b"
Longest substring : "abaaabab"

If you keep the beginning of matches
Longest match : "aba"
Slice "abaaabab" into "aba", "aabab"
-> "aba", "aabab", "b"
Retry with "aabab"
-> "aba", "a", "abab", "b"
Retry with "abab" (complete match)

Otherwise (harder to implement, not necessarily better performing, as shown in this example)
Longest match : "abab"
Slice "abaaabab" into "abaa", "abab"
-> "abaa", "abab", "b"
Retry with "abaa"
-> "aba", "a", "abab", "b"
Retry with "abab" (complete match)

如果您确实获得了完全匹配，请继续将输入分成两部分以及子字符串列表，然后重复匹配最长的子字符串。

With "ph2016png" as input
Longest substring : "2016"
Complete match
Match substrings "h", "ph" with input "ph"
Match substrings "p", "ng", "ng", "ph", "1", "png" with input "png"

您可以保证找到包含最少子字符串的子字符串序列，因为您首先尝试最长的子字符串。如果输入不包含文件名中的许多短子字符串，这通常会表现良好。

【讨论】：

不错。我特别喜欢设置步骤。在我的解决方案中，必须从搜索字符串中重复搜索下一个出现的下一个字符，从而引发如何存储结果以供下一次查找的问题。通过删除不必要的字符，您可以避免完全搜索字符。 I hacked an implementation (without thinking too much of optimization) of this method，它的速度是我的 2 倍到 5 倍。
查看您的代码，我意识到我忘记在设置中添加一个实际使用出现表的部分，而不是存在表（我编辑过）。另外，我不擅长 Python，但我很确定有一种简单的方法可以避免正则表达式并提高性能。
@Philip，我尝试了您为此创建的实现，但在测试用例 ("aba", "ab", 1) 上出现断言错误。
@trincot 谢谢！我没有正确“跟踪最长的匹配”。我现在已经解决了这个问题，but then I noticed I misunderstood something else。 @Nelxost：在“核心”部分的第一个示例中，您最终将匹配 ['16', 'p', 'ng', 'ng', 'ph', '1'] 和 61。 2 个字符的字符串都不匹配，因此您检查，发现 16 具有最佳匹配，将其拆分为 1/6 并重复。由于不再匹配 2 个字符的字符串，因此您退回到单打。但是1 现在在列表中存在两次。你如何决定使用哪一个而不尝试两者？
@Phillip 好吧，你试试第一个，看看它是否有效。然后，您最终尝试将[] 与"6" 和['6', 'p', 'n', 'g', 'n', 'g', 'p', 'h', '1'] 与"" 匹配。所以你把它扔掉并将['6', 'p', 'n', 'g', 'n', 'g', 'p', 'h', '1']匹配到"61"。