确定一个序列是否在另一个序列中的最佳方法？答案

【问题标题】：Best way to determine if a sequence is in another sequence?确定一个序列是否在另一个序列中的最佳方法？
【发布时间】：2010-09-30 08:23:08
【问题描述】：

这是将“字符串包含子字符串”问题推广到（更多）任意类型。

给定一个序列（例如列表或元组），确定另一个序列是否在其中的最佳方法是什么？作为奖励，它应该返回子序列开始的元素的索引：

示例用法（Sequence in Sequence）：

>>> seq_in_seq([5,6],  [4,'a',3,5,6])
3
>>> seq_in_seq([5,7],  [4,'a',3,5,6])
-1 # or None, or whatever

到目前为止，我只是依靠蛮力，它看起来缓慢、丑陋和笨拙。

【问题讨论】：

这能回答你的问题吗？ Check for presence of a sliced list in Python

标签： python algorithm sequence

【解决方案1】：

我支持 Knuth-Morris-Pratt 算法。顺便说一句，您的问题（和 KMP 解决方案）正是 Python Cookbook 第 2 版中的配方 5.13。相关代码http://code.activestate.com/recipes/117214/

它在给定序列中找到所有正确的子序列，并且应该用作迭代器：

>>> for s in KnuthMorrisPratt([4,'a',3,5,6], [5,6]): print s
3
>>> for s in KnuthMorrisPratt([4,'a',3,5,6], [5,7]): print s
(nothing)

【讨论】：

请注意，在 code.activestate 上给出的 KMP 实现对于某些人来说明显慢了 30-500 倍（可能是不具代表性的输入）。进行基准测试以查看愚蠢的内置方法是否表现出色似乎是个好主意！
KMP 在实践中的速度大约是朴素算法的两倍。因此，对于大多数目的而言，它完全不合适，尽管它具有良好的渐近最坏情况运行时。

【解决方案2】：

这是一种蛮力方法O(n*m)（类似于@mcella's answer）。对于 small 输入序列，它可能比纯 Python O(n+m)（参见 @Gregg Lind answer）中的 Knuth-Morris-Pratt 算法实现更快。

#!/usr/bin/env python
def index(subseq, seq):
    """Return an index of `subseq`uence in the `seq`uence.

    Or `-1` if `subseq` is not a subsequence of the `seq`.

    The time complexity of the algorithm is O(n*m), where

        n, m = len(seq), len(subseq)

    >>> index([1,2], range(5))
    1
    >>> index(range(1, 6), range(5))
    -1
    >>> index(range(5), range(5))
    0
    >>> index([1,2], [0, 1, 0, 1, 2])
    3
    """
    i, n, m = -1, len(seq), len(subseq)
    try:
        while True:
            i = seq.index(subseq[0], i + 1, n - m + 1)
            if subseq == seq[i:i + m]:
               return i
    except ValueError:
        return -1

if __name__ == '__main__':
    import doctest; doctest.testmod()

不知道这种情况下small有多大？

【讨论】：

【解决方案3】：

一个简单的方法：转换成字符串，依赖字符串匹配。

使用字符串列表的示例：

 >>> f = ["foo", "bar", "baz"]
 >>> g = ["foo", "bar"]
 >>> ff = str(f).strip("[]")
 >>> gg = str(g).strip("[]")
 >>> gg in ff
 True

使用字符串元组的示例：

>>> x = ("foo", "bar", "baz")
>>> y = ("bar", "baz")
>>> xx = str(x).strip("()")
>>> yy = str(y).strip("()")
>>> yy in xx
True

使用数字列表的示例：

>>> f = [1 , 2, 3, 4, 5, 6, 7]
>>> g = [4, 5, 6]
>>> ff = str(f).strip("[]")
>>> gg = str(g).strip("[]")
>>> gg in ff
True

【讨论】：

我喜欢！无论如何，对于快速和肮脏的东西。一般：def is_in(seq1, seq2): return str(list(seq1))[1:-1] in str(list(seq2))[1:-1] 我猜不是找到匹配索引的好方法。

【解决方案4】：

与字符串匹配先生相同...Knuth-Morris-Pratt string matching

【讨论】：

【解决方案5】：

>>> def seq_in_seq(subseq, seq):
...     while subseq[0] in seq:
...         index = seq.index(subseq[0])
...         if subseq == seq[index:index + len(subseq)]:
...             return index
...         else:
...             seq = seq[index + 1:]
...     else:
...         return -1
... 
>>> seq_in_seq([5,6], [4,'a',3,5,6])
3
>>> seq_in_seq([5,7], [4,'a',3,5,6])
-1

抱歉，我不是算法专家，这只是我目前能想到的最快的事情，至少我认为它看起来不错（对我来说）并且我在编写它时玩得很开心。 ;-)

很可能这与您的蛮力方法正在做的事情相同。

【讨论】：

很干净，但是蛮力的 --> O(mn)

【解决方案6】：

蛮力可能适用于小模式。

对于较大的，请查看Aho-Corasick algorithm。

【讨论】：

Aho-Corasick 会很棒。我正在专门寻找 python 或 pythonish 解决方案......所以如果有一个实现，那就太好了。我会四处逛逛。

【解决方案7】：

这是另一个 KMP 实现：

from itertools import tee

def seq_in_seq(seq1,seq2):
    '''
    Return the index where seq1 appears in seq2, or -1 if 
    seq1 is not in seq2, using the Knuth-Morris-Pratt algorithm

    based heavily on code by Neale Pickett <neale@woozle.org>
    found at:  woozle.org/~neale/src/python/kmp.py

    >>> seq_in_seq(range(3),range(5))
    0
    >>> seq_in_seq(range(3)[-1:],range(5))
    2
    >>>seq_in_seq(range(6),range(5))
    -1
    '''
    def compute_prefix_function(p):
        m = len(p)
        pi = [0] * m
        k = 0
        for q in xrange(1, m):
            while k > 0 and p[k] != p[q]:
                k = pi[k - 1]
            if p[k] == p[q]:
                k = k + 1
            pi[q] = k
        return pi

    t,p = list(tee(seq2)[0]), list(tee(seq1)[0])
    m,n = len(p),len(t)
    pi = compute_prefix_function(p)
    q = 0
    for i in range(n):
        while q > 0 and p[q] != t[i]:
            q = pi[q - 1]
        if p[q] == t[i]:
            q = q + 1
        if q == m:
            return i - m + 1
    return -1

【讨论】：

tee 调用似乎对任何事情都没有好处，因为 tee 的输出 2 元组中的另一个元素被忽略了。 seq1 和 seq2 分别被复制到两个新的生成器中，其中一个被实例化为一个列表，另一个被忽略。

【解决方案8】：

我参加聚会有点晚了，但这里有一些简单的使用字符串：

>>> def seq_in_seq(sub, full):
...     f = ''.join([repr(d) for d in full]).replace("'", "")
...     s = ''.join([repr(d) for d in sub]).replace("'", "")
...     #return f.find(s) #<-- not reliable for finding indices in all cases
...     return s in f
...
>>> seq_in_seq([5,6], [4,'a',3,5,6])
True
>>> seq_in_seq([5,7], [4,'a',3,5,6])
False
>>> seq_in_seq([4,'abc',33], [4,'abc',33,5,6])
True

正如 Ilya V. Schurov 所指出的，在这种情况下，find 方法不会返回包含多字符串或多位数字的正确索引。

【讨论】：

如果序列的元素具有非唯一长度，则此解决方案不可靠：如何将返回的索引转换为初始序列中的索引变得不明显。另请注意，`d` 语法的反引号与 Python 3 一样已弃用且不鼓励。
即使尺寸相同也不可靠的例子：sub='ab', full='aa','bb'

【解决方案9】：

为了它的价值，我尝试使用这样的双端队列：

from collections import deque
from itertools import islice

def seq_in_seq(needle, haystack):
    """Generator of indices where needle is found in haystack."""
    needle = deque(needle)
    haystack = iter(haystack)  # Works with iterators/streams!
    length = len(needle)
    # Deque will automatically call deque.popleft() after deque.append()
    # with the `maxlen` set equal to the needle length.
    window = deque(islice(haystack, length), maxlen=length)
    if needle == window:
        yield 0  # Match at the start of the haystack.
    for index, value in enumerate(haystack, start=1):
        window.append(value)
        if needle == window:
            yield index

双端队列实现的一个优点是它只对大海捞针进行一次线性传递。因此，如果 haystack 正在流式传输，那么它仍然可以工作（与依赖切片的解决方案不同）。

解决方案仍然是蛮力，O（n * m）。一些简单的本地基准测试表明，它比 str.index 中字符串搜索的 C 实现慢约 100 倍。

【讨论】：

【解决方案10】：

另一种方法，使用集合：

set([5,6])== set([5,6])&set([4,'a',3,5,6])
True

【讨论】：

仅仅找出集合是否是序列的子集。不是它实际上是否按顺序排列。 set([5,6])== set([5,6])&set([4,'a',5,4,6]) 返回True
这可能是第一个快速测试：检查所有元素是否都在完整列表中。