返回不是其他字符串子字符串的字符串 - 时间是否可能小于 O(n^2)？答案

【问题标题】：Return string that is not a substring of other strings - is it possible in time less than O(n^2)?返回不是其他字符串子字符串的字符串 - 时间是否可能小于 O(n^2)？
【发布时间】：2017-02-05 19:57:18
【问题描述】：

给你一个字符串数组。您必须只返回那些不是数组中其他字符串的子字符串的字符串。输入 - ['abc','abcd','ab','def','efgd']。输出应该是 - 'abcd' 和 'efgd' 我在 python 中提出了一个时间复杂度为 O(n^2) 的解决方案。是否有可能的解决方案可以降低时间复杂度？我的解决方案：

def sub(l,s):
  l1=l
  for i in range (len(l)):
        l1[i]=''.join(sorted(l1[i]))
  for i in l1:      
         if s in i:
              return True
  return False

def main(l):
      for i in range(len(l)):
            if sub(l[0:i-1]+l[i+1:],l[i])==False:
                  print l[i]


main(['abc','abcd','ab','def','efgd'])

【问题讨论】：

为什么'def' 不在您的预期输出中？
发布您的 O(n^2) 解决方案可能会让其他人更愿意提供帮助，因为它首先显示了尝试。
可能存在使用后缀树的 O(n) 解决方案，其中 n 是所有字符串中的字符总数。但我不确定。
很明显，如果元素的长度没有界限并且n 是列表的长度，那么算法就不可能有界限，因为元素可能是列表任意大小。如果n 是包含的字符串长度的总和，则有一个 O(n) 解决方案：构建字符串的generalized suffixtree，然后访问它并找到所有字符串结尾的叶子。
嗯，更正：访问没那么简单。您必须找到最大叶子（对应于列表中的元素），然后确保它们的分支不包含其他最大叶子。这仍然可以通过一次复杂的访问或执行其中两次来完成，但仍然是 O(n)。

标签： python string python-2.7 substring

【解决方案1】：

使用Aho-Corasick 应该可以让您获得O(n) 的渐近运行时间，但代价是增加了额外的内存使用量和更高的固定成本乘数（被大O 表示法忽略，但仍然有意义）。算法的复杂度是几个分量的总和，但没有一个分量相乘，所以它应该在所有指标（字符串数、字符串长度、最长字符串等）上都是线性的。

使用pyahocorasick，您将执行初始传递以创建一个可以一次扫描所有字符串的自动机：

import ahocorasick

# This code assumes no duplicates in mystrings (which would make them mutually
# substrings). Easy to handle if needed, but simpler to avoid for demonstration

mystrings = ['abc','abcd','ab','def','efgd']

# Build Aho-Corasick automaton, involves O(n) (in combined length of mystrings) work
# Allows us to do single pass scans of a string for all strings in mystrings
# at once
aut = ahocorasick.Automaton()
for s in mystrings:
    # mapping string to itself means we're informed directly of which substring
    # we hit as we scan
    aut.add_word(s, s)
aut.make_automaton()

# Initially, assume all strings are non-substrings
nonsubstrings = set(mystrings)

# Scan each of mystrings for substrings from other mystrings
# This only involves a single pass of each s in mystrings thanks to Aho-Corasick,
# so it's only O(n+m) work, where n is again combined length of mystrings, and
# m is the number of substrings found during the search
for s in mystrings:
    for _, substr in aut.iter(s):
        if substr != s:
           nonsubstrings.discard(substr)

# A slightly more optimized version of the above loop, but admittedly less readable:
# from operator import itemgetter
# getsubstr = itemgetter(1)
# for s in mystrings:
#     nonsubstrings.difference_update(filter(s.__ne__, map(getsubstr, aut.iter(s))))

for nonsub in nonsubstrings:
    print(nonsub)

注意：烦人的是，我现在在一台没有编译器的机器上，所以我无法安装 pyahocorasick 来测试这段代码，但我以前用过它，我相信这应该可以工作，模数愚蠢错别字。

【讨论】：

【解决方案2】：

内存有问题吗？你可以转向久经考验的真实......TRIE！

构建后缀树！

鉴于您的意见['abc','abcd','ab','def','efgd']

我们会有一棵树

              _
            / | \
           a  e  d
          /   |   \
         b*   f    e
        /     |     \
       c*     g      f*
      /       |
     d*       d*

利用所述树的 DFS（深度优先搜索）搜索，您将找到最深的叶子 abcd、efgd 和 def

树遍历非常简单，您的时间复杂度是 O(n*m). 比之前的 O(n^2) 时间有更好的改进。

使用这种方法，添加新键变得简单，并且仍然可以轻松找到唯一键。

考虑添加密钥deg

你的新树大约是

              _
            / | \
           a  e  d
          /   |   \
         b*   f    e
        /     |   / \
       c*     g  g*   f*
      /       |
     d*       d*

使用这个新树，执行 DFS 搜索以获取不是其他人前缀的唯一键仍然是一件简单的事情。

from typing import List


class Trie(object):
    class Leaf(object):
        def __init__(self, data, is_key):
            self.data = data
            self.is_key = is_key
            self.children = []

        def __str__(self):
            return "{}{}".format(self.data, "*" if self.is_key else "")

    def __init__(self, keys):
        self.root = Trie.Leaf('', False)
        for key in keys:
            self.add_key(key)

    def add_key(self, key):
        self._add(key, self.root.children)

    def has_suffix(self, suffix):
        leaf = self._find(suffix, self.root.children)

        if not leaf:
            return False

        # This is only a suffix if the returned leaf has children and itself is not a key
        if not leaf.is_key and leaf.children:
            return True

        return False

    def includes_key(self, key):
        leaf = self._find(key, self.root.children)

        if not leaf:
            return False

        return leaf.is_key

    def delete(self, key):
        """
        If the key is present as a unique key as in it does not have any children nor are any of its nodes comprised of
         we should delete all of the nodes up to the root
        If the key is a prefix of another long key in the trie, umark the leaf node
        if the key is present in the trie and contains no children but contains nodes that are keys we should delete all
         nodes up to the first encountered key
        :param key:
        :return:
        """

        if not key:
            raise KeyError

        self._delete(key, self.root.children, None)

    def _delete(self, key, children: List[Leaf], parents: (List[Leaf], None), key_idx=0, parent_key=False):
        if not parents:
            parents = [self.root]

        if key_idx >= len(key):
            return

        key_end = True if len(key) == key_idx + 1 else False
        suffix = key[key_idx]
        for leaf in children:
            if leaf.data == suffix:
                # we have encountered a leaf node that is a key we can't delete these
                # this means our key shares a common branch
                if leaf.is_key:
                    parent_key = True

                if key_end and leaf.children:
                    # We've encountered another key along the way
                    if parent_key:
                        leaf.is_key = False
                    else:
                        # delete all nodes recursively up to the top of the first node that has multiple children
                        self._clean_parents(key, key_idx, parents)
                elif key_end and not leaf.children:
                    # delete all nodes recursively up to the top of the first node that has multiple children
                    self._clean_parents(key, key_idx, parents)

                # Not at the key end so we need to keep traversing the tree down
                parents.append(leaf)
                self._delete(key, leaf.children, parents, key_idx + 1, key_end)

    def _clean_parents(self, key, key_idx, parents):
        stop = False
        while parents and not stop:
            p = parents.pop()

            # Need to stop processing a removal at a branch
            if len(p.children) > 1:
                stop = True

            # Locate our branch and kill its children
            for i in range(len(p.children)):
                if p.children[i].data == key[key_idx]:
                    p.children.pop(i)
                    break
            key_idx -= 1

    def _find(self, key, children: List[Leaf]):
        if not key:
            raise KeyError

        match = False
        if len(key) == 1:
            match = True

        suffix = key[0]
        for leaf in children:
            if leaf.data == suffix and not match:
                return self._find(key[1:], leaf.children)
            elif leaf.data == suffix and match:
                return leaf
        return None

    def _add(self, key, children: List[Leaf]):
        if not key:
            return

        is_key = False
        if len(key) == 1:
            is_key = True

        suffix = key[0]
        for leaf in children:
            if leaf.data == suffix:
                self._add(key[1:], leaf.children)
                break
        else:
            children.append(Trie.Leaf(suffix, is_key))
            self._add(key[1:], children[-1].children)

        return

    @staticmethod
    def _has_children(leaf):
        return bool(leaf.children)


def main():
    keys = ['ba', 'bag', 'a', 'abc', 'abcd', 'abd', 'xyz']
    trie = Trie(keys)
    print(trie.includes_key('ba'))  # True
    print(trie.includes_key('b'))  # False
    print(trie.includes_key('dog'))  # False
    print(trie.has_suffix('b'))  # True
    print(trie.has_suffix('ab'))  # True
    print(trie.has_suffix('abd'))   # False

    trie.delete('abd')  # Should only remove the d
    trie.delete('a')    # should unmark a as a key
    trie.delete('ba')   # should remove the ba trie
    trie.delete('xyz')  # Should remove the entire branch
    trie.delete('bag')  # should only remove the g

    print(trie)

if __name__ == "__main__":
    main()

请注意，上面的 trie 实现没有实现 DFS 搜索；但是，它为您提供了一些惊人的开始。

【讨论】：

我完全清楚，trie 是一种重量级的数据结构，用于搜索这么少的键；然而，如果只使用简单的数组和集合，了解 trie 及其有用性将允许人们在很短的时间内对完整的英语词典执行相同的搜索。
不错。 OOC，你知道这与 Aho-Corasick 相比如何吗？总的来说，它们非常相似（两种修改都试图使单遍搜索多个字符串成为可能），而且我认为 Aho-Corasick 做了更多的前期工作，并且可能会使用更多的内存来更快地在扫描中运行，但我没有看细节。
@ShadowRanger 虽然 Aho-Corasick 看起来经过了高度优化，但现在是时候进行更多学习了。

【解决方案3】：

使用set 对象来保留所有子字符串。这个比较快但是比较占内存，如果每个字符串都很短，可以试试这个。

import string
import random
from itertools import combinations

def get_substrings(w):
    return (w[s:e] for s, e in combinations(range(len(w)+1), 2))

def get_not_substrings(words):
    words = sorted(set(words), key=len, reverse=True)
    substrings = set()

    for w in words:
        if w not in substrings:
            yield w
            substrings.update(get_substrings(w))

words = ["".join(random.choice(string.ascii_lowercase) 
    for _ in range(random.randint(1, 12))) for _ in range(10000)]
res = list(get_not_substrings(words))

【讨论】：

在实践中是否更快是一个悬而未决的问题，但在大 O 表示法中，这并没有改善。使用itertools.combinations 可以看出，创建和存储单词的每个可能的子字符串，定义上是O(n^2) 问题，为每个字符串解决一次（它在内存和运行时都是O(n^2)，而不是提到对开始/结束索引的每个组合执行O(n) 切片操作，这可以说是O(n^3))；您必须假设任何给定字符串的长度可能是总长度的大部分，因此 O(n^2) 本身会破坏整个 thng。
不是O(n^3)，因为n是列表的长度，而不是每个字符串的长度。
如果算法中涉及的工作随着最长字符串的长度增长比随着字符串的数量增长快得多，那么根据字符串的数量来计算 big-O 是没有用的。您必须在实际的算法分析中考虑这两者；如果该算法在字符串数量或字符串长度上的表现都非常糟糕，那么增长最快的因素才是最重要的； big-O 表示法不是关于最好的情况，而是关于最坏的情况（有时，“除了几乎不可能的情况之外的最坏情况”，例如，我们称dict 查找O(1) 当它们可能是O(n) )。
点是，如果n是最大字符串的长度，这是一个O(n^3)算法，因为n描述了组合的个数（n^2）和每个切片由这些组合是O(n) 本身。这是一个单一的、一致的n。
如果字符串长度为k，列表长度为n，则为O(k^3 * n)。因此，如果字符串很短，则速度很快。

【解决方案4】：

弹出第一个元素。遍历每个剩余的元素，看看较短的字符串是否是较长字符串的子字符串。重复。那应该是 O(n log n)

编辑：实施草稿

def not_substrings(l):
    mask = [True]*len(l)
    for i in range(len(l)):
        if not mask[i]:
            continue
        for j in range(i+1, len(l)):
            if len(l[i]) > len(l[j]):
                if l[j] in l[i]:
                    mask[j] = False
            elif l[j] == l[i]:
                mask[j] = False
                mask[i] = False
            else:
                if l[i] in l[j]:
                    mask[i] = False
        if mask[i]:
            print l[i]

我没有运行这段代码，但它应该是大致正确的。我不知道是否有没有掩码的方法，或者[True]*len(l) 语句的时间复杂度是多少。我没有进行任何严格的分析，但在我看来这是n log n，因为每次迭代只迭代列表的剩余部分，而不是整个列表。

【讨论】：

为什么会在 O(n log n) 时间内运行？
重复什么？ “较短的字符串”和“较长的字符串”是什么意思？它们与你弹出的那个有什么关系？重复是什么意思？弹出第一个元素？ “较短的字符串”/“较长的字符串”？另外：如果你一个一个地弹出元素，即使不计算比较字符串所花费的时间，你将最终得到 O(n^2) 复杂度。
如果字符串长度相同，则检查它们是否是相同的字符串。如果它们的长度不同，那么只有较短的字符串可以是较大字符串的子字符串。
让我快速编写一个粗略的实现，看看是否更容易理解。
覆盖列表剩余部分的每次迭代仍然等价于O(n^2)； the sum of numbers from 1 to n is n(n+1) / 2，它有一个常数除数，但这与大 O 计算无关，仍然是 O(n^2) 去除常数因子。要使其成为O(n log n)，您需要每次迭代迭代超过上一次迭代的list 元素的一半（或某个固定除数），而不是减法减少（例如，减少一个元素）。跨度>