过滤一组包含在其他短语中的所有短语的算法答案

【问题标题】：Algorithm to filter a set of all phrases containing in other phrase过滤一组包含在其他短语中的所有短语的算法
【发布时间】：2009-09-03 10:02:23
【问题描述】：

给定一组短语，我想过滤包含任何其他短语的所有短语集。此处包含意味着如果一个短语包含另一个短语的所有单词，则应将其过滤掉。短语中单词的顺序无关紧要。

到目前为止，我所拥有的是：

按每个短语中的单词数对集合进行排序。
对于集合中的每个短语 X：
1. 对于其余集合中的每个短语 Y：
  1. 如果 X 中的所有单词都在 Y 中，则 X 包含在 Y 中，丢弃 Y。

考虑到大约 10k 个短语的列表，这很慢。有更好的选择吗？

【问题讨论】：

与所有短语相比，您的集合中有多少个短语？

标签： c# java c++ python algorithm

【解决方案1】：

这是寻找一组集合的最小值的问题。朴素的算法和问题定义如下：

set(s for s in sets if not any(other < s for other in sets))

有次二次算法可以做到这一点（例如this），但鉴于 N 是 10000，实现的效率可能更重要。最佳方法在很大程度上取决于输入数据的分布。鉴于输入集是大部分不同的自然语言短语，redtuna 建议的方法应该可以很好地工作。这是该算法的python实现。

from collections import defaultdict

def find_minimal_phrases(phrases):
    # Make the phrases hashable
    phrases = map(frozenset, phrases)

    # Create a map to find all phrases containing a word
    phrases_containing = defaultdict(set)
    for phrase in phrases:
        for word in phrase:
            phrases_containing[word].add(phrase)

    minimal_phrases = []
    found_superphrases = set()
    # in sorted by length order to find minimal sets first thanks to the
    # fact that a.superset(b) implies len(a) > len(b)
    for phrase in sorted(phrases, key=len):
        if phrase not in found_superphrases:
            connected_phrases = [phrases_containing[word] for word in phrase]
            connected_phrases.sort(key=len)
            superphrases = reduce(set.intersection, connected_phrases)
            found_superphrases.update(superphrases)
            minimal_phrases.append(phrase)
    return minimal_phrases

这仍然是二次的，但在我的机器上，它在 350 毫秒内运行一组 10k 短语，其中包含 50% 的最小值和来自指数分布的单词。

【讨论】：

【解决方案2】：

您可以建立一个将单词映射到短语的索引并执行以下操作：

让匹配=所有短语的集合对于搜索短语中的每个单词让 wordMatch = 包含当前单词的所有短语让匹配 = 匹配和 wordMatch 的交集

在此之后，matched 将包含与目标短语中的所有单词匹配的所有短语。通过将matched 初始化为仅包含words[0] 的所有短语集，然后仅迭代words[1..words.length]，它可以得到很好的优化。过滤太短而无法匹配目标短语的短语也可以提高性能。

除非我弄错了，否则简单实现的最坏情况复杂度（当搜索短语匹配所有短语时）为O(n·m)，其中n 是搜索短语中的单词数，m 是短语的数量。

【讨论】：

【解决方案3】：

您的算法在短语数量上是二次方的，这可能是它变慢的原因。在这里，我按单词索引短语，以便在常见情况下低于二次方。

# build index
foreach phrase: foreach word: phrases[word] += phrase

# use index to filter out phrases that contain all the words
# from another phrase
foreach phrase:
  foreach word: 
     if first word:
        siblings = phrases[word]
     else
        siblings = siblings intersection phrases[word]
  # siblings now contains any phrase that has at least all our words
  remove each sibling from the output set of phrases  

# done!

【讨论】：

【解决方案4】：

按内容对短语进行排序，即'Z A' -> 'A Z'，然后从最短到更长的短语很容易消除。

【讨论】：