当关键字是多个单词时有效地搜索关键字答案

【问题标题】：search keywords efficiently when keywords are multi words当关键字是多个单词时有效地搜索关键字
【发布时间】：2018-01-15 07:38:52
【问题描述】：

我需要使用 python 有效地匹配字符串中的一个非常大的关键字列表 (>1000000)。我发现了一些非常好的库，它们试图快速做到这一点：

1) FlashText (https://github.com/vi3k6i5/flashtext)

2) Aho-Corasick 算法等

但是我有一个特殊的要求：在我的上下文中，如果我的字符串是“XXXX 是 YYYY 的一个很好的指示”，则关键字“XXXX YYYY”应该返回匹配项。请注意，'XXXX YYYY' 不是作为子字符串出现的，但字符串中存在 XXXX 和 YYYY，这对我来说已经足够匹配了。

我知道如何天真地做到这一点。我正在寻找的是效率，还有更多好的库吗？

【问题讨论】：

很高兴知道您的幼稚解决方案，而不是重复它们。其中一个想法可能是从字符串中删除不在关键字列表中的所有内容，然后应用其中一个快速库。
@Maciek 天真的我的意思是将多词关键字转换为一个列表，并用一个和条件分别匹配每个元素（这是不使用 fats 库）。您的建议假设 YYYY 发生在 XXXX 之后，这可能也不正确。
好的，我明白了。你的问题不清楚。
您是在寻找空格分隔的模式还是更通用的模式？
@tripleee 空格和标点符号分开最适合我的情况

标签： python string pattern-matching string-matching keyword-search

【解决方案1】：

你问的听起来像是a full text search 任务。有一个名为 whoosh 的 Python 搜索包。 @derek 的语料库可以在内存中被索引和搜索，如下所示。

from whoosh.filedb.filestore import RamStorage
from whoosh.qparser import QueryParser
from whoosh import fields


texts = [
    "Here's a sentence with dog and apple in it",
    "Here's a sentence with dog and poodle in it",
    "Here's a sentence with poodle and apple in it",
    "Here's a dog with and apple and a poodle in it",
    "Here's an apple with a dog to show that order is irrelevant"
]

schema = fields.Schema(text=fields.TEXT(stored=True))
storage = RamStorage()
index = storage.create_index(schema)
storage.open_index()

writer = index.writer()
for t in texts:
    writer.add_document(text = t)
writer.commit()

query = QueryParser('text', schema).parse('dog apple')
results = index.searcher().search(query)

for r in results:
    print(r)

这会产生：

<Hit {'text': "Here's a sentence with dog and apple in it"}>
<Hit {'text': "Here's a dog with and apple and a poodle in it"}>
<Hit {'text': "Here's an apple with a dog to show that order is irrelevant"}>

您还可以使用FileStorage 来持久化您的索引，如How to index documents 中所述。

【讨论】：

【解决方案2】：

这属于“幼稚”阵营，但这里有一种使用集合作为思考的方法：

文档 = [ """ 这是一个包含狗和苹果的句子 """, """ 这里有一个包含 dog 和 poodle 的句子 """, """ 这是一个带有贵宾犬和苹果的句子 """, """ 这是一只狗，里面有苹果和一只贵宾犬"""， """ 这是一个带有狗的苹果，表明顺序无关紧要""" ] 查询 = ['狗'，'苹果'] def get_similar（查询，文档）：水库= [] query_set = 设置（查询）对于我在文档中： # 如果查询的所有 n 个元素都在 i 中，则返回 i 如果 query_set & set(i.split(" ")) == query_set: res.append(i) 返回资源

这会返回：

[“这是一个包含狗和苹果的句子”， “这是一只狗，里面有苹果和一只贵宾犬”， “这是一个带有狗的苹果，表明顺序无关紧要”]

当然，时间复杂度并没有那么高，但由于执行散列/集合操作的速度，它比整体使用列表要快得多。

第 2 部分是 Elasticsearch 是一个很好的候选人，如果你愿意付出努力并且你正在处理大量的数据。

【讨论】：