使用通配符保存字符串的高效数据结构答案

【问题标题】：Efficient data structure to hold strings with wildcards使用通配符保存字符串的高效数据结构
【发布时间】：2015-02-21 21:23:25
【问题描述】：

这个问题与Efficient data structure for word lookup with wildcards几乎相反

假设我们有一个urls的数据库

http://aaa.com/
http://bbb.com/
http://ccc.com/
....

要查找url 是否在列表中，我可以创建binary-search 并在O(log n) 时间得到结果，n 是列表的大小。

这种结构多年来一直运行良好，但现在我想在数据库条目中使用通配符，例如：

http://*aaa.com/*
http://*bbb.com/*
http://*ccc.com/
....

而幼稚的搜索将导致完整的扫描，O(n) 的查找时间。

哪个数据结构可以在小于O(n) 的地方找到？

【问题讨论】：

你仍然可以进行二分搜索，但要维护已知 url 的排序列表，字符串从后面开始
查询地址：http://test.ccc.com/结果true
http://sasccc.com 是一个有效的查询，即没有点分隔符吗？
您能否将网址拆分为固定数量的字段，其中字段可以是野生的或指定的？或者您是否需要通配符才能出现在网址中的任何位置（例如http*://*ca*.c/*/*.html）？
Efficient data structure for word lookup with wildcards的可能重复

标签： performance algorithm optimization data-structures

【解决方案1】：

如果事先知道所有 url，那么您可以构建一个有限自动机，这将解决您在 O(url length) 中查询的问题。

这个有限自动机可以构建为一个正则表达式：

http://(.*aaa\.com/.*|.*bbb\.com/.*|.*ccc\.com/)$

这是一些 python 代码。 re.compile()之后，每次查询都很快。

import re

urls = re.compile("http://(.*aaa\.com/.*|.*bbb\.com/.*|.*ccc\.com/)$")

print urls.match("http://testaaa.com/") is not None
> True
print urls.match("http://somethingbbb.com/dir") is not None
> True
print urls.match("http://ccc.com/") is not None
> True
print urls.match("http://testccc.com/") is not None
> True
print urls.match("http://testccc.com/ddd") is not None
> False
print urls.match("http://ddd.com/") is not None
> False

【讨论】：

我猜你不能re.compile一个非常大的字符串:)
如果 regexp 实现不能胜任任务，您始终可以自己构建自动机。这将使您更好地控制使用的内存量。