查找python中较长字符串中是否存在短字符串的有效方法答案

【问题标题】：Efficient way to find if a short string is present in a longer string in python查找python中较长字符串中是否存在短字符串的有效方法
【发布时间】：2014-07-31 12:08:43
【问题描述】：

我有一个短字符串文件，我已将其加载到列表 short 中（有 150 万个长度为 150 的短字符串）。我想找到在代码中为seq 的较长字符串（长度约为 500 万）中存在的这些短字符串的数量。我使用以下明显的实现。但是，这似乎需要很长时间（大约一天）才能运行。

count1=count2=0
for line in short:
    count1+=1
    if line in seq:
            count2+=1
print str(count2) + ' of ' + str(count1) + ' strings are in long string.'

有什么方法可以更有效地做到这一点？

【问题讨论】：

只是一些头脑风暴：您可以建立一个短字符串的 trie，并在匹配 seq 时使用它。如果你的许多小字符串都有共同的前缀，这可能会大大减少检查的次数。
你能显示加载文件的代码吗？这也可能是您的问题的一部分。
@RedX：唯一的问题是纯 python trie 可能因为解释器开销而非常慢。
我正在考虑构建尝试。但是@nneonneo 的方法效果很好。

标签： python string performance find

【解决方案1】：

进行分析，并尝试不同的选项。您将无法遍历您的“测试”字符串序列，因此for line in short 是您很可能会保留的东西。测试if line in seq 我认为在 CPython 中非常有效地实现，但我认为这没有针对在 laaaaarge 大海捞针中搜索小针进行优化。您的要求有点极端，我想正是这个测试需要相当长的时间，并且是您的代码的瓶颈。作为比较，您可能想尝试regex 模块，用于大海捞针。

编辑：

一个基本的基准测试（不重复，不调查缩放行为，不使用配置文件模块），用于比较此线程中讨论的方法：

import string
import random
import time


def genstring(N):
    return ''.join(random.choice(string.ascii_uppercase) for _ in xrange(N))


t0 = time.time()
length_longstring = 10**6
length_shortstring = 7
nbr_shortstrings = 3*10**6
shortstrings = [genstring(length_shortstring) for _ in xrange(nbr_shortstrings)]
longstring = genstring(length_longstring)
duration = time.time() - t0
print "Setup duration: %.1f s" % duration


def method_1():
    count1 = 0
    count2 = 0
    for ss in shortstrings:
        count1 += 1
        if ss in longstring:
            count2 += 1
    print str(count2) + ' of ' + str(count1) + ' strings are in long string.'


#t0 = time.time()
#method_1()
#duration = time.time() - t0
#print "M1 duration: %.1f s" % duration


def method_2():
    shortset = set()
    for i in xrange(len(longstring)-length_shortstring+1):
        shortset.add(longstring[i:i+length_shortstring])
    count1 = 0
    count2 = 0
    for ss in shortstrings:
        count1 += 1
        if ss in shortset:
            count2 += 1
    print str(count2) + ' of ' + str(count1) + ' strings are in long string.'


t0 = time.time()
method_2()
duration = time.time() - t0
print "M2 duration: %.1f s" % duration


def method_3():
    shortset = set(
        longstring[i:i+length_shortstring] for i in xrange(
            len(longstring)-length_shortstring+1))
    count1 = len(shortstrings)
    count2 = sum(1 for ss in shortstrings if ss in shortset)
    print str(count2) + ' of ' + str(count1) + ' strings are in long string.'


t0 = time.time()
method_3()
duration = time.time() - t0
print "M3 duration: %.1f s" % duration

输出：

$ python test.py 
Setup duration: 23.3 s
364 of 3000000 strings are in long string.
M2 duration: 1.4 s
364 of 3000000 strings are in long string.
M3 duration: 1.2 s

（这是 Linux 上的 Python 2.7.3，在 E5-2650 0 @ 2.00GHz 上）

nneonneo 提出的方法与 chepner 提出的改进方法略有不同。在这种情况下，执行原始代码已经没有乐趣了。在稍微不那么极端的条件下，我们可以对所有三种方法进行比较：

length_longstring = 10**6
length_shortstring = 5
nbr_shortstrings = 10**5

$ python test1.py 
Setup duration: 1.4 s
8121 of 100000 strings are in long string.
M1 duration: 95.0 s
8121 of 100000 strings are in long string.
M2 duration: 0.4 s
8121 of 100000 strings are in long string.
M3 duration: 0.4 s

【讨论】：

这是用于子串匹配的算法：effbot.org/zone/stringlib.htm。它非常快，并且基于高效的子字符串算法的混合。
值得注意的是：在第一个输出中，0.0 是因为您禁用了对 method_1 的调用。最初我认为必须打破基准:)
哦，我以为这很明显，所以我写了“在这种情况下，执行原始代码已经没有乐趣了”。移除了那个陷阱。 :)

【解决方案2】：

好的，我知道您已经接受了另一个效果很好的答案，但为了完整起见，这里是 RedX 在 cmets 中建议的填写版本（我认为）

import itertools
PREFIXLEN = 50  #This will need to be adjusted for efficiency, consider doing a sensitivity study

commonpres = itertools.groupby(sorted(short), lambda x: x[0:PREFIXLEN])
survivors = []
precount = 0

for pres in commonpres:
     precount += 1
     if pres[0] in seq:
         survivors.extend(pres[1])

postcount = len(survivors)

actcount = 0
for survivor in survivors:
    if survivor in seq:
        actcount += 1

print "{} of {} strings are in long string.".format(actcount, len(short))
print "{} short strings ruled out by groups".format(len(short) - len(survivors))
print "{} total comparisons done".format(len(survivors) + precount)

这里的想法是在运行所有survivors 所述检查之前尽可能多地排除常见前缀。在一个极端的例子中，假设你的 150 万个短字符串适合 10 个公共前缀。为简单起见，我们还假设每个前缀均分（150,000）。如果我们可以用 10 次检查消除其中两个前缀，那么我们以后可以节省 300,000 次检查。这就是为什么需要“调整”PREFIXLEN。如果它太低，您将有太多的公共前缀，并且您不会保存任何检查（长度为 1 的前缀 = 150 万检查）。其中PREFIXLEN 太高不会给您消除前缀带来的任何好处，因为消除的数量会很少。我随意挑选了 50 个，这可能对你有帮助，也可能对你没有帮助。

正如我之前所说，这个答案非常学术，所以如果有人看到任何需要改进的地方，请评论或编辑。

【讨论】：

【解决方案3】：

如果short 字符串的长度是恒定的（您表示它们是 150 长），您可以预处理长字符串以提取所有短字符串，然后只需执行设置查找（预期时间恒定）：

shortlen = 150
shortset = set()
for i in xrange(len(seq)-shortlen+1):
    shortset.add(seq[i:i+shortlen])

for line in short:
    count1 += 1
    if line in shortset:
        count2 += 1

运行时间可能会受到预处理步骤的支配（因为它插入了近 5M 字符串，每个字符串长度为 150），但这仍然应该比 5M 字符串中的 150 万次搜索要快。

【讨论】：

它可以防止if needle in haystack 之类的东西带有一个大字符串干草堆。现在你有一个大的干草堆。分析/基准测试将说明这会产生多少收益。
在任何情况下您都必须进行 150 万次比较，因为您必须遍历所有 short。不过，这会对每个 short 字符串进行预期的 O(1) 查找，因此它应该比在 5,000,000 个元素的字符串中搜索要快得多。
谢谢。这很好用。这种方法实际上运行大约 5 分钟。
@chepner：嗯，可能。实际上，我怀疑frozenset 甚至可能会稍微提升一点。
另外，count1 = len(short); count2 = len(shortset.intersection(short))