在不使用字符串切片的情况下将正则表达式应用于子字符串答案

【问题标题】：Applying a Regex to a Substring Without using String Slice在不使用字符串切片的情况下将正则表达式应用于子字符串
【发布时间】：2011-06-09 09:56:23
【问题描述】：

我想从某个位置开始在较大的字符串中搜索正则表达式匹配，并且不使用字符串切片。

我的背景是我想通过一个字符串迭代搜索各种正则表达式的匹配。 Python 中的一个自然解决方案是跟踪字符串中的当前位置并使用例如

re.match(regex, largeString[pos:])

在一个循环中。但是对于非常大的字符串（~ 1MB），largeString[pos:] 中的字符串切片变得昂贵。我正在寻找一种方法来解决这个问题。

旁注：有趣的是，在 Python documentation 的一个小众市场中，它谈到了 match 函数的可选 pos 参数（这正是我想要的），而函数本身却找不到:-)。

【问题讨论】：

标签： python regex

【解决方案1】：

带有 pos 和 endpos 参数的变体仅作为正则表达式对象的成员存在。试试这个：

import re
pattern = re.compile("match here")
input = "don't match here, but do match here"
start = input.find(",")
print pattern.search(input, start).span()

...输出(25, 35)

【讨论】：

这太疯狂了！ pos 参数实际上是存在的，但仅限于对象方法！我一定是瞎了眼……非常感谢，也感谢其他人。

【解决方案2】：

pos 关键字仅在方法版本中可用。例如，

re.match("e+", "eee3", pos=1)

无效，但是

pattern = re.compile("e+")
pattern.match("eee3", pos=1)

有效。

【讨论】：

... 我非常确定模块函数和对象方法之间的唯一区别是正则表达式参数（可能还有标志）：-/。怪我。

【解决方案3】：

>>> import re
>>> m=re.compile ("(o+)")
>>> m.match("oooo").span()
(0, 4)
>>> m.match("oooo",2).span()
(2, 4)

【讨论】：

【解决方案4】：

您也可以使用积极的后向观察，如下所示：

import re

test_string = "abcabdabe"

position=3
a = re.search("(?<=.{" + str(position) + "})ab[a-z]",test_string)

print a.group(0)

产量：

abd

【讨论】：

感谢您的想法，但是对于长输入字符串，如果我在该字符串的末尾进行搜索，这将导致非常长的后视:)。但我会留到以后。