在文件的特定情况下,如果您可以使用mmap 对文件进行内存映射,并且使用字节串而不是Unicode,则可以将内存映射文件提供给re,就好像它一样是一个字节串,它会正常工作。这受限于您的地址空间,而不是您的 RAM,因此具有 8 GB RAM 的 64 位计算机可以很好地映射 32 GB 文件。
如果你能做到这一点,这是一个非常好的选择。如果你不能,你必须转向更混乱的选择。
第 3 方 regex 模块(不是 re)提供部分匹配支持,可用于构建流媒体支持......但它很混乱并且有很多警告。诸如lookbehinds 和^ 之类的东西不起作用,零宽度匹配很难正确处理,而且我不知道它是否能与regex 提供的其他高级功能正确交互,而re 不能.尽管如此,它似乎是最接近可用的完整解决方案的东西。
如果您将partial=True 传递给regex.match、regex.fullmatch、regex.search 或regex.finditer,那么除了报告完全匹配之外,regex 还将报告可能匹配的内容,如果数据已扩展:
In [10]: regex.search(r'1234', '12', partial=True)
Out[10]: <regex.Match object; span=(0, 2), match='12', partial=True>
如果更多数据可能改变匹配结果,它将报告部分匹配而不是完全匹配,例如,regex.search(r'[\s\S]*', anything, partial=True) 将始终是部分匹配。
有了这个,你可以保持一个滑动的数据窗口来匹配,当你到达窗口的末尾时扩展它,并从一开始就丢弃消耗的数据。不幸的是,任何会被从字符串开头消失的数据弄糊涂的东西都行不通,所以后面的^、\b 和\B 都出局了。零宽度匹配也需要小心处理。这是一个在文件或类似文件的对象上使用滑动窗口的概念证明:
import regex
def findall_over_file_with_caveats(pattern, file):
# Caveats:
# - doesn't support ^ or backreferences, and might not play well with
# advanced features I'm not aware of that regex provides and re doesn't.
# - Doesn't do the careful handling that zero-width matches would need,
# so consider behavior undefined in case of zero-width matches.
# - I have not bothered to implement findall's behavior of returning groups
# when the pattern has groups.
# Unlike findall, produces an iterator instead of a list.
# bytes window for bytes pattern, unicode window for unicode pattern
# We assume the file provides data of the same type.
window = pattern[:0]
chunksize = 8192
sentinel = object()
last_chunk = False
while not last_chunk:
chunk = file.read(chunksize)
if not chunk:
last_chunk = True
window += chunk
match = sentinel
for match in regex.finditer(pattern, window, partial=not last_chunk):
if not match.partial:
yield match.group()
if match is sentinel or not match.partial:
# No partial match at the end (maybe even no matches at all).
# Discard the window. We don't need that data.
# The only cases I can find where we do this are if the pattern
# uses unsupported features or if we're on the last chunk, but
# there might be some important case I haven't thought of.
window = window[:0]
else:
# Partial match at the end.
# Discard all data not involved in the match.
window = window[match.start():]
if match.start() == 0:
# Our chunks are too small. Make them bigger.
chunksize *= 2