Python：在文件中查找正则表达式答案

【问题标题】：Python: find regexp in a filePython：在文件中查找正则表达式
【发布时间】：2011-02-14 05:44:32
【问题描述】：

有：

f = open(...)  
r = re.compile(...)

需要：
在大文件中查找第一个匹配的正则表达式的位置（开始和结束）？
（从current_pos=...开始）

我该怎么做？

我想要这个功能：

def find_first_regex_in_file(f, regexp, start_pos=0):  
   f.seek(start_pos)  

   .... (searching f for regexp starting from start_pos) HOW?  

   return [match_start, match_end]

文件“f”预计很大。

【问题讨论】：

您能否展示一个更完整的示例来说明您想要做什么？附带一些示例输入和输出。

标签： python regex

【解决方案1】：

搜索大文件的一种方法是使用mmap 库将文件映射到大内存块中。然后你可以搜索它而无需显式阅读它。

例如：

size = os.stat(fn).st_size
f = open(fn)
data = mmap.mmap(f.fileno(), size, access=mmap.ACCESS_READ)

m = re.search(r"867-?5309", data)

这适用于非常大的文件（我已经为 30+ GB 的文件做过，但如果您的文件超过 1 GB 或 2 GB，则需要 64 位操作系统）。

【讨论】：

看起来不错，我会尽快检查的
如何搜索来自套接字的数据？！
如果你使用0的大小参数，它将使用整个文件。

【解决方案2】：

以下代码适用于大小约为 2GB 的测试文件。

def search_file(pattern, filename, offset=0):
    with open(filename) as f:
        f.seek(offset)
        for line in f:
            m = pattern.search(line)
            if m:
                search_offset = f.tell() - len(line) - 1
                return search_offset + m.start(), search_offset + m.end()

注意正则表达式不能跨越多行。

【讨论】：

【解决方案3】：

注意：这已经在 python2.7 上测试过了。您可能需要在 python 3 中调整一些东西来处理字符串与字节，但希望它不会太痛苦。

内存映射文件可能不适合您的情况（32 位模式会增加连续虚拟内存不足、无法从管道或其他非文件读取等的可能性）。

这是一次读取 128k 块的解决方案，只要您的正则表达式匹配小于该大小的字符串，这将起作用。另请注意，您不受使用单行正则表达式的限制。这个解决方案工作得很快，虽然我怀疑它会比使用 mmap 稍微慢一些。这可能更多地取决于您对匹配项所做的工作，以及您正在搜索的正则表达式的大小/复杂性。

该方法将确保在内存中最多只保留 2 个块。在某些用例中，您可能希望对每个块强制执行至少 1 个匹配作为健全性检查，但此方法将截断以在内存中保留最多 2 个块。它还确保不会产生任何吃到当前块末尾的正则表达式匹配，而是保存最后一个位置，以供当真正的输入用完或者我们在结束之前有另一个正则表达式匹配的块时，在为了更好地匹配“[^\n]+”或“xxx$”等模式。如果您在 xx(?!xyz) 之类的正则表达式末尾有一个前瞻，您仍然可以破坏事情，其中 yz 在下一个块中，但在大多数情况下，您可以使用这些模式来解决问题。

import re

def regex_stream(regex,stream,block_size=128*1024):
    stream_read=stream.read
    finditer=regex.finditer
    block=stream_read(block_size)
    if not block:
        return
    lastpos = 0
    for mo in finditer(block):
        if mo.end()!=len(block):
            yield mo
            lastpos = mo.end()
        else:
            break
    while True:
        new_buffer = stream_read(block_size)
        if not new_buffer:
            break
        if lastpos:
            size_to_append=len(block)-lastpos
            if size_to_append > block_size:
                block='%s%s'%(block[-block_size:],new_buffer)
            else:
                block='%s%s'%(block[lastpos:],new_buffer)
        else:
            size_to_append=len(block)
            if size_to_append > block_size:
                block='%s%s'%(block[-block_size:],new_buffer)
            else:
                block='%s%s'%(block,new_buffer)
        lastpos = 0
        for mo in finditer(block):
            if mo.end()!=len(block):
                yield mo
                lastpos = mo.end()
            else:
                break
    if lastpos:
        block=block[lastpos:]
    for mo in finditer(block):
        yield mo

要测试/探索，你可以运行这个：

# NOTE: you can substitute a real file stream here for t_in but using this as a test
t_in=cStringIO.StringIO('testing this is a 1regexxx\nanother 2regexx\nmore 3regexes')
block_size=len('testing this is a regex')
re_pattern=re.compile(r'\dregex+',re.DOTALL)
for match_obj in regex_stream(re_pattern,t_in,block_size=block_size):
    print 'found regex in block of len %s/%s: "%s[[[%s]]]%s"'%(
        len(match_obj.string),
        block_size,match_obj.string[:match_obj.start()].encode('string_escape'),
        match_obj.group(),
        match_obj.string[match_obj.end():].encode('string_escape'))

这是输出：

found regex in block of len 46/23: "testing this is a [[[1regexxx]]]\nanother 2regexx\nmor"
found regex in block of len 46/23: "testing this is a 1regexxx\nanother [[[2regexx]]]\nmor"
found regex in block of len 14/23: "\nmore [[[3regex]]]es"

这对于快速解析大型 XML 很有用，在这种情况下，它可以基于作为根的子元素拆分成 mini-DOM，而不必在使用 SAX 解析器时深入处理回调和状态。它还允许您更快地过滤 XML。但我也将它用于许多其他目的。我有点惊讶这样的食谱在网上并不容易获得！

还有一件事：只要传入的流生成 unicode 字符串，在 unicode 中解析就应该工作，如果您使用像 \w 这样的字符类，则需要将 re.U 标志添加到重新编译模式构造。在这种情况下，block_size 实际上是指字符数而不是字节数。

【讨论】：