如何读取文件并在多行模式之间提取数据？答案

【问题标题】：How to read a file and extract data between multiline patterns?如何读取文件并在多行模式之间提取数据？
【发布时间】：2016-03-09 10:23:07
【问题描述】：

我有一个文件，我需要从中提取一条数据，由（可能）多行固定模式分隔

some data ... [my opening pattern
is here
and can be multiline] the data 
I want to extract [my ending
pattern which can be
multiline as well] ... more data

这些模式在内容始终相同的意义上是固定的，除了它可以在单词之间包含新行。

如果我确信我的模式将被可预测地格式化，那么解决方案会很简单。

有没有办法将这种“模式”与流匹配？

有一个question 几乎是重复的，答案指向缓冲输入。我的情况的不同之处在于我知道模式中的确切字符串，除了单词也可能由换行符分隔（因此不需要\w* 类型的匹配）

【问题讨论】：

stackoverflow.com/a/28644645/918959 是一种适用于巨大文件的解决方案。你只需要做一个匹配括号的多行正则表达式。
一方面，您可以从文本中删除所有/n。除此之外，如果您有非常大的文本并且开始和结束模式可能相距甚远，那么您引用的答案是正确的。
预期结果是什么？ the data I want to extract?

标签： python regex python-3.x pattern-matching

【解决方案1】：

你在找这个吗？

>>> import re
>>> data = """
... some data ... [my opening pattern
... is here
... and can be multiline] the data
... I want to extract [my ending
... pattern which can be
... multiline as well] ... more data
... """
>>> re.findall('\[[^]]*\]\s+([^[]+)\s+\[[^]]+\]', data)
['the data \nI want to extract']

更新要将大文件读成块，我建议以下方法：

## The following was modified based on ChrisA's code in
## http://www.gossamer-threads.com/lists/python/python/1242366.
## Titled " How to read from a file to an arbitrary delimiter efficiently?"
import re

class ChunkIter:
    def __init__(self, f, delim):
        """ f: file object
        delim: regex pattern"""        
        self.f = f
        self.delim = re.compile(delim)
        self.buffer = ''
        self.part = '' # the string to return

    def read_to_delim(self):
        """Return characters up to the last delim, or None if at EOF"""

        while "delimiter not found":
            b = self.f.read(256)
            if not b: # if EOF
                self.part = None
                break
            # Continue reading to buffer
            self.buffer += b
            # Try regex split the buffer string    
            parts = self.delim.split(self.buffer)
            # If pattern is found
            if parts[:-1]:
                # Retrieve the string up to the last delim
                self.part = ''.join(parts[:-1])
                # Reset buffer string
                self.buffer = parts[-1]
                break   

        return self.part

if __name__ == '__main__':
    with open('input.txt', 'r') as f:
        chunk = ChunkIter(f, '(\[[^]]*\]\s+(?:[^[]+)\s+\[[^]]+\])')
        while chunk.read_to_delim():
             print re.findall('\[[^]]*\]\s+([^[]+)\s+\[[^]]+\]', chunk.part)

    print 'job done.'

【讨论】：

不完全是。我知道 Python 中的正则表达式 - 我将使用的文件很大，所以我不想将它加载到内存中。但为了简单起见，我可能最终还是会这样做。我会检查您的解决方案并回来，谢谢。