您可以使用 groupby 假设这些部分由以 #TYPE 开头的行分隔:
from itertools import groupby, chain
def get_sections(fle):
with open(fle) as f:
grps = groupby(f, key=lambda x: x.lstrip().startswith("#TYPE"))
for k, v in grps:
if k:
yield chain([next(v)], (next(grps)[1])) # all lines up to next #TYPE
您可以在迭代时获取每个部分:
In [13]: cat in.txt
#TYPE Lorem.Text.A
first
#TYPE Lorem.Text.B
second
#TYPE Lorem.Text.C
third
In [14]: for sec in get_sections("in.txt"):
....: print(list(sec))
....:
['#TYPE Lorem.Text.A\n', 'first\n']
['#TYPE Lorem.Text.B\n', 'second\n']
['#TYPE Lorem.Text.C\n', 'third\n']
如果没有其他行以# 开头,那么仅此一项就足以在startswith 中使用,您的模式没有什么复杂的,因此它不是正则表达式的真正用例。这也一次只将一个部分而不是整个文件存储到内存中。
如果您没有前导空格,并且 # 出现的唯一位置是在 TYPE 之前,则只需调用 groupby 就足够了:
from itertools import groupby, chain
def get_sections(fle):
with open(fle) as f:
grps = groupby(f)
for k, v in grps:
if k:
yield chain([next(v)], (next(grps)[1])) # all lines up to next #TYPE
如果一开始有一些元数据,您可以使用 dropwhile 来跳过行,直到我们点击 #Type 然后只是分组:
from itertools import groupby, chain, dropwhile
def get_sections(fle):
with open(fle) as f:
grps = groupby(dropwhile(lambda x: not x.startswith("#"), f))
for k, v in grps:
if k:
yield chain([next(v)], (next(grps)[1])) # all lines up to next #TYPE
演示:
In [16]: cat in.txt
meta
more meta
#TYPE Lorem.Text.A
first
#TYPE Lorem.Text.B
second
second
#TYPE Lorem.Text.C
third
In [17]: for sec in get_sections("in.txt"):
print(list(sec))
....:
['#TYPE Lorem.Text.A\n', 'first\n']
['#TYPE Lorem.Text.B\n', 'second\n', 'second\n']
['#TYPE Lorem.Text.C\n', 'third\n']