使用正则表达式匹配python中文件的开头和结尾答案

【问题标题】：Match start and end of file in python with regex使用正则表达式匹配python中文件的开头和结尾
【发布时间】：2010-03-02 10:37:57
【问题描述】：

我很难在 python 中找到文件开头和结尾的正则表达式。我将如何做到这一点？

【问题讨论】：

正则表达式应用于字符串，而不是文件。

标签： python regex

【解决方案1】：

将整个文件读入字符串，则\A只匹配字符串的开头，\Z只匹配字符串的结尾。使用 re.MULTILINE，'^' 匹配字符串的开头和就在换行符之后，而 '$' 匹配字符串的结尾 and 就在换行符之前.请参阅re syntax 的 Python 文档。

import re

data = '''sentence one.
sentence two.
a bad sentence
sentence three.
sentence four.'''

# find lines ending in a period
print re.findall(r'^.*\.$',data,re.MULTILINE)
# match if the first line ends in a period
print re.findall(r'\A^.*\.$',data,re.MULTILINE)
# match if the last line ends in a period.
print re.findall(r'^.*\.$\Z',data,re.MULTILINE)

输出：

['sentence one.', 'sentence two.', 'sentence three.', 'sentence four.']
['sentence one.']
['sentence four.']

【讨论】：

【解决方案2】：

也许你应该更清楚地提出你的问题，比如你想做什么。也就是说，您可以将文件 slurp 成一个完整的字符串，并使用 re 匹配您的模式。

import re
data=open("file").read()
pat=re.compile("^.*pattern.*$",re.M|re.DOTALL)
print pat.findall(data)

有更好的方法来做你想做的事，不管它是什么，而无需重新。

【讨论】：

因为 .* 是贪婪的，它只会在文件中找到一个 'pattern' 的实例。由于您指定了 re.M 标志，$ 就在文件中的每个换行符之前匹配，因此对于贪婪的 .* 和 re.DOTALL，第一个 .* 将匹配文件中最后一个“模式”之前的所有内容，第二个将匹配最后一个“模式”之后的所有内容。
随便。这不是一个完整的解决方案，因为我们不确定 OP 真正想要做什么。我能做的最好的就是告诉他他可以将整个文件作为字符串读取并像普通字符串一样对其进行正则表达式。

【解决方案3】：

regex $ 是不是你的朋友；见this SO answer

【讨论】：