【问题标题】：Regex to find all sentences of text?正则表达式查找所有文本句子？
【发布时间】：2011-04-02 17:18:10
【问题描述】：

我一直在尝试在 python 中自学正则表达式，我决定打印出文本的所有句子。过去 3 小时我一直在修改正则表达式，但无济于事。

我刚刚尝试了以下但无能为力。

p = open('anan.txt')
process = p.read()
regexMatch = re.findall('^[A-Z].+\s+[.!?]$',process,re.I)
print regexMatch
p.close()

我的输入文件是这样的：

OMG is this a question ! Is this a sentence ? My.
name is.

这不打印任何输出。但是当我删除“My. name is.”时，它会打印 OMG is this a question 和 Is this a sentence 一起，好像它只读取第一行一样。

什么是正则表达式的最佳解决方案，它可以在文本文件中找到所有句子 - 无论句子是否换行 - 并且还可以读取整个文本？谢谢。

【问题讨论】：

也许这会有所帮助：stackoverflow.com/questions/587345/…

标签： python regex

【解决方案1】：

这样的工作：

## pattern: Upercase, then anything that is not in (.!?), then one of them
>>> pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)
>>> pat.findall('OMG is this a question ! Is this a sentence ? My. name is.')
['OMG is this a question !', 'Is this a sentence ?', 'My.']

注意name is. 不在结果中，因为它不是以大写字母开头。

您的问题来自于使用 ^$ 锚点，它们适用于整个文本。

【讨论】：

非常感谢。我将其改编为 re.findall，因为我必须处理 txt 文件。有没有办法防止 '\n' 字符出现在结果中？我的意思是，在换行的句子中，\n 出现在不同行的单词之间。
@sarevok：您可以在使用text.replace('\n', '') 拆分之前删除\n。
这也包括数字吗？

【解决方案2】：

您的正则表达式中有两个问题：

您的表达式是 anchored by ^ 和 $，它们分别是“行首”和“行尾”锚点。这意味着您的模式正在寻找匹配整行文本。
您在标点符号之前搜索\s+，它指定one or more whitespace character。如果标点符号前没有空格，则表达式将不匹配。

【讨论】：

赞成实际解释这两个问题，而不仅仅是分发一个固定的正则表达式。

【解决方案3】：

已编辑：现在它也可以处理多行句子了。

>>> t = "OMG is this a question ! Is this a sentence ? My\n name is."
>>> re.findall("[A-Z].*?[\.!?]", t, re.MULTILINE | re.DOTALL )
['OMG is this a question !', 'Is this a sentence ?', 'My\n name is.']

只有一件事需要解释 - re.DOTALL 使 . 匹配换行符，如 here 所述

【讨论】：

【解决方案4】：

谢谢 cji 和 Jochen Ritzel。

sentence=re.compile("[A-Z].*?[\.!?] ", re.MULTILINE | re.DOTALL )

我认为这是最好的，只需在末尾添加一个空格即可。

 SampleReport='I image from 08/25 through 12. The patient image 1.2, 23, 34, 45 and 64 from serise 34. image look good to have a tumor in this area.  It has been resected during the interval between scans.  The'

如果使用

pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)
pat.findall(SampleReport)

结果将是：

['I image from 08/25 through 12.',
'The patient image 1.',
 'It has been resected during the interval between scans.']

错误是它不能处理像 1.2 这样的数字。但这一款效果很好。

sentence.findall(SampleReport)

结果

['I image from 08/25 through 12. ',
'The patient image 1.2, 23, 34, 45 and 64 from serise 34. ',
 'It has been resected during the interval between scans. ']

【讨论】：

带有空格的“[A-Z].*?[\.!?]”确实是正确答案，它解决了我的问题。谢谢叶宁荣。

【解决方案5】：

换一种方式试试：在句子边界处拆分文本。

lines = re.split(r'\s*[!?.]\s*', text)

如果这不起作用，请在 . 之前添加 \。

【讨论】：

【解决方案6】：

你可以试试：

p = open('a')
process = p.read()
print process
regexMatch = re.findall('[^.!?]+[.!?]',process)
print regexMatch
p.close()

这里使用的正则表达式是[^.!?]+[.!?]，它尝试匹配一个或多个非句子分隔符，后跟一个句子分隔符。

【讨论】：

【解决方案7】：

我在 Notepad++ 上试过，我得到了这个：

.*$

并激活多行选项：

re.MULTILINE

干杯

【讨论】：