使用 Look behind 或 Look Ahead Functions 查找匹配项时的正则表达式模式答案

【问题标题】：Pattern of regular expressions while using Look Behind or Look Ahead Functions to find a match使用 Look behind 或 Look Ahead Functions 查找匹配项时的正则表达式模式
【发布时间】：2016-11-05 21:24:39
【问题描述】：

我正在尝试根据 python 中的正常语法规则正确拆分句子。

我要拆分的句子是

s = """Mr. Smith bought cheapsite.com for 1.5 million dollars,
i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a
probability of .9 it isn't."""

预期的输出是

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.

Did he mind?

Adam Jones Jr. thinks he didn't.

In any case, this isn't true...

Well, with a probability of .9 it isn't.

为了实现这一点，我正在使用常规，经过大量搜索后，我发现了以下正则表达式，它可以解决问题。new_str 只是为了从 's' 中删除一些 \n

m = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s',new_str)

for i in m:
    print (i)



Mr. Smith bought cheapsite.com for 1.5 million dollars,i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with aprobability of .9 it isn't.

所以我理解 reg ex 的方式是我们首先选择

1) 像 i.e 这样的所有字符

2) 从第一次选择的过滤空格中，我们选择那些字符没有像 Mr. Mrs. 这样的词

3) 从过滤的第二步中，我们只选择那些我们有点或问题并且前面有空格的主题。

所以我尝试如下更改顺序

1) 先过滤掉所有标题。

2) 从过滤后的步骤中选择前面有空格的步骤

3) 删除所有短语，例如 i.e

但是当我这样做时，后面的空白也会被分割

m = re.split(r'(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.)',new_str)

for i in m:
    print (i)


Mr. Smith bought cheapsite.com for 1.5 million dollars,i.e.
he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with aprobability of .9 it isn't.

修改后的过程中的最后一步是否应该能够识别短语，例如，为什么它无法检测到它？

【问题讨论】：

您将使用 nltk 将文本拆分为句子，在 Python 中编写精确的拆分正则表达式是不可能的（您可以尝试匹配一个，但这将是一个挑战）。
@WiktorStribiżew 我同意，但在这种情况下，我想了解正则表达式的细微差别以及为什么更改顺序会产生不正确的结果
您想说new_str 中的输入是否已将换行符替换为here 等常规空格？

标签： python regex nlp tokenize negative-lookbehind

【解决方案1】：

首先，(?<!\w\.\w.) 中的最后一个 . 看起来很可疑，如果您需要匹配文字点，请将其转义 ((?<!\w\.\w\.))。

回到问题上来，当您使用r'(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.)' 正则表达式时，最后一个否定的lookbehind 检查空格后的位置是否前面没有单词char、dot、word char、任何字符 （因为 . 未转义）。此条件为真，因为在该位置之前有一个点 e、另一个 . 和一个空格。

要使后向工作与 \s 之前的工作方式相同，请将 \s 也放入后向模式：

(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.\s)

见regex demo

另一个增强功能是在第二个lookbehind中使用字符类：(?<=\.|\?) -> (?<=[.?])。

【讨论】：

谢谢，所以当我使用回溯否定时，它会检查当前字符串的条件是否不正确，所以它不应该在这里开始查找空格之前而不是空格之后的单词两个正则表达式选择了所有的空格？
它不应该在空格之前开始查找 - 放置环视的位置很重要。如果您将lookbehind 放在\s 之后，则会在position after 空格之前搜索lookbehind 模式。当lookbehind 在\s 之前时，它会在空格之前断言模式的存在（或不存在）。 不适用于空格之后的单词 - 后向不会在此处查找空格之后的单词，因为它位于空格模式之后，因此只会断言在空格之前并包含空格的模式不存在.