通过正则表达式获取重复内容答案

【问题标题】：Get repeating content by regex通过正则表达式获取重复内容
【发布时间】：2018-11-16 06:21:51
【问题描述】：

我有一些格式的内容：

text = """Pos no
...
... 25/gm
The Text to be 
...
excluded
Pos no
...
... 46 kg
The Text to be 
...
excluded
Pos no
...
... 46 xunit
End of My Text

在哪里， Pos no... 25/gm - 这是一种表格结构，我必须从中提取值。

The Text to be ... excluded - 这有恒定的开始（比如说The Text to be），但没有明确的结束，即excluded 可能不存在。

End of My Text - 此文本将始终存在。

我想要一个仅包含表格内容的列表，即

["Pos no
...
... 25/gm",
"Pos no
...
... 46 kg",
"Pos no
...
... 46 xunit"]

这是我的尝试，但它没有获取正确的列表：

re.findall(r'(Pos no .+?)(?: |The Text to be|End of My Text)', text, re.DOTALL | re.M)

【问题讨论】：

【解决方案1】：

你可以使用

re.findall(r'(?sm)(Pos no\r?\n.+?)[\r\n]+(?:The Text to be|End of My Text)', text)

请注意，Pos no 没有空格，但您的模式需要它。此外，仅在行首匹配右侧上下文将使匹配更安全。

模式详情

(?sm) - re.DOTALL 和 re.MULTILINE 内联修饰符（用于更短的代码）
(Pos no\r?\n.+?) - 第 1 组（re.findall 返回的内容）：
- Pos no - 文字子字符串
- \r?\n - CRLF 或 LF 换行符
- .+? - 任何 1+ 个字符，尽可能少到最左边出现的后续子模式
[\r\n]+ - 1+ 换行符
(?:The Text to be|End of My Text) - 两个子字符串之一，The Text to be 或 End of My Text。

【讨论】：

Another demo with the same approach，只是打印结果不同。
感谢您的努力。但看起来不知何故它不适用于实际的客户数据。我所做的一个猜测是，实际数据包含utf-8 字符，所以想知道当文本中包含utf-8 字符时有什么区别。
@Laxmikant 你的意思是有 Unicode 换行符吗？将[\r\n] 替换为[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]，将\r?\n 替换为\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]。另外，Pos no、The Text to be 和 End of My Text 是否在单独的填充行上？添加 \s* 以允许前导或尾随空格。见this regex demo。