python re.sub非贪婪替换失败，字符串中有换行符[重复]答案

【问题标题】：python re.sub non-greed substitute fails with a newline in the string [duplicate]python re.sub非贪婪替换失败，字符串中有换行符[重复]
【发布时间】：2016-08-09 22:43:53
【问题描述】：

我在 Python (2.7.9) 中遇到了一个正则表达式问题

我正在尝试使用这样的正则表达式去除 HTML <span> 标签：

re.sub(r'<span[^>]*>(.*?)</span>', r'\1', input_text, re.S)

（正则表达式是这样写的：<span，任何不是>，然后是>，然后是非贪婪匹配任何东西，然后是</span>，然后使用re.S（re. DOTALL) 所以. 匹配换行符

除非文本中有换行符，否则这似乎有效。看起来 re.S (DOTALL) 不适用于非贪婪匹配。

这是测试代码；从 text1 中删除换行符并且 re.sub 有效。把它放回去，re.sub 失败了。将换行符放在<span> 标签之外，re.sub 就可以工作了。

#!/usr/bin/env python
import re
text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>'
print repr(text1)
text2 = re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, re.S)
print repr(text2)

为了比较，我写了一个 Perl 脚本来做同样的事情；正则表达式在这里按我的预期工作。

#!/usr/bin/perl
$text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>';
print "$text1\n";
$text1 =~ s/<span[^>]*>(.*?)<\/span>/\1/s;
print "$text1\n";

有什么想法吗？

在 Python 2.6.6 和 Python 2.7.9 中测试

【问题讨论】：

另一个相同的问题：stackoverflow.com/questions/42581/…。这个问题比较常见。答案是：阅读docs。

标签： python regex

【解决方案1】：

re.sub 的第四个参数是count，而不是flags。

re.sub(pattern, repl, string, count=0, flags=0)¶

您需要使用关键字参数来显式指定flags：

re.sub(r'<span[^>]*>(.*?)</span>', r'\1', input_text, flags=re.S)
                                                      ↑↑↑↑↑↑

否则，re.S 将被解释为替换计数（最多 16 次）而不是 S（或 DOTALL 标志）：

>>> import re
>>> re.S
16

>>> text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>'

>>> re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, re.S)
'<body id="aa">this is a <span color="red">test\n with newline</span></body>'

>>> re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, flags=re.S)
'<body id="aa">this is a test\n with newline</body>'

【讨论】：

感谢@falsetru，解决了它。（呼！）有趣的旁注，标志参数在 Python 2.6 中不被识别，所以我们使用 2.7