【发布时间】:2014-08-20 20:20:29
【问题描述】:
考虑这个我在 Python 2.7 上运行的示例:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
tstr = r''' <div class="thebibliography">
<p class="bibitem" ><span class="biblabel">
[1]<span class="bibsp"> </span></span><a
id="Xtester"></a><span
class="cmcsc-10">A<span
class="small-caps">k</span><span
class="small-caps">e</span><span
class="small-caps">g</span><span
class="small-caps">c</span><span
class="small-caps">t</span><span
class="small-caps">o</span><span
class="small-caps">r</span>,</span>
<span
class="cmcsc-10">P. D.</span><span
class="cmcsc-10"> H. </span> testöng ... . <span
class="cmti-10">Draftin:</span>
<a
href="http://www.example.com/test.html" class="url" ><span
class="cmitt-10">http://www.example.com/test.html</span></a> (2001).
</p>
</div>
'''
# remove <a id>
tout2 = re.sub(r'''<a[\s]*?id=['"].*?['"][\s]*?></a>''', " ", tstr, re.DOTALL)
# remove class= in <a
regstr = r'''(<a.*?)(class=['"].*?['"])([\s]*>)'''
print( re.findall(regstr, tout2, re.DOTALL)) # finds
print("------") #
print( re.sub(regstr, "AAAAAAA", tout2, re.DOTALL )) # does nothing?
当我运行它时 - 第一个正则表达式被替换/按预期替换(消失了);然后在输出中我得到:
[('<a\nhref="http://www.example.com/test.html" ', 'class="url"', ' >')]
... 这意味着第二个正则表达式编写正确(找到了所有三个部分)-但是,当我尝试用“AAAAAAA”替换所有这些 sn-p 时-输出的那部分没有任何反应:
------
<div class="thebibliography">
<p class="bibitem" ><span class="biblabel">
[1]<span class="bibsp"> </span></span> <span
class="cmcsc-10">A<span
class="small-caps">k</span><span
class="small-caps">e</span><span
class="small-caps">g</span><span
class="small-caps">c</span><span
class="small-caps">t</span><span
class="small-caps">o</span><span
class="small-caps">r</span>,</span>
<span
class="cmcsc-10">P. D.</span><span
class="cmcsc-10"> H. </span> testöng ... . <span
class="cmti-10">Draftin:</span>
<a
href="http://www.example.com/test.html" class="url" ><span
class="cmitt-10">http://www.example.com/test.html</span></a> (2001).
</p>
</div>
显然,正如我所料,这里没有“AAAAAAA”。
有什么问题,我应该怎么做才能让sub 替换显然已经找到的匹配项?
【问题讨论】:
-
感谢@Jerry 的评论-但是,它们是相同的:首先我打电话给
re.findall(regstr, ...,然后我打电话给re.sub(regstr, ...;正则表达式模式存储在字符串regstr中(这就是我首先将它放在变量中的原因)。干杯! -
哦,哎呀。那里有两个不同的
res,但我没有看到它们。
标签: python html regex replace html-parsing