【发布时间】:2012-01-26 20:38:17
【问题描述】:
可能重复:
Strip html from strings in python
RegEx match open tags except XHTML self-contained tags
我的 python 模块中有正则表达式模式,它从给定的字符串中删除 html 标记。
在这种情况下它不起作用。
输入字符串:
string=<li class="
tal
"><h3><a href="/aclk?sa=l&ai=CoS4y-Wz0TrnqC8y0rAfysK2DB46PiJECzoK8_yKPwd4FCAAQAigCUL7Kz4P9_____wFg5erjg5gOoAH0m_XuA8gBAakCoqvilYNWVD6qBB1P0Dm6CNzrf62IC36fDvUIh77EpeheIRdH_YEaPw&sig=AOD64_2z9xPK8vOxUCpIGTjBcc2Lg-GAeA&adurl=http://www.policybazaar.com/creditcards/creditcard-india.aspx%3Futm_source%3Dgoogle%26utm_medium%3Dppc%26utm_term%3DCreditcard_delhi_only%26utm_campaign%3Dcredit_card" id="pa2">Compare <b>Credit Cards</b> | PolicyBazaar.com</a></h3>Get Best <b>Credit Card</b> For Free, Now U Have a Choice, Choose wisely!<br /><cite>www.policybazaar.com/<b>credit</b>-<b>Cards</b></cite></li>
正则表达式模式:
In [64]:p = re.compile(r'<.*?>')
In [65]:text=p.sub('',str(string))
In [66]: text
Out[66]: '<li class="\n tal\n ">Compare Credit Cards | PolicyBazaar.comGet Best Credit Card For Free, Now U Have a Choice, Choose wisely!www.policybazaar.com/credit-Cards'
结果仍然有<li> 标签。无论此类名称和字符串模式如何,都应如何删除它。
【问题讨论】: