Python 中用于删除 XML 注释和 HTML 元素的正则表达式答案

【问题标题】：Regular Expression in Python for Removing XML Comments and HTML elementsPython 中用于删除 XML 注释和 HTML 元素的正则表达式
【发布时间】：2011-10-12 11:42:49
【问题描述】：

我正在使用 Universal feed Parser 解析 RSS 内容。有时在描述标签中我得到如下所示：

<!--This is the XML comment -->
<p>This is a Test Paragraph</p></br>
<b>Sample Bold</b>
<m:Table>Sampe Text</m:Table>

为了删除 HTML 元素/标签，我使用以下正则表达式。

pattern = re.compile(u'<\/?\w+\s*[^>]*?\/?>', re.DOTALL | re.MULTILINE | re.IGNORECASE | re.UNICODE)
desc = pattern.sub(u" ", desc)

这有助于删除 HTML 标记，但不能删除 xml cmets。如何删除元素和 XML 注释？

【问题讨论】：

这还不够吗？ r'<.*?>'
正确的方法是使用 XML 解析器，就像 @duffymo 说的那样。试试BeautifulSoup
在这种情况下，解析器是多余的。你不需要知道树结构、标签命名空间、名称和属性，只是为了把它们扔掉，对吗？哦，还有@rplnt，你忘了 CDATA (<![CDATA[some text <this is not a tag!> some more text]]>)。

标签： python regex string

【解决方案1】：

使用lxml：

import lxml.html as LH

content='''
<!--This is the XML comment -->
<p>This is a Test Paragraph</p></br>
<b>Sample Bold</b>
<Table>Sampe Text</Table>
'''

doc=LH.fromstring(content)
print(doc.text_content())

产量

This is a Test Paragraph
Sample Bold
Sampe Text

【讨论】：

【解决方案2】：

以这种方式使用正则表达式是个坏主意。

我会在使用真正的解析器后导航 DOM 树，并以这种方式删除我想要的内容。

【讨论】：

根据此处接受的答案stackoverflow.com/questions/1732348/…。用漂亮的汤代替。
你们这些来自 Ban Regex Movement 的人真的把我吓坏了。正则表达式不能用于 PARSE XML，因为标签可以嵌套 (<b><i></i></b>)，但它们可以用于 STRIP 标签，因为标签只是尖括号之间的任何内容。读维基百科，该死的。（对不起。）
没有禁止正则表达式的动作，只是指出每个任务都应该使用正确的工具，在剥离标签之前你必须找到它，你会怎么做?用正则表达式？坏主意。
那么，到底为什么会很糟糕呢？
因为 DOM 树有更多的上下文，它为你提供元素类型信息，并且它有一个很好的 API (XPath) 来查找东西。

【解决方案3】：

使用纯 Python 有一个简单的方法：

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out

这里解释了这个想法：http://youtu.be/2tu9LTDujbw

你可以在这里看到它的工作原理：http://youtu.be/HPkNPcYed9M?t=35s

PS - 如果你对课程感兴趣（关于使用 python 进行智能调试）我给你一个链接：http://www.udacity.com/overview/Course/cs259/CourseRev/1。免费！

不客气！

【讨论】：

【解决方案4】：

为什么这么复杂？ re.sub('<!\[CDATA\[(.*?)\]\]>|<.*?>', lambda m: m.group(1) or '', desc, flags=re.DOTALL)

如果您希望 XML 标记完整无缺，您可能应该在http://www.whatwg.org/specs/web-apps/current-work/multipage/ 处查看 HTML 标记列表并使用 '(<!\[CDATA\[.*?\]\]>)||</?(?:tag names separated by pipes)(?:\s.*?)?>' 正则表达式。

【讨论】：