如何使用 lxml 删除 html 实体（以及更多）？答案

【问题标题】：How do I remove html entities (and more) using lxml?如何使用 lxml 删除 html 实体（以及更多）？
【发布时间】：2011-08-18 00:22:13
【问题描述】：

我有一个 html 文件，其中包含一些看起来像这样的文本（通过lxml.html parse、lxml.html clean 运行后，这是etree.tostring(table, pretty_print=True) 的结果）

 <tr><td>&#13;
224&#13;
9:00 am&#13;
-3:00 pm&#13;
NPHC Leadership</td>&#13;
<td>&#13;
<font>ALSO IN 223; WALL OPEN</font></td>&#13;

我在 lxml 上找到的文档有点 spotty。我已经能够做很多事情来达到这一点，但我想做的是去掉除<table>、<td> 和<tr> 之外的所有标签。我还想从这些标签中去除所有属性，并且我还想摆脱实体，例如&#13;。

剥离我当前使用的属性：

    etree.strip_attributes(tree, 'width', 'href', 'style', 'onchange',
                           'ondblclick', 'class', 'colspan', 'cols',
                           'border', 'align', 'color', 'value',
                           'cellpadding', 'nowrap', 'selected',
                           'cellspacing')

效果很好，但似乎应该有更好的方法。似乎应该有一些相当简单的方法来做我想做的事，但我找不到任何适合我的例子。

我尝试使用Cleaner，但是当我通过allow_tags时，像这样：

错误：Cleaner(allow_tags=['table', 'td', 'tr']).clean_html(tree) 它给了我这个错误：

ValueError: It does not make sense to pass in both allow_tags and remove_unknown_tags。此外，当我添加 remove_unkown_tags=False 时，我收到此错误：

Traceback (most recent call last):
  File "parse.py", line 73, in <module>
    SParser('schedule.html').test()
  File "parse.py", line 38, in __init__
    self.clean()
  File "parse.py", line 42, in clean
    Cleaner(allow_tags=['table', 'td', 'tr'], remove_unknown_tags=False).clean_html(tree)
  File "/usr/lib/python2.6/dist-packages/lxml/html/clean.py", line 488, in clean_html
    self(doc)
  File "/usr/lib/python2.6/dist-packages/lxml/html/clean.py", line 390, in __call__
    el.drop_tag()
  File "/usr/lib/python2.6/dist-packages/lxml/html/__init__.py", line 191, in drop_tag
    assert parent is not None
AssertionError

所以，总结一下：

我要删除 HTML 实体，例如 &#13;
我想删除除<table>、<tr> 和<td> 之外的所有标签
我想删除剩余标签中的所有属性。

任何帮助将不胜感激！

【问题讨论】：

使用 BeautifullSoup 会很好。

标签： python html-parsing lxml

【解决方案1】：

这是一个去除所有属性并只允许[table, tr, td] 中的标签的示例。为了便于说明，我添加了一些 Unicode 实体。

DATA = '''<table border="1"><tr colspan="4"><td rowspan="2">\r
224&#13;
&#8220;hi there&#8221;
9:00 am\r
-3:00 pm&#13;
NPHC Leadership</td>\r
<td rowspan="2">\r
<font>ALSO IN 223; WALL OPEN</font></td>\r
</table>'''

import lxml.html
from lxml.html import clean

def _clean_attrib(node):
    for n in node:
        _clean_attrib(n)
    node.attrib.clear()

tree = lxml.html.fromstring(DATA)
cleaner = clean.Cleaner(allow_tags=['table','tr','td'],
                        remove_unknown_tags=False)
cleaner.clean_html(tree)
_clean_attrib(tree)

print lxml.html.tostring(tree, encoding='utf-8', pretty_print=True, 
                         method='html')

结果：

<table><tr>
<td>
224
“hi there”
9:00 am
-3:00 pm
NPHC Leadership</td>
<td>
<font>ALSO IN 223; WALL OPEN</font>
</td>
</tr></table>

您确定要删除所有实体吗？ &#13; 对应一个回车，当 lxml 解析文档时，它会将所有实体转换为它们对应的 Unicode 字符。

实体是否出现还取决于输出方法和编码。例如，如果您使用lxml.html.tostring(encoding='ascii', method='xml')，则'\r' 和Unicode 字符将作为实体输出：

<table>
  <tr><td>&#13;
  &#8220;hi there&#8221;
...

【讨论】：

【解决方案2】：

对我来说，我发现基于文本、标签和尾部的基本元素编写它可以更容易地将行为专门化为您想要的内容并包括错误检查（例如，确保传入数据中没有意外的标签)。

text 和 tail 上的 if 语句是因为它们在零长度时返回 None 而不是 ""。

def ctext(el):
    result = [ ]
    if el.text:
        result.append(el.text)
    for sel in el:
        if sel.tag in ["tr", "td", "table"]:
            result.append("<%s>" % sel.tag)
            result.append(ctext(sel))
            result.append("</%s>" % sel.tag)
        else:
            result.append(ctext(sel))
        if sel.tail:
            result.append(sel.tail)
    return "".join(result)

html = """your input string"""
el = lxml.html.fromstring(html)
print ctext(el)

记住关系是：

  <b>text of the bold <i>text of the italic</i> tail of the italic</b>

【讨论】：