Python/Etree：从元素及其子元素中获取文本答案

【问题标题】：Python/Etree: Get text from element and its childrenPython/Etree：从元素及其子元素中获取文本
【发布时间】：2011-05-21 14:52:11
【问题描述】：

我必须像这样使用一些 HTML：

<li><a href="#">S:</a><a class="#"> (n) </a><a href="#">trial</a>, <a href="#">trial run</a>, <b>test</b>, <a href="#">tryout</a> (trying something to find out about it) <i>"a sample for ten days free trial"; "a trial of progesterone failed to relieve the pain"</i></li>

问题是我需要从孩子（如as 和is）和文本节点（如孩子之间的, 部分）获取文本。

我所能做的就是从每个孩子那里获取文本并将它们放在一起（这给了我除了所有文本节点之外的所有内容）或者只获取文本节点（而不是 a 和is)。有没有办法两者兼得？

【问题讨论】：

标签： python html xml parsing elementtree

【解决方案1】：

lxml changelog 显示 lxml v2.3 与 python 3.1.2 和更新版本兼容。

你也可以像Python's equivalent to PHP's strip_tags所说的那样使用正则表达式re.sub(r'<[^>]*?>', '', val)。

【讨论】：

【解决方案2】：

您可以使用 lxml.html 来做到这一点。

In [1]: import lxml.html

In [2]: el = lxml.html.fromstring('<li><a href="#">S:</a><a class="#"> (n) </a><a href="#">trial</a>, <a href="#">trial run</a>, <b>test</b>, <a href="#">tryout</a> (trying something to find out about it) <i>"a sample for ten days free trial"; "a trial of progesterone failed to relieve the pain"</i></li>')

In [3]: print el.text_content()
S: (n) trial, trial run, test, tryout (trying something to find out about it) "a sample for ten days free trial"; "a trial of progesterone failed to relieve the pain"

【讨论】：

我需要兼容 Python 3 的东西。 AFAIK，lxml 还没有。
来自lxml.de 最新版本适用于从 2.4 到 3.2 的所有 CPython 版本。