ElementTree 文本与标签混合答案

【问题标题】：ElementTree text mixed with tagsElementTree 文本与标签混合
【发布时间】：2015-12-16 18:07:19
【问题描述】：

想象下面的文字：

<description>
the thing <b>stuff</b> is very important for various reasons, notably <b>other things</b>.
</description>

如何使用etree 接口解析这个？有了description 标记，.text 属性只返回第一个单词 - the。 .getchildren() 方法返回 <b> 元素，但不返回文本的其余部分。

非常感谢！

【问题讨论】：

标签： python html elementtree

【解决方案1】：

获取.text_content()。使用lxml.html 的工作示例：

from lxml.html import fromstring   

data = """
<description>
the thing <b>stuff</b> is very important for various reasons, notably <b>other things</b>.
</description>
"""

tree = fromstring(data)

print(tree.xpath("//description")[0].text_content().strip())

打印：

the thing stuff is very important for various reasons, notably other things.

我忘了指定一件事，抱歉。我理想的解析版本将包含一个小节列表：[normal("the thing")、bold("stuff")、normal("....")]，lxml.html 库有可能吗？

假设描述中只有文本节点和b 元素：

for item in tree.xpath("//description/*|//description/text()"):
    print([item.strip(), 'normal'] if isinstance(item, basestring) else [item.text, 'bold'])

打印：

['the thing', 'normal']
['stuff', 'bold']
['is very important for various reasons, notably', 'normal']
['other things', 'bold']
['.', 'normal']

【讨论】：

我忘了指定一件事，抱歉。我理想的解析版本将包含一个小节列表：[normal("the thing")、bold("stuff")、normal("....")]，lxml.html 库有可能吗？跨度>