如何选择没有 HTML 标记的文本答案

【问题标题】：How to select text without the HTML markup如何选择没有 HTML 标记的文本
【发布时间】：2015-06-06 13:40:38
【问题描述】：

我正在开发一个网络爬虫（使用 Python），所以我有一大块 HTML 试图从中提取文本。其中一个 sn-ps 看起来像这样：

<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>

我想从这个类中提取文本。现在，我可以使用类似

的东西

//p[@class='something')]//text()

但这会导致每个文本块最终成为一个单独的结果元素，如下所示：

(This class has some ,text, and a few ,links, in it.)

所需的输出将包含一个元素中的所有文本，如下所示：

This class has some text and a few links in it.

有没有简单或优雅的方法来实现这一点？

编辑：这是产生上述结果的代码。

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'

xpath_query = "//p[@class='something']//text()"

tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
    print "'{0}'".format(item)

【问题讨论】：

你用的是什么HTML解析库？
我正在使用 lxml，我已经更新了问题。

标签： python html xpath web-scraping lxml

【解决方案1】：

原始代码的替代单行：使用带有空字符串分隔符的join：

print("".join(query_results))

【讨论】：

【解决方案2】：

您可以在 lxml 元素上调用 .text_content()，而不是使用 XPath 获取文本。

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'

xpath_query = "//p[@class='something']"

tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
    print "'{0}'".format(item.text_content())

【讨论】：

【解决方案3】：

您可以在 XPath 中使用normalize-space()。那么

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "normalize-space(//p[@class='something'])"

tree = html.fromstring(html_snippet)
print tree.xpath(xpath_query)

将产生

This class has some text and a few links in it.

【讨论】：