使用 xpath 获取 div 文本，包括链接文本答案

【问题标题】：Get div text with xpath, including link text使用 xpath 获取 div 文本，包括链接文本
【发布时间】：2018-05-29 02:33:38
【问题描述】：

将 Tweet div 的全文作为一个返回值（包括链接文本）的 xpath 选择器是什么？

//*[contains(@class, 'tweet-text')][2]/text()

上述方法适用于没有链接的 div，但当推文包含链接时，它只返回第一个字符串段。

【问题讨论】：

你能分享你正在测试的网址吗？请使用该信息更新您的问题。

标签： html xpath

【解决方案1】：

上述方法适用于没有链接的 div，但当推文包含链接时，它只返回第一个字符串段。

这是因为 /text() 部分 - 您基本上只匹配 顶级文本子节点。要匹配元素内的所有文本节点，在任何级别，您都可以：

//*[contains(@class, 'tweet-text')][2]//text()

这通常由 HTML 解析器在询问节点的“文本”值时自动完成 - 它们递归地转到所有子节点并获取“文本”值 - 然后加入它们。

使用 Python+lxml 解析器演示上述所有内容：

In [1]: from lxml.html import fromstring 

In [2]: html = """
    ...: <div>
    ...:     div text here
    ...:     <a href="https://google.com">link text</a>
    ...: </div>"""

In [3]: root = fromstring(html)

In [4]: root.xpath('//div/text()')  # <- No text of the a element
Out[4]: ['\n    div text here\n    ', '\n']

In [5]: root.xpath('//div//text()')  # <- We've got all the texts now
Out[5]: ['\n    div text here\n    ', 'link text', '\n']

In [6]: root.xpath("//div")[0].text_content()  # <- but this would that for us
Out[6]: '\n    div text here\n    link text\n'

【讨论】：