获取 href <a> 标签 nokogiri 的链接名称答案

【问题标题】：Get the link name of href <a> tag nokogiri获取 href <a> 标签 nokogiri 的链接名称
【发布时间】：2016-02-01 04:50:58
【问题描述】：

我正在抓取一些层次结构为/h2/a 但a 的href 应包含http://www.thedomain.com 的数据。所有链接都是这样的： thedomain.com/test 等等。现在我只得到文本，而不是 href 链接本身的名称。

例如：

<h2>
<a href="http://www.thedomain.com/test">Hey there</a>
<a href="http://www.thedomain.com/test1">2nd link</a>
<a href="http://www.thedomain.com/test2">3rd link</a>
</h2>

这是我的代码：

html_doc.xpath('//h2/a[contains(@href, "http://www.thedomain.com")]/text()')

嘿，第二个链接，第三个链接

而我想要http://www.thedomain.com/test 等等。

【问题讨论】：

标签： html ruby html-parsing nokogiri

【解决方案1】：

只需获取@href 而不是text()：

//h2/a[contains(@href, "http://www.thedomain.com")]/@href

【讨论】：

【解决方案2】：

您也可以为此目的使用 CSS 选择器（在这种情况下可能比 xpath 更易于使用）。您可以使用以下命令选择h2 下的<a> 元素：

html_doc.css('h2 a')

这是代码的完整工作版本：

html = <<EOT
<html>
    <h2>
        <a href="http://www.thedomain.com/test">Hey there</a>
        <a href="http://www.thedomain.com/test1">2nd link</a>
        <a href="http://www.thedomain.com/test2">3rd link</a>
    </h2>
</html>
EOT

html_doc = Nokogiri::HTML(html)
html_doc.css('h2 a').map { |link| p link['href'] }
# => "http://www.thedomain.com/test"
# => "http://www.thedomain.com/test1"
# => "http://www.thedomain.com/test2"

【讨论】：