从链接文本中提取与 Nokogiri 的链接？答案

【问题标题】：Extract a link with Nokogiri from the text of link?从链接文本中提取与 Nokogiri 的链接？
【发布时间】：2012-12-15 15:15:31
【问题描述】：

我想从网页中提取特定链接，通过其文本搜索它，使用 Nokogiri：

<div class="links">
   <a href='http://example.org/site/1/'>site 1</a>
   <a href='http://example.org/site/2/'>site 2</a>
   <a href='http://example.org/site/3/'>site 3</a>
</div>

我想要“站点 3”的 href 并返回：

http://example.org/site/3/

或者我想要“站点1”的href并返回：

http://example.org/site/1/

我该怎么做？

【问题讨论】：

您想要子字符串搜索还是完全匹配？
两者。这种情况是完全匹配的，但我想知道如何搜索（获取）href，例如哪些文本也将以“站点”开头。在并非所有链接文本都是“站点”的情况下。

标签： ruby nokogiri

【解决方案1】：

原文：

text = <<TEXT
<div class="links">
  <a href='http://example.org/site/1/'>site 1</a>
  <a href='http://example.org/site/2/'>site 2</a>
  <a href='http://example.org/site/3/'>site 3</a>
</div>
TEXT

link_text = "site 1"

doc = Nokogiri::HTML(text)
p doc.xpath("//a[text()='#{link_text}']/@href").to_s

更新：

据我所知，Nokogiri 的 XPath 实现不支持正则表达式，对于基本的 starts with 匹配，有一个名为 starts-with 的函数，您可以像这样使用它（链接以“s”开头）：

doc = Nokogiri::HTML(text)
array_of_hrefs = doc.xpath("//a[starts-with(text(), 's')]/@href").map(&:to_s)

【讨论】：

例如，如果我想获取所有以“s”或“si”开头的href ...（正则表达式）。我该怎么办？

【解决方案2】：

require 'nokogiri'

text = "site 1"

doc = Nokogiri::HTML(DATA)
p doc.xpath("//div[@class='links']//a[contains(text(), '#{text}')]/@href").to_s

【讨论】：

【解决方案3】：

也许你会更喜欢css样式选择：

doc.at('a[text()="site 1"]')[:href] # exact match
doc.at('a[text()^="site 1"]')[:href] # starts with
doc.at('a[text()*="site 1"]')[:href] # match anywhere

【讨论】：

【解决方案4】：

只是为了说明我们可以在 Ruby 中使用 URI 模块执行此操作的另一种方式：

require 'uri'

html = %q[
<div class="links">
    <a href='http://example.org/site/1/'>site 1</a>
    <a href='http://example.org/site/2/'>site 2</a>
    <a href='http://example.org/site/3/'>site 3</a>
</div>
]

uris = Hash[URI.extract(html).map.with_index{ |u, i| [1 + i, u] }]

=> {
    1 => "http://example.org/site/1/'",
    2 => "http://example.org/site/2/'",
    3 => "http://example.org/site/3/'"
}

uris[1]
=> "http://example.org/site/1/'"

uris[3]
=> "http://example.org/site/3/'"

在幕后URI.extract 使用正则表达式，这不是在页面中查找链接的最可靠方法，但它非常好，因为如果要有用，URI 通常是不带空格的字符串。

【讨论】：