通过 nokogiri 或 hpricot 刮屏答案

【问题标题】：Screen scraping through nokogiri or hpricot通过 nokogiri 或 hpricot 刮屏
【发布时间】：2011-10-17 12:08:25
【问题描述】：

我正在尝试获取给定 xpath 的实际值。我在 sample.rb 文件中有以下代码

require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.changebadtogood.com/'))
desc "Trying to get the value of given xapth"
task :sample do
  begin
    doc.xpath('//*[@id="view_more"]').each do |link|
      puts link.content
    end
  rescue Exception => e
    puts "error" 
  end
end

输出是：

查看更多问题..

当我尝试获取其他不同 XPath 的值时，例如：
/html/body/div[4]/div[3]/h1/span 然后我收到 “错误” 消息。

我在 Nokogiri 试过这个。我不知道为什么这只会为少数 XPath 提供结果。

我在 Hpricot 中也尝试过。
http://hpricot.com/demonstrations

我粘贴了我的 url 和 XPath，我看到了
//*[@id="view_more"]
的结果 as
查看更多问题..
[此文本位于最近问题标题的底部]

但未显示以下结果：
/html/body/div[4]/div[3]/h1/span 对于这个 XPath，我期待结果 Bad。
[这出现在 http://www.changebadtogood.com/ 作为 class="hero-unit" div 的第一个标头。]

【问题讨论】：

这里有很多问题。您没有包含引发错误的代码。捕获错误并打印“错误”有什么好处？让错误出现，以便您可以调试它。并且您应该在发布问题之前修正缩进。
而且，您已经提出了 14 个问题，但尚未接受一个答案。我已经在下面回答了您的问题，但我敦促您重新访问your previously-asked questions，并为每个问题找到最能回答您问题的答案（如果有的话）并接受（点击复选标记）。

标签： ruby xpath screen-scraping nokogiri hpricot

【解决方案1】：

您的问题与糟糕的 XPath 选择器有关，与 Nokogiri 或 Hpricot 无关。让我们调查一下：

irb:01:0> require 'nokogiri'; require 'open-uri'
#=> true
irb:02:0> doc = Nokogiri::HTML(open('http://www.changebadtogood.com/')); nil
#=> nil
irb:03:0> doc.xpath('//*[@id="view_more"]').each{ |link| puts link.content }
View more issues ..
#=> 0
irb:04:0> doc.at('#view_more').text  # Simpler version of the above.
#=> "View more issues .."
irb:05:0> doc.xpath('/html/body/div[4]/div[3]/h1/span')
#=> []
irb:06:0> doc.xpath('/html/body/div[4]')
#=> []
irb:07:0> doc.xpath('/html/body/div').length
#=> 2

由此我们可以看到只有两个 div 是 <body> 元素的子元素，因此 div[4] 无法选择一个。

您似乎正在尝试在此处选择跨度：

<h1 class="landing_page_title">
  Change <span style='color: #808080;'>Bad</span> To Good
</h1>

与其依赖导致此问题的脆弱标记（索引元素的匿名层次结构），不如使用文档的语义结构来获得更简单和更健壮的选择器。使用 CSS 或 XPath 语法：

irb:08:0> doc.at('h1.landing_page_title > span').text
#=> "Bad"
irb:09:0> doc.at_xpath('//h1[@class="landing_page_title"]/span').text
#=> "Bad"

【讨论】：