【问题标题】:Encoding problems with hpricothpricot 的编码问题
【发布时间】:2010-07-05 12:29:14
【问题描述】:

在 ruby​​ 1.9 中尝试使用 hpricot 抓取网页时出现以下编码错误:

Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8

我可以通过执行以下操作来重现错误:

ska:~ sam$ rvm 1.9.2@hpricot
ska:~ sam$ ruby -v
ruby 1.9.2dev (2010-05-31 revision 28117) [x86_64-darwin10.4.0]
ska:~ sam$ gem list

*** LOCAL GEMS ***

hpricot (0.8.2)
rake (0.8.7)
rdoc (2.5.8)
ska:~ sam$ irb
ruby-1.9.2-preview3 > require 'rubygems'
 => false 
ruby-1.9.2-preview3 > require 'hpricot'
 => true 
ruby-1.9.2-preview3 > require 'open-uri'
 => true 

ruby-1.9.2-preview3 > page = Hpricot(open('http://www.imdb.com/title/tt0435761/'))
 => #<Hpricot::Doc "\n" {doctype "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\">"} "\n" {elem <html xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml"> "\n" {elem <head> "\n" __TRUNCATED__


ruby-1.9.2-preview3 > page.search("//div[@class = 'info-content").collect { |f| f.inner_text }.join(', ')

Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `join'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `inner_text'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `block in inner_text'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `map'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `inner_text'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `block in inner_text'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `map'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `inner_text'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `block in inner_text'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `map'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `inner_text'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `block in inner_text'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `map'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `inner_text'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `block in inner_text'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `map'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `inner_text'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `block in inner_text'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `map'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `inner_text'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `block in inner_text'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `map'
        from /Users/sam/.rvm/gems/ruby-1.9.2-preview3@hpricot/gems/hpricot-0.8.2/lib/hpricot/traverse.rb:160:in `inner_text'
        from (irb):5:in `block in irb_binding'
        from (irb):5:in `collect'
        from (irb):5
        from /Users/sam/.rvm/rubies/ruby-1.9.2-preview3/bin/irb:17:in `<main>'ruby-1.9.2-preview3 > 

【问题讨论】:

  • 我让它和 nokogiri 一起工作。
  • 我个人推荐 Nokogiri 而不是 Hpricot,因为我遇到的问题要少得多。
  • Nokogiri 是 hpricot 的“临时”替代品,我建议改用它,因为 _why 不再维护 hpricot。

标签: ruby character-encoding ruby-1.9 hpricot


【解决方案1】:

使用Nokogiri

【讨论】:

    【解决方案2】:

    尝试从以下位置更改 xpath:

    page.search("//div[@class= '信息内容")

    到:

    page.search('//div[@class=info-content]')

    在 IRB 中运行示例给了我:

    ruby-1.9.1-p378 > page.search("//div[@class=info-content]").map{ |i| i.inner_text }[0] => “本周人气下降 66%。在 IMDbPro 上了解原因。”

    【讨论】:

    • 你的权利是一个错误,但是仍然得到一个编码错误。也许我应该尝试 1.9.1
    • 1.9.1 对处理编码进行了更改。我还没有看到 1.9.1 处理文本比 1.8.7 更好的情况,但这可能是因为我最近不需要进行任何转换。无论如何,我认为 1.9.1 已经足够稳定,并且有足够多的模块在使用它,所以我将它用作我的默认版本。
    猜你喜欢
    • 1970-01-01
    • 2011-06-09
    • 2011-02-17
    • 2011-04-13
    • 2017-11-10
    • 1970-01-01
    • 2023-04-04
    • 2017-03-05
    相关资源
    最近更新 更多