【发布时间】:2014-09-11 07:29:12
【问题描述】:
ruby n00b 在这里。我正在尝试从存储在 CSV 文件中的每个 URL 中抓取一个 p 标签,并将抓取的内容及其 URL 输出到一个新文件 (myResults.csv)。但是,我不断收到“UTF-8 中的无效字节序列(ArgumentError)”错误,这表明 URL 无效? (它们都是标准的'http://www.exmaple.com/page'并在我的浏览器中工作)?
在这里尝试过类似线程的 .parse 和 .encode,但没有运气。感谢阅读。
代码:
require 'csv'
require 'nokogiri'
require 'open-uri'
CSV_OPTIONS = {
:write_headers => true,
:headers => %w[url desc]
}
CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('listOfURLs.xls') do |url|
URI.parse(URI.encode(url.chomp))
begin
page = Nokogiri.HTML(open(url))
page.css('.bio media-content').each do |scrape|
desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace)
csv << [url, desc]
end
end
end
end
puts "scraping done!"
错误信息:
/Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `escape'
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:623:in `escape'
from bbb.rb:13:in `block (2 levels) in <main>'
from bbb.rb:11:in `foreach'
from bbb.rb:11:in `block in <main>'
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/csv.rb:1266:in `open'
from bbb.rb:10:in `<main>'
【问题讨论】:
标签: ruby-on-rails ruby web-scraping nokogiri export-to-csv