【发布时间】:2015-11-29 14:27:18
【问题描述】:
我正在尝试使用 Mechanize 和 JRuby 抓取一组页面。我正在使用 JRuby 进行多线程处理,因为该程序在 MRI 上有点慢。但是,我在 Mechanize 和 http-cookie gem 中似乎是非线程安全的数据类型时遇到了一些问题。特别是,我收到这样的错误:
RuntimeError: can't add a new key into hash during iteration
[]= at org/jruby/RubyHash.java:991
push at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/history.rb:28
add_to_history at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize.rb:1290
get at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize.rb:441
(root) at main.rb:82
open_uri at /Users/user/.rvm/rubies/jruby-1.7.19/lib/ruby/1.9/open-uri.rb:150
open at /Users/user/.rvm/rubies/jruby-1.7.19/lib/ruby/1.9/open-uri.rb:678
open at /Users/user/.rvm/rubies/jruby-1.7.19/lib/ruby/1.9/open-uri.rb:33
(root) at main.rb:80
Mechanize 中看似冒犯的代码是here:
def push(page, uri = nil)
super page
index = uri ? uri : page.uri
@history_index[index.to_s] = page # offending line
shift while length > @max_size if @max_size
self
end
当我注释掉 lib/mechanize.rb 中将访问页面添加到历史记录的代码时,该特定错误消失并被关于 http-cookie gem 的非常相似的错误所取代:
RuntimeError: can't add a new key into hash during iteration
[]= at org/jruby/RubyHash.java:991
add at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie_jar/hash_store.rb:56
add at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie_jar.rb:108
add at (eval):3
add at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/cookie_jar.rb:22
parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie_jar.rb:192
parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie.rb:322
scan_set_cookie at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie/scanner.rb:212
parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie.rb:281
tap at org/jruby/RubyKernel.java:1886
parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie.rb:280
parse at (eval):3
parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/cookie.rb:37
parse at /Users/user/.rvm/gems/jruby-1.7.19/gems/http-cookie-1.0.2/lib/http/cookie_jar.rb:191
save_cookies at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:857
response_cookies at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:845
each at org/jruby/RubyArray.java:1613
response_cookies at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:844
fetch at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:282
post_form at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize.rb:1281
submit at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize.rb:548
submit at /Users/user/.rvm/gems/jruby-1.7.19/gems/mechanize-2.7.3/lib/mechanize/form.rb:223
(root) at main.rb:92
还有a very similar thing going on in http-cookie:
def add(cookie)
path_cookies = ((@jar[cookie.domain] ||= {})[cookie.path] ||= {})
path_cookies[cookie.name] = cookie # offending line
cleanup if (@gc_index += 1) >= @gc_threshold
self
end
再一次,当我注释掉 http-cookie 中添加 cookie 的代码时,错误消失了。但是后来我的程序停止正确地抓取数据,可能是因为我已经删除了我正在使用的 gem 的上述功能。所有这一切最奇怪的是程序只有在抓取一定数量的页面后才会出错,所以我想知道我是否做错了什么。我会分享我拥有的代码,但它是一种私人程序,我宁愿只根据需要分享它的一部分。顺便说一句,我的程序在 MRI 上运行正常,尽管速度有点慢。
所以,我想我的问题是:Mechanize 及其依赖项是否与 JRuby 中的多线程不兼容,还是我做错了什么?
【问题讨论】:
-
当您在迭代期间使用
Iterator修改集合时,Java 的集合会抛出ConcurrentModificationException。这似乎更像是一个 JRuby 而不是并发问题。 -
@sschmeck 你是什么意思?我没有看到任何
ConcurrentModificationException错误。 -
@sschmeck 哦,没关系,现在我明白你的意思了。我以前不知道来自 Java 的
ConcurrentModificationException。
标签: ruby multithreading web-scraping jruby mechanize-ruby