【问题标题】:Altering text of links, then clicking on them in Ruby with Mechanize更改链接的文本,然后在 Ruby 中使用 Mechanize 单击它们
【发布时间】:2012-08-23 18:06:53
【问题描述】:

我设法用 Mechanize 填写了一份表格并获得了链接列表。 部分结果如下所示:

[
  #<Mechanize::Page::Link "View" "/cgi-bin/dcdev/forms/C00508200/800329/">,
  #<Mechanize::Page::Link "View" "/cgi-bin/dcdev/forms/C00487363/800634/">,
  #<Mechanize::Page::Link "View" "/cgi-bin/dcdev/forms/C00498097/800463/">
] 

我无法弄清楚接下来会发生什么。

  1. 我需要抓取的页面不是那些链接,而是链接末尾有/sa/ALL,例如:/cgi-bin/dcdev/forms/C00508200/800329/sa/ALL。如何在每个链接末尾添加sa/ALL
  2. 那么,我怎样才能点击每个更正的链接,并保存生成的页面?循环?

【问题讨论】:

    标签: ruby mechanize


    【解决方案1】:

    这就是你钓鱼的方式......

    require 'nokogiri'
    
    doc = Nokogiri::HTML(<<EOT)
    <html>
      <body>
        <a href="/cgi-bin/dcdev/forms/C00508200/800329/">
        <a href="/cgi-bin/dcdev/forms/C00487363/800634/">
        <a href="/cgi-bin/dcdev/forms/C00498097/800463/">
      </body>
    </html>
    EOT
    
    hrefs = doc.search('a').map{ |a| a['href'] + '/sa/ALL' }
    

    Mechanize 在其 HTML 解析器内部使用 Nokogiri。您可以通过以下方式访问 doc Mechanize 用途:

    require 'mechanize'
    
    agent = Mechanize.new
    page = agent.get('http://www.example.net')
    

    我们正在处理 Nokogiri 文档的证明:

    page.parser.class # => Nokogiri::HTML::Document < Nokogiri::XML::Document
    

    获取页面中的链接进行操作:

    page.parser.search('a').map(&:to_html)
    

    返回:

    [
        [ 0] "<a href=\"/\"><img src=\"/_img/iana-logo-pageheader.png\" alt=\"Homepage\"></a>",
        [ 1] "<a href=\"/domains/\">Domains</a>",
        [ 2] "<a href=\"/numbers/\">Numbers</a>",
        [ 3] "<a href=\"/protocols/\">Protocols</a>",
        [ 4] "<a href=\"/about/\">About IANA</a>",
        [ 5] "<a href=\"/go/rfc2606\">RFC 2606</a>",
        [ 6] "<a href=\"/about/\">About</a>",
        [ 7] "<a href=\"/about/presentations/\">Presentations</a>",
        [ 8] "<a href=\"/about/performance/\">Performance</a>",
        [ 9] "<a href=\"/reports/\">Reports</a>",
        [10] "<a href=\"/domains/\">Domains</a>",
        [11] "<a href=\"/domains/root/\">Root Zone</a>",
        [12] "<a href=\"/domains/int/\">.INT</a>",
        [13] "<a href=\"/domains/arpa/\">.ARPA</a>",
        [14] "<a href=\"/domains/idn-tables/\">IDN Repository</a>",
        [15] "<a href=\"/protocols/\">Protocols</a>",
        [16] "<a href=\"/numbers/\">Number Resources</a>",
        [17] "<a href=\"/abuse/\">Abuse Information</a>",
        [18] "<a href=\"http://www.icann.org/\">Internet Corporation for Assigned Names and Numbers</a>",
        [19] "<a href=\"mailto:iana@iana.org?subject=General%20website%20feedback\">iana@iana.org</a>"
    ]
    

    抓住并吃掉它们:

    links = page.parser.search('a').map{ |a| a['href'] + 'sa/ALL' }
    [
        [ 0] "/sa/ALL",
        [ 1] "/domains/sa/ALL",
        [ 2] "/numbers/sa/ALL",
        [ 3] "/protocols/sa/ALL",
        [ 4] "/about/sa/ALL",
        [ 5] "/go/rfc2606sa/ALL",
        [ 6] "/about/sa/ALL",
        [ 7] "/about/presentations/sa/ALL",
        [ 8] "/about/performance/sa/ALL",
        [ 9] "/reports/sa/ALL",
        [10] "/domains/sa/ALL",
        [11] "/domains/root/sa/ALL",
        [12] "/domains/int/sa/ALL",
        [13] "/domains/arpa/sa/ALL",
        [14] "/domains/idn-tables/sa/ALL",
        [15] "/protocols/sa/ALL",
        [16] "/numbers/sa/ALL",
        [17] "/abuse/sa/ALL",
        [18] "http://www.icann.org/sa/ALL",
        [19] "mailto:iana@iana.org?subject=General%20website%20feedbacksa/ALL"
    ]
    

    您可以确定哪些链接应用到哪些链接,以及如何重新获取它们是您的练习。

    【讨论】:

    • 谢谢,这很好用,但我无法弄清楚如何进一步操纵“搜索”标准。正如您可能在上面看到的,我的标准是'a' 和 text='View'。不过我会继续努力的!
    【解决方案2】:
    page.links.each do |link|
      agent.get(link.href + 'sa/ALL').save
    end
    

    【讨论】:

      猜你喜欢
      • 2014-04-20
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-11-07
      • 2013-03-02
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多