【问题标题】:How to scrape a site every hour with nokogiri如何使用 nokogiri 每小时抓取一个网站
【发布时间】:2014-01-06 20:48:33
【问题描述】:

下面列出的是我编写的刮板代码。我需要帮助添加此 scraper 的延迟。我希望每小时抓取一个页面。

require 'open-uri'
require 'nokogiri'
require 'sanitize'

class Scraper

    def initialize(url_to_scrape)
        @url = url_to_scrape
    end

    def scrape
        # TO DO: change to JSON
        # page = Nokogiri::HTML(open(@url)) 
        puts "Initiating scrape..."
        raw_response = open(@url)
        json_response = JSON.parse(raw_response.read)
        page = Nokogiri::HTML(json_response["html"]) 

        # your page should now be a hash. You need the page["html"]

        # Change this to parse the a tags with the class "article_title"
        # and build the links array for each href in these article_title links
        puts "Scraping links..."
        links = page.css(".article_title")
        articles = []

        # everything else here should work fine.
        # Limit the number of links to scrape for testing phase
        puts "Building articles collection..."
        links.each do |link|
            article_url = "http://seekingalpha.com" + link["href"]
            article_page = Nokogiri::HTML(open(article_url))
            article = {}
            article[:company] = article_page.css("#about_primary_stocks").css("a")
            article[:content] = article_page.css("#article_content")
            article[:content] = Sanitize.clean(article[:content].to_s)
            unless article[:content].blank?
                articles << article
            end
        end

        puts "Clearing all existing transcripts..."
        Transcript.destroy_all
        # Iterate over the articles collection and save each record into the database
        puts "Saving new transcripts..."
        articles.each do |article|
            transcript = Transcript.new
            transcript.stock_symbol = article[:company].text.to_s
            transcript.content = article[:content].to_s
            transcript.save
        end

        #return articles
    end

end

【问题讨论】:

    标签: ruby web-scraping nokogiri


    【解决方案1】:

    那么,当你完成抓取后,你对文章数组做了什么?

    我不确定它是否是您要查找的内容,但我会使用cron 来安排每小时运行此脚本。 如果您的脚本是更大应用程序的一部分 - 有一个名为 whenever 的简洁 gem,它为 cron 任务提供了一个 ruby​​ 包装器。

    希望对你有帮助

    【讨论】:

      猜你喜欢
      • 2012-01-14
      • 2016-05-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-06-06
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多