【问题标题】:Questions about scrapy, why can't I parse the whole page, but just the first record on the page?关于scrapy的问题,为什么我解析不了整个页面,只能解析页面上的第一条记录?
【发布时间】:2015-10-27 16:22:12
【问题描述】:

我是 scrappy 的新手,并试图按照示例 (link http://mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/#.VcFiAjBVhBc) 抓取 craiglist。

但是,每次运行我的代码,我只能获取页面上的第一条记录,而附件代码中的示例是这样的,它只包含每个页面上的第一条记录

link,title
/eby/npo/5155561393.html,Residential Administrator full time
/sfc/npo/5154403251.html,Sr. Director of Family Support Services
/eby/npo/5150280793.html,Veterans Program Internship
/eby/npo/5157174843.html,PROTECT OUR LIFE SAVING MEDICINE! $10-15/H
/eby/npo/5143949422.html,Program Supervisor - Multisystemic Therapy (MST)
/sby/npo/5145782515.html,Housing Specialist -- Santa Clara and Alameda Counties
/nby/npo/5148193893.html,Shipping Assistant for Non Profit
/sby/npo/5142160649.html,Companion for People with Developmental Disabilities
/sfc/npo/5139127862.html,Director of Vocational Services

我使用“scrapy crawl craig2 -o items_2.csv -t csv”来运行代码。 提前感谢您的帮助

代码是:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider#, Rule
from scrapy.selector import HtmlXPathSelector

from scrapy.http import Request
class CraigslistSampleItem(Item):
    title = Field()
    link = Field()



class MySpider(CrawlSpider):
    name = "craig2"
    allowed_domains = ["sfbay.craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/search/"]

   # rules = (Rule (SgmlLinkExtractor(allow=("index\d00\.html", ),restrict_xpaths=('//p[@class="button next"]',))
   # , callback="parse_items", follow= True),
    #)


    def start_requests(self):
            for i in range(9):
                yield Request("http://sfbay.craigslist.org/search/npo?s=" + str(i) + "00" , self.parse_items)


    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//span[@class="pl"]')
        items = []
        for ii in titles:
            item = CraigslistSampleItem()
            item ["title"] = ii.select("a/text()").extract()
            item ["link"] = ii.select("a/@href").extract()
            items.append(item)
            return(items)

【问题讨论】:

    标签: javascript python html scrapy


    【解决方案1】:

    试试下面的代码:

    class MySpider(CrawlSpider):
        name = "craig2"
        allowed_domains = ["sfbay.craigslist.org"]
        start_urls = ["http://sfbay.craigslist.org/search/npo?s=%s" % i for i in xrange(1,9)]
    
        def parse(self, response):
            hxs = HtmlXPathSelector(response)
            titles = hxs.select('//span[@class="pl"]')
            items = []
            for ii in titles:
                item = CraigslistSampleItem()
                item ["title"] = ii.select("a/text()").extract()
                item ["link"] = ii.select("a/@href").extract()
                items.append(item)
                yield item
    

    【讨论】:

      【解决方案2】:

      您的代码的问题是您在for 循环中执行return(items)。这意味着您将在第一个标题之后立即返回。因此,即使每页有 100 个标题,您也会返回第一个。所以将return(items) 向左移动一格就可以了:

      def parse_items(self, response):
          hxs = HtmlXPathSelector(response)
          titles = hxs.select('//span[@class="pl"]')
          items = []
          for ii in titles:
              item = CraigslistSampleItem()
              item ["title"] = ii.select("a/text()").extract()
              item ["link"] = ii.select("a/@href").extract()
              items.append(item)
          return(items)
      

      请注意,在这种情况下,return(items)for 循环处于同一缩进级别,而不是在循环中。这会在我的机器上返回 CSV 输出中的 900 个条目。

      solution of Ooorza 也不错,但您不需要全部。在这种情况下,解决方案是 yield 每个 item 在循环中。在这种情况下,您将 for 循环转换为生成器函数,该生成器函数将解析的项目发送到进一步处理。在这种情况下,您不需要append 将当前项目添加到列表中。 parse_items 方法如下所示:

      def parse_items(self, response):
          hxs = HtmlXPathSelector(response)
          titles = hxs.select('//span[@class="pl"]')
          for ii in titles:
              item = CraigslistSampleItem()
              item ["title"] = ii.select("a/text()").extract()
              item ["link"] = ii.select("a/@href").extract()
              yield item
      

      【讨论】:

        猜你喜欢
        • 2020-03-03
        • 1970-01-01
        • 2020-12-18
        • 2020-05-16
        • 1970-01-01
        • 2017-05-18
        • 1970-01-01
        • 2012-10-29
        • 1970-01-01
        相关资源
        最近更新 更多