【问题标题】:Why does my scrapy spider not scrape anything?为什么我的scrapy spider什么都刮不下来?
【发布时间】:2016-02-04 04:22:21
【问题描述】:

我不知道问题出在哪里可能超级容易解决,因为我是scrapy的新手。感谢您的帮助!

我的蜘蛛:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Item

class ArticleSpider(CrawlSpider):
    name = "article"
    allowed_domains = ["economist.com"]
    start_urls = ['http://www.economist.com/sections/science-technology']

    rules = [
      Rule(LinkExtractor(restrict_xpaths='//article'), callback='parse_item', follow=True),
    ]

    def parse_item(self, response):
        for sel in response.xpath('//div/article'):
            item = scrapy.Item()
            item ['title'] = sel.xpath('a/text()').extract()
            item ['link'] = sel.xpath('a/@href').extract()
            item ['desc'] = sel.xpath('text()').extract()
            return item

项目:

import scrapy

class EconomistItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

部分日志:

INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Crawled (200) <GET http://www.economist.com/sections/science-technology> (referer: None)

编辑:

在我添加了 alecxe 提出的更改后,出现了另一个问题:

日志:

[scrapy] DEBUG: Crawled (200) <GET http://www.economist.com/news/science-and-technology/21688848-stem-cells-are-starting-prove-their-value-medical-treatments-curing-multiple> (referer: http://www.economist.com/sections/science-technology)
2016-02-04 14:05:01 [scrapy] DEBUG: Crawled (200) <GET http://www.economist.com/news/science-and-technology/21689501-beating-go-champion-machine-learning-computer-says-go> (referer: http://www.economist.com/sections/science-technology)
2016-02-04 14:05:02 [scrapy] ERROR: Spider error processing <GET http://www.economist.com/news/science-and-technology/21688848-stem-cells-are-starting-prove-their-value-medical-treatments-curing-multiple> (referer: http://www.economist.com/sections/science-technology)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
    for x in result:
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/crawl.py", line 67, in _parse_response
    cb_res = callback(response, **cb_kwargs) or ()
  File "/Users/FvH/Desktop/Python/projects/economist/economist/spiders/article.py", line 18, in parse_item
    item = scrapy.Item()
NameError: global name 'scrapy' is not defined

设置:

BOT_NAME = 'economist'

    SPIDER_MODULES = ['economist.spiders']
    NEWSPIDER_MODULE = 'economist.spiders'
    USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36"

如果我想将数据导出到 csv 文件中,它显然只是空的。

谢谢

【问题讨论】:

    标签: python python-2.7 web-scraping scrapy scrapy-spider


    【解决方案1】:

    parse_item 没有正确缩进,应该是:

    class ArticleSpider(CrawlSpider):
        name = "article"
        allowed_domains = ["economist.com"]
        start_urls = ['http://www.economist.com/sections/science-technology']
    
        rules = [
          Rule(LinkExtractor(allow=r'Items'), callback='parse_item', follow=True),
        ]
    
        def parse_item(self, response):
            for sel in response.xpath('//div/article'):
                item = scrapy.Item()
                item ['title'] = sel.xpath('a/text()').extract()
                item ['link'] = sel.xpath('a/@href').extract()
                item ['desc'] = sel.xpath('text()').extract()
                return item
    

    除此之外还有两件事要解决:

    • 链接提取部分应固定以匹配文章链接:

      Rule(LinkExtractor(restrict_xpaths='//article'), callback='parse_item', follow=True),
      
    • 您需要指定USER_AGENT setting 来伪装成真正的浏览器。否则,response 将不包含文章列表:

      USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36"
      

    【讨论】:

    • 谢谢 alecxe 我添加了你的评论,但显然我做错了,因为现在还有其他错误。谢谢
    • @peter 你只需要在蜘蛛内部有import scrapy。或者,我认为您的意思是初始化项目中定义的项目,而不是 scrapy.Item()
    【解决方案2】:

    你只导入了 Item(不是所有的 scrapy 模块):

    from scrapy.item import Item
    

    所以不要在这里使用scrapy.Item:

    for sel in response.xpath('//div/article'):
            item = scrapy.Item()
            item ['title'] = sel.xpath('a/text()').extract()
    

    你应该只使用 Item:

    for sel in response.xpath('//div/article'):
            item = Item()
            item ['title'] = sel.xpath('a/text()').extract()
    

    或导入您自己的项目以使用它。这应该可以工作(不要忘记将 project_name 替换为您的项目名称):

    from project_name.items import EconomistItem
    ...
    for sel in response.xpath('//div/article'):
            item = EconomistItem()
            item ['title'] = sel.xpath('a/text()').extract()
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-11-30
      • 1970-01-01
      相关资源
      最近更新 更多