【问题标题】:Scrapy isn't extracting dataScrapy 没有提取数据
【发布时间】:2015-01-20 19:34:41
【问题描述】:

这是一个抓取代码,我想从mouthshut.com 抓取数据,其中包含强标记。我能够运行它并获得标题,但它们是空白的。为什么它没有提取任何数据?

import scrapy
from scrapy.selector import Selector

from shut.items import ShutItem

class criticspider(scrapy.Spider):
    name ="shut"
    allowed_domains =["mouthshut.com"]
    start_urls =["http://www.mouthshut.com/mobile-operators/vodafone-mobile-operator-reviews-925020930"]

    def parse(self,response):
        hxs = Selector(response)
        sites = hxs.select('//li[@class="profile"]')
        items = []
        for site in sites:
            item = ShutItem()
            item['title'] = site.select('//strong[@style=" font-size: 15px;font-weight: 700;"]//a/text()').extract()
            #item['date'] = site.select('div[@class="review_stats"]//div[@class="date"]/text()').extract()
            #item['desc'] = site.select('div[@class="review_body"]//span[@class="blurb blurb_expanded"]/text()').extract()
            items.append(item)
    return items

【问题讨论】:

    标签: python xpath scrapy web-crawler selector


    【解决方案1】:

    您应该使用管道从蜘蛛中提取数据!这是一个将数据提取到 json 文件的示例:

    管道.py

    ​​>
    # -*- coding: utf-8 -*-
    
    # python import
    from scrapy import signals, log
    from scrapy.contrib.exporter import JsonItemExporter
    from datetime import datetime
    import os
    
    # project import
    from items import tgju
    from pymongo import MongoClient
    
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    def get_items(module):
        md = module.__dict__
        return (str(md[c].__name__) for c in md if (isinstance(md[c], type) and md[c].__module__ == module.__name__))
    
    
    class JsonPipeline(object):
        def __init__(self):
            self.files = dict()
            self.exporter = dict()
    
        @classmethod
        def from_crawler(cls, crawler):
            pipeline = cls()
            crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
            crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
            return pipeline
    
        def spider_opened(self, spider):
            for key in get_items(tgju):
                path = os.path.join('temp', key)
                if not os.path.exists(path):
                    os.makedirs(path)
                self.files[key] = open(os.path.join(path,
                                                    '%s_%s_%s.json' % (spider.name,
                                                                       key.lower(),
                                                                       datetime.now().strftime('%Y%m%dT%H%M%S'))),
                                       'w+b')
    
                self.exporter[key] = JsonItemExporter(self.files[key])
                self.exporter[key].start_exporting()
    
        def spider_closed(self, spider):
            for key in get_items(tgju):
                self.exporter[key].finish_exporting()
                self.files.pop(key).close()
    
        def process_item(self, item, spider):
    
            try:
                log.msg('-----------------%s------------------' % item.__class__.__name__)
                self.exporter[item.__class__.__name__].export_item(item)
            except KeyError:
                pass
            return item
    

    将此行添加到您的设置文件中:

    ITEM_PIPELINES = {
        'pipelines.JsonPipeline': 800,
    }
    

    并尝试yield 每个项目而不是return

    更新: 还将您的蜘蛛更改为这个...

    import scrapy
    from scrapy.selector import Selector
    
    from shut.items import ShutItem
    
    class criticspider(scrapy.Spider):
        name ="shut"
        allowed_domains =["mouthshut.com"]
        start_urls =["http://www.mouthshut.com/mobile-operators/vodafone-mobile-operator-reviews-925020930"]
    
        def parse(self,response):
            hxs = Selector(response)
            sites = hxs.select('//li[@class="profile"]')
            for site in sites:
                item = ShutItem()
                item['title'] = site.select('//strong[@style=" font-size: 15px;font-weight: 700;"]//a/text()').extract()
                #item['date'] = site.select('div[@class="review_stats"]//div[@class="date"]/text()').extract()
                #item['desc'] = site.select('div[@class="review_body"]//span[@class="blurb blurb_expanded"]/text()').extract()
                yield item
    

    【讨论】:

      【解决方案2】:
      def parse(self,response):
          hxs = HtmlXPathSelector(response)
          sites = hxs.select('//div[@class="reviewtitle fl"]')
          for site in sites:
              item = ShutItem()
              item['title'] = site.select('//strong[@style="  font-size: 15px;font-weight: 700;"]/a/text()').extract()
              #item['date'] = site.select('div[@class="review_stats"]//div[@class="date"]/text()').extract()
              #item['desc'] = site.select('div[@class="review_body"]//span[@class="blurb blurb_expanded"]/text()').extract()
              yield item
      

      这很好用。

      2015-01-21 19:06:33+0800 [shut] DEBUG: Scraped from <200 http://www.mouthshut.com/mobile-operators/vodafone-mobile-operator-reviews-925020930>
          {'title': [u'Vodafone 3G - Useless in Bangalore',
                     u'Worst Mobile Operator Ever',
                     u'Worst 3g connectivity of vodafone in bangalore',
                     u'Pathetic Network 3G',
                     u'HOW DO THEY STILL DO BUSINESS WITH SUCH SERVICES!!',
                     u'Bad customer service',
                     u'Vodafone Kolkata \u2013 My worst ever experience.',
                     u'Network connectivity - permanent nemesis',
                     u'VODAFONE MOBILE OPERATOR',
                     u'Beware of Vodafone billing plans',
                     u'Vodafone changed my billing plan without my notice',
                     u'Pathetic service.  They deduct balance unnecessari',
                     u'Worst service from Vodafone',
                     u'Forget Vodafone',
                     u'Vodafone Data Services sucks',
                     u'Outgoing calls has been barred',
                     u'Vodafone Sucks',
                     u'Worst Customer satisfaction I have ever Faced',
                     u'Untrained Customer Care... Seems like headline de',
                     u'3rd Party downloads - shameless way to make money!']}
      

      在这里你应该知道: 1.产量比在scrapy中列出的要好得多。 2.li节点不是strong的父节点。 3. strong stype的值有一些空白。

      【讨论】:

      • 非常感谢!!!如果我希望通过不转到单个页面并将其发布到starts_url中的链接来提取所有链接,你能告诉我一件事吗?我怎样才能做到这一点???
      • for url in urls: if not urlPattern.findall(url): continue if str(url.encode('utf8')).startswith('http:'): url = url.encode('utf8') elif str(url.encode('utf8')).startswith('/'): url = rootUrl + str(url.encode('utf8')) else: continue yield Request(url,callback=self.parse_start_url) 这是我在每页中处理 url 的代码。希望能帮到你。
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2018-07-15
      • 2020-11-22
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-05-09
      • 1970-01-01
      相关资源
      最近更新 更多