【问题标题】:Scrapy merge subsite-item with site-itemScrapy将子站点项目与站点项目合并
【发布时间】:2016-12-09 18:34:57
【问题描述】:

我正在尝试从子站点中抓取详细信息并与从站点抓取的详细信息合并。我一直在通过 stackoverflow 以及文档进行研究。但是,我仍然无法让我的代码正常工作。似乎我从子站点中提取其他详细信息的功能不起作用。如果有人可以看一下,我将非常感激。

# -*- coding: utf-8 -*-
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapeInfo.items import infoItem
import pyodbc


class scrapeInfo(Spider):
    name = "info"
    allowed_domains = ["http://www.nevermind.com"]
    start_urls = []

    def start_requests(self):

        #Get infoID and Type from database
        self.conn = pyodbc.connect('DRIVER={SQL Server};SERVER=server;DATABASE=dbname;UID=user;PWD=password')
        self.cursor = self.conn.cursor()
        self.cursor.execute("SELECT InfoID, category FROM dbo.StageItem")

        rows = self.cursor.fetchall()

        for row in rows:
            url = 'http://www.nevermind.com/info/'
            InfoID = row[0]
            category = row[1]
            yield self.make_requests_from_url(url+InfoID, InfoID, category, self.parse)

    def make_requests_from_url(self, url, InfoID, category, callback):
        request = Request(url, callback)
        request.meta['InfoID'] = InfoID
        request.meta['category'] = category
        return request

    def parse(self, response):
        hxs = Selector(response)
        infodata = hxs.xpath('div[2]/div[2]')  # input item path

        itemPool = []

        InfoID = response.meta['InfoID']
        category = response.meta['category']

        for info in infodata:
            item = infoItem()
            item_cur, item_hist = InfoItemSubSite()

            # Stem Details
            item['id'] = InfoID
            item['field'] = info.xpath('tr[1]/td[2]/p/b/text()').extract()
            item['field2'] = info.xpath('tr[2]/td[2]/p/b/text()').extract()
            item['field3'] = info.xpath('tr[3]/td[2]/p/b/text()').extract()
            item_cur['field4'] = info.xpath('tr[4]/td[2]/p/b/text()').extract()
            item_cur['field5'] = info.xpath('tr[5]/td[2]/p/b/text()').extract()
            item_cur['field6'] = info.xpath('tr[6]/td[2]/p/b/@href').extract()

            # Extract additional information about item_cur from refering site
            # This part does not work
            if item_cur['field6'] = info.xpath('tr[6]/td[2]/p/b/@href').extract():
                url = 'http://www.nevermind.com/info/sub/' + item_cur['field6'] = info.xpath('tr[6]/td[2]/p/b/@href').extract()[0]
                request = Request(url, housingtype, self.parse_item_sub)
                request.meta['category'] = category
                yield self.parse_item_sub(url, category)
            item_his['field5'] = info.xpath('tr[5]/td[2]/p/b/text()').extract()
            item_his['field6'] = info.xpath('tr[6]/td[2]/p/b/text()').extract()
            item_his['field7'] = info.xpath('tr[7]/td[2]/p/b/@href').extract()      

            item['subsite_dic'] = [dict(item_cur), dict(item_his)]

            itemPool.append(item)
            yield item
        pass

        # Function to extract additional info from the subsite, and return it to the original item.
        def parse_item_sub(self, response, category):
            hxs = Selector(response)
            subsite = hxs.xpath('div/div[2]')  # input base path

            category = response.meta['category']

            for i in subsite:        
                item = InfoItemSubSite()    
                if (category == 'first'):
                    item['subsite_field1'] = i.xpath('/td[2]/span/@title').extract()            
                    item['subsite_field2'] = i.xpath('/tr[4]/td[2]/text()').extract()
                    item['subsite_field3'] = i.xpath('/div[5]/a[1]/@href').extract()
                else:
                    item['subsite_field1'] = i.xpath('/tr[10]/td[3]/span/@title').extract()            
                    item['subsite_field2'] = i.xpath('/tr[4]/td[1]/text()').extract()
                    item['subsite_field3'] = i.xpath('/div[7]/a[1]/@href').extract()
                return item
            pass

我一直在查看这些示例以及许多其他示例(stackoverflow 非常适合!),以及 scrapy 文档,但仍然无法理解我如何从一个函数发送详细信息并与从原始函数中抓取项目。

how do i merge results from target page to current page in scrapy? How can i use multiple requests and pass items in between them in scrapy python

【问题讨论】:

    标签: python function merge scrapy


    【解决方案1】:

    您在此处查看的内容称为请求链。您的问题是 - 从多个请求中产生一项。解决此问题的方法是在请求 meta 属性中携带您的项目时链接请求。
    示例:

    def parse(self, response):
        item = MyItem()
        item['name'] = response.xpath("//div[@id='name']/text()").extract()
        more_page = # some page that offers more details
        # go to more page and take your item with you.
        yield Request(more_page, 
                      self.parse_more,
                      meta={'item':item})  
    
    
    def parse_more(self, response):
        # get your item from the meta
        item = response.meta['item']
        # fill it in with more data and yield!
        item['last_name'] = response.xpath("//div[@id='lastname']/text()").extract()
        yield item 
    

    【讨论】:

    • 谢谢!所以在我的情况下,我在第一个请求中有大约 109 个字段,所有这些都包含在 meta={'item':item} 中,我如何以这种方式发送不同的类?
    • @PhilipHoyos 您可以在请求元数据中携带任何对象或引用,但是一些更复杂的对象类型和类可能会导致一些问题,例如内存泄漏等,因此您可能希望在以下情况下坚持使用基本的 python 类型可能(scrapy.Item 几乎只是 python dict btw,所以它是完全安全的)
    • @Granitosaurus 需要一分钟和 1+ 的时间来获得我见过的 request.meta 的最干净、最简单、最直接和最切题的解释。没有 gobbledygook,只是“它就在这里,这就是它的作用,这是让它为你工作的方法”。太棒了!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-04-26
    • 1970-01-01
    • 2012-05-09
    • 1970-01-01
    • 2014-07-05
    • 1970-01-01
    相关资源
    最近更新 更多