Scrapy：非连续回调的屈服参数答案

【问题标题】：Scrapy: yield argument to non-consecutive callbackScrapy：非连续回调的屈服参数
【发布时间】：2021-08-27 00:06:03
【问题描述】：

我是 Scrapy 的新手，但到目前为止，我设法编写了这个蜘蛛（实际上几乎可以按预期工作）。

我想将下载的文件命名为f'{issue['number']}_{issue['date']}.pdf'，但到目前为止我还不能。 name 只是我想出的一个临时解决方法。

我正在查看Itemloader 文档，也许这就是我需要的，但这意味着重写整个代码。也许有一个更简单的解决方案。我会继续阅读文档。

欢迎任何提示，在此先感谢。

PD：英语是我的第二语言，我仍在阅读 Scrapy 文档；）如果您想知道为什么选择委内瑞拉，我的任务之一是将这些文档导入我们的数据库。

ftp_connection = ftplib.FTP(host=ftp_host, user=ftp_user, passwd=ftp_password)
print(ftp_connection.getwelcome())
ftp_connection.cwd(ftp_directory)
ftp_files = ftp_connection.nlst()
print('Successully created list of file names')
ftp_connection.quit()

class DoVenezuela(scrapy.Spider):

    name = 'do_venezuela'
    start_urls = ['http://spgoin.imprentanacional.gob.ve/cgi-win/be_alex.cgi?forma=FGENERAL&nombrebd=spgoin&c01=Titulo&m01=frase&t01=&c03=Descriptor_TGO1&m03=comienzo&c04=FechaInicio&m04=%3E%3D&t04=01-01-2021&c05=FechaInicio&t05=&c06=Descriptor_EDR1&m06=frase&t06=Publicado&TSalida=T%3AGeneralGCTOF&recuperar=3000&MostrarHijos=E&Cizq=2&xsl=&pxsl=&TipoDoc=GCTOF&Submit2=Buscar&Orden=;FID;']

    def parse(self, response):
        even_rows = response.css('tr.LineaTablaImpar')
        odd_rows = response.css('tr.LineaTablaPar')
        all_rows = even_rows + odd_rows

        for row in all_rows:
            issue = Issue()
            issue['number'] = row.css('a.DocTitulo::text').get().replace('.', '')
            issue['edition'] = row.css('a.RefDescriptor::text').get()
            issue['date'] = row.css('td::text')[2].get().replace('-', '')
            issue['link'] =  row.css('a.DocTitulo').attrib['href']

            lookup = f'{issue["number"]}_{issue["date"]}.pdf'

            if lookup in ftp_files:
                print(f'Skipping {lookup}.pdf: already in ftp')
            elif os.path.isfile(f'{download_directory}/{lookup}'):
                print(f'Skipping {lookup}.pdf: already downloaded')
            else:
                yield response.follow(url=f'http://spgoin.imprentanacional.gob.ve{issue["link"]}', callback=self.parse_link1)

    def parse_link1(self, response):
        link1 = response.css('a')[21].attrib['href']
        yield response.follow(url=f'http://spgoin.imprentanacional.gob.ve{link1}', callback=self.parse_link2)

    def parse_link2(self, response):
        link2 = response.css('a')[17].attrib['href']
        yield response.follow(url=f'http://spgoin.imprentanacional.gob.ve{link2}', callback=self.download_pdf)
    
    def download_pdf(self, response):
        name = response.url.split('/')[4].replace('be_alex.cgi?Documento=', '')
        pdf_file = open(f'{download_directory}/{name}.pdf', 'wb')
        pdf_file.write(response.body)

【问题讨论】：

标签： python scrapy yield

【解决方案1】：

我设法通过在代码中传递 meta 关键字来解决这个问题，不确定这是否是最方便的解决方案...根据 Scrapy 文档，cb_kwargs (https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.cb_kwargs) 更可取。

class DoVenezuela(scrapy.Spider):
    name = 'do_venezuela'
    start_urls = ['http://spgoin.imprentanacional.gob.ve/cgi-win/be_alex.cgi?forma=FGENERAL&nombrebd=spgoin&c01=Titulo&m01=frase&t01=&c03=Descriptor_TGO1&m03=comienzo&c04=FechaInicio&m04=%3E%3D&t04=01-01-2021&c05=FechaInicio&t05=&c06=Descriptor_EDR1&m06=frase&t06=Publicado&TSalida=T%3AGeneralGCTOF&recuperar=3000&MostrarHijos=E&Cizq=2&xsl=&pxsl=&TipoDoc=GCTOF&Submit2=Buscar&Orden=;FID;']

    def parse(self, response):
        even_rows = response.css('tr.LineaTablaImpar')
        odd_rows = response.css('tr.LineaTablaPar')
        all_rows = even_rows + odd_rows

        for row in all_rows:
            issue = Issue()
            issue['number'] = row.css('a.DocTitulo::text').get().replace('.', '')
            issue['edition'] = row.css('a.RefDescriptor::text').get()
            issue['date'] = row.css('td::text')[2].get().replace('-', '')
            issue['link'] =  row.css('a.DocTitulo').attrib['href']

            file_name = f'{issue["number"]}_{issue["date"]}.pdf'

            if file_name in ftp_files:
                print(f'Skipping {file_name}.pdf: already in ftp')
            elif os.path.isfile(f'{download_directory}/{file_name}'):
                print(f'Skipping {file_name}.pdf: already downloaded')
            else:

                yield response.follow(
                    url=f'http://spgoin.imprentanacional.gob.ve{issue["link"]}',
                    callback=self.parse_link1,
                    meta={'file_name': file_name}
                )

    def parse_link1(self, response):
        link1 = response.css('a')[21].attrib['href']
        file_name = response.meta.get('file_name')
        yield response.follow(
            url=f'http://spgoin.imprentanacional.gob.ve{link1}',
            callback=self.parse_link2,
            meta={'file_name': file_name}
        )

    def parse_link2(self, response):
        link2 = response.css('a')[17].attrib['href']
        file_name = response.meta.get('file_name')

        yield response.follow(
            url=f'http://spgoin.imprentanacional.gob.ve{link2}',
            callback=self.download_pdf,
            meta={'file_name': file_name}
        )

    def download_pdf(self, response):
        file_name = response.meta.get('file_name')
        pdf_file = open(f'{download_directory}/{file_name}', 'wb')
        pdf_file.write(response.body)

仍在寻找更好的解决方案。

【讨论】：