【问题标题】:Scrapy: yield argument to non-consecutive callbackScrapy:非连续回调的屈服参数
【发布时间】:2021-08-27 00:06:03
【问题描述】:

我是 Scrapy 的新手,但到目前为止,我设法编写了这个蜘蛛(实际上几乎可以按预期工作)。

我想将下载的文件命名为f'{issue['number']}_{issue['date']}.pdf',但到目前为止我还不能。 name 只是我想出的一个临时解决方法。

我正在查看Itemloader 文档,也许这就是我需要的,但这意味着重写整个代码。也许有一个更简单的解决方案。我会继续阅读文档。

欢迎任何提示,在此先感谢。

PD:英语是我的第二语言,我仍在阅读 Scrapy 文档;)如果您想知道为什么选择委内瑞拉,我的任务之一是将这些文档导入我们的数据库。

ftp_connection = ftplib.FTP(host=ftp_host, user=ftp_user, passwd=ftp_password)
print(ftp_connection.getwelcome())
ftp_connection.cwd(ftp_directory)
ftp_files = ftp_connection.nlst()
print('Successully created list of file names')
ftp_connection.quit()

class DoVenezuela(scrapy.Spider):

    name = 'do_venezuela'
    start_urls = ['http://spgoin.imprentanacional.gob.ve/cgi-win/be_alex.cgi?forma=FGENERAL&nombrebd=spgoin&c01=Titulo&m01=frase&t01=&c03=Descriptor_TGO1&m03=comienzo&c04=FechaInicio&m04=%3E%3D&t04=01-01-2021&c05=FechaInicio&t05=&c06=Descriptor_EDR1&m06=frase&t06=Publicado&TSalida=T%3AGeneralGCTOF&recuperar=3000&MostrarHijos=E&Cizq=2&xsl=&pxsl=&TipoDoc=GCTOF&Submit2=Buscar&Orden=;FID;']

    def parse(self, response):
        even_rows = response.css('tr.LineaTablaImpar')
        odd_rows = response.css('tr.LineaTablaPar')
        all_rows = even_rows + odd_rows

        for row in all_rows:
            issue = Issue()
            issue['number'] = row.css('a.DocTitulo::text').get().replace('.', '')
            issue['edition'] = row.css('a.RefDescriptor::text').get()
            issue['date'] = row.css('td::text')[2].get().replace('-', '')
            issue['link'] =  row.css('a.DocTitulo').attrib['href']

            lookup = f'{issue["number"]}_{issue["date"]}.pdf'

            if lookup in ftp_files:
                print(f'Skipping {lookup}.pdf: already in ftp')
            elif os.path.isfile(f'{download_directory}/{lookup}'):
                print(f'Skipping {lookup}.pdf: already downloaded')
            else:
                yield response.follow(url=f'http://spgoin.imprentanacional.gob.ve{issue["link"]}', callback=self.parse_link1)

    def parse_link1(self, response):
        link1 = response.css('a')[21].attrib['href']
        yield response.follow(url=f'http://spgoin.imprentanacional.gob.ve{link1}', callback=self.parse_link2)

    def parse_link2(self, response):
        link2 = response.css('a')[17].attrib['href']
        yield response.follow(url=f'http://spgoin.imprentanacional.gob.ve{link2}', callback=self.download_pdf)
    
    def download_pdf(self, response):
        name = response.url.split('/')[4].replace('be_alex.cgi?Documento=', '')
        pdf_file = open(f'{download_directory}/{name}.pdf', 'wb')
        pdf_file.write(response.body)

【问题讨论】:

    标签: python scrapy yield


    【解决方案1】:

    我设法通过在代码中传递 meta 关键字来解决这个问题,不确定这是否是最方便的解决方案...根据 Scrapy 文档,cb_kwargs (https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.cb_kwargs) 更可取。

    class DoVenezuela(scrapy.Spider):
        name = 'do_venezuela'
        start_urls = ['http://spgoin.imprentanacional.gob.ve/cgi-win/be_alex.cgi?forma=FGENERAL&nombrebd=spgoin&c01=Titulo&m01=frase&t01=&c03=Descriptor_TGO1&m03=comienzo&c04=FechaInicio&m04=%3E%3D&t04=01-01-2021&c05=FechaInicio&t05=&c06=Descriptor_EDR1&m06=frase&t06=Publicado&TSalida=T%3AGeneralGCTOF&recuperar=3000&MostrarHijos=E&Cizq=2&xsl=&pxsl=&TipoDoc=GCTOF&Submit2=Buscar&Orden=;FID;']
    
        def parse(self, response):
            even_rows = response.css('tr.LineaTablaImpar')
            odd_rows = response.css('tr.LineaTablaPar')
            all_rows = even_rows + odd_rows
    
            for row in all_rows:
                issue = Issue()
                issue['number'] = row.css('a.DocTitulo::text').get().replace('.', '')
                issue['edition'] = row.css('a.RefDescriptor::text').get()
                issue['date'] = row.css('td::text')[2].get().replace('-', '')
                issue['link'] =  row.css('a.DocTitulo').attrib['href']
    
                file_name = f'{issue["number"]}_{issue["date"]}.pdf'
    
                if file_name in ftp_files:
                    print(f'Skipping {file_name}.pdf: already in ftp')
                elif os.path.isfile(f'{download_directory}/{file_name}'):
                    print(f'Skipping {file_name}.pdf: already downloaded')
                else:
    
                    yield response.follow(
                        url=f'http://spgoin.imprentanacional.gob.ve{issue["link"]}',
                        callback=self.parse_link1,
                        meta={'file_name': file_name}
                    )
    
        def parse_link1(self, response):
            link1 = response.css('a')[21].attrib['href']
            file_name = response.meta.get('file_name')
            yield response.follow(
                url=f'http://spgoin.imprentanacional.gob.ve{link1}',
                callback=self.parse_link2,
                meta={'file_name': file_name}
            )
    
        def parse_link2(self, response):
            link2 = response.css('a')[17].attrib['href']
            file_name = response.meta.get('file_name')
    
            yield response.follow(
                url=f'http://spgoin.imprentanacional.gob.ve{link2}',
                callback=self.download_pdf,
                meta={'file_name': file_name}
            )
    
        def download_pdf(self, response):
            file_name = response.meta.get('file_name')
            pdf_file = open(f'{download_directory}/{file_name}', 'wb')
            pdf_file.write(response.body)
    
    

    仍在寻找更好的解决方案。

    【讨论】:

      猜你喜欢
      • 2017-06-07
      • 1970-01-01
      • 2017-09-14
      • 2022-09-30
      • 1970-01-01
      • 2013-02-24
      • 2012-11-10
      • 2021-12-13
      • 2021-12-26
      相关资源
      最近更新 更多