【发布时间】:2021-08-27 00:06:03
【问题描述】:
我是 Scrapy 的新手,但到目前为止,我设法编写了这个蜘蛛(实际上几乎可以按预期工作)。
我想将下载的文件命名为f'{issue['number']}_{issue['date']}.pdf',但到目前为止我还不能。 name 只是我想出的一个临时解决方法。
我正在查看Itemloader 文档,也许这就是我需要的,但这意味着重写整个代码。也许有一个更简单的解决方案。我会继续阅读文档。
欢迎任何提示,在此先感谢。
PD:英语是我的第二语言,我仍在阅读 Scrapy 文档;)如果您想知道为什么选择委内瑞拉,我的任务之一是将这些文档导入我们的数据库。
ftp_connection = ftplib.FTP(host=ftp_host, user=ftp_user, passwd=ftp_password)
print(ftp_connection.getwelcome())
ftp_connection.cwd(ftp_directory)
ftp_files = ftp_connection.nlst()
print('Successully created list of file names')
ftp_connection.quit()
class DoVenezuela(scrapy.Spider):
name = 'do_venezuela'
start_urls = ['http://spgoin.imprentanacional.gob.ve/cgi-win/be_alex.cgi?forma=FGENERAL&nombrebd=spgoin&c01=Titulo&m01=frase&t01=&c03=Descriptor_TGO1&m03=comienzo&c04=FechaInicio&m04=%3E%3D&t04=01-01-2021&c05=FechaInicio&t05=&c06=Descriptor_EDR1&m06=frase&t06=Publicado&TSalida=T%3AGeneralGCTOF&recuperar=3000&MostrarHijos=E&Cizq=2&xsl=&pxsl=&TipoDoc=GCTOF&Submit2=Buscar&Orden=;FID;']
def parse(self, response):
even_rows = response.css('tr.LineaTablaImpar')
odd_rows = response.css('tr.LineaTablaPar')
all_rows = even_rows + odd_rows
for row in all_rows:
issue = Issue()
issue['number'] = row.css('a.DocTitulo::text').get().replace('.', '')
issue['edition'] = row.css('a.RefDescriptor::text').get()
issue['date'] = row.css('td::text')[2].get().replace('-', '')
issue['link'] = row.css('a.DocTitulo').attrib['href']
lookup = f'{issue["number"]}_{issue["date"]}.pdf'
if lookup in ftp_files:
print(f'Skipping {lookup}.pdf: already in ftp')
elif os.path.isfile(f'{download_directory}/{lookup}'):
print(f'Skipping {lookup}.pdf: already downloaded')
else:
yield response.follow(url=f'http://spgoin.imprentanacional.gob.ve{issue["link"]}', callback=self.parse_link1)
def parse_link1(self, response):
link1 = response.css('a')[21].attrib['href']
yield response.follow(url=f'http://spgoin.imprentanacional.gob.ve{link1}', callback=self.parse_link2)
def parse_link2(self, response):
link2 = response.css('a')[17].attrib['href']
yield response.follow(url=f'http://spgoin.imprentanacional.gob.ve{link2}', callback=self.download_pdf)
def download_pdf(self, response):
name = response.url.split('/')[4].replace('be_alex.cgi?Documento=', '')
pdf_file = open(f'{download_directory}/{name}.pdf', 'wb')
pdf_file.write(response.body)
【问题讨论】: