【发布时间】:2019-12-22 08:19:32
【问题描述】:
我正在尝试抓取所有 .pdf 链接、pdf 的标题以及在此 webpage 上收到它的时间。在尝试从页面中查找 href 链接时,我尝试了以下代码 -
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.bseindia.com/corporates/ann.html?scrip=532538').text
soup = BeautifulSoup(source, 'lxml')
for link in soup.find_all('a'):
if link.has_attr('href'):
print(link.attrs['href'])
我得到以下输出-
{{CorpannData.Table[0].NSURL}}
{{CorpannData.Table[0].NSURL}}
#
/xml-data/corpfiling/AttachLive/{{cann.ATTACHMENTNAME}}
/xml-data/corpfiling/AttachHis/{{cann.ATTACHMENTNAME}}
/xml-data/corpfiling/AttachLive/{{CorpannDataByNewsId[0].ATTACHMENTNAME}}
/xml-data/corpfiling/AttachHis/{{CorpannDataByNewsId[0].ATTACHMENTNAME}}
我想要的输出是得到所有像这样的 pdf 链接:
https://www.bseindia.com/xml-data/corpfiling/AttachHis/e525dbbb-5ec1-4327-a5ea-9662c66f32a5.pdf
https://www.bseindia.com/xml-data/corpfiling/AttachHis/d2355247-3287-4c41-be61-2a5655276e79.pdf
(可选)我想要的整个程序的输出是-
Title: Compliances-Reg. 39 (3) - Details of Loss of Certificate / Duplicate Certificate
Exchange received time: 19-12-2019 13:49:14
PDF link: https://www.bseindia.com/xml-data/corpfiling/AttachHis/e525dbbb-5ec1-4327-a5ea-9662c66f32a5.pdf
...
并让程序每秒在网页上查找新的更新。
【问题讨论】:
标签: python beautifulsoup python-requests