【问题标题】:How to get href links from a webpage using Python?如何使用 Python 从网页中获取 href 链接?
【发布时间】:2019-12-22 08:19:32
【问题描述】:

我正在尝试抓取所有 .pdf 链接、pdf 的标题以及在此 webpage 上收到它的时间。在尝试从页面中查找 href 链接时,我尝试了以下代码 -

from bs4 import BeautifulSoup
import requests

source = requests.get('https://www.bseindia.com/corporates/ann.html?scrip=532538').text
soup = BeautifulSoup(source, 'lxml')

for link in soup.find_all('a'):
    if link.has_attr('href'):
        print(link.attrs['href'])

我得到以下输出-

{{CorpannData.Table[0].NSURL}}
{{CorpannData.Table[0].NSURL}}
#
/xml-data/corpfiling/AttachLive/{{cann.ATTACHMENTNAME}}
/xml-data/corpfiling/AttachHis/{{cann.ATTACHMENTNAME}}
/xml-data/corpfiling/AttachLive/{{CorpannDataByNewsId[0].ATTACHMENTNAME}}
/xml-data/corpfiling/AttachHis/{{CorpannDataByNewsId[0].ATTACHMENTNAME}}

我想要的输出是得到所有像这样的 pdf 链接:

https://www.bseindia.com/xml-data/corpfiling/AttachHis/e525dbbb-5ec1-4327-a5ea-9662c66f32a5.pdf

https://www.bseindia.com/xml-data/corpfiling/AttachHis/d2355247-3287-4c41-be61-2a5655276e79.pdf

(可选)我想要的整个程序的输出是-

Title: Compliances-Reg. 39 (3) - Details of Loss of Certificate / Duplicate Certificate
Exchange received time: 19-12-2019 13:49:14 
PDF link: https://www.bseindia.com/xml-data/corpfiling/AttachHis/e525dbbb-5ec1-4327-a5ea-9662c66f32a5.pdf
...

并让程序每秒在网页上查找新的更新。

【问题讨论】:

    标签: python beautifulsoup python-requests


    【解决方案1】:
    import requests
    r = requests.get(
        'https://api.bseindia.com/BseIndiaAPI/api/AnnGetData/w?strCat=-1&strPrevDate=&strScrip=532538&strSearch=A&strToDate=&strType=C').json()
    
    data = []
    for item in r['Table']:
        if item['News_submission_dt'] is None:
            item['News_submission_dt'] = "N/A"
        else:
            item['News_submission_dt'] = item['News_submission_dt'].replace(
                "T", " ")
        if len(item['ATTACHMENTNAME']) == 0:
            item['ATTACHMENTNAME'] = "N/A"
        else:
            item['ATTACHMENTNAME'] = f"https://www.bseindia.com/xml-data/corpfiling/AttachHis/{item['ATTACHMENTNAME']}"
    
        item = item['NEWSSUB'], item[
            'News_submission_dt'], item['ATTACHMENTNAME']
        print(
            f"Title: {item[0]}\nExchange received time: {item[1]}\nPDF: {item[2]}")
    

    输出:

    Title: Compliances-Reg. 39 (3) - Details of Loss of Certificate / Duplicate Certificate
    Exchange received time: 2019-12-19 13:49:14
    PDF: https://www.bseindia.com/xml-data/corpfiling/AttachHis/e525dbbb-5ec1-4327-a5ea-9662c66f32a5.pdf
    Title: Compliances-Reg. 39 (3) - Details of Loss of Certificate / Duplicate Certificate
    Exchange received time: 2019-12-16 15:48:22
    PDF: https://www.bseindia.com/xml-data/corpfiling/AttachHis/d2355247-3287-4c41-be61-2a5655276e79.pdf
    Title: Announcement under Regulation 30 (LODR)-Analyst / Investor Meet - Intimation
    Exchange received time: 2019-12-16 09:50:00
    PDF: https://www.bseindia.com/xml-data/corpfiling/AttachHis/6d7ba756-a541-4c85-b711-7270db7cb003.pdf
    Title: Allotment Of Non-Convertible Debentures
    Exchange received time: 2019-12-11 16:44:33
    PDF: https://www.bseindia.com/xml-data/corpfiling/AttachHis/cdb18e51-725f-43ac-b01f-89f322ae2f5b.pdf
    Title: Lntimation Regarding Change Of Name Of Karvy Fintech Private Limited, Registrar & Transfer Agents
    Exchange received time: 2019-12-09 15:48:49
    PDF: https://www.bseindia.com/xml-data/corpfiling/AttachHis/9dd527d7-d39d-422d-8de8-c428c24e169e.pdf
    Title: Compliances-Reg. 39 (3) - Details of Loss of Certificate / Duplicate Certificate
    Exchange received time: 2019-12-05 14:44:23
    PDF: https://www.bseindia.com/xml-data/corpfiling/AttachHis/38af1a6e-a597-47e7-85b8-b620a961df84.pdf
    Title: Compliances-Reg. 39 (3) - Details of Loss of Certificate / Duplicate Certificate
    

    等等……

    输出到CSV 文件:

    import requests
    import csv
    
    r = requests.get(
        'https://api.bseindia.com/BseIndiaAPI/api/AnnGetData/w?strCat=-1&strPrevDate=&strScrip=532538&strSearch=A&strToDate=&strType=C').json()
    
    data = []
    for item in r['Table']:
        if item['News_submission_dt'] is None:
            item['News_submission_dt'] = "N/A"
        else:
            item['News_submission_dt'] = item['News_submission_dt'].replace(
                "T", " ")
        if len(item['ATTACHMENTNAME']) == 0:
            item['ATTACHMENTNAME'] = "N/A"
        else:
            item['ATTACHMENTNAME'] = f"https://www.bseindia.com/xml-data/corpfiling/AttachHis/{item['ATTACHMENTNAME']}"
    
        item = item['NEWSSUB'], item[
            'News_submission_dt'], item['ATTACHMENTNAME']
        # print(
        #     f"Title: {item[0]}\nExchange received time: {item[1]}\nPDF: {item[2]}")
        data.append(item)
    
    with open('output.csv', 'w', newline="", encoding='UTF-8') as f:
        writer = csv.writer(f)
        writer.writerow(['Title', 'Exchange Received Time', 'PDF Link'])
        writer.writerows(data)
    

    (Copy of the CSV file)

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2011-03-05
      • 2021-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-09-07
      • 1970-01-01
      相关资源
      最近更新 更多