【问题标题】:How can i create a loop to scrape multiple pages from source url using BeautifulSoup?如何创建一个循环以使用 BeautifulSoup 从源 URL 中抓取多个页面?
【发布时间】:2020-08-07 20:05:56
【问题描述】:

当前脚本只允许我抓取一个页面,但我想从源 URL 抓取所有 5 个页面。如何循环/迭代剩余的 4 个页面?

#Import Libraries
from bs4 import BeautifulSoup
import requests
import csv
source = requests.get('https://www.sustainalytics.com/esg-ratings/?industry=Aerospace%20&%20Defense&currentpage=1').text
soup = BeautifulSoup(source, 'lxml')

#Start CSV
csv_file = open('aerospacedata_1.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['company_name', 'company_exchange', 'company_risk'])

#Scrape Data from Web and write to csv
for company_info in soup.find_all(class_='company-row d-flex'):
    company_name = company_info.a.text
    company_exchange = company_info.find("small").text
    company_risk = company_info.find("div", class_="col-2").text
    print(company_name, company_exchange,company_risk)
    csv_writer.writerow([company_name, company_exchange, company_risk])
csv_file.close()

输出:

company_name company_exchange company_risk

中国航电航空动力有限公司 SHG:600893 53.3

空客 SE PAR:AIR 30.3

Aselsan Elektronik Sanayi ve Ticaret AS IST:ASELS 31.6

中航飞机有限责任公司SHE:000768 54.4

中航沉阳飞机有限公司 SHG:600760 51.3

中航科工科技有限公司 HKG:2357 45.2

BAE Systems PLC LON:BA 34.3

庞巴迪公司 TSE:BBD.B 30

BWX Technologies, Inc. NYS:BWXT 42.3

CAE 公司 TSE:CAE 32.4

【问题讨论】:

    标签: python loops web-scraping beautifulsoup


    【解决方案1】:

    放一个for循环,用循环不变量构造url和文件名

    #Import Libraries
    from bs4 import BeautifulSoup
    import requests
    import csv
    
    pages = 5
    for i in range(1, pages+1):
        print(f"Page - {i}")
        source = requests.get(f'https://www.sustainalytics.com/esg-ratings/?industry=Aerospace%20&%20Defense&currentpage={i}').text
        soup = BeautifulSoup(source, 'lxml')
    
        #Start CSV
        csv_file = open(f'aerospacedata_{i}.csv', 'w')
        csv_writer = csv.writer(csv_file)
        csv_writer.writerow(['company_name', 'company_exchange', 'company_risk'])
    
        #Scrape Data from Web and write to csv
        for company_info in soup.find_all(class_='company-row d-flex'):
            company_name = company_info.a.text
            company_exchange = company_info.find("small").text
            company_risk = company_info.find("div", class_="col-2").text
            print(company_name, company_exchange,company_risk)
            csv_writer.writerow([company_name, company_exchange, company_risk])
        csv_file.close()
    
        print("---" * 30)
    
    

    输出:

    Page - 1
    AECC Aviation Power Co Ltd SHG:600893 53.3
    Airbus SE PAR:AIR 30.3
    Aselsan Elektronik Sanayi ve Ticaret AS IST:ASELS 31.6
    AVIC Aircraft Co., Ltd. SHE:000768 54.4
    AVIC Shenyang Aircraft Co. Ltd. SHG:600760 51.3
    AviChina Industry & Technology Company Limited HKG:2357 45.2
    BAE Systems PLC LON:BA 34.3
    Bombardier Inc. TSE:BBD.B 30
    BWX Technologies, Inc. NYS:BWXT 42.3
    CAE Inc. TSE:CAE 32.4
    ------------------------------------------------------------------------------------------
    Page - 2
    China Avionics Systems Co.,Ltd. SHG:600372 54.8
    Cobham PLC LON:COB 34.7
    Curtiss-Wright Corp NYS:CW 39
    Dassault Aviation S.A. PAR:AM 31.8
    Embraer S.A. BSP:EMBR3 36.3
    FACC AG WBO:FACC 37.9
    General Dynamics Corp NYS:GD 37.5
    Heico Corp NYS:HEI 39.3
    Hexcel Corp NYS:HXL 31.6
    Huntington Ingalls Industries, Inc. NYS:HII 41.3
    ------------------------------------------------------------------------------------------
    Page - 3
    Kongsberg Gruppen ASA OSL:KOG 29
    Korea Aerospace Industries, Ltd. KRX:047810 49.9
    L3Harris Technologies, Inc. NYS:LHX 38.8
    Leonardo S.p.a. MIL:LDO 28.7
    Lockheed Martin Corp NYS:LMT 30.6
    Macquarie Infrastructure Corp NYS:MIC 44.7
    Meggitt PLC LON:MGGT 32.7
    MTU Aero Engines AG ETR:MTX 23.8
    Northrop Grumman Corp. NYS:NOC 31.1
    QinetiQ Group PLC LON:QQ 23
    ------------------------------------------------------------------------------------------
    Page - 4
    Raytheon Co NYS:RTN 32.9
    Rheinmetall AG ETR:RHM 35.4
    Rolls-Royce Holdings PLC LON:RR 28.6
    Saab AB OME:SAAB.B 31.5
    Safran SA PAR:SAF 30.7
    Senior PLC LON:SNR 31.9
    Signature Aviation Plc LON:SIG 35.4
    Singapore Technologies Engineering Ltd. SES:S63 29.2
    Spirit AeroSystems Holdings Inc NYS:SPR 36.8
    Teledyne Technologies, Inc. NYS:TDY 37.5
    ------------------------------------------------------------------------------------------
    Page - 5
    Textron Inc. NYS:TXT 37.8
    Thales SA PAR:HO 28.6
    The Boeing Company NYS:BA 39
    TransDigm Group Inc NYS:TDG 40.9
    Ultra Electronics Holdings PLC LON:ULE 37.4
    United Technologies Corp NYS:UTX 29.3
    ------------------------------------------------------------------------------------------
    

    【讨论】:

    • 效果很好,非常感谢您的帮助!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2022-12-03
    • 2020-11-15
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多