【问题标题】:How to scrape paginated table with BeautifulSoup and store results in csv?如何用 BeautifulSoup 抓取分页表并将结果存储在 csv 中?
【发布时间】:2022-01-05 20:30:44
【问题描述】:

我想抓取https://www.airport-data.com/manuf/Reims.html 并遍历所有内容并将结果提取到AircraftListing.csv

代码运行无误,但结果填充错误,并非所有记录都从网页提取到 .csv 文件

如何将所有 Reims 航空记录导出到 AircraftListing.csv?

import requests
from bs4 import BeautifulSoup
import csv

root_url = "https://www.airport-data.com/manuf/Reims.html"
html = requests.get(root_url)
soup = BeautifulSoup(html.text, 'html.parser')

paging = soup.find("table",{"class":"table table-bordered table-condensed"}).find_all("td")

start_page = paging[1].text
last_page = paging[len(paging)-2].text


outfile = open('AircraftListing.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Tail_Number", "Year_Maker_Model", "C_N","Engines", "Seats", "Location"])


pages = list(range(1,int(last_page)+1))
for page in pages:
    url = 'https://www.airport-data.com/manuf/Reims:%s.html' %(page)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'html.parser')

    print ('https://www.airport-data.com/manuf/Reims:%s' %(page))

    product_name_list = soup.find("table",{"class":"table table-bordered table-condensed"}).find_all("td")

    # Each row has 6 elements in it.
    # Loop through every sixth element. (The first element of each row)
    # Get all the other elements in the row by adding to index of the first.
    for i in range(int(len(product_name_list)/6)):
        Tail_Number = product_name_list[(i*6)].get_text('td')
        Year_Maker_Model = product_name_list[(i*6)+1].get_text('td')
        C_N = product_name_list[(i*6)+2].get_text('td')
        Engines = product_name_list[(i*6)+3].get_text('td')
        Seats = product_name_list[(i*6)+4].get_text('td')
        Location = product_name_list[(i*6)+5].get_text('td')

        writer.writerow([Tail_Number, Year_Maker_Model, C_N, Engines, Seats, Location])

outfile.close()
print ('Done')

【问题讨论】:

    标签: python csv web-scraping beautifulsoup pagination


    【解决方案1】:

    要改进您的代码,尤其是带有 for 循环的部分,请尝试更具体地选择。而不是<td> 选择<tr>,这样可以最大限度地减少您在迭代中的工作量并且更通用。

    for row in soup.select('table tbody tr'):
        writer.writerow([c.text if c.text else '' for c in row.select('td')])
    

    示例

    import requests, csv
    from bs4 import BeautifulSoup
    
    url = 'https://www.airport-data.com/manuf/Reims.html'
    
    with open('AircraftListing.csv', "w", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(["Tail_Number", "Year_Maker_Model", "C_N","Engines", "Seats", "Location"])
    
        while True:
            html = requests.get(url)
            soup = BeautifulSoup(html.text, 'html.parser')
            for row in soup.select('table tbody tr'):
                writer.writerow([c.text if c.text else '' for c in row.select('td')])
    
    
            if soup.select_one('li.active + li a'):
                url = soup.select_one('li.active + li a')['href']
            else:
                break
    

    输出

    Tail Number,Year Maker Model,C/N,Engines,Seats,Location
    0008,1987 Reims F406 Caravan II,F406-0008,2,14.0,France
    0010,1987 Reims F406 Caravan II,F406-0010,2,12.0,France
    13701,0000 Reims FTB337G,0002,2,4.0,Portugal
    13705,0000 Reims FTB337G,0016,2,4.0,Portugal
    13710,0000 Reims FTB337G,0011,2,4.0,Portugal
    ...,...,...,...,...,...
    ZS-OHP,0000 Reims FR172J Reims Rocket,0496,1,4.0,South Africa
    ZS-OTT,1989 Reims F406 Caravan II,F406-0040,2,12.0,South Africa
    ZS-OXS,0000 Reims FR172J Reims Rocket,0418,1,4.0,South Africa
    ZS-SSC,1988 Reims BPSW,F406-0032,2,12.0,South Africa
    ZS-SSE,1990 Reims F406 Caravan II,F406-0043,2,12.0,South Africa
    

    熊猫的替代品

    遍历所有 51 个页面的另一种方法是使用 pandas.read_html 获取表格,将它们附加到列表中,concat() 来自所有页面的数据帧并将它们保存为包含所有 5085 条记录的 csv 文件。

    示例

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    url = 'https://www.airport-data.com/manuf/Reims.html'
    
    data = []
    
    while True:
        #print(url)
        html = requests.get(url)
        soup = BeautifulSoup(html.text, 'html.parser')
        data.append(pd.read_html(soup.select_one('table').prettify())[0])
    
        if soup.select_one('li.active + li a[href]'):
            url = soup.select_one('li.active + li a')['href']
        else:
            break
    df = pd.concat(data)
    df.to_csv('AircraftListing.csv',index=False)
    

    【讨论】:

    • 刺猬,谢谢。你用代码的效率让我感到谦卑。我印象深刻!我想您的代码可以一口气提取所有制造商数据。从 airport-data.com/manuf/09.html 到 airport-data.com/manuf/Z.html
    • 很高兴支持 - 谦虚?那么没有什么可以阻止投票。通过调整,很有可能做到这一点。但是,您应该先自己尝试一下,如果您真的遇到困难,只需 ask a new question 就可以专注于此。 SO会注意到并支持你。这都是关于学习的,我相信你可以做到。提示 - 可以嵌套构造并为您提供数据。
    • 亲爱的 HedgeHog,尊重国际语言,正确使用的 Python 代码是 --->df.to_csv('AircraftListing.csv',encoding='utf-8-sig',index=False)
    【解决方案2】:

    有更好的方法可以做到这一点,但在第 32-40 行使用:

    # Each row has 6 elements in it.
    # Loop through every sixth element. (The first element of each row)
    # Get all the other elements in the row by adding to index of the first.
    for i in range(int(len(product_name_list)/6)):
        Tail_Number = product_name_list[(i*6)].get_text('td')
        Year_Maker_Model = product_name_list[(i*6)+1].get_text('td')
        C_N = product_name_list[(i*6)+2].get_text('td')
        Engines = product_name_list[(i*6)+3].get_text('td')
        Seats = product_name_list[(i*6)+4].get_text('td')
        Location = product_name_list[(i*6)+5].get_text('td')
    
        writer.writerow([Tail_Number, Year_Maker_Model, C_N, Engines, Seats, Location])
    

    cmets 解释发生了什么。

    【讨论】:

    • 谢谢您,并在原始帖子中反映了您的建议。 .csv 输出现在已组织,但 5085 条记录中只有 200 条记录。我想捕获所有这些记录,最后一条记录在 airport-data.com/manuf/Reims:51.html
    猜你喜欢
    • 1970-01-01
    • 2019-06-03
    • 2021-10-31
    • 1970-01-01
    • 1970-01-01
    • 2019-01-10
    • 2021-11-29
    • 2014-06-20
    • 2017-09-24
    相关资源
    最近更新 更多