使用 BeautifulSoup 动态抓取分页表并将结果存储在 csv 中？答案

【问题标题】：Dynamically scrape paginated table with BeautifulSoup and store results in csv?使用 BeautifulSoup 动态抓取分页表并将结果存储在 csv 中？
【发布时间】：2022-01-11 19:49:25
【问题描述】：

代码运行但数据框为空。在下面的 URL 中，YEAR 和 PAGE 都是动态的。我想遍历两者并获取 table td 和（如果可能的话）acc 下的依赖数据。日期并在year.csv中提取每年的结果。

import requests, csv
from bs4 import BeautifulSoup
from urllib.request import Request

url = 'https://aviation-safety.net/wikibase/dblist.php?Year=1916&sorteer=datekey&page=1'
req = Request(url , headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'})


with open('1916_aviation-safety.csv', "w", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["acc. date", "Type", "Registration","operator", "fat", "Location", " ", "dmg", " ", " "])

    while True:
        print(url)
        html = requests.get(url)
        soup = BeautifulSoup(html.text, 'html.parser')

        # Go throught table = tbody and extract the data under the 'td' tag
        for row in soup.select('table > tbody > tr'):
            writer.writerow([c.text if c.text else '' for c in row.select('td')])
            print(row)

        # If more than one page then iterate through all of them        
        if soup.select_one('div.pagenumbers > span.current + div.a'):
            url = soup.select_one('div.pagenumbers > span.current + div.a')['href']
        else:
            break

【问题讨论】：

标签： python csv web-scraping beautifulsoup pagination

【解决方案1】：

我已对您的脚本进行了一些更改，这将使调试和维护更容易。它使用 pandas 使写入 CSV 变得更容易，并使用 concurrent.futures 来加快速度。如果您有问题请告诉我，基本上每年都是同时抓取的，我抓取第一页以获取要抓取的页数，然后循环遍历每个页面并解析 HTML。关键信息被放入字典，然后添加到列表中（通过 pandas 更容易写入 csv，因为它基本上已经是一个数据框 - 一个字典列表）

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import re
import concurrent.futures

def scrape_year(year):

    headers =   {
        'accept':'*/*',
        'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
        }

    url = f'https://aviation-safety.net/wikibase/dblist.php?Year={year}&sorteer=datekey&page=1'
    req = requests.get(url, headers=headers)

    soup = BeautifulSoup(req.text,'html.parser')

    page_container = soup.find('div',{'class':'pagenumbers'})
    pages = max([int(page['href'].split('=')[-1]) for page in  page_container.find_all('a')])

    info = []
    for page in range(1,pages+1):

        new_url = f'https://aviation-safety.net/wikibase/dblist.php?Year={year}&sorteer=datekey&page={page}'
        print(new_url)

        data = requests.get(new_url,headers=headers)
        soup = BeautifulSoup(data.text,'html.parser')


        table = soup.find('table',{'class':'hp'})


        regex = re.compile('list.*')
        for index,row in enumerate(table.find_all('tr',{'class':regex})):
            if index == 0:
                continue

            acc_link = 'https://aviation-safety.net/'+row.find('a')['href']
            try:
                acc_date = datetime.strptime(row.find('a').text.strip(),'%d-%b-%Y').strftime('%Y-%m-%d')
            except ValueError:
                try:
                    acc_date = datetime.strptime("01"+row.find('a').text.strip(),'%d-%b-%Y').strftime('%Y-%m-%d')
                except ValueError:
                    try:
                        acc_date = datetime.strptime("01-01"+row.find('a').text.strip(),'%d-%b-%Y').strftime('%Y-%m-%d')
                    except ValueError:
                        continue

            acc_type = row.find_all('td')[1].text
            acc_reg = row.find_all('td')[2].text
            acc_operator = row.find_all('td')[3].text
            acc_fat = row.find_all('td')[4].text
            acc_location = row.find_all('td')[5].text
            acc_dmg = row.find_all('td')[7].text

            item = {
                'acc_link' : acc_link,
                'acc_date': acc_date,
                'acc_type': acc_type,
                'acc_reg': acc_reg,
                'acc_operator' :acc_operator,
                'acc_fat':acc_fat,
                'acc_location':acc_location,
                'acc_dmg':acc_dmg
                }

            info.append(item)

    df= pd.DataFrame(info)
    df.to_csv(f'{year}_aviation-safety.csv',index=False)


if __name__ == "__main__":

    START = 1916
    STOP = 2022

    years = [year for year in range(START,STOP+1)]

    print(f'Scraping {len(years)} years of data')

    with concurrent.futures.ThreadPoolExecutor(max_workers=60) as executor:
            final_list = executor.map(scrape_year,years)

【讨论】：

这个很详细。诚然，我需要稍微研究一下。 Bushcat69，我真的很感激。 acc_link 可以提供更多细节，能否在代码中引入这些细节的捕获
亲爱的 Bushcat69，正在寻找 [aviation-safety.net/wikibase/…] 页面底部的第三条记录丢失。这可能是因为日期错过了那一天吗？这个可以拍吗？
我已编辑代码以捕获缺少日期信息的事件，并假定为每月 1 日。不幸的是，通过进入每个事件来添加详细信息将花费大量时间，因为您需要提出数千个额外的请求。也许你可以把它做成一个新项目
1) 我只是使用默认的标题来覆盖我的轨迹，以防它们阻止有时发生的没有“接受”和“用户代理”的请求
2) 抱歉，这太复杂了，但它的作用是使用“列表理解”获得最大页数，我得到了页面底部的所有链接（'a' 标签）并为每个获取[href]，但将其拆分为“=”使每个列表成为一个列表，然后获取最后一个（[-1]）并将文本转换为整数，以便我可以获得所有整数的最大值即最后一页3）这是一个上下文管理器，它启动60个并发线程，同时处理一年，我为“scape_year”函数提供一个列表（年份）并同时处理它们

【解决方案2】：

会发生什么？

首先，永远看汤——真相就在其中。

您在 while 循环的请求中缺少标头，这会导致 403 错误并且表选择不正确。

如何实现？

在 while 循环中正确设置您的请求的标头：

html = requests.get(url , headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'})

选择更具体的行 - 注意 html 中没有 tbody：

        # Go throught table = tbody and extract the data under the 'td' tag
        for row in soup.select('table tr.list'):

还要检查分页的选择器：

# If more than one page then iterate through all of them        
if soup.select_one('div.pagenumbers span.current + a'):
    url = 'https://aviation-safety.net/wikibase/dblist.php'+soup.select_one('div.pagenumbers span.current + a')['href']
else:
    break

示例

import requests, csv
from bs4 import BeautifulSoup
from urllib.request import Request

url = 'https://aviation-safety.net/wikibase/dblist.php?Year=1916&sorteer=datekey&page=1'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}

with open('1916_aviation-safety.csv', "w", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["acc. date", "Type", "Registration","operator", "fat", "Location", " ", "dmg", " ", " "])

    while True:
        print(url)
        html = requests.get(url , headers = headers)
        soup = BeautifulSoup(html.text, 'html.parser')

        # Go throught table = tbody and extract the data under the 'td' tag
        for row in soup.select('table tr.list'):
            writer.writerow([c.text if c.text else '' for c in row.select('td')])
            print(row)

        # If more than one page then iterate through all of them        
        if soup.select_one('div.pagenumbers span.current + a'):
            url = 'https://aviation-safety.net/wikibase/dblist.php'+soup.select_one('div.pagenumbers span.current + a')['href']
        else:
            break

以防万一

使用pandas.read_html() 的替代解决方案，可在所有年份进行迭代：

import requests,time,random
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import Request

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
url = 'https://aviation-safety.net/wikibase/'
req = requests.get(url , headers = headers)
soup = BeautifulSoup(req.text, 'html.parser')


data = []

for url in ['https://aviation-safety.net/'+a['href'] for a in soup.select('a[href*="/wikibase/dblist.php"]')]:
    while True:

        html = requests.get(url, headers = headers)
        soup = BeautifulSoup(html.text, 'html.parser')

        data.append(pd.read_html(soup.prettify())[0])

        # If more than one page then iterate through all of them        
        if soup.select_one('div.pagenumbers span.current + a'):
            url = 'https://aviation-safety.net/wikibase/dblist.php'+soup.select_one('div.pagenumbers span.current + a')['href']
        else:
            break
        time.sleep(random.random())

df = pd.concat(data)
df.loc[:, ~df.columns.str.contains('^Unnamed')].to_csv('aviation-safety.csv', index=False)

【讨论】：

HedgeHog，感谢您指出 tbody 不存在。 “ if ”语句中的页面迭代代码没有迭代，而只是停留在第一页。
第二个，忘记用变量替换-刚刚编辑。
亲爱的 HedgeHog，这就像一块瑞士手表！非常感谢！！！
超级你可以自己解决这个问题，很高兴我接管了这个添加，虽然这不会发生在我身上，通常它也取决于系统细节。至于第二点，只是一个脚本，不是一个完整的方法，所以你完全可以扩展它，把断掉的连接写掉，然后再调用，因为对方也必须配合。
哦，不行吗？可以尝试为请求设置encoding='utf-8' 或 read_html