会发生什么?
首先,永远看汤——真相就在其中。
您在 while 循环的请求中缺少标头,这会导致 403 错误并且表选择不正确。
如何实现?
在 while 循环中正确设置您的请求的标头:
html = requests.get(url , headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'})
选择更具体的行 - 注意 html 中没有 tbody:
# Go throught table = tbody and extract the data under the 'td' tag
for row in soup.select('table tr.list'):
还要检查分页的选择器:
# If more than one page then iterate through all of them
if soup.select_one('div.pagenumbers span.current + a'):
url = 'https://aviation-safety.net/wikibase/dblist.php'+soup.select_one('div.pagenumbers span.current + a')['href']
else:
break
示例
import requests, csv
from bs4 import BeautifulSoup
from urllib.request import Request
url = 'https://aviation-safety.net/wikibase/dblist.php?Year=1916&sorteer=datekey&page=1'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
with open('1916_aviation-safety.csv', "w", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["acc. date", "Type", "Registration","operator", "fat", "Location", " ", "dmg", " ", " "])
while True:
print(url)
html = requests.get(url , headers = headers)
soup = BeautifulSoup(html.text, 'html.parser')
# Go throught table = tbody and extract the data under the 'td' tag
for row in soup.select('table tr.list'):
writer.writerow([c.text if c.text else '' for c in row.select('td')])
print(row)
# If more than one page then iterate through all of them
if soup.select_one('div.pagenumbers span.current + a'):
url = 'https://aviation-safety.net/wikibase/dblist.php'+soup.select_one('div.pagenumbers span.current + a')['href']
else:
break
以防万一
使用pandas.read_html() 的替代解决方案,可在所有年份进行迭代:
import requests,time,random
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import Request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
url = 'https://aviation-safety.net/wikibase/'
req = requests.get(url , headers = headers)
soup = BeautifulSoup(req.text, 'html.parser')
data = []
for url in ['https://aviation-safety.net/'+a['href'] for a in soup.select('a[href*="/wikibase/dblist.php"]')]:
while True:
html = requests.get(url, headers = headers)
soup = BeautifulSoup(html.text, 'html.parser')
data.append(pd.read_html(soup.prettify())[0])
# If more than one page then iterate through all of them
if soup.select_one('div.pagenumbers span.current + a'):
url = 'https://aviation-safety.net/wikibase/dblist.php'+soup.select_one('div.pagenumbers span.current + a')['href']
else:
break
time.sleep(random.random())
df = pd.concat(data)
df.loc[:, ~df.columns.str.contains('^Unnamed')].to_csv('aviation-safety.csv', index=False)