【问题标题】:Select csv files urls from soup object从汤对象中选择 csv 文件 url
【发布时间】:2020-12-04 08:22:34
【问题描述】:
请问,如何从soccer historical data 中选择 csv 文件 url,并将它们保存为名称:“state”+“season”+“csv 文件名称”?我迷失在这个领域...
driver2 = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver2.get("https://www.football-data.co.uk/englandm.php")
pgsource2 = driver2.page_source
soup2 = BeautifulSoup(pgsource2, 'html.parser')
x = soup2.find_all('table')
for a in x.find_all('a', href=True):
y = a['href']
print(y)
【问题讨论】:
标签:
python
web-scraping
beautifulsoup
【解决方案1】:
这是我的版本
soup2 = BeautifulSoup(resp, 'html.parser')
main_table = soup2.find('a', href=re.compile(r'.csv')).parent
result = {}
curent_key = 'none'
for item in main_table:
if item.name == 'i':
curent_key = item.text
print(curent_key)
if not curent_key in result:
result[curent_key] = []
else: continue
if item.name == 'a' and item['href'] and curent_key in result:
result[curent_key].append({ 'href': item['href'], 'text': item.text })
print(result)
【解决方案2】:
您可以通过以下方式找到所有 .csv 文件并将其保存在本地:
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
base_url = "https://www.football-data.co.uk/"
page = requests.get(urljoin(base_url, "englandm.php")).text
anchors = BeautifulSoup(page, "html.parser").find_all(
lambda t: t.name == "a" and ".csv" in t["href"],
)
csv_links = [urljoin(base_url, a["href"]) for a in anchors]
name_mapping = {
"E0.csv": "Premier_League",
"E1.csv": "Championship",
"E2.csv": "League_1",
"E3.csv": "League_2",
"EC.csv": "Conference",
}
for csv_link in csv_links:
*_, date, file_name = csv_link.split("/")
print(f"Fetching {csv_link}...")
with open(f"{'_'.join([date, name_mapping[file_name]])}.csv", "wb") as f:
f.write(requests.get(csv_link).content)