从汤对象中选择 csv 文件 url答案

【问题标题】：Select csv files urls from soup object从汤对象中选择 csv 文件 url
【发布时间】：2020-12-04 08:22:34
【问题描述】：

请问，如何从soccer historical data 中选择 csv 文件 url，并将它们保存为名称：“state”+“season”+“csv 文件名称”？我迷失在这个领域...

driver2 = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver2.get("https://www.football-data.co.uk/englandm.php")
pgsource2 = driver2.page_source
soup2 = BeautifulSoup(pgsource2, 'html.parser')

x = soup2.find_all('table')

for a in x.find_all('a', href=True):
        y = a['href']
        print(y)

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

这是我的版本

    soup2 = BeautifulSoup(resp, 'html.parser')
    main_table = soup2.find('a', href=re.compile(r'.csv')).parent
    result = {}
    curent_key = 'none'
    for item in main_table:
        if item.name == 'i':
            curent_key = item.text
            print(curent_key)
            if not curent_key in result:
                result[curent_key] = []
            else: continue
        if item.name == 'a' and item['href'] and curent_key in result:
            result[curent_key].append({ 'href': item['href'], 'text': item.text })
        
    print(result)

【讨论】：

【解决方案2】：

您可以通过以下方式找到所有 .csv 文件并将其保存在本地：

from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup

base_url = "https://www.football-data.co.uk/"
page = requests.get(urljoin(base_url, "englandm.php")).text
anchors = BeautifulSoup(page, "html.parser").find_all(
    lambda t: t.name == "a" and ".csv" in t["href"],
)
csv_links = [urljoin(base_url, a["href"]) for a in anchors]

name_mapping = {
    "E0.csv": "Premier_League",
    "E1.csv": "Championship",
    "E2.csv": "League_1",
    "E3.csv": "League_2",
    "EC.csv": "Conference",
}

for csv_link in csv_links:
    *_, date, file_name = csv_link.split("/")
    print(f"Fetching {csv_link}...")
    with open(f"{'_'.join([date, name_mapping[file_name]])}.csv", "wb") as f:
        f.write(requests.get(csv_link).content)

【讨论】：