【问题标题】:How do I find a tag using BeautifulSoup?如何使用 BeautifulSoup 找到标签?
【发布时间】:2022-01-25 05:12:46
【问题描述】:

我正在尝试从德国台球联盟网站上搜索结果,将其制成表格并提交给需要特定格式的评分系统。我不是 python 专家,但我对如何从根联赛页面提取完整的链接列表以及获取日期、主场/访客球队感到困惑,现在我正在尝试捕获个人比赛数据。

这是相关的 HTML:

<tr>
<td colspan="3" nowrap="" rowspan="2" width="100"><b>
                                Spiel 2<br/>8-Ball                      </b>
</td>
<td class="home up" colspan="6" valign="top">Christian Fachinger</td>
<td class="visitor up" colspan="7" valign="top">Michael Schneider</td>
</tr>
<tr>
<td class="home down" colspan="6" valign="top">7</td>
<td class="visitor down" colspan="7" valign="top">4</td>

网站:https://hbu.billardarea.de/cms_leagues/matchday/344947

我正在尝试查找包含文本字符串“Spiel 2”的“td”标签。然后我应该能够拉出游戏 - “8-ball”,然后继续弄清楚如何在相关的“类”标签中捕获数据。对于我的一生,我无法得到结果。我尝试了各种汤命令的许多排列,但要么得到“无”,要么得到“[]”。我“认为”它可能与额外的空格有关,但尝试了各种以正则表达式为中心的命令,但无法“选择”这个 td 标签来进行进一步的数据收集。

我做错了什么?我知道我没有以最有效的方式编码,这是我第一次尝试编写网络爬虫,总的来说,我是一个 python 新手。

'''

import requests
import re
import os
from bs4 import BeautifulSoup

URL = "https://hbu.billardarea.de/cms_leagues/plan/7870/10406"

def import_all_links():
    page = requests.get(URL).text
    soup = BeautifulSoup(page, "html.parser")
    path = soup.select("a[href*=matchday]")

    for link in path:
        file1 = open("league.txt", "a")  # append mode
        file1.write("https://hbu.billardarea.de" + link['href'] + '\n')
        file1.close()

def get_date():
    links_file = open(r'C:\Users\Russ\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Python 3.10\league.txt', "r") 
    for day_link in links_file:
        day_link = day_link.rstrip("\n")
        soup = requests.get(day_link).text
        day_links_parse = BeautifulSoup(soup, "html.parser")
        date = day_links_parse.select('label:contains(Datum)')
        league = day_links_parse.select('label:contains(Saison)')
        home = day_links_parse.find(attrs={"class": "home"}).text
        home = home.partition(":")[2]
        visitor = day_links_parse.find(attrs={"class": "visitor"}).text
        visitor = visitor.partition(":")[2]
        print(day_links_parse)
        **play_table = day_links_parse.td.find_all(text = re.compile('Spiel 2'))**  <<<<< Issue
        **print(play_table)**                                                 <<<<< Returns 0 results

        for item in date:
            date = item.next_sibling.next_sibling.text
            date = date.partition(" ")[0]
            date = date.split(".")
            date = date[1] + "\\" + date[0] + "\\" + date[2]
        for item in league:
            league = item.next_sibling.next_sibling.text
            league = league.partition(" ")[0]

        print(date, ",", league, ",", home, " (H) vs ", visitor, "(V)", sep='')

import_all_links()
get_date() '''

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    你可以试试这样的,例如:

    import requests
    from bs4 import BeautifulSoup
    
    page = requests.get("https://hbu.billardarea.de/cms_leagues/matchday/344947")
    table_rows = (
        BeautifulSoup(page.text, "lxml")
        .select(".report_table, .matchday_table, .score_table > tr")
    )
    
    spiel_zwei = table_rows[5].select_one("b").getText(strip=True, separator=" ")
    heim, gast = [spieler.getText() for spieler in table_rows[5].select("td")[1:]]
    spielergebnis = table_rows[6].getText(strip=True, separator=" vs. ")
    
    print(f"{spiel_zwei}\n{heim} - {gast}\n{spielergebnis}")
    

    输出:

    Spiel 2 8-Ball
    Rolf Berghöfer - Zühtü Uyanik
    6 vs. 7
    

    【讨论】:

    • 非常感谢!这段代码效果很好!然而,我无法完全理解这些数据是如何被提取出来的。如果我想提取“Spiel 3”和相关字段怎么办?我不太确定您是如何找到正确的表格行的?
    【解决方案2】:
    from bs4 import BeautifulSoup
    import requests
    from urllib.parse import urljoin
    import re
    import pandas as pd
    
    
    def get_soup(content):
        return BeautifulSoup(content, 'lxml')
    
    
    def main(url):
        with requests.Session() as req:
            r = req.get(url)
            soup = get_soup(r.text)
            urls = [urljoin(url, x['href'])
                    for x in soup.select('a[href*=matchday]')]
            allin = []
            for link in urls:
                r = req.get(link)
                soup = get_soup(r.text).select_one('#main_frontend')
                match = soup.find(text=re.compile('Spiel 2'))
                allin.append(
                    {
                        'Date': soup.select('.nochange')[3].text.split()[0],
                        'League': soup.select('.nochange')[1].text.split()[1],
                        'Home': soup.select_one('.home').text.split(':')[1],
                        'Visitor': soup.select_one('.visitor').text.split(':')[1],
                        'Game': list(match.next_elements)[1].strip(),
                        'Whom': [x.text for x in match.find_all_next('td')[:2]],
                        'Result': [x['colspan'] for x in match.find_all_next('td')[:2]]
                    }
                )
    
            df = pd.DataFrame(allin)
            print(df)
    
    
    main('https://hbu.billardarea.de/cms_leagues/plan/7870/10406')
    

    输出:

             Date       League  ...                                    Whom  Result
    0  11.09.2021  (2021/2022)  ...          [Rolf Berghöfer, Zühtü Uyanik]  [6, 7]
    1  11.09.2021  (2021/2022)  ...     [Christian Roller, Balthasar Nebel]  [6, 7]
    2  11.09.2021  (2021/2022)  ...  [Peter Graessner, Christian Fachinger]  [6, 7]
    
    [3 rows x 7 columns]
    

    【讨论】:

      猜你喜欢
      • 2011-08-29
      • 1970-01-01
      • 2022-10-04
      • 2021-07-04
      • 2020-06-04
      • 1970-01-01
      • 1970-01-01
      • 2017-11-22
      相关资源
      最近更新 更多