【发布时间】:2022-01-25 05:12:46
【问题描述】:
我正在尝试从德国台球联盟网站上搜索结果,将其制成表格并提交给需要特定格式的评分系统。我不是 python 专家,但我对如何从根联赛页面提取完整的链接列表以及获取日期、主场/访客球队感到困惑,现在我正在尝试捕获个人比赛数据。
这是相关的 HTML:
<tr>
<td colspan="3" nowrap="" rowspan="2" width="100"><b>
Spiel 2<br/>8-Ball </b>
</td>
<td class="home up" colspan="6" valign="top">Christian Fachinger</td>
<td class="visitor up" colspan="7" valign="top">Michael Schneider</td>
</tr>
<tr>
<td class="home down" colspan="6" valign="top">7</td>
<td class="visitor down" colspan="7" valign="top">4</td>
网站:https://hbu.billardarea.de/cms_leagues/matchday/344947
我正在尝试查找包含文本字符串“Spiel 2”的“td”标签。然后我应该能够拉出游戏 - “8-ball”,然后继续弄清楚如何在相关的“类”标签中捕获数据。对于我的一生,我无法得到结果。我尝试了各种汤命令的许多排列,但要么得到“无”,要么得到“[]”。我“认为”它可能与额外的空格有关,但尝试了各种以正则表达式为中心的命令,但无法“选择”这个 td 标签来进行进一步的数据收集。
我做错了什么?我知道我没有以最有效的方式编码,这是我第一次尝试编写网络爬虫,总的来说,我是一个 python 新手。
'''
import requests
import re
import os
from bs4 import BeautifulSoup
URL = "https://hbu.billardarea.de/cms_leagues/plan/7870/10406"
def import_all_links():
page = requests.get(URL).text
soup = BeautifulSoup(page, "html.parser")
path = soup.select("a[href*=matchday]")
for link in path:
file1 = open("league.txt", "a") # append mode
file1.write("https://hbu.billardarea.de" + link['href'] + '\n')
file1.close()
def get_date():
links_file = open(r'C:\Users\Russ\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Python 3.10\league.txt', "r")
for day_link in links_file:
day_link = day_link.rstrip("\n")
soup = requests.get(day_link).text
day_links_parse = BeautifulSoup(soup, "html.parser")
date = day_links_parse.select('label:contains(Datum)')
league = day_links_parse.select('label:contains(Saison)')
home = day_links_parse.find(attrs={"class": "home"}).text
home = home.partition(":")[2]
visitor = day_links_parse.find(attrs={"class": "visitor"}).text
visitor = visitor.partition(":")[2]
print(day_links_parse)
**play_table = day_links_parse.td.find_all(text = re.compile('Spiel 2'))** <<<<< Issue
**print(play_table)** <<<<< Returns 0 results
for item in date:
date = item.next_sibling.next_sibling.text
date = date.partition(" ")[0]
date = date.split(".")
date = date[1] + "\\" + date[0] + "\\" + date[2]
for item in league:
league = item.next_sibling.next_sibling.text
league = league.partition(" ")[0]
print(date, ",", league, ",", home, " (H) vs ", visitor, "(V)", sep='')
import_all_links()
get_date() '''
【问题讨论】:
标签: python beautifulsoup