【问题标题】:How can I extract text from a class tag that appears after an <a href> tag within a <div> using BeautifulSoup and Python?如何使用 BeautifulSoup 和 Python 从出现在 <div> 中的 <a href> 标记之后的类标记中提取文本?
【发布时间】:2020-06-24 01:24:30
【问题描述】:

我正在尝试从出现在标签内(和之后)的类中提取文本,如下所示:

from bs4 import BeautifulSoup


html = """<div class="wisbb_teamA">
    <a href="http://www.example.com/eg1" class="wisbb_name">Phillies</a>
</div>"""

soup = BeautifulSoup(html,"lxml")

for div in soup.findAll('div', attrs={'class':'wisbb_teamA'}):
    print(div.find('a').contents[0])

这将返回以下内容:

Phillies

这是正确的,但是当我尝试从实际页面中提取时,我得到以下信息:

TypeError: object of type 'Response' has no len()

页面在

https://www.foxsports.com/mlb/scores?season=2019&date=2019-09-23

我使用了以下内容:

import requests
from bs4 import BeautifulSoup

url = requests.get("https://www.foxsports.com/mlb/scores?season=2019&date=2019-09-23")

soup = BeautifulSoup(url,'lxml')

for div in soup.findAll('div', attrs={'class':'wisbb_teamA'}):
    print(div.find('a').contents[0])

谢谢。

【问题讨论】:

  • 试试这个:soup = BeautifulSoup(url.text,'lxml')
  • @Matt,你到底在追求什么?可能有一种更简单的方法可以通过 api 获取它。
  • @chitown88 - 正在寻找打印团队名称。我也会试试你的解决方案。

标签: html python-3.x beautifulsoup


【解决方案1】:

您收到的错误,

TypeError: object of type 'Response' has no len()

是因为您的 url 变量是“响应”对象,而不是实际的 html。如果您使用.text 方法,您可以获得html,如下所示

url = requests.get("https://www.foxsports.com/mlb/scores?season=2019&date=2019-09-23")

print(url) # Response object
print(url.text) # html

soup = BeautifulSoup(url.text) # new soup code, (it knows its html)

这是一个提取链接的示例,它可能会将您引向正确的方向

for link in soup.find_all('a'):
        print(link.get('href'))

【讨论】:

  • 这确实返回了页面上的所有链接,但是我怎样才能具体返回“wisbb_teamA”类下的 中的文本?
【解决方案2】:

您需要将 URL.text 传递给 beautifulsoup 并尝试将 lxml 转换为 'html.parser'

您的方法很好,但问题是没有获取完整的内容。

我在下面给出了一个有效的示例代码。

from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('../chromedriver.exe')
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

driver.get("https://www.foxsports.com/mlb/scores?season=2019&date=2019-09-23")
delay = 10 # seconds
try:
    myElem = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'wisbb_teamA')))
    print("Page is ready!")
except TimeoutException:
    print("Loading took too much time!")

content = driver.page_source
soup = BeautifulSoup(content, 'html.parser')
soup.find("div", attrs={"class":"wisbb_teamA"})

致谢:post

我还发现 request.get 并没有完全深入 DOM。应该有一个参数来控制递归深度,你应该查看 urllib 或 requests 的文档。

但这将为您完成这项工作。

请注意,我必须使用显式等待元素,因为即使使用 selenium 也无法获取完整的 HTML,因此我必须等待特定元素被加载

更新:

使用它来查找玩家姓名

for s in soup.findAll("div", attrs={"class":"wisbb_teamA"}):
    links = s.findAll("a")
    for link in links:
        print(link.text)

【讨论】:

  • 是的,我已经这样做了,但不幸的是它仍然没有返回确切的信息
  • 谢谢,页面加载成功,但我无法像上面的示例帖子中那样获取每个团队的名称。我正在尝试将它们全部打印出来。
【解决方案3】:

只需通过 API 即可准确获取您想要的数据。这里我把响应解析成表格,但你不一定需要这样做。

import requests

url = 'https://api.foxsports.com/sportsdata/v1/baseball/mlb/events.json'

payload = {
'enable': 'odds,teamdetails',
'date': '20190923',
'apikey': 'jE7yBJVRNAwdDesMgTzTXUUSx1It41Fq'}

jsonData = requests.get(url, params=payload).json()

for each in jsonData['page']:
    homeTeam = each['homeTeam']['name']
    awayTeam = each['awayTeam']['name']
    print ('Home Team: %s\nAway Team: %s\n' %(homeTeam, awayTeam))

输出:

Home Team: Nationals
Away Team: Phillies

Home Team: Blue Jays
Away Team: Orioles

Home Team: Rays
Away Team: Red Sox

Home Team: Mets
Away Team: Marlins

Home Team: Diamondbacks
Away Team: Cardinals  

解析到表格:

import pandas as pd
import requests
import re

def flatten_json(y):
    out = {}
    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '.')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '.')
                i += 1
        else:
            out[name[:-1]] = x
    flatten(y)
    return out


url = 'https://api.foxsports.com/sportsdata/v1/baseball/mlb/events.json'

payload = {
'enable': 'odds,teamdetails',
'date': '20190923',
'apikey': 'jE7yBJVRNAwdDesMgTzTXUUSx1It41Fq'}

jsonData = requests.get(url, params=payload).json()
flat = flatten_json(jsonData)


results = pd.DataFrame()
special_cols = []

columns_list = list(flat.keys())
for item in columns_list:
    try:
        row_idx = re.findall(r'\.(\d+)\.', item )[0]
    except:
        special_cols.append(item)
        continue
    column = re.findall(r'\.\d+\.(.*)', item )[0]

    row_idx = int(row_idx)
    value = flat[item]

    results.loc[row_idx, column] = value

for item in special_cols:
    results[item] = flat[item]

【讨论】:

    猜你喜欢
    • 2021-12-02
    • 1970-01-01
    • 2021-03-08
    • 1970-01-01
    • 2016-12-08
    • 1970-01-01
    • 2023-03-19
    • 2018-12-29
    • 2023-03-29
    相关资源
    最近更新 更多