【问题标题】:Opening and Closing Tags are Removed from html When Using BeautifulSoup使用 BeautifulSoup 时,从 html 中删除了打开和关闭标记
【发布时间】:2020-06-01 22:43:33
【问题描述】:

我在使用 BeautifulSoup 从 www.basketball-reference.com 刮取数据时遇到了问题。我之前在 Bballreference 上使用过 BeautifulSoup,所以我对正在发生的事情感到有些困惑(当然我是一个非常大的菜鸟,所以请多多包涵)。

我正试图从https://www.basketball-reference.com/leagues/NBA_2020.html 中获取球队赛季统计数据,并且从一开始就遇到了麻烦:

from bs4 import BeautifulSoup
import requests

web_response = requests.get('https://www.basketball-reference.com/leagues/NBA_2020.html').text
soup = BeautifulSoup(web_response, 'lxml')

table = soup.find('table', id='team-stats-per_game')
print(table)

这表明即使我在检查网页时可以清楚地找到该标签,也未能成功找到有问题的表。好吧...到目前为止没什么大不了的(通常这些错误都在我的最后)所以我只是打印出整个汤:

soup = BeautifulSoup(web_response, 'lxml')
print(soup)

我将其复制并粘贴到https://codebeautify.org/htmlviewer/。为了获得比从终端更好的视野,我发现它看起来不像我期望的那样。基本上元标签很好,但其他所有东西似乎都失去了它的开始和结束标签,只是把我的汤变成了真正的汤......

再一次,没什么大不了的(我仍然很确定这是我正在做的事情),所以我从一个简单的博客站点获取 html,打印它,然后将其粘贴到 codebeautify 中,你瞧,它看起来很正常。现在我怀疑篮球参考方面正在发生一些事情,这掩盖了我什至抓取 html 的能力。

我的问题是这样的;这里到底发生了什么?我假设有 80% 的机会仍然是我,但 20% 的人目前还不确定。有人能指出我做错了什么或如何获取 html 吗?

【问题讨论】:

  • 这是因为 html 内容是动态创建的,bs4 无法解析。一种解决方案是使用无头浏览器或直接跳转到 Selenium。

标签: python html parsing web-scraping beautifulsoup


【解决方案1】:

数据存储在页面内,但在 HTML 注释内。

要解析它,你可以这样做:

import requests
from bs4 import BeautifulSoup, Comment

web_response = requests.get('https://www.basketball-reference.com/leagues/NBA_2020.html').text
soup = BeautifulSoup(web_response, 'lxml')

table = soup.find('table', id='team-stats-per_game')

# find the comment section where the data is stored
for idx, c in enumerate(soup.select_one('div#all_team-stats-per_game').contents):
    if isinstance(c, Comment):
        break

# load the data from comment:
soup2 = BeautifulSoup(soup.select_one('div#all_team-stats-per_game').contents[idx], 'html.parser')

# print data:
for tr in soup2.select('tr:has(td)'):
    tds = tr.select('td')
    for td in tds:
        print(td.get_text(strip=True), end='\t')
    print()

打印:

Dallas Mavericks    67  241.5   41.6    90.0    .462    15.3    41.5    .369    26.3    48.5    .542    17.9    23.1    .773    10.6    36.4    47.0    24.5    6.3 5.0 12.8    19.0    116.4   
Milwaukee Bucks*    65  240.8   43.5    91.2    .477    13.7    38.6    .356    29.8    52.6    .567    17.8    24.0    .742    9.5 42.2    51.7    25.9    7.4 6.0 14.9    19.2    118.6   
Houston Rockets 64  241.2   41.1    90.7    .454    15.4    44.3    .348    25.7    46.4    .554    20.5    26.0    .787    10.4    34.6    44.9    21.5    8.5 5.1 14.7    21.6    118.1   
Portland Trail Blazers  66  240.8   41.9    90.9    .461    12.6    33.8    .372    29.3    57.1    .513    17.3    21.7    .798    10.1    35.4    45.5    20.2    6.1 6.2 13.0    21.4    113.6   
Atlanta Hawks   67  243.0   40.6    90.6    .449    12.0    36.1    .333    28.6    54.5    .525    18.5    23.4    .790    9.9 33.4    43.3    24.0    7.8 5.1 16.2    23.1    111.8   
New Orleans Pelicans    64  242.3   42.6    92.2    .462    14.0    37.6    .372    28.6    54.6    .525    16.9    23.2    .729    11.2    35.8    47.0    27.0    7.6 5.1 16.2    21.0    116.2   
Los Angeles Clippers    64  241.2   41.6    89.7    .464    12.2    33.2    .366    29.5    56.5    .522    20.8    26.2    .792    11.0    37.0    48.0    23.8    7.1 5.0 14.8    22.0    116.2   
Washington Wizards  64  241.2   41.9    91.0    .461    12.3    33.1    .372    29.6    57.9    .511    19.5    24.8    .787    10.1    31.6    41.7    25.3    8.1 4.3 14.1    22.6    115.6   
Memphis Grizzlies   65  240.4   42.8    91.0    .470    10.9    31.1    .352    31.8    59.9    .531    16.2    21.3    .761    10.4    36.3    46.7    27.0    8.0 5.6 15.3    20.8    112.6   
Phoenix Suns    65  241.2   40.8    87.8    .464    11.2    31.7    .353    29.6    56.1    .527    19.8    24.0    .826    9.8 33.3    43.1    27.2    7.8 4.0 15.1    22.1    112.6   
Miami Heat  65  243.5   39.6    84.4    .470    13.4    34.8    .383    26.3    49.6    .530    19.5    25.1    .778    8.5 36.0    44.5    26.0    7.4 4.5 14.9    20.4    112.2   
Minnesota Timberwolves  64  243.1   40.4    91.6    .441    13.3    39.7    .336    27.1    52.0    .521    19.1    25.4    .753    10.5    34.3    44.8    23.8    8.7 5.7 15.3    21.4    113.3   
Boston Celtics* 64  242.0   41.2    89.6    .459    12.4    34.2    .363    28.8    55.4    .519    18.3    22.8    .801    10.7    35.3    46.0    22.8    8.3 5.6 13.6    21.4    113.0   
Toronto Raptors*    64  241.6   40.6    88.5    .458    13.8    37.0    .371    26.8    51.5    .521    18.1    22.6    .800    9.7 35.5    45.2    25.4    8.8 4.9 14.4    21.5    113.0   
Los Angeles Lakers* 63  240.8   42.9    88.6    .485    11.2    31.4    .355    31.8    57.1    .556    17.3    23.7    .730    10.6    35.5    46.1    25.9    8.6 6.8 15.1    20.6    114.3   
Denver Nuggets  65  242.3   41.8    88.9    .471    10.9    30.4    .358    31.0    58.5    .529    15.9    20.5    .775    10.8    33.5    44.3    26.5    8.1 4.6 13.7    20.0    110.4   
San Antonio Spurs   63  242.8   42.0    89.5    .470    10.7    28.7    .371    31.4    60.8    .517    18.4    22.8    .809    8.8 35.6    44.4    24.5    7.2 5.5 12.3    19.2    113.2   
Philadelphia 76ers  65  241.2   40.8    87.7    .465    11.4    31.6    .362    29.4    56.1    .523    16.6    22.1    .752    10.4    35.1    45.5    25.9    8.2 5.4 14.2    20.6    109.6   
Indiana Pacers  65  241.5   42.2    88.4    .477    10.0    27.5    .363    32.2    60.9    .529    15.1    19.1    .787    8.8 34.0    42.8    25.9    7.2 5.1 13.1    19.6    109.3   
Utah Jazz   64  240.4   40.1    84.6    .475    13.2    34.4    .383    27.0    50.2    .537    17.6    22.8    .772    8.8 36.3    45.1    22.2    5.9 4.0 14.9    20.0    111.0   
Oklahoma City Thunder   64  241.6   40.3    85.1    .473    10.4    29.3    .355    29.9    55.8    .536    19.8    24.8    .797    8.1 34.6    42.7    21.9    7.6 5.0 13.5    18.8    110.8   
Brooklyn Nets   64  243.1   40.0    90.0    .444    12.9    37.9    .340    27.1    52.2    .519    18.0    24.1    .744    10.8    37.6    48.5    24.0    6.5 4.6 15.5    20.7    110.8   
Detroit Pistons 66  241.9   39.3    85.7    .459    12.0    32.7    .367    27.3    53.0    .515    16.6    22.4    .743    9.8 32.0    41.7    24.1    7.4 4.5 15.3    19.7    107.2   
New York Knicks 66  241.9   40.0    89.3    .447    9.6 28.4    .337    30.4    61.0    .499    16.3    23.5    .694    12.0    34.5    46.5    22.1    7.6 4.7 14.3    22.2    105.8   
Sacramento Kings    64  242.3   40.4    87.8    .459    12.6    34.7    .364    27.7    53.2    .522    15.6    20.3    .769    9.6 32.9    42.5    23.4    7.6 4.2 14.4    21.9    109.0   
Cleveland Cavaliers 65  241.9   40.3    87.9    .458    11.2    31.8    .351    29.1    56.1    .519    15.1    19.9    .758    10.8    33.4    44.2    23.1    6.9 3.2 16.5    18.3    106.9   
Chicago Bulls   65  241.2   39.6    88.6    .447    12.2    35.1    .348    27.4    53.5    .511    15.5    20.5    .755    10.5    31.4    41.9    23.2    10.0    4.1 15.5    21.8    106.8   
Orlando Magic   65  240.4   39.2    88.8    .442    10.9    32.0    .341    28.3    56.8    .498    17.0    22.1    .770    10.4    34.2    44.5    24.0    8.4 5.7 12.6    17.6    106.4   
Golden State Warriors   65  241.9   38.6    88.2    .438    10.4    31.3    .334    28.2    56.9    .495    18.7    23.2    .803    10.0    32.9    42.8    25.6    8.2 4.6 14.9    20.1    106.3   
Charlotte Hornets   65  242.3   37.3    85.9    .434    12.1    34.3    .352    25.2    51.6    .489    16.2    21.6    .748    11.0    31.8    42.8    23.8    6.6 4.1 14.6    18.8    102.9   
League Average  65  241.7   40.8    88.8    .460    12.1    33.9    .357    28.7    54.9    .523    17.7    22.9    .771    10.1    34.7    44.9    24.3    7.7 4.9 14.5    20.6    111.4   

【讨论】:

  • 谢谢。因为我不熟悉 Comment 对象以及它们的可导航性,所以我花了一分钟来分解你在做什么。结合您的建议阅读文档真的很有帮助!
猜你喜欢
  • 2023-04-02
  • 2023-03-29
  • 2015-08-03
  • 1970-01-01
  • 2020-03-08
  • 2011-04-04
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多