【问题标题】:Parsing NBA season stats from basketball-reference.com, how to remove html comment tags从basketball-reference.com解析NBA赛季统计数据,如何删除html评论标签
【发布时间】:2019-10-24 11:10:08
【问题描述】:

我正在尝试解析来自篮球参考网站 (https://www.basketball-reference.com/leagues/NBA_1980.html) 的其他统计数据表。但是,我要解析的表在 html 注释中。

使用以下代码

html = requests.get("http://www.basketball-reference.com/leagues/NBA_2016.html").content
cleaned_soup = BeautifulSoup(re.sub("<!--|-->","", html))

结果如下

TypeError                                 Traceback (most recent call last)
<ipython-input-35-93508687bbc6> in <module>()
----> 1 cleaned_soup = BeautifulSoup(re.sub("<!--|-->","", html))

~/.pyenv/versions/3.7.0/lib/python3.7/re.py in sub(pattern, repl, string, count, flags)
    190     a callable, it's passed the Match object and must return
    191     a replacement string to be used."""
--> 192     return _compile(pattern, flags).sub(repl, string, count)
    193 
    194 def subn(pattern, repl, string, count=0, flags=0):

TypeError: cannot use a string pattern on a bytes-like object

我正在使用python3.7。

【问题讨论】:

    标签: python regex beautifulsoup


    【解决方案1】:

    与其尝试使用 re 将 cmets 中的所有 HTML 放入您的 HTML 中,不如使用 BeautifulSoup 从 HTML 中返回 cmets。然后也可以使用 BeautifulSoup 解析这些,以根据需要提取任何表格元素,例如:

    import requests
    from bs4 import BeautifulSoup, Comment
    
    
    html = requests.get("http://www.basketball-reference.com/leagues/NBA_2016.html").content
    soup = BeautifulSoup(html, "html.parser")
    
    for comment in soup.find_all(text=lambda t : isinstance(t, Comment)):
        comment_html = BeautifulSoup(comment, "html.parser")
    
        for table in comment_html.find_all("table"):
            for tr in table.find_all("tr"):
                row = [td.text for td in tr.find_all("td")]
                print(row)
            print()
    

    这会给你表格中的行开始:

    ['Finals', 'Cleveland Cavaliers \nover \nGolden State Warriors\n\xa0(4-3)\n', 'Series Stats']
    ['\n\n\nGame 1\nThu, June 2\nCleveland Cavaliers\n89@ Golden State Warriors\n104\n\nGame 2\nSun, June 5\nCleveland Cavaliers\n77@ Golden State Warriors\n110\n\nGame 3\nWed, June 8\nGolden State Warriors\n90@ Cleveland Cavaliers\n120\n\nGame 4\nFri, June 10\nGolden State Warriors\n108@ Cleveland Cavaliers\n97\n\nGame 5\nMon, June 13\nCleveland Cavaliers\n112@ Golden State Warriors\n97\n\nGame 6\nThu, June 16\nGolden State Warriors\n101@ Cleveland Cavaliers\n115\n\nGame 7\nSun, June 19\nCleveland Cavaliers\n93@ Golden State Warriors\n89\n\n\n', 'Game 1', 'Thu, June 2', 'Cleveland Cavaliers', '89', '@ Golden State Warriors', '104', 'Game 2', 'Sun, June 5', 'Cleveland Cavaliers', '77', '@ Golden State Warriors', '110', 'Game 3', 'Wed, June 8', 'Golden State Warriors', '90', '@ Cleveland Cavaliers', '120', 'Game 4', 'Fri, June 10', 'Golden State Warriors', '108', '@ Cleveland Cavaliers', '97', 'Game 5', 'Mon, June 13', 'Cleveland Cavaliers', '112', '@ Golden State Warriors', '97', 'Game 6', 'Thu, June 16', 'Golden State Warriors', '101', '@ Cleveland Cavaliers', '115', 'Game 7', 'Sun, June 19', 'Cleveland Cavaliers', '93', '@ Golden State Warriors', '89']
    ['Game 1', 'Thu, June 2', 'Cleveland Cavaliers', '89', '@ Golden State Warriors', '104']
    ['Game 2', 'Sun, June 5', 'Cleveland Cavaliers', '77', '@ Golden State Warriors', '110']
    ['Game 3', 'Wed, June 8', 'Golden State Warriors', '90', '@ Cleveland Cavaliers', '120']
    ['Game 4', 'Fri, June 10', 'Golden State Warriors', '108', '@ Cleveland Cavaliers', '97']
    

    注意:为避免获得cannot use a string pattern on a bytes-like object,您可以使用.text 而不是.content 将字符串传递给您的正则表达式。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2023-03-29
      • 2011-03-31
      • 1970-01-01
      • 2017-07-14
      • 2016-02-21
      • 1970-01-01
      • 2011-05-19
      • 2016-07-23
      相关资源
      最近更新 更多