【问题标题】:How to use beautifulsoup to scrape a certain table and turn into pandas dataframe?如何使用beautifulsoup 抓取某个表并变成pandas 数据框?
【发布时间】:2021-09-12 04:25:57
【问题描述】:

如何使用 bs4 获取 here 上的“Per Game Stats”表以将其转换为 pandas 数据框?

我已经试过了

url = 'https://www.basketball-reference.com/leagues/NBA_2021.html'
page = requests.get(url)
page
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

从那里卡住了。

谢谢。

【问题讨论】:

    标签: python pandas beautifulsoup


    【解决方案1】:

    使用pd.read_html:

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    url = 'https://www.basketball-reference.com/leagues/NBA_2021.html'
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    table = soup.find('table', id='per_game-team')
    df = pd.read_html(str(table))[0]
    

    您想要的桌子的 ID 为“per_game-team”。使用浏览器开发者工具中的检查器找到它。

    输出:

    >>> df.head(10)
         Rk                     Team   G     MP  ...  BLK   TOV    PF    PTS
    0   1.0         Milwaukee Bucks*  72  240.7  ...  4.6  13.8  17.3  120.1
    1   2.0           Brooklyn Nets*  72  241.7  ...  5.3  13.5  19.0  118.6
    2   3.0      Washington Wizards*  72  241.7  ...  4.1  14.4  21.6  116.6
    3   4.0               Utah Jazz*  72  241.0  ...  5.2  14.2  18.5  116.4
    4   5.0  Portland Trail Blazers*  72  240.3  ...  5.0  11.1  18.9  116.1
    5   6.0            Phoenix Suns*  72  242.8  ...  4.3  12.5  19.1  115.3
    6   7.0           Indiana Pacers  72  242.4  ...  6.4  13.5  20.2  115.3
    7   8.0          Denver Nuggets*  72  242.8  ...  4.5  13.5  19.1  115.1
    8   9.0     New Orleans Pelicans  72  242.1  ...  4.4  14.6  18.0  114.6
    9  10.0    Los Angeles Clippers*  72  240.0  ...  4.1  13.2  19.2  114.0
    
    [10 rows x 25 columns]
    

    【讨论】:

      【解决方案2】:

      pandas.read_html() 是通往这里的道路(因为它在后台使用 BeautifulSoup)。而且由于它已经包含了请求,因此您实际上可以将 Corral 提供的解决方案简化为:

      import pandas as pd
      
      url = 'https://www.basketball-reference.com/leagues/NBA_2021.html'
      df = pd.read_html(url, attrs = {'id': 'per_game-team'})[0]
      

      但由于您特别询问如何使用 bs4 转换为数据帧,我将提供该解决方案。

      执行此操作的基本逻辑/步骤是:

      1. 获取表格标签
      2. 从表对象中,从<th>标签下的<thead>标签中获取Header名称
      3. 遍历行(<tr> 标签)并从每一行获取<td> 内容

      代码:

      import pandas as pd
      import requests
      from bs4 import BeautifulSoup
      
      url = 'https://www.basketball-reference.com/leagues/NBA_2021.html'
      response = requests.get(url)
      
      soup = BeautifulSoup(response.text, 'html.parser')
      table = soup.find('table', {'id':'per_game-team'})
      
      headers = [x.text for x in table.find('thead').find_all('th')]
      
      data = []
      table_body_rows = table.find('tbody').find_all('tr')
      for row in table_body_rows:
          rank = [row.find('th').text]
          row_data = rank + [x.text for x in row.find_all('td')]
          data.append(row_data)
      
      
      df = pd.DataFrame(data, columns=headers)
      

      输出:

      print(df)
          Rk                     Team   G     MP    FG  ...  STL  BLK   TOV    PF    PTS
      0    1         Milwaukee Bucks*  72  240.7  44.7  ...  8.1  4.6  13.8  17.3  120.1
      1    2           Brooklyn Nets*  72  241.7  43.1  ...  6.7  5.3  13.5  19.0  118.6
      2    3      Washington Wizards*  72  241.7  43.2  ...  7.3  4.1  14.4  21.6  116.6
      3    4               Utah Jazz*  72  241.0  41.3  ...  6.6  5.2  14.2  18.5  116.4
      4    5  Portland Trail Blazers*  72  240.3  41.3  ...  6.9  5.0  11.1  18.9  116.1
      5    6            Phoenix Suns*  72  242.8  43.3  ...  7.2  4.3  12.5  19.1  115.3
      6    7           Indiana Pacers  72  242.4  43.3  ...  8.5  6.4  13.5  20.2  115.3
      7    8          Denver Nuggets*  72  242.8  43.3  ...  8.1  4.5  13.5  19.1  115.1
      8    9     New Orleans Pelicans  72  242.1  42.5  ...  7.6  4.4  14.6  18.0  114.6
      9   10    Los Angeles Clippers*  72  240.0  41.8  ...  7.1  4.1  13.2  19.2  114.0
      10  11           Atlanta Hawks*  72  241.7  40.8  ...  7.0  4.8  13.2  19.3  113.7
      11  12         Sacramento Kings  72  240.3  42.6  ...  7.5  5.0  13.4  19.4  113.7
      12  13    Golden State Warriors  72  240.3  41.3  ...  8.2  4.8  15.0  21.2  113.7
      13  14      Philadelphia 76ers*  72  242.1  41.4  ...  9.1  6.2  14.4  20.2  113.6
      14  15       Memphis Grizzlies*  72  241.7  42.8  ...  9.1  5.1  13.3  18.7  113.3
      15  16          Boston Celtics*  72  241.4  41.5  ...  7.7  5.3  14.1  20.4  112.6
      16  17        Dallas Mavericks*  72  240.3  41.1  ...  6.3  4.3  12.1  19.4  112.4
      17  18   Minnesota Timberwolves  72  241.7  40.7  ...  8.8  5.5  14.3  20.9  112.1
      18  19          Toronto Raptors  72  240.3  39.7  ...  8.6  5.4  13.2  21.2  111.3
      19  20        San Antonio Spurs  72  242.8  41.9  ...  7.0  5.1  11.4  18.0  111.1
      20  21            Chicago Bulls  72  241.4  42.2  ...  6.7  4.2  15.1  18.9  110.7
      21  22      Los Angeles Lakers*  72  242.4  40.6  ...  7.8  5.4  15.2  19.1  109.5
      22  23        Charlotte Hornets  72  241.0  39.9  ...  7.8  4.8  14.8  18.0  109.5
      23  24          Houston Rockets  72  240.3  39.3  ...  7.6  5.0  14.7  19.5  108.8
      24  25              Miami Heat*  72  241.4  39.2  ...  7.9  4.0  14.1  18.9  108.1
      25  26         New York Knicks*  72  242.1  39.4  ...  7.0  5.1  12.9  20.5  107.0
      26  27          Detroit Pistons  72  242.1  38.7  ...  7.4  5.2  14.9  20.5  106.6
      27  28    Oklahoma City Thunder  72  241.0  38.8  ...  7.0  4.4  16.1  18.1  105.0
      28  29            Orlando Magic  72  240.7  38.3  ...  6.9  4.4  12.8  17.2  104.0
      29  30      Cleveland Cavaliers  72  242.1  38.6  ...  7.8  4.5  15.5  18.2  103.8
      
      [30 rows x 25 columns]
      

      【讨论】:

        猜你喜欢
        • 2021-02-28
        • 1970-01-01
        • 2021-06-23
        • 2018-07-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-05-22
        相关资源
        最近更新 更多