【问题标题】:How to use beautifulsoup to scrape a certain table and turn into pandas dataframe?如何使用beautifulsoup 抓取某个表并变成pandas 数据框?
【发布时间】:2021-09-12 04:25:57
【问题描述】:
如何使用 bs4 获取 here 上的“Per Game Stats”表以将其转换为 pandas 数据框?
我已经试过了
url = 'https://www.basketball-reference.com/leagues/NBA_2021.html'
page = requests.get(url)
page
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
从那里卡住了。
谢谢。
【问题讨论】:
标签:
python
pandas
beautifulsoup
【解决方案1】:
使用pd.read_html:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_2021.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table', id='per_game-team')
df = pd.read_html(str(table))[0]
您想要的桌子的 ID 为“per_game-team”。使用浏览器开发者工具中的检查器找到它。
输出:
>>> df.head(10)
Rk Team G MP ... BLK TOV PF PTS
0 1.0 Milwaukee Bucks* 72 240.7 ... 4.6 13.8 17.3 120.1
1 2.0 Brooklyn Nets* 72 241.7 ... 5.3 13.5 19.0 118.6
2 3.0 Washington Wizards* 72 241.7 ... 4.1 14.4 21.6 116.6
3 4.0 Utah Jazz* 72 241.0 ... 5.2 14.2 18.5 116.4
4 5.0 Portland Trail Blazers* 72 240.3 ... 5.0 11.1 18.9 116.1
5 6.0 Phoenix Suns* 72 242.8 ... 4.3 12.5 19.1 115.3
6 7.0 Indiana Pacers 72 242.4 ... 6.4 13.5 20.2 115.3
7 8.0 Denver Nuggets* 72 242.8 ... 4.5 13.5 19.1 115.1
8 9.0 New Orleans Pelicans 72 242.1 ... 4.4 14.6 18.0 114.6
9 10.0 Los Angeles Clippers* 72 240.0 ... 4.1 13.2 19.2 114.0
[10 rows x 25 columns]
【解决方案2】:
pandas 的.read_html() 是通往这里的道路(因为它在后台使用 BeautifulSoup)。而且由于它已经包含了请求,因此您实际上可以将 Corral 提供的解决方案简化为:
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_2021.html'
df = pd.read_html(url, attrs = {'id': 'per_game-team'})[0]
但由于您特别询问如何使用 bs4 转换为数据帧,我将提供该解决方案。
执行此操作的基本逻辑/步骤是:
- 获取表格标签
- 从表对象中,从
<th>标签下的<thead>标签中获取Header名称
- 遍历行(
<tr> 标签)并从每一行获取<td> 内容
代码:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/leagues/NBA_2021.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'id':'per_game-team'})
headers = [x.text for x in table.find('thead').find_all('th')]
data = []
table_body_rows = table.find('tbody').find_all('tr')
for row in table_body_rows:
rank = [row.find('th').text]
row_data = rank + [x.text for x in row.find_all('td')]
data.append(row_data)
df = pd.DataFrame(data, columns=headers)
输出:
print(df)
Rk Team G MP FG ... STL BLK TOV PF PTS
0 1 Milwaukee Bucks* 72 240.7 44.7 ... 8.1 4.6 13.8 17.3 120.1
1 2 Brooklyn Nets* 72 241.7 43.1 ... 6.7 5.3 13.5 19.0 118.6
2 3 Washington Wizards* 72 241.7 43.2 ... 7.3 4.1 14.4 21.6 116.6
3 4 Utah Jazz* 72 241.0 41.3 ... 6.6 5.2 14.2 18.5 116.4
4 5 Portland Trail Blazers* 72 240.3 41.3 ... 6.9 5.0 11.1 18.9 116.1
5 6 Phoenix Suns* 72 242.8 43.3 ... 7.2 4.3 12.5 19.1 115.3
6 7 Indiana Pacers 72 242.4 43.3 ... 8.5 6.4 13.5 20.2 115.3
7 8 Denver Nuggets* 72 242.8 43.3 ... 8.1 4.5 13.5 19.1 115.1
8 9 New Orleans Pelicans 72 242.1 42.5 ... 7.6 4.4 14.6 18.0 114.6
9 10 Los Angeles Clippers* 72 240.0 41.8 ... 7.1 4.1 13.2 19.2 114.0
10 11 Atlanta Hawks* 72 241.7 40.8 ... 7.0 4.8 13.2 19.3 113.7
11 12 Sacramento Kings 72 240.3 42.6 ... 7.5 5.0 13.4 19.4 113.7
12 13 Golden State Warriors 72 240.3 41.3 ... 8.2 4.8 15.0 21.2 113.7
13 14 Philadelphia 76ers* 72 242.1 41.4 ... 9.1 6.2 14.4 20.2 113.6
14 15 Memphis Grizzlies* 72 241.7 42.8 ... 9.1 5.1 13.3 18.7 113.3
15 16 Boston Celtics* 72 241.4 41.5 ... 7.7 5.3 14.1 20.4 112.6
16 17 Dallas Mavericks* 72 240.3 41.1 ... 6.3 4.3 12.1 19.4 112.4
17 18 Minnesota Timberwolves 72 241.7 40.7 ... 8.8 5.5 14.3 20.9 112.1
18 19 Toronto Raptors 72 240.3 39.7 ... 8.6 5.4 13.2 21.2 111.3
19 20 San Antonio Spurs 72 242.8 41.9 ... 7.0 5.1 11.4 18.0 111.1
20 21 Chicago Bulls 72 241.4 42.2 ... 6.7 4.2 15.1 18.9 110.7
21 22 Los Angeles Lakers* 72 242.4 40.6 ... 7.8 5.4 15.2 19.1 109.5
22 23 Charlotte Hornets 72 241.0 39.9 ... 7.8 4.8 14.8 18.0 109.5
23 24 Houston Rockets 72 240.3 39.3 ... 7.6 5.0 14.7 19.5 108.8
24 25 Miami Heat* 72 241.4 39.2 ... 7.9 4.0 14.1 18.9 108.1
25 26 New York Knicks* 72 242.1 39.4 ... 7.0 5.1 12.9 20.5 107.0
26 27 Detroit Pistons 72 242.1 38.7 ... 7.4 5.2 14.9 20.5 106.6
27 28 Oklahoma City Thunder 72 241.0 38.8 ... 7.0 4.4 16.1 18.1 105.0
28 29 Orlando Magic 72 240.7 38.3 ... 6.9 4.4 12.8 17.2 104.0
29 30 Cleveland Cavaliers 72 242.1 38.6 ... 7.8 4.5 15.5 18.2 103.8
[30 rows x 25 columns]