如何解析 NHL Team Defense 统计数据以使用 Python 创建 Pandas DataFrame？答案

【问题标题】：How to Parse NHL Team Defense stats to create Pandas DataFrame using Python?如何解析 NHL Team Defense 统计数据以使用 Python 创建 Pandas DataFrame？
【发布时间】：2020-02-24 09:18:59
【问题描述】：

我已经抓取了数据，但需要帮助才能正确解析它。我仍在学习，并会感谢我能得到的任何建议。

我正在寻找以下两个变量的数据：TEAM、SA/G

到目前为止，这是我的代码：


#import modules
from selenium import webdriver

from bs4 import BeautifulSoup

#set path for driver
driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')

# open page
driver.get('http://www.espn.com/nhl/statistics/team/_/stat/scoring/sort/avgGoals')

# driver.page_source
soup = BeautifulSoup(driver.page_source,'lxml')

#close driver
driver.close()

#grab table data
table = soup.find(class_='tablehead')

#parse data (extra data included)
for t in table:
    td_tags = table.find_all('td')
    # print(td_tags)
    for td in td_tags:
        a_tags = table.find('a')
        print(td.text)

我已经抓取了正确的数据，但还有额外的信息可以使用帮助解析。关于如何获取 TEAM 和 SA/G 数据的任何建议？

这是我正在寻找的 Pandas DataFrame 输出示例：

Team             SA/G

Nashville        30.1

Colorado         33.6

Washington       31.0

提前感谢您提供的任何帮助！

代码更新：

第一次尝试只获取了团队信息并且有额外的数据（例如“GP”）。

第一次尝试修复代码：

# parse data (closer to desired output but missing SA/G data)
 for tab in table:
     tr = table.find_all('tr')
     for t in tr:
         td = table.find_all('td')
         print((t.a.text))

第二次尝试获取了团队数据和 SA/G，但也有额外的数据（例如，每 11 行代码中的“TEAM”和“SA/G”文本）。

这是第二次尝试：

#parses TEAM and SA/G
import pandas as pd
x = pd.read_html("http://www.espn.com/nhl/statistics/team/_/stat/scoring/sort/avgGoals")[0]

print(x[[1, 9]])

【问题讨论】：

您应该首先收集所有行，然后获取列 td 标记的值。第一行应该是标题。并非所有列都包含标记，因此您应该仔细注意这一点，然后提取值
请@satyam soni 回复您！我确实尝试了你的建议，我相信它有所帮助。但是，我还想解析一些额外的数据。你对此有什么建议吗？泰先生！

标签： python pandas dataframe parsing web-scraping

【解决方案1】：

如果你想从 url 读取表格，我会使用 pandas 的 read_html 方法。在下面，Pandas 使用bs4 为您解析网页。您可以在下面看到一个示例：

In [3]: import pandas as pd 
In [4]: pd.read_html("http://www.espn.com/nhl/statistics/team/_/stat/scoring/sort/avgGoals")[0]
Out[4]:
     0             1   2   3   4     5     6      7     8     9      10     11   12    13    14
0    RK          TEAM  GP   G  GA  GF/G  GA/G   DIFF  SF/G  SA/G   DIFF  SVPCT  PIM  PIMA  DIFF
1     1     Nashville  11  45  33  4.09  3.00   1.09  31.9  30.1   01.8   .900   87   109   -22
2     2      Colorado  11  44  30  4.00  2.73   1.27  31.4  33.6  -02.3   .919  102   140   -38
3     3    Washington  13  49  43  3.77  3.31   0.46  30.3  31.0  -00.7   .893  125   111    14
4     4     Vancouver  11  40  26  3.64  2.36   1.27  32.6  31.3   01.4   .924  103   119   -16
5   NaN      Montreal  11  40  35  3.64  3.18   0.45  34.4  31.1   03.3   .898   77    83    -6
6     6       Toronto  13  46  44  3.54  3.38   0.15  32.7  32.8  -00.1   .897   88    82     6
7     7       Florida  12  42  45  3.50  3.75  -0.25  34.0  30.0   04.0   .875   78    86    -8
8   NaN  Philadelphia  10  35  30  3.50  3.00   0.50  35.4  27.4   08.0   .891   78    90   -12
9     9       Buffalo  13  43  32  3.31  2.46   0.85  30.2  33.5  -03.2   .926  100   118   -18
10   10     Tampa Bay  10  33  32  3.30  3.20   0.10  31.4  34.5  -03.1   .907  100    88    12
11   RK          TEAM  GP   G  GA  GF/G  GA/G   DIFF  SF/G  SA/G   DIFF  SVPCT  PIM  PIMA  DIFF
12   11        Boston  11  36  23  3.27  2.09   1.18  33.3  31.5   01.7   .934   82    80     2
13  NaN      Carolina  11  36  29  3.27  2.64   0.64  32.9  29.4   03.5   .910   97    87    10
14   13    Pittsburgh  12  39  30  3.25  2.50   0.75  31.9  29.8   02.1   .916   82    84    -2
15   14    NY Rangers   9  29  34  3.22  3.78  -0.56  28.2  36.9  -08.7   .898   90    82     8
16   15     St. Louis  12  37  38  3.08  3.17  -0.08  29.0  30.3  -01.3   .895   87    91    -4
17   16         Vegas  13  40  36  3.08  2.77   0.31  35.3  32.7   02.6   .915  143   143     0
18   17      Edmonton  12  36  32  3.00  2.67   0.33  27.9  30.6  -02.7   .913   80    74     6
19  NaN       Arizona  11  33  24  3.00  2.18   0.82  31.5  29.8   01.6   .927   68    74    -6
20  NaN  NY Islanders  11  33  27  3.00  2.45   0.55  27.6  31.5  -03.8   .922   95    67    28
21   20      Columbus  11  30  39  2.73  3.55  -0.82  33.6  31.1   02.5   .886   75    81    -6
22   RK          TEAM  GP   G  GA  GF/G  GA/G   DIFF  SF/G  SA/G   DIFF  SVPCT  PIM  PIMA  DIFF
23   21        Ottawa  11  29  36  2.64  3.27  -0.64  31.1  35.0  -03.9   .906  134   110    24
24   22       Calgary  13  34  39  2.62  3.00  -0.38  30.9  31.2  -00.3   .904  147   122    25
25   23      San Jose  12  31  43  2.58  3.58  -1.00  28.3  31.8  -03.4   .887  128   124     4
26  NaN   Los Angeles  12  31  49  2.58  4.08  -1.50  37.3  28.3   08.9   .856  102   116   -14
27   25      Winnipeg  12  30  37  2.50  3.08  -0.58  33.2  33.3  -00.1   .907   52    88   -36
28  NaN       Chicago  10  25  30  2.50  3.00  -0.50  31.6  32.9  -01.3   .909   66    68    -2
29   27       Anaheim  13  32  31  2.46  2.38   0.08  27.5  31.5  -04.0   .924  131    99    32
30   28    New Jersey   9  22  34  2.44  3.78  -1.33  29.3  29.0   00.3   .870   99    93     6
31   29     Minnesota  11  26  37  2.36  3.36  -1.00  29.5  30.4  -00.8   .889   87    93    -6
32   30       Detroit  12  27  45  2.25  3.75  -1.50  31.5  33.2  -01.7   .887  105    96     9
33   RK          TEAM  GP   G  GA  GF/G  GA/G   DIFF  SF/G  SA/G   DIFF  SVPCT  PIM  PIMA  DIFF
34   31        Dallas  13  25  35  1.92  2.69  -0.77  27.8  28.8  -01.1   .907   89    79    10

【讨论】：

请@mmngreco 回复您！我确实学到了一些新东西，但是，还有一些我想解析的额外数据。感谢您的建议，我更新了代码。如果我可以做任何额外的改进，请告诉我。泰先生！