【问题标题】:Trying to Scrape Data from Pro Football Reference试图从职业足球参考资料中抓取数据
【发布时间】:2021-03-27 03:34:39
【问题描述】:

作为序言,我对 python 的经验非常少。我正在尝试为我最喜欢的 NFL 球队新英格兰爱国者队收集足球数据。我要抓取的链接是https://www.pro-football-reference.com/teams/nwe/2020.htm,我关心日程表和游戏结果表。我可以从我的代码中获取我想要的数据,但是我的格式都是错误的。

任何帮助将不胜感激。

import requests
import lxml.html as lh
import pandas as pd
import argparse
import re
import os
from bs4 import BeautifulSoup
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

twenty_twenty = []

link = "https://www.pro-football-reference.com/teams/nwe/2020.htm"
r = session.get(link)
soup = BeautifulSoup(r.text,'html.parser')
table_all = soup.find_all('div',{"class":"overthrow table_container"})
tbody = table_all[1].table.tbody
trs = tbody.find_all('tr')
week_dict = {}
for tr in trs:
    stat = str(tr.find('th')) #['data-stat'])     
    val = str(tr.find('th').getText())            
    
    week_dict.update({stat:val})
    tds = tr.find_all('td')
    for td in tds:
        stat = str((td)['data-stat'])              
        val = str((td).getText())                   
        if stat == 'team_record':
            record = (val.split('-'))
            wins = record[0]
            losses = record[-1]
            week_dict.update({'wins_to_date':wins,'losses_to_date':losses})
        if stat == 'game_location':
            if val == '@': 
                week_dict.update({'home':0})
            else:
                week_dict.update({'home':1})
        if stat == 'overtime':
            if val == 'OT':
                week_dict.update({'OT':1})
            else: 
                week_dict.update({'OT':0})
        week_dict.update({stat:val})
    twenty_twenty.append(week_dict) 
    print("Patriots" + " " + "Year 2020" + " " + "stats added.")  
df2020 = pd.DataFrame(twenty_twenty)
df2020.head(16)  

【问题讨论】:

标签: python pandas for-loop


【解决方案1】:

拉一张桌子需要做很多工作。 Pandas(在后台使用 beautifulsoup)可以解析表格。然后你可以使用数据框来聚合你想要的任何东西:

代码:

import pandas as pd

url = 'https://www.pro-football-reference.com/teams/nwe/2020.htm'
df = pd.read_html(url)[1]

输出:

print (df.to_string())
   Unnamed: 0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 Unnamed: 5_level_0 Unnamed: 6_level_0 Unnamed: 7_level_0 Unnamed: 8_level_0    Unnamed: 9_level_0 Score       Offense                           Defense                           Expected Points                
                 Week                Day               Date Unnamed: 3_level_1 Unnamed: 4_level_1 Unnamed: 5_level_1                 OT                Rec Unnamed: 8_level_1                   Opp    Tm   Opp    1stD  TotYd  PassY  RushY   TO    1stD  TotYd  PassY  RushY   TO         Offense Defense Sp. Tms
0                   1                Sun       September 13          1:00PM ET           boxscore                  W                NaN                1-0                NaN        Miami Dolphins  21.0  11.0    29.0  357.0  140.0  217.0  1.0    20.0  269.0  182.0   87.0  3.0           14.11    2.20   -5.64
1                   2                Sun       September 20          8:20PM ET           boxscore                  L                NaN                1-1                  @      Seattle Seahawks  30.0  35.0    29.0  464.0  397.0   67.0  1.0    22.0  429.0  275.0  154.0  1.0           19.82  -14.54   -6.98
2                   3                Sun       September 27          1:00PM ET           boxscore                  W                NaN                2-1                NaN     Las Vegas Raiders  36.0  20.0    25.0  406.0  156.0  250.0  1.0    22.0  375.0  249.0  126.0  3.0           11.03   -0.71    5.81
3                   4                Mon          October 5          7:05PM ET           boxscore                  L                NaN                2-2                  @    Kansas City Chiefs  10.0  26.0    21.0  357.0  172.0  185.0  4.0    19.0  323.0  229.0   94.0  1.0          -10.82   -5.72    3.78
4                   5                NaN                NaN                NaN                NaN                NaN                NaN                NaN                NaN              Bye Week   NaN   NaN     NaN    NaN    NaN    NaN  NaN     NaN    NaN    NaN    NaN  NaN             NaN     NaN     NaN
5                   6                Sun         October 18          1:00PM ET           boxscore                  L                NaN                2-3                NaN        Denver Broncos  12.0  18.0    14.0  288.0  171.0  117.0  3.0    15.0  299.0  164.0  135.0  2.0          -13.30    6.67    1.59
6                   7                Sun         October 25          4:25PM ET           boxscore                  L                NaN                2-4                NaN   San Francisco 49ers   6.0  33.0    17.0  241.0  147.0   94.0  4.0    26.0  467.0  270.0  197.0  2.0           -9.49  -22.44    6.55
7                   8                Sun         November 1          1:00PM ET           boxscore                  L                NaN                2-5                  @         Buffalo Bills  21.0  24.0    20.0  349.0  161.0  188.0  1.0    22.0  339.0  149.0  190.0  1.0            8.26   -8.85   -0.85
8                   9                Mon         November 9          8:15PM ET           boxscore                  W                NaN                3-5                  @         New York Jets  30.0  27.0    30.0  433.0  274.0  159.0  NaN    18.0  322.0  257.0   65.0  1.0           18.70  -18.71    2.16
9                  10                Sun        November 15          8:20PM ET           boxscore                  W                NaN                4-5                NaN      Baltimore Ravens  23.0  17.0    25.0  308.0  135.0  173.0  NaN    19.0  357.0  242.0  115.0  1.0           13.95   -3.45   -1.86
10                 11                Sun        November 22          1:00PM ET           boxscore                  L                NaN                4-6                  @        Houston Texans  20.0  27.0    22.0  435.0  349.0   86.0  NaN    21.0  399.0  344.0   55.0  NaN           13.67  -15.69   -0.27
11                 12                Sun        November 29          1:00PM ET           boxscore                  W                NaN                5-6                NaN     Arizona Cardinals  20.0  17.0    16.0  179.0   69.0  110.0  2.0    23.0  298.0  160.0  138.0  1.0           -4.68    0.00    9.06
12                 13                Sun         December 6          4:25PM ET           boxscore                  W                NaN                6-6                  @  Los Angeles Chargers  45.0   0.0    22.0  291.0  126.0  165.0  NaN    17.0  258.0  188.0   70.0  2.0            9.66   16.13   22.19
13                 14                Thu        December 10          8:20PM ET           boxscore                  L                NaN                6-7                  @      Los Angeles Rams   3.0  24.0    10.0  220.0  113.0  107.0  1.0    17.0  318.0  132.0  186.0  1.0          -29.77   -1.19    8.58
14                 15                Sun        December 20          1:00PM ET            preview                NaN                NaN                NaN                  @        Miami Dolphins   NaN   NaN     NaN    NaN    NaN    NaN  NaN     NaN    NaN    NaN    NaN  NaN             NaN     NaN     NaN
15                 16                Mon        December 28          8:15PM ET            preview                NaN                NaN                NaN                NaN         Buffalo Bills   NaN   NaN     NaN    NaN    NaN    NaN  NaN     NaN    NaN    NaN    NaN  NaN             NaN     NaN     NaN
16                 17                Sun          January 3          1:00PM ET            preview                NaN                NaN                NaN                NaN         New York Jets   NaN   NaN     NaN    NaN    NaN    NaN  NaN     NaN    NaN    NaN    NaN  NaN             NaN     NaN     NaN

代码 2:

要将所有这些团队纳入 1 个数据框:

import pandas as pd

team_abbrev = ['list','of','teams','...']

year = 2020
list_of_dataframes = []
for team in team_abbrev: 
    link = "pro-football-reference.com/teams" + team + "/" + str(year) + ".htm" 
    df = pd.read_html(link)[1]
    df['team'] = team
    list_of_dataframes.append(df)
    
final_df = pd.concat(list_of_dataframes).reset_index(drop=True)

【讨论】:

  • 感谢您对表格的帮助! ' week_dict= {} 20_twenty = [] 年 = 2020 年 df2020 = pd.DataFrame(twenty_twenty) 用于 team_abbrev 中的团队:link = "pro-football-reference.com/teams" + team + "/" + str(year) + ".htm" session = requests.Session() r = session.get(link) df = pd.read_html(link)[1] df_s = (df.to_string()) 20_twenty.update(df) 20_twenty.append(week_dict) df2020 = pd.DataFrame (twenty_twenty) ' 现在我正在尝试使用 for 循环来获取每个团队的数据,我正在苦苦挣扎。
  • 我能够让它遍历每个团队并获取表格,但我无法弄清楚尝试更新每个团队全年数据的语法。本质上,我想创建一个包含所有 32 个团队及其时间表和结果的主数据集。我认为使用更新和附加功能会起作用,但效果不如我预期。
  • 感谢您的帮助!您提供的代码极大地帮助了我!我现在遇到了一些新问题,需要进一步澄清。
  • 这样您就可以获得每个团队表,但希望将它们全部放在一张表中?
  • @football_2020,我添加到代码中。看看这是否适合你。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-10-03
  • 2021-04-18
  • 2011-02-17
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多