【问题标题】:Python Beautiful soup get correct column headers for each tablePython Beautiful soup 为每个表获取正确的列标题
【发布时间】:2021-12-31 23:24:27
【问题描述】:

以下代码获取玩家数据,但每个数据集都不同。它看到的第一个数据是四分卫数据,因此它将这些列用于所有未来的数据。如何更改标头,以便对于它遇到的每个不同数据集,正确的标头与正确的数据一起使用?

import pandas as pd
import csv
from pprint import pprint

from bs4 import BeautifulSoup
import requests

url = 'https://www.espn.com/nfl/boxscore/_/gameId/401326313'# Create object page
soup = BeautifulSoup(requests.get(url).content, "html.parser")

rows = soup.select("table.mod-data tr")
#rows = soup.find_all("table.mod-data tr")
headers = [header.get_text(strip=True).encode("utf-8") for header in rows[0].find_all("th")]

data = [dict(zip(headers, [cell.get_text(strip=True).encode("utf-8") for cell in row.find_all("td")]))
        for row in rows[1:]]

df = pd.DataFrame(data)
df.to_csv('_Data_{}.csv'.format(pd.datetime.now().strftime("%Y-%m-%d %H%M%S")),index=False)

# see what the data looks like at this point
pprint(data)

【问题讨论】:

  • 我无法完全理解这个问题,四分卫数据有什么问题?是不是只显示了四分卫的数据,但我们想要其他球员的数据?正确的标题和正确的数据是什么意思?
  • 你是说表格都有不同的列名并且在收集所有玩家数据时使用第一个?因此,您想要的是每个具有适当列的不同表,而不是具有第一个表的列的一个大表?
  • 你考虑过使用api吗? https://site.api.espn.com/apis/site/v2/sports/football/nfl/summary?event=401326313

标签: python python-3.x beautifulsoup


【解决方案1】:

这是我的尝试。有几点需要注意。我没有打印到 CSV,只是向您显示带有正确标题信息的数据帧,您可以稍后处理 CSV 输出。

运行程序后按回车键可以查看下一个带有不同标题的表格。

import pandas as pd
import csv
from pprint import pprint

from bs4 import BeautifulSoup
import requests


#Made this function because the rows we want for different headers are different.
def get_individual_table(headers, rows, start,end):
    '''for header in headers:
        print(header)'''


    data = [dict(zip(headers, [cell.get_text(strip=True).encode("utf-8") for cell in row.find_all("td")]))
            for row in rows[start+1:end]]

    df = pd.DataFrame(data)
    #print(df)
    return df

url = 'https://www.espn.com/nfl/boxscore/_/gameId/401326313'# Create object page
soup = BeautifulSoup(requests.get(url).content, "html.parser")

rows = soup.select("table.mod-data tr")

row_team_names = soup.select("div.team-name")

#rows = soup.find_all("table.mod-data tr")


#Only some of the row indicies have correct header information, we find them here.
correct_row_indices = []
for idx, row in enumerate(rows):
    #print(idx, row.find_all("th"))
    if len(row.find_all("th"))>0:
        correct_row_indices.append(idx)


print(correct_row_indices)
headers = [header.get_text(strip=True).encode("utf-8") for header in rows[0].find_all("th")]

team_names =[]

for name in row_team_names:
    team_names.append(name.text)
    

for i in range(len(correct_row_indices)-1):
    headers = [header.get_text(strip=True).encode("utf-8") for header in rows[correct_row_indices[i]].find_all("th")]
    #print(headers)
    

    #Use the function to get the tables individually
    df = get_individual_table(headers, rows, correct_row_indices[i], correct_row_indices[i+1])
    if not df.empty:
        df["Team Name"] = team_names[i]

    #Press enter to see each individual table.
    print(df)
    input()
    df.to_csv('_Data_{}.csv'.format(pd.datetime.now().strftime("%Y-%m-%d %H%M%S")),index=False)
    
    






#df.to_csv('_Data_{}.csv'.format(pd.datetime.now().strftime("%Y-%m-%d %H%M%S")),index=False)

# see what the data looks like at this point
#pprint(data)

这是输出的一部分:

如您所见,我们根据需要获取具有不同标头的每个不同 DataFrame。观察每个 DataFrame 如何与 ESPN 网站上的表格匹配:

您可以选择稍后在循环逻辑中单独处理它们的方式。 并未显示所有表格。初次运行程序后请一直按回车键,该页面上的所有表格信息最终都会显示出来。

编辑: 我已更改代码以为每个表输出单独的 CSV 文件。

我显示了上面输出的前两个表。

【讨论】:

  • @RicrardKYu 您提供的代码显示数据。我更改了几行以将数据输出到 csv 文件,但创建的 csv 文件为空。这是我对代码的代码更改 data = get_individual_table(headers, rows, correct_row_indices[i], correct_row_indices[i+1]) df = pd.DataFrame(data) df.to_csv('_Data_{}.csv'.format( pd.datetime.now().strftime("%Y-%m-%d %H%M%S")),index=False)
  • 我明白了,您在导出为 CSV 时是否需要任何帮助,或者到目前为止还可以吗。
  • 是的,理查德,我需要一种可以将其插入数据库的格式的数据
  • 每次按下回车后,我都更新了代码以生成 CSV 文件。如果是多个CSV文件可以吗?
  • 是的@Richard 它必须是多个文件,因为这些表都有不同的模式。有没有办法将表名“迈阿密接收”添加为列中所有行的值?我刚刚意识到,一旦我有数百场比赛,我将不知道谁在哪个团队中。
【解决方案2】:

如前所述,预期结果并不那么清楚,但如果您只想阅读表格,请使用 pandas.read_html 来实现您的目标 - index_col=0 避免将没有标题的第一列命名为 Unnamed_0 em>。

pd.read_html('https://www.espn.com/nfl/boxscore/_/gameId/401326313',index_col=0)

示例

import pandas as pd

for table in pd.read_html('https://www.espn.com/nfl/boxscore/_/gameId/401326313',index_col=0):
    pd.DataFrame(table).to_csv('_Data_{}.csv'.format(pd.datetime.today().strftime("%Y-%m-%d %H%M%S.%f")))

您也可以reset_index() 并使用to_csv(index=False)

pd.DataFrame(table).rename_axis('').reset_index().to_csv('_Data_{}.csv'.format(pd.datetime.today().strftime("%Y-%m-%d %H%M%S.%f")),index=False)

编辑

在表格中使用标题并将结果存储在命名的 csv 文件中:

import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.espn.com/nfl/boxscore/_/gameId/401326313'
soup = BeautifulSoup(requests.get(url).content, "html.parser")


for table in soup.select('article.boxscore-tabs table'):
    caption = '_'.join(table.parent.select_one('.table-caption').text.split(' '))
    df = pd.read_html(table.prettify(),index_col=0)[0].rename_axis('').reset_index()
    df.insert(0, 'caption', caption)
    df.to_csv(f'_DATA_{caption}_{pd.datetime.now().strftime("%Y-%m-%d %H%M%S")}.csv',index=False)

csv 文件的输出

caption,,C/ATT,YDS,AVG,TD,INT,SACKS,QBR,RTG
Miami_Passing,Tua Tagovailoa  T. Tagovailoa,16/27,202,7.5,1,1,2-17,47.5,79.6
Miami_Passing,TEAM,16/27,185,7.5,1,1,2-17,--,79.6

caption,,C/ATT,YDS,AVG,TD,INT,SACKS,QBR,RTG
New_England_Passing,Mac Jones  M. Jones,29/39,281,7.2,1,0,1-13,76.9,102.6
New_England_Passing,TEAM,29/39,268,7.2,1,0,1-13,--,102.6

caption,,CAR,YDS,AVG,TD,LONG
Miami_Rushing,Myles Gaskin  M. Gaskin,9,49,5.4,0,15
Miami_Rushing,Malcolm Brown  M. Brown,5,16,3.2,0,5
Miami_Rushing,Jacoby Brissett  J. Brissett,2,4,2.0,0,2
Miami_Rushing,Salvon Ahmed  S. Ahmed,3,4,1.3,0,8
Miami_Rushing,Tua Tagovailoa  T. Tagovailoa,4,1,0.3,1,3
Miami_Rushing,TEAM,23,74,3.2,1,15

caption,,CAR,YDS,AVG,TD,LONG
New_England_Rushing,Damien Harris  D. Harris,23,100,4.3,0,35
New_England_Rushing,James White  J. White,4,12,3.0,0,10
New_England_Rushing,Jonnu Smith  J. Smith,1,6,6.0,0,6
New_England_Rushing,Brandon Bolden  B. Bolden,1,5,5.0,0,5
New_England_Rushing,Rhamondre Stevenson  R. Stevenson,1,2,2.0,0,2
New_England_Rushing,TEAM,30,125,4.2,0,35

caption,,REC,YDS,AVG,TD,LONG,TGTS
Miami_Receiving,DeVante Parker  D. Parker,4,81,20.3,0,30,7
Miami_Receiving,Jaylen Waddle  J. Waddle,4,61,15.3,1,36,5
Miami_Receiving,Myles Gaskin  M. Gaskin,5,27,5.4,0,12,5
Miami_Receiving,Salvon Ahmed  S. Ahmed,2,24,12.0,0,18,3
Miami_Receiving,Durham Smythe  D. Smythe,1,9,9.0,0,9,2
Miami_Receiving,Albert Wilson  A. Wilson,0,0,0.0,0,0,2
Miami_Receiving,Mike Gesicki  M. Gesicki,0,0,0.0,0,0,3
Miami_Receiving,TEAM,16,202,12.6,1,36,27

caption,,REC,YDS,AVG,TD,LONG,TGTS
New_England_Receiving,Nelson Agholor  N. Agholor,5,72,14.4,1,25,7
New_England_Receiving,James White  J. White,6,49,8.2,0,26,7
New_England_Receiving,Jakobi Meyers  J. Meyers,6,44,7.3,0,22,9
New_England_Receiving,Jonnu Smith  J. Smith,5,42,8.4,0,11,5
New_England_Receiving,Hunter Henry  H. Henry,3,31,10.3,0,16,3
New_England_Receiving,Kendrick Bourne  K. Bourne,1,17,17.0,0,17,3
New_England_Receiving,Damien Harris  D. Harris,2,17,8.5,0,9,3
New_England_Receiving,Rhamondre Stevenson  R. Stevenson,1,9,9.0,0,9,1
New_England_Receiving,TEAM,29,281,9.7,1,26,38

caption,,FUM,LOST,REC
Miami_Fumbles,Xavien Howard  X. Howard,0,0,1
Miami_Fumbles,Zach Sieler  Z. Sieler,0,0,1
Miami_Fumbles,TEAM,0,0,2

caption,,FUM,LOST,REC
New_England_Fumbles,David Andrews  D. Andrews,0,0,1
New_England_Fumbles,Damien Harris  D. Harris,1,1,0
New_England_Fumbles,Rhamondre Stevenson  R. Stevenson,1,1,0
New_England_Fumbles,Jonnu Smith  J. Smith,1,0,1
New_England_Fumbles,Mac Jones  M. Jones,1,0,0
New_England_Fumbles,TEAM,4,2,2

caption,,tackles,tackles,tackles,tackles,misc,misc,misc,misc
,,TOT,SOLO,SACKS,TFL,PD,QB HTS,TD,Unnamed: 8_level_1
Miami_Defensive,Jerome Baker  J. Baker,12,9,0,0,0,0,0,
Miami_Defensive,Eric Rowe  E. Rowe,9,6,0,0,0,0,0,
Miami_Defensive,Byron Jones  B. Jones,6,5,0,0,1,0,0,
Miami_Defensive,Nik Needham  N. Needham,6,5,0,0,0,0,0,
Miami_Defensive,Sam Eguavoen  S. Eguavoen,6,2,0,0,0,3,0,
Miami_Defensive,Xavien Howard  X. Howard,5,4,0,0,0,0,0,
Miami_Defensive,Jason McCourty  J. McCourty,5,3,0,0,1,0,0,
Miami_Defensive,Brennan Scarlett  B. Scarlett,5,2,0,0,1,1,0,
Miami_Defensive,Andrew Van Ginkel  A. Van Ginkel,5,2,0,0,0,1,0,
Miami_Defensive,John Jenkins  J. Jenkins,4,4,0,0,0,0,0,
Miami_Defensive,Emmanuel Ogbah  E. Ogbah,3,3,0,1,1,1,0,
Miami_Defensive,Zach Sieler  Z. Sieler,3,2,0,1,0,0,0,
Miami_Defensive,Christian Wilkins  C. Wilkins,3,2,0,0,0,1,0,
Miami_Defensive,Elandon Roberts  E. Roberts,2,2,0,0,1,1,0,
Miami_Defensive,Jamal Perry  J. Perry,2,2,0,0,0,0,0,
Miami_Defensive,Brandon Jones  B. Jones,2,2,0,0,0,0,0,
Miami_Defensive,Jevon Holland  J. Holland,2,2,0,0,0,0,0,
Miami_Defensive,Adam Butler  A. Butler,2,1,0,0,0,0,0,
Miami_Defensive,Mack Hollins  M. Hollins,2,0,0,0,0,0,0,
Miami_Defensive,Team  Team,1,1,1,0,0,0,0,
Miami_Defensive,Mike Gesicki  M. Gesicki,1,1,0,0,0,0,0,
Miami_Defensive,Durham Smythe  D. Smythe,1,0,0,0,0,0,0,
Miami_Defensive,Jaelan Phillips  J. Phillips,0,0,0,0,0,1,0,
Miami_Defensive,TEAM,87,60,1,2,5,9,0,

caption,,tackles,tackles,tackles,tackles,misc,misc,misc,misc
,,TOT,SOLO,SACKS,TFL,PD,QB HTS,TD,Unnamed: 8_level_1
New_England_Defensive,Kyle Dugger  K. Dugger,7,6,0,1,0,0,0,
New_England_Defensive,Devin McCourty  D. McCourty,7,4,0,0,0,0,0,
New_England_Defensive,Ja'Whaun Bentley  J. Bentley,4,4,0,1,0,0,0,
New_England_Defensive,Matthew Judon  M. Judon,4,3,0,1,0,1,0,
New_England_Defensive,Lawrence Guy  L. Guy,4,2,0,0,0,1,0,
New_England_Defensive,Dont'a Hightower  D. Hightower,4,2,0,0,0,0,0,
New_England_Defensive,J.C. Jackson  J.C. Jackson,3,3,0,0,1,0,0,
New_England_Defensive,Kyle Van Noy  K. Van Noy,3,2,1,1,1,1,0,
New_England_Defensive,Adrian Phillips  A. Phillips,3,2,0,2,0,0,0,
New_England_Defensive,Davon Godchaux  D. Godchaux,3,2,0,0,0,0,0,
New_England_Defensive,Jalen Mills  J. Mills,2,2,0,0,1,0,0,
New_England_Defensive,Josh Uche  J. Uche,1,1,1,1,0,1,0,
New_England_Defensive,Carl Davis  C. Davis,1,1,0,0,0,0,0,
New_England_Defensive,Chase Winovich  C. Winovich,1,1,0,0,0,0,0,
New_England_Defensive,Joejuan Williams  J. Williams,1,1,0,0,0,0,0,
New_England_Defensive,Christian Barmore  C. Barmore,1,0,0,0,0,0,0,
New_England_Defensive,Jonathan Jones  J. Jones,0,0,0,0,1,0,0,
New_England_Defensive,TEAM,49,36,2,7,4,4,0,

caption,,INT,YDS,TD
Miami_Interceptions,No Miami Interceptions,,,

caption,,INT,YDS,TD
New_England_Interceptions,Jonathan Jones  J. Jones,1,0,0
New_England_Interceptions,TEAM,1,0,0

caption,,NO,YDS,AVG,LONG,TD
Miami_Kick_Returns,No Miami Kick Returns,,,,,

caption,,NO,YDS,AVG,LONG,TD
New_England_Kick_Returns,Brandon Bolden  B. Bolden,1,23,23.0,23,0
New_England_Kick_Returns,Gunner Olszewski  G. Olszewski,1,17,17.0,17,0
New_England_Kick_Returns,TEAM,2,40,20.0,23,0

caption,,NO,YDS,AVG,LONG,TD
Miami_Punt_Returns,Jakeem Grant Sr.  J. Grant Sr.,1,18,18.0,18,0
Miami_Punt_Returns,TEAM,1,18,18.0,18,0

caption,,NO,YDS,AVG,LONG,TD
New_England_Punt_Returns,Gunner Olszewski  G. Olszewski,3,20,6.7,14,0
New_England_Punt_Returns,TEAM,3,20,6.7,14,0

caption,,FG,PCT,LONG,XP,PTS
Miami_Kicking,Jason Sanders  J. Sanders,1/1,100.0,48,2/2,5
Miami_Kicking,TEAM,1/1,100.0,48,2/2,5

caption,,FG,PCT,LONG,XP,PTS
New_England_Kicking,Nick Folk  N. Folk,3/3,100.0,42,1/1,10
New_England_Kicking,TEAM,3/3,100.0,42,1/1,10

caption,,NO,YDS,AVG,TB,In 20,LONG
Miami_Punting,Michael Palardy  M. Palardy,4,180,45.0,1,0,52
Miami_Punting,TEAM,4,180,45.0,1,0,52

caption,,NO,YDS,AVG,TB,In 20,LONG
New_England_Punting,Jake Bailey  J. Bailey,2,99,49.5,1,0,62
New_England_Punting,TEAM,2,99,49.5,1,0,62

【讨论】:

  • 惊人的简短而有效的解决方案。我添加了这两行: data = pd.DataFrame(table) data.to_csv('_Data_{}.csv'.format(pd.datetime.now().strftime("%Y-%m-%d %H% M%S")),索引=假)。但是我注意到,当文件导出到 csv 时,播放器名称丢失了。结果是没有播放器的数据。
  • @LeoTorres - 以防万一您需要知道 csv 中存储的内容 - 添加了一个附加示例,该示例也适用于表格的标题来命名 csv 文件。
猜你喜欢
  • 1970-01-01
  • 2018-06-29
  • 1970-01-01
  • 2023-01-13
  • 2012-11-27
  • 1970-01-01
  • 2012-02-03
  • 1970-01-01
  • 2013-05-27
相关资源
最近更新 更多