【问题标题】:Trying to Capture the Table from Multiple Pages With For Loops尝试使用 For 循环从多个页面中捕获表格
【发布时间】:2021-11-11 14:57:27
【问题描述】:

大家好。

我正在尝试从附加到“player_page”的链接中获取每个页面上的表格。 我想要那个赛季每个球员每场比赛的统计数据,我想要的表格列在球员的个人页面上。附加的每个链接都是正确的,但我在运行循环时无法捕获正确的信息。

知道我在这里做错了什么吗?

感谢任何帮助。

from bs4 import BeautifulSoup
import requests
import pandas as pd

from numpy import sin


url = 'https://www.pro-football-reference.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
}
year = 2018

r = requests.get(url + '/years/' + str(year) + '/fantasy.htm')
soup = BeautifulSoup(r.content, 'lxml')


player_list = soup.find_all('td', attrs= {'class': 'left', 'data-stat': 'player'})
player_page = []
for player in player_list:
    for link in player.find_all('a', href= True):
        #names = str(link['href'])strip('')
        link = str(link['href'].strip('.htm'))
        player_page.append(url + link + '/gamelog' + '/' + str(year))



for page in player_page:
    dfs = pd.read_html(page)

yearly_stats = []
for df in dfs:
        yearly_stats.append(df)
final_stats = pd.concat(yearly_stats)
final_stats.to_excel('Fantasy2018.xlsx')

【问题讨论】:

    标签: python-3.x pandas web-scraping


    【解决方案1】:

    这行得通。我相信,表格列会根据玩家的位置而变化。例如,并非每个人都有解决信息。

    import pandas as pd
    from bs4 import BeautifulSoup
    import requests
    import pandas as pd
    
    
    url = 'https://www.pro-football-reference.com'
    year = 2018
    
    r = requests.get(url + '/years/' + str(year) + '/fantasy.htm')
    soup = BeautifulSoup(r.content, 'lxml')
    
    
    player_list = soup.find_all('td', attrs= {'class': 'left', 'data-stat': 'player'})
    
    dfs = []
    for player in player_list:
        for link in player.find_all('a', href= True):
            name = link.getText()
            link = str(link['href'].strip('.htm'))
            try:
                df = pd.read_html(url + link + '/gamelog' + '/' + str(year))[0]
                for i, columns_old in enumerate(df.columns.levels):
                    columns_new = np.where(columns_old.str.contains('Unnamed'), '' , columns_old)
                    df.rename(columns=dict(zip(columns_old, columns_new)), level=i, inplace=True)
                df.columns = df.columns.map('|'.join).str.strip('|')
                df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
                df = df.dropna(subset=['Date'])
                df.insert(0,'Name',name)
                df.insert(1,'Moment','Regular Season')
                dfs.append(df)
            except:
                pass
            try:
                df1 = pd.read_html(url + link + '/gamelog' + '/' + str(year))[1]
                for i, columns_old in enumerate(df1.columns.levels):
                    columns_new = np.where(columns_old.str.contains('Unnamed'), '' , columns_old)
                    df1.rename(columns=dict(zip(columns_old, columns_new)), level=i, inplace=True)
                df1.columns = df1.columns.map('|'.join).str.strip('|')
                df1['Date'] = pd.to_datetime(df1['Date'], errors='coerce')
                df1 = df1.dropna(subset=['Date'])
                df1.insert(0,'Name',name)
                df1.insert(1,'Moment','Playoffs')
                dfs.append(df1)
            except:
                pass
    
        
    
    dfall = pd.concat(dfs)
    dfall.to_excel('Fantasy2018.xlsx')
    

    【讨论】:

    • 感激不尽。为此一直失眠。非常非常感谢您的帮助。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2017-07-10
    • 1970-01-01
    • 2013-12-15
    • 1970-01-01
    • 1970-01-01
    • 2019-10-29
    • 2014-03-17
    相关资源
    最近更新 更多