【问题标题】:How can I improve performance (runtime) on my webscraping script (Python and Selenium)如何提高我的网页抓取脚本(Python 和 Selenium)的性能(运行时)
【发布时间】:2020-10-13 22:58:18
【问题描述】:

所以我编写了一个脚本来抓取网站上的表格 - 4 年多来 32 支球队的 NFL 名册。然而,该网站一次只显示一个团队,而且是一年。所以我的脚本打开页面,选择一年,抓取数据,然后转到下一年,依此类推,直到收集了所有四年的数据。然后它为其他 32 个团队重复该过程。

现在,我是网络抓取的新手,所以我不确定在计算上,我正在做的是最好的方法。目前,要为一个团队抓取一年的数据,大约需要 40-50 秒,因此每个团队总共大约需要 4 分钟。要为所有团队收集所有年份,这需要两个多小时。

有没有办法抓取数据并减少运行时间?

代码如下:

import requests
import lxml.html as lh
import pandas as pd
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

# Team list
team_ls = ['Arizona Cardinals','Atlanta Falcons','Baltimore Ravens','Buffalo Bills','Carolina Panthers','Chicago Bears','Cincinnati Bengals',
           'Cleveland Browns','Dallas Cowboys','Denver Broncos','Detroit Lions','Green Bay Packers','Houston Texans','Indianapolis Colts',
           'Jacksonville Jaguars','Kansas City Chiefs','Las Vegas Raiders','Los Angeles Chargers','Los Angeles Rams','Miami Dolphins','Minnesota Vikings','New England Patriots',
           'New Orleans Saints','New York Giants','New York Jets','Philadelphia Eagles','Pittsburgh Steelers','San Francisco 49ers','Seattle Seahawks',
           'Tampa Bay Buccaneers','Tennessee Titans','Washington Redskins']

# Format list for URL
team_ls = [team.lower().replace(' ','-') for team in team_ls]

# Changes the year parameter on a given pages
def next_year(driver, year_idx):
    
    driver.find_element_by_xpath('//*[@id="main-dropdown"]').click()
    parentElement = driver.find_element_by_xpath('/html/body/app-root/app-nfl/app-roster/div/div/div[2]/div/div/div[1]/div/div/div')
    elementList = parentElement.find_elements_by_tag_name("button")
    elementList[year_idx].click()
    time.sleep(3)

# Create scraping function
def sel_scrape(driver, team, year):
    
    # Get main table
    main_table = driver.find_element_by_tag_name('table')
    
    # Scrape rows and header
    rows = [[td.text.strip() for td in row.find_elements_by_xpath(".//td")] for row in main_table.find_elements_by_xpath(".//tr")][1:]
    header = [[th.text.strip() for th in row.find_elements_by_xpath(".//th")] for row in main_table.find_elements_by_xpath(".//tr")][0]
    
    # compile in dataframe
    df=pd.DataFrame(rows,columns=header)
    
    # Edit data frame
    df['Merge Name'] = df['Name'].str.split(' ',1).str[0].str[0] + '.' + df['Name'].str.split(' ').str[1]
    df['Team'] = team.replace('-',' ').title()
    df['Year'] = year
    
    return df

url='https://www.lineups.com/nfl/roster/'

df = pd.DataFrame()
years = [2020,2019,2018,2017]

start_time = time.time()

for team in team_ls:
    driver = webdriver.Chrome()
    # Generate team link
    driver.get(url+team)
    
    # For each of the four years
    for idx in range(0,4):
        print("Starting {} {}".format(team, years[idx]))
        # Scrape the page
        df = pd.concat([df, sel_scrape(driver, team, years[idx])])
        
        # Change to next year
        next_year(driver, idx)
    driver. close()

print("--- %s seconds ---" % (time.time() - start_time))
    
df.head()

【问题讨论】:

    标签: python selenium web-scraping runtime


    【解决方案1】:

    您可以通过不使用 Selenium 来改进。 Selenium(当它工作时)自然会运行得更慢。获取数据的最佳方式是通过 API 呈现数据:

    import pandas as pd
    import requests
    import time
    
    # Team list
    team_ls = ['Arizona Cardinals','Atlanta Falcons','Baltimore Ravens','Buffalo Bills','Carolina Panthers','Chicago Bears','Cincinnati Bengals',
               'Cleveland Browns','Dallas Cowboys','Denver Broncos','Detroit Lions','Green Bay Packers','Houston Texans','Indianapolis Colts',
               'Jacksonville Jaguars','Kansas City Chiefs','Las Vegas Raiders','Los Angeles Chargers','Los Angeles Rams','Miami Dolphins','Minnesota Vikings','New England Patriots',
               'New Orleans Saints','New York Giants','New York Jets','Philadelphia Eagles','Pittsburgh Steelers','San Francisco 49ers','Seattle Seahawks',
               'Tampa Bay Buccaneers','Tennessee Titans','Washington Redskins']
    
    
    rows = []
    start_time = time.time()
    for team in team_ls:
        for season in range(2017,2021):
            print ('Season: %s\tTeam: %s' %(season, team))
            teamStr = '-'.join(team.split()).lower()
            url= 'https://api.lineups.com/nfl/fetch/roster/{season}/{teamStr}'.format(season=season, teamStr=teamStr)
    
            jsonData = requests.get(url).json()
            roster = jsonData['data']
            for item in roster:
                item.update( {'Year':season, 'Team':team})
            rows += roster
            
    df = pd.DataFrame(rows)
    
    print("--- %s seconds ---" % (time.time() - start_time))
    
    print (df.head())  
    

    【讨论】:

    • 谢谢!你从哪里得到那个 api url?
    • 如果您转到开发工具(右键单击并检查。您可能需要重新加载页面)。在窗格中,转到选项卡 Network -> XHR -> Headers 查看发出的请求。我将在上面的解决方案中添加一张图片
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-01-19
    • 2023-03-13
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-06-18
    相关资源
    最近更新 更多