【问题标题】:Is there a way to optimize the for loop? Selenium is taking too long to scrape 38 pages有没有办法优化for循环? Selenium 抓取 38 页的时间太长
【发布时间】:2020-10-21 14:30:39
【问题描述】:

我正在尝试通过 Selenium 和 python 抓取 https://arxiv.org/search/?query=healthcare&searchtype=allI。 for 循环执行时间过长。我尝试使用无头浏览器和 PhantomJS 进行抓取,但它没有抓取抽象字段(需要通过单击更多按钮来扩展抽象字段)

import pandas as pd
import selenium
import re
import time
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver import Firefox

browser = Firefox()
url_healthcare = 'https://arxiv.org/search/?query=healthcare&searchtype=all'
browser.get(url_healthcare)

dfs = []
for i in range(1, 39):
    articles = browser.find_elements_by_tag_name('li[class="arxiv-result"]')

    for article in articles:
        title = article.find_element_by_tag_name('p[class="title is-5 mathjax"]').text
        arxiv_id = article.find_element_by_tag_name('a').text.replace('arXiv:','')
        arxiv_link = article.find_elements_by_tag_name('a')[0].get_attribute('href') 
        pdf_link = article.find_elements_by_tag_name('a')[1].get_attribute('href')
        authors = article.find_element_by_tag_name('p[class="authors"]').text.replace('Authors:','')

        try:
                link1 = browser.find_element_by_link_text('▽ More')
                link1.click()
        except:
                time.sleep(0.1)

        abstract = article.find_element_by_tag_name('p[class="abstract mathjax"]').text
        date = article.find_element_by_tag_name('p[class="is-size-7"]').text
        date = re.split(r"Submitted|;",date)[1]
        tag = article.find_element_by_tag_name('div[class="tags is-inline-block"]').text.replace('\n', ',')
        
        try:
            doi = article.find_element_by_tag_name('div[class="tags has-addons"]').text
            doi = re.split(r'\s', doi)[1] 
        except NoSuchElementException:
            doi = 'None'

        all_combined = [title, arxiv_id, arxiv_link, pdf_link, authors, abstract, date, tag, doi]
        dfs.append(all_combined)

    print('Finished Extracting Page:', i)

    try:
        link2 = browser.find_element_by_class_name('pagination-next')
        link2.click()
    except:
        browser.close
        
    time.sleep(0.1)


【问题讨论】:

  • 抓取的 df 每篇文章应该有 9 列:标题、id、链接、pdf 链接、作者、摘要、日期标签、doi。所以结果df应该是(1890 X 9)。我需要有关摘要链接的帮助,因为它有一个更多按钮,单击该按钮会给出我需要的扩展摘要,但我无法提取它!
  • 请不要破坏您的帖子。这包括对帖子进行编辑以使现有答案无效或以其他方式使您的问题无法回答。
  • @user1234 要删除内容,版权所有者或其代理人需要以规定的方式发出 DMCA 删除通知。鉴于您说您无权发布代码,这意味着需要发布 DMCA 删除通知的是其他人。这并不意味着我们不愿意与您合作提出一个仍然是有效问题、不会使答案无效并且不包含您关注的代码的问题。但是,鉴于此问题和答案的具体情况,这并非易事(即您需要做很多工作)。
  • 完成所有三件事的最有可能的解决方案是让您将问题中的代码重写为新代码,A) 仍然存在相同的问题,B) 不包含任何您认为不允许共享的代码中的哪些,并且不是该代码的衍生作品。然后,您需要得到为您回答的人的同意,将您的新代码集成到他们的答案中,同时仍然保持每个答案的质量,并以与最初相同的方式解决问题(例如,通过建议获得同意一个编辑)。
  • 我可能还应该指出,即使从这里删除,在多个存档站点和多个镜像 SO 内容的站点上仍然会有副本。 SO 无法控制这些第三方网站,因此版权所有者需要单独查找和处理每个网站。虽然从此处删除它会降低其可见性,但它肯定不会将其从互联网上删除。将其从任何地方完全移除将是一项艰巨的任务,而且可能是不可能的(即,这困难并且需要大量时间/精力)。

标签: python selenium web-scraping optimization


【解决方案1】:

以下实现在 16 秒内实现了这一目标。

为了加快执行过程,我采取了以下措施:

  • 完全删除了Selenium(无需点击)
  • 对于abstract,使用BeautifulSoup 的输出并稍后处理
  • 添加了multiprocessing 以显着加快进程
from multiprocessing import Process, Manager
import requests 
from bs4 import BeautifulSoup
import re
import time

start_time = time.time()

def get_no_of_pages(showing_text):
    no_of_results = int((re.findall(r"(\d+,*\d+) results for all",showing_text)[0].replace(',','')))
    pages = no_of_results//200 + 1
    print("total pages:",pages)
    return pages 

def clean(text):
    return text.replace("\n", '').replace("  ",'')

def get_data_from_page(url,page_number,data):
    print("getting page",page_number)
    response = requests.get(url+"start="+str(page_number*200))
    soup = BeautifulSoup(response.content, "lxml")
    
    arxiv_results = soup.find_all("li",{"class","arxiv-result"})

    for arxiv_result in arxiv_results:
        paper = {} 
        paper["titles"]= clean(arxiv_result.find("p",{"class","title is-5 mathjax"}).text)
        links = arxiv_result.find_all("a")
        paper["arxiv_ids"]= links[0].text.replace('arXiv:','')
        paper["arxiv_links"]= links[0].get('href')
        paper["pdf_link"]= links[1].get('href')
        paper["authors"]= clean(arxiv_result.find("p",{"class","authors"}).text.replace('Authors:',''))

        split_abstract = arxiv_result.find("p",{"class":"abstract mathjax"}).text.split("▽ More\n\n\n",1)
        if len(split_abstract) == 2:
            paper["abstract"] = clean(split_abstract[1].replace("△ Less",''))
        else: 
            paper["abstract"] = clean(split_abstract[0].replace("△ Less",''))

        paper["date"] = re.split(r"Submitted|;",arxiv_results[0].find("p",{"class":"is-size-7"}).text)[1]
        paper["tag"] = clean(arxiv_results[0].find("div",{"class":"tags is-inline-block"}).text) 
        doi = arxiv_results[0].find("div",{"class":"tags has-addons"})       
        if doi is None:
            paper["doi"] = "None"
        else:
            paper["doi"] = re.split(r'\s', doi.text)[1] 

        data.append(paper)
    
    print(f"page {page_number} done")


if __name__ == "__main__":
    url = 'https://arxiv.org/search/?searchtype=all&query=healthcare&abstracts=show&size=200&order=-announced_date_first&'

    response = requests.get(url+"start=0")
    soup = BeautifulSoup(response.content, "lxml")

    with Manager() as manager:
        data = manager.list()  
        processes = []
        get_data_from_page(url,0,data)


        showing_text = soup.find("h1",{"class":"title is-clearfix"}).text
        for i in range(1,get_no_of_pages(showing_text)):
            p = Process(target=get_data_from_page, args=(url,i,data))
            p.start()
            processes.append(p)

        for p in processes:
            p.join()

        print("Number of entires scraped:",len(data))

        stop_time = time.time()

        print("Time taken:", stop_time-start_time,"seconds")

输出:

>>> python test.py
getting page 0
page 0 done
total pages: 10
getting page 1
getting page 4
getting page 2
getting page 6
getting page 5
getting page 3
getting page 7
getting page 9
getting page 8
page 9 done
page 4 done
page 1 done
page 6 done
page 2 done
page 7 done
page 3 done
page 5 done
page 8 done
Number of entires scraped: 1890
Time taken: 15.911492586135864 seconds

注意:

  • 请将以上代码写入.py 文件。对于 Jupyter 笔记本,请参阅 this
  • 多处理代码取自here
  • data 列表中条目的顺序与网站上的顺序不匹配,因为 Manager 将在它们出现时将 dictionaries 添加到其中。
  • 上述代码自行查找页数,因此可以推广到任何 arxiv 搜索结果。不幸的是,首先要做到这一点getspage 0,然后计算number of pages,然后为剩余页面计算multiprocessing。这样做的缺点是在处理 0th page 时,没有其他进程正在运行。因此,如果您删除该部分并简单地为 10 pages 运行循环,那么所用时间应该会下降到大约 8 秒

【讨论】:

    【解决方案2】:

    您可以尝试使用请求和美丽的汤方法。无需点击更多链接。

    from requests import get
    from bs4 import BeautifulSoup
    
    # you can change the size to retrieve all the results at one shot.
    
    url = 'https://arxiv.org/search/?query=healthcare&searchtype=all&abstracts=show&order=-announced_date_first&size=50&start=0'
    response = get(url,verify = False)
    soup = BeautifulSoup(response.content, "lxml")
    #print(soup)
    queryresults = soup.find_all("li", attrs={"class": "arxiv-result"})
    
    for result in queryresults:
        title = result.find("p",attrs={"class": "title is-5 mathjax"})
        print(title.text)
    
    #If you need full abstract content - try this (you do not need to click on more button
        for result in queryresults:
            abstractFullContent = result.find("span",attrs={"class": "abstract-full has-text-grey-dark mathjax"})
            print(abstractFullContent.text)
    

    输出:

     Interpretable Deep Learning for Automatic Diagnosis of 12-lead Electrocardiogram
                
      Leveraging Technology for Healthcare and Retaining Access to Personal Health Data to Enhance Personal Health and Well-being
      Towards new forms of particle sensing and manipulation and 3D imaging on a smartphone for healthcare applications
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-11-09
      • 1970-01-01
      • 2015-01-13
      • 1970-01-01
      相关资源
      最近更新 更多