【问题标题】:Scraper looping though a list of URL's using Selenium without getting blockedScraper 使用 Selenium 循环遍历 URL 列表而不会被阻止
【发布时间】:2021-10-19 08:49:56
【问题描述】:

我想获得嵌套在多个页面上的完全相同的信息。然后我将 URL 放在一个列表中,并编写了一个 for 循环 来迭代这些页面。刮板在第一个 URL 上运行良好,但不幸的是在第二个 URL 上卡住了,我得到了一个 MaxRetryError

我对 Selenium 的想法是打开一个页面,获取我需要的信息,将其放入数据框中,然后关闭页面。然后,打开另一个页面,获取类似信息,附加数据框,关闭页面,等等,然后将数据框保存为 .csv 文件。

代码如下:

options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH, options=options)
driver.maximize_window()
driver.implicitly_wait(30)

time.sleep(10)
wait = WebDriverWait(driver,30)

# Create the csv at the good place
csv_file = open('\path_to_folder.csv', 'w', newline='')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['titre', 'contrat', 'localisation', 'description'])

# A list of two URL's
listurl = ['https://candidat.pole-emploi.fr/offres/emploi/horticulteur/s1m1','https://candidat.pole-emploi.fr/offres/emploi/ouvrier-agricole/s1m2']

# Loop through the list
for i in listurl:
    driver.get(i)
    # Click cookies popup
    wait.until(EC.element_to_be_clickable((By.LINK_TEXT,"Continuer sans accepter"))).click()
    time.sleep(3)

# Get the elements
try:
    zone = WebDriverWait(driver, 10).until(
       EC.presence_of_element_located((By.CLASS_NAME, "zone-resultats"))
       )
    offres = zone.find_elements_by_css_selector("div.media-body")
    offres2 = zone.find_elements_by_css_selector("div.media-right.media-middle.hidden-xs")
    for offre in offres:
       titre = (offre.find_element_by_css_selector("h2.t4.media-heading")).text
       print(titre)
       localisation = (offre.find_element_by_css_selector("span")).text
       print(localisation)
       description =(offre.find_element_by_class_name("description")).text
       print(description)
       
    for offre2 in offres2:
       contrat = (offre2.find_element_by_class_name("contrat")).text
       print(contrat)
    csv_writer.writerow([titre, contrat, localisation, description])
except Exception as ex:
    print(ex)
finally:
     csv_file.close()
     driver.quit()

这是错误信息:

MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=52938): Max retries exceeded with url: /session/cab64f2c3688431768dfcdba1c4ca98f/url (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000021FBC3B6730>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))

【问题讨论】:

    标签: python selenium loops web-scraping


    【解决方案1】:

    这段代码应该适合你:

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    import pandas as pd
    
    options = webdriver.ChromeOptions()
    # options.add_argument("--incognito")
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    
    
    # # Create the csv at the good place
    # csv_file = open('\path_to_folder.csv', 'w', newline='')
    # csv_writer = csv.writer(csv_file)
    # csv_writer.writerow(['titre', 'contrat', 'localisation', 'description'])
    data = {
        "titre": [],
        "contrat": [],
        "localisation": [],
        "description": []
    }
    
    # A list of two URL's
    listurl = ['https://candidat.pole-emploi.fr/offres/emploi/horticulteur/s1m1',
               'https://candidat.pole-emploi.fr/offres/emploi/ouvrier-agricole/s1m2']
    
    # Loop through the list
    for i in listurl:
        driver = webdriver.Chrome("D:\chromedriver\94\chromedriver.exe", options=options)
        driver.maximize_window()
        driver.get(i)
        # Click cookies popup
        WebDriverWait(driver,30).until(EC.element_to_be_clickable((By.LINK_TEXT,"Continuer sans accepter"))).click()
    
        # Get the elements
        try:
            zone = WebDriverWait(driver, 10).until(
               EC.presence_of_element_located((By.CLASS_NAME, "zone-resultats"))
               )
            offres = zone.find_elements_by_css_selector("div.media-body")
            offres2 = zone.find_elements_by_css_selector("div.media-right.media-middle.hidden-xs")
            for offre in offres:
                titre = (offre.find_element_by_css_selector("h2.t4.media-heading")).text
                print(titre)
                localisation = (offre.find_element_by_css_selector("span")).text
                print(localisation)
                description =(offre.find_element_by_class_name("description")).text
                print(description)
            for offre2 in offres2:
                contrat = (offre2.find_element_by_class_name("contrat")).text
                print(contrat)
            data["titre"].append(titre)
            data["contrat"].append(contrat)
            data["localisation"].append(localisation)
            data["description"].append(description)
    
        except Exception as ex:
            print(ex)
        driver.quit()
            
    df = pd.DataFrame.from_dict(data)
    print(df)
    df.to_csv("data.csv")
    

    【讨论】:

    • 谢谢。我已经用您的编辑编辑了此代码并进行了一些细微的更改。无论如何,您知道如何正确附加数据框吗?它只返回最后一个结果,所以我的 csv 只有一行
    • 我在上一个答案中添加了新代码。这应该适合你。
    • 刮板现在工作得很好,迭代两个 URL,谢谢。虽然它只写了最后的结果,因此 CSV 只有两行。但无论如何,这是向前迈出的重要一步
    • 我终于发现带有 data["titre"].append(titre) 的块必须在循环中。现在完美运行
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2020-02-17
    • 2021-08-29
    • 2017-07-27
    • 2022-11-13
    • 1970-01-01
    • 2017-04-26
    • 2016-06-21
    相关资源
    最近更新 更多