【问题标题】:Selenium button click in loop fails after first try第一次尝试后,Selenium 按钮单击循环失败
【发布时间】:2021-03-12 08:46:40
【问题描述】:

目前我正在开发一个网络爬虫,它应该能够下载荷兰报纸银行的文本。第一个链接工作正常,但突然第二个链接产生了一个错误,我不知道如何解决这个问题。

似乎 selenium 在第一个链接中成功单击时无法单击第二个链接中的按钮。

你知道导致第二个链接(telegraaf 页面)失败的原因吗?

更新代码:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
import numpy as np
import re
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.common.exceptions import TimeoutException


from selenium.webdriver.common.action_chains import ActionChains

#Set up the path to the chrome driver
driver = webdriver.Chrome()
html = driver.find_element_by_tag_name('html')

all_details = []
for c in range(1,2):
    try:
        driver.get("https://www.delpher.nl/nl/kranten/results?query=kernenergie&facets%5Bpapertitle%5D%5B%5D=Algemeen+Dagblad&facets%5Bpapertitle%5D%5B%5D=De+Volkskrant&facets%5Bpapertitle%5D%5B%5D=De+Telegraaf&facets%5Bpapertitle%5D%5B%5D=Trouw&page={}&sortfield=date&cql%5B%5D=(date+_gte_+%2201-01-1970%22)&cql%5B%5D=(date+_lte_+%2201-01-2018%22)&coll=ddd".format(c))
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        incategory = driver.find_elements_by_class_name("search-result")
        print(driver.current_url)
        
        links = [ i.find_element_by_class_name("thumbnail search-result__thumbnail").get_attribute("href") for i in incategory]
            
        # Lets loop through each link to acces the page of each book
        for link in links:
            # get one book url
            driver.get(link)
                      
            # newspaper 
            newspaper = driver.find_element_by_xpath("//*[@id='content']/div[2]/div/div[2]/header/h1/span[2]")
            
            # date of the article
            date = driver.find_element_by_xpath("//*[@id='content']/div[2]/div/div[2]/header/div/ul/li[1]")
            
            #click button and find title
            div_element = WebDriverWait(driver, 60).until(expected_conditions.presence_of_element_located((By.XPATH,'//*[@id="object"]/div/div/div')))
            hover = ActionChains(driver).move_to_element(div_element)
            hover.perform()
            div_element.click()
            
            button = WebDriverWait(driver, 90).until(expected_conditions.presence_of_element_located((By.XPATH, '//*[@id="object-viewer__ocr-button"]')))
            hover = ActionChains(driver).move_to_element(button)
            hover.perform()

            button.click()
            
            element = driver.find_element_by_css_selector(".object-viewer__ocr-panel-results")
            driver.execute_script("$(arguments[0]).click();", element)
            
            # content of article
                        
            try:
                content = driver.find_elements_by_xpath("//*[contains(text(), 'kernenergie')]").text
                
            except:
                content = None
                
            # Define a dictionary with details we need
            r = {
                "1Newspaper":newspaper.text,
                "2Date":date.text,
                "3Content":content,
            }
            # append r to all details
            all_details.append(r)
            
    except Exception as e:
        print(str(e))
        pass
            
# save the information into a CSV file
df = pd.DataFrame(all_details)
df = df.to_string()

time.sleep(3)
driver.close()

【问题讨论】:

    标签: python selenium web text web-crawler


    【解决方案1】:

    所以你有一些问题。

    driver.implicitly_wait(10)
    

    只能使用一次

     links = [ i.find_element_by_class_name("search-result__thumbnail-link").get_attribute("href") for i in incategory]
    

    是获取所有链接的更有用的方法

    print(driver.current_url)
    

    可以替换

     print("https://www.delpher.nl/nl/kranten/results?query=kernenergie&facets%5Bpapertitle%5D%5B%5D=Algemeen+Dagblad&facets%5Bpapertitle%5D%5B%5D=De+Volkskrant&facets%5Bpapertitle%5D%5B%5D=De+Telegraaf&facets%5Bpapertitle%5D%5B%5D=Trouw&page={}&sortfield=date&cql%5B%5D=(date+_gte_+%2201-01-1970%22)&cql%5B%5D=(date+_lte_+%2201-01-2018%22)&coll=ddd".format(c))
    

    不需要 url=link

    for link in links:
        driver.get(link)
    

    您的标题实际上并没有出现在第二页上。对所有值使用类似的东西。

            try:
                content = driver.find_element_by_xpath('//*[@id="object-viewer__ocr-panel"]/div[2]/div[5]').text
            except:
                content = None
    
            # Define a dictionary 
            r = {
                "1Newspaper":newspaper,
                "2Date":date,
                "3Title": title,
                "4Content": content,
            }
    

    您可以将您的异常替换为找出问题所在。

    except Exception as e:
            print(str(e))
            pass
    

    【讨论】:

    • 我现在建议的是,当您查找元素时,要么将其设置为 .text,要么将其设置为 None。
    • 会做,同时我想我发现了错误。当前定位的元素的存在是指加载之前存在的ID,将实现仅在加载后显示的div类和样式元素,从而显示按钮。
    • 现在 selenium 似乎无法打开检索到的链接,您知道为什么会发生这种情况吗?
    • 打印出来的链接很好。
    • 无法打开收集到的链接,出现以下错误:'WebElement' object has no attribute 'WebDriverWait'
    【解决方案2】:

    您尝试访问的按钮可能位于 iframe 内,这意味着您必须在搜索 XPATH 之前访问该按钮:

    iframe = driver.find_elements_by_tag_name('iframe')
    

    driver.switch_to.frame(iframe)

    还有可能您尝试点击的对象尚不可见,这可以通过超时解决

    【讨论】:

    • 谢谢,但据我所知,检查页面并没有显示任何 (i)frame 元素。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-04-27
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多