【问题标题】:Python - Selenium next pagePython - Selenium 下一页
【发布时间】:2018-09-20 09:49:34
【问题描述】:

我正在尝试制作一个抓取应用程序来抓取 Hants.gov.uk,现在我正在处理它,只需单击页面而不是抓取。当它到达第 1 页的最后一行时,它刚刚停止,所以我所做的是让它单击“下一页”按钮,但首先它必须返回到原始 URL。它点击第 2 页,但在第 2 页被抓取后,它不会转到第 3 页,它只是重新启动第 2 页。

有人可以帮我解决这个问题吗?

代码:

import time
import config # Don't worry about this. This is an external file to make a DB
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True"

driver = webdriver.Chrome(executable_path=r"C:\Users\Goten\Desktop\chromedriver.exe")
driver.get(url)

driver.find_element_by_id("mainContentPlaceHolder_btnAccept").click()

def start():
    elements = driver.find_elements_by_css_selector(".searchResult a")
    links = [link.get_attribute("href") for link in elements]

    result = []
    for link in links:
        if link not in result:
            result.append(link)
        else:
            driver.get(link)
            goUrl = urllib.request.urlopen(link)
            soup = BeautifulSoup(goUrl.read(), "html.parser")
            #table = soup.find_element_by_id("table", {"class": "applicationDetails"})
            for i in range(20):
                pass # Don't worry about all this commented code, it isn't relevant right now
                #table = soup.find_element_by_id("table", {"class": "applicationDetails"})
                #print(table.text)
            #   div = soup.select("div.applicationDetails")
            #   getDiv = div[i].split(":")[1].get_text()
            #   log = open("log.txt", "a")
            #   log.write(getDiv + "\n")
            #log.write("\n")

start()
driver.get(url)

for i in range(5):
    driver.find_element_by_id("ctl00_mainContentPlaceHolder_lvResults_bottomPager_ctl02_NextButton").click()
    url = driver.current_url
    start()
    driver.get(url)
driver.close()

【问题讨论】:

    标签: python selenium selenium-webdriver web-scraping webdriver


    【解决方案1】:

    试试这个:

    import time
    # import config # Don't worry about this. This is an external file to make a DB
    import urllib.request
    from bs4 import BeautifulSoup
    from selenium import webdriver
    
    url = "https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True"
    
    driver = webdriver.Chrome()
    driver.get(url)
    
    driver.find_element_by_id("mainContentPlaceHolder_btnAccept").click()
    
    result = []
    
    
    def start():
        elements = driver.find_elements_by_css_selector(".searchResult a")
        links = [link.get_attribute("href") for link in elements]
        result.extend(links)
    
    def start2():
        for link in result:
            # if link not in result:
            #     result.append(link)
            # else:
                driver.get(link)
                goUrl = urllib.request.urlopen(link)
                soup = BeautifulSoup(goUrl.read(), "html.parser")
                #table = soup.find_element_by_id("table", {"class": "applicationDetails"})
                for i in range(20):
                    pass # Don't worry about all this commented code, it isn't relevant right now
                    #table = soup.find_element_by_id("table", {"class": "applicationDetails"})
                    #print(table.text)
                #   div = soup.select("div.applicationDetails")
                #   getDiv = div[i].split(":")[1].get_text()
                #   log = open("log.txt", "a")
                #   log.write(getDiv + "\n")
                #log.write("\n")
    
    
    while True:
        start()
        element = driver.find_element_by_class_name('rdpPageNext')
        try:
            check = element.get_attribute('onclick')
            if check != "return false;":
                element.click()
            else:
                break
    
        except:
            break
    
    print(result)
    start2()
    driver.get(url)
    

    【讨论】:

    • 是的,但也需要代码来检查每个应用程序。每页有7个
    • 它正在检查。我用了while循环
    • 在循环之间使用睡眠。根据要求。我现在无法运行代码。但我认为这会很好
    • 我认为您在浏览每一页时遇到了问题。所以我只解决了那个。您必须添加其他代码才能从表中获取数据。您可以在 while True: 行之后添加它
    • 告诉我这是否有效。如果是我会解释逻辑
    【解决方案2】:

    根据网址https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True点击所有页面,您可以使用以下解决方案:

    • 代码块:

      from selenium import webdriver
      from selenium.webdriver.chrome.options import Options
      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support import expected_conditions as EC
      
      options = Options()
      options.add_argument("start-maximized")
      options.add_argument("disable-infobars")
      options.add_argument("--disable-extensions")
      driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
      driver.get('https://planning.hants.gov.uk/SearchResults.aspx?RecentDecisions=True')
      WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, "mainContentPlaceHolder_btnAccept"))).click()
      numLinks = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div#ctl00_mainContentPlaceHolder_lvResults_topPager div.rdpWrap.rdpNumPart>a"))))
      print(numLinks)
      for i in range(numLinks):
          print("Perform your scrapping here on page {}".format(str(i+1)))
          WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@id='ctl00_mainContentPlaceHolder_lvResults_topPager']//div[@class='rdpWrap rdpNumPart']//a[@class='rdpCurrentPage']/span//following::span[1]"))).click()
      driver.quit()
      
    • 控制台输出:

      8
      Perform your scrapping here on page 1
      Perform your scrapping here on page 2
      Perform your scrapping here on page 3
      Perform your scrapping here on page 4
      Perform your scrapping here on page 5
      Perform your scrapping here on page 6
      Perform your scrapping here on page 7
      Perform your scrapping here on page 8
      

    【讨论】:

    • 虽然这是一个绝妙的主意,但我想用我自己的代码来完成这项任务,我只是想弄清楚:) 谢谢你
    • @FeitanPortor 我们既不了解您的要求,也不了解您的用例。你提出了你的问题,贡献者正试图以他们自己的身份帮助你。随意使用 codelogic :) 这将是您的选择
    • 我知道。这并不能准确回答我的问题。我之前投了赞成票
    【解决方案3】:

    嗨@Feitan Portor,您编写的代码绝对完美,您被重定向回第一页的唯一原因是因为您在最后一个 for 循环中给出了url = driver.current_url,它是保持静态的 url,只有引发下一次点击事件的 java 脚本,因此只需删除 url = driver.current_urldriver.get(url)

    你很高兴我已经测试了我自己 还要获取你的刮板所在的当前页面,只需在 for 循环中添加这部分,这样你就可以知道你的刮板在哪里:

    ss = driver.find_element_by_class_name('rdpCurrentPage').text
        print(ss)
    

    希望这能解决你的困惑

    【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2014-07-07
    • 1970-01-01
    • 1970-01-01
    • 2019-05-26
    • 2019-07-29
    • 1970-01-01
    • 2018-11-03
    相关资源
    最近更新 更多