【问题标题】:How to paginate in selenium Python?如何在 selenium Python 中进行分页?
【发布时间】:2020-04-02 16:37:42
【问题描述】:

我需要对页面进行分页并将每个页面的 HTML 保存在一个列表中。

HTML 看起来像这样,对于第一页 class="sc-4j28w0-1 fDeSdf" 的第一个元素是箭头 '>'

<li disabled="" class="sc-4j28w0-1 fDeSdf"></li>
<li data-testid="current-page-item" class="sc-4j28w0-1 sc-4j28w0-2 jDlZyl">1</li>
<li class="sc-4j28w0-1 lhEbhI"><span class="sc-4j28w0-3 jAKnhT">2</span></li>
<li class="sc-4j28w0-1 lhEbhI"><span class="sc-4j28w0-3 jAKnhT">3</span></li>
<li class="sc-4j28w0-1 lhEbhI"></li>

第二页和附加页(不是最后一页)

<li class="sc-4j28w0-1 lhEbhI"></li>
<li class="sc-4j28w0-1 lhEbhI"><span class="sc-4j28w0-3 jAKnhT">1</span></li>
<li data-testid="current-page-item" class="sc-4j28w0-1 sc-4j28w0-2 jDlZyl">2</li>
<li class="sc-4j28w0-1 lhEbhI"><span class="sc-4j28w0-3 jAKnhT">3</span></li>
<li class="sc-4j28w0-1 lhEbhI"></li>

对于最后一页 class="sc-4j28w0-1 fDeSdf" 的最后一个元素是箭头 '

<li class="sc-4j28w0-1 lhEbhI"></li>
<li class="sc-4j28w0-1 lhEbhI"><span class="sc-4j28w0-3 jAKnhT">1</span></li>
<li class="sc-4j28w0-1 lhEbhI"><span class="sc-4j28w0-3 jAKnhT">2</span></li>
<li data-testid="current-page-item" class="sc-4j28w0-1 sc-4j28w0-2 jDlZyl">3</li>
<li disabled="" class="sc-4j28w0-1 fDeSdf"></li>

所以如果页面的第一个或最后一个类是 'sc-4j28w0-1 fDeSdf'

我尝试使用 while 循环进行分页

#  list for html pages 
news_list = []

while True: 
    wait = WebDriverWait(driver, 10) 

    #  by clicking on the last element of pagination == >
    search = wait.until(EC.presence_of_element_located((By.XPATH, '/html/body/div/div/div[2]/div[2]/div/ol/li[5]')))
   # if it is active click
    if search.is_enabled():
        search.click()
        time.sleep(5)
        html = driver.page_source
        soup_news = BeautifulSoup(html)
        news_list.append(soup_news)
    else:
        pass

但是循环不停的问题,一直保存最后一页

我也试过这样:

wait = WebDriverWait(driver, 10) 

search = wait.until(EC.element_to_be_clickable((By.XPATH, '/html/body/div/div/div[2]/div[2]/div/ol/li[5]')))

while search.get_property('disabled') is False:
    search.click()
    time.sleep(5)
    html = driver.page_source
    soup_news = BeautifulSoup(html)
    news_list.append(soup_news)

然后我得到错误

---------------------------------------------------------------------------
StaleElementReferenceException            Traceback (most recent call last)
<ipython-input-51-49e862d6475f> in <module>
     34 
     35 
---> 36 while search.is_enabled():
     37     try:
     38         search.click()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py in is_enabled(self)
    157     def is_enabled(self):
    158         """Returns whether the element is enabled."""
--> 159         return self._execute(Command.IS_ELEMENT_ENABLED)['value']
    160 
    161     def find_element_by_id(self, id_):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py in _execute(self, command, params)
    631             params = {}
    632         params['id'] = self._id
--> 633         return self._parent.execute(command, params)
    634 
    635     def find_element(self, by=By.ID, value=None):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py in execute(self, driver_command, params)
    319         response = self.command_executor.execute(driver_command, params)
    320         if response:
--> 321             self.error_handler.check_response(response)
    322             response['value'] = self._unwrap_value(
    323                 response.get('value', None))

~\AppData\Local\Continuum\anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py in check_response(self, response)
    240                 alert_text = value['alert'].get('text')
    241             raise exception_class(message, screen, stacktrace, alert_text)
--> 242         raise exception_class(message, screen, stacktrace)
    243 
    244     def _value_or_default(self, obj, key, default):

StaleElementReferenceException: Message: The element reference of <li class="sc-4j28w0-1 lhEbhI"> is stale; either the element is no longer attached to the DOM, it is not in the current frame context, or the document has been refreshed

感谢任何帮助

【问题讨论】:

  • 您的意思是else: break 而不是else: pass
  • 两个都试过了,不行

标签: python selenium pagination


【解决方案1】:

您可以通过多种方式在此处进行分页。我会强调一个:

  1. 获取当前页码
  2. 搜索下一个,找不到就退出

代码:

while True:
   current_page_number = int(driver.find_element_by_css_selector('li[data-testid=current-page-item]').text)

   print(f"Processing page {current_page_number}..")

   try:
       next_page_link = driver.find_element_by_xpath(f'.//li[span = "{current_page_number + 1}"]')
       next_page_link.click()
    except NoSuchElementException:
        print(f"Exiting. Last page: {current_page_number}.")
        break

   # TODO: save the page

【讨论】:

  • 你好,我试过并得到一个错误 TypeError: int() argument must be a string, a bytes-like object or a number, not 'FirefoxWebElement'
  • @AnnaDmitrieva 哦,是的,忘了.text 那里,An'ka,prover' :)
猜你喜欢
  • 2018-06-01
  • 1970-01-01
  • 2021-11-01
  • 2021-07-19
  • 2021-02-06
  • 2015-04-09
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多