【问题标题】:Load More using Selenium on Webscraping在 Webscraping 上使用 Selenium 加载更多内容
【发布时间】:2020-01-07 03:51:01
【问题描述】:

我试图在 Reuters 上进行网络抓取以进行 nlp 分析,其中大部分都在工作,但我无法获取代码以单击“加载更多”按钮以获取更多新闻文章。下面是当前使用的代码:

import csv
import time
import pprint
from datetime import datetime, timedelta
import requests
import nltk
nltk.download('vader_lexicon')
from urllib.request import urlopen
from bs4 import BeautifulSoup
from bs4.element import Tag

comp_name = 'Apple'
url = 'https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all'

res = requests.get(url.format(1))
soup = BeautifulSoup(res.text,"lxml")
for item in soup.find_all("h3",{"class":"search-result-title"}):
    s = str(item)
    article_addr = s.partition('a href="')[2].partition('">')[0]
    headline = s.partition('a href="')[2].partition('">')[2].partition('</a></h3>')[0]
    article_link = 'https://www.reuters.com' + article_addr

    try:
        resp = requests.get(article_addr)
    except Exception as e:
        try:
            resp = requests.get(article_link)
        except Exception as e:
            continue

    sauce = BeautifulSoup(resp.text,"lxml")
    dateTag = sauce.find("div",{"class":"ArticleHeader_date"})
    contentTag = sauce.find("div",{"class":"StandardArticleBody_body"})

    date = None
    title = None
    content = None

    if isinstance(dateTag,Tag):
        date = dateTag.get_text().partition('/')[0]
    if isinstance(contentTag,Tag):
        content = contentTag.get_text().strip()
    time.sleep(3)
    link_soup = BeautifulSoup(content)
    sentences = link_soup.findAll("p")
    print(date, headline, article_link)

from selenium import webdriver
from selenium.webdriver.common.keys import keys
import time

browser = webdriver.Safari()
browser.get('https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all')
try:
    element = WebDriverWait(browser, 3).until(EC.presence_of_element_located((By.ID,'Id_Of_Element')))
except TimeoutException: 
    print("Time out!") 

【问题讨论】:

  • ^ 编辑器打开时,有在线帮助可用于代码格式化。
  • 这取决于你想点击多少次。只要一次或只要元素可见你想点击?
  • 感谢您的回复。理想情况下,我们希望点击“加载更多结果”按钮 10 次。

标签: javascript python selenium web-scraping lazy-loading


【解决方案1】:

要单击文本为 LOAD MORE RESULTS 的元素,您需要为 element_to_be_clickable() 诱导 WebDriverWait,您可以使用以下 Locator Strategies

  • 代码块:

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
    
    comp_name = 'Apple'
    driver.get('https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all')
    while True:
        try:
            driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='search-result-more-txt']"))))
            WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='search-result-more-txt']"))).click()
            print("LOAD MORE RESULTS button clicked")
        except TimeoutException:
            print("No more LOAD MORE RESULTS button to be clicked")
            break
    driver.quit()
    
  • 控制台输出:

    LOAD MORE RESULTS button clicked
    LOAD MORE RESULTS button clicked
    LOAD MORE RESULTS button clicked
    .
    .
    No more LOAD MORE RESULTS button to be clicked
    

参考

您可以在以下位置找到相关的详细讨论:

【讨论】:

  • 那么,这里如何提取搜索结果的标题呢?
【解决方案2】:

点击LOAD MORE RESULTS诱导WebDriverWait()和element_to_be_clickable()

使用while循环并检查计数器

我在 Chrome 上进行了测试,因为我没有 safari 浏览器,但它也应该可以工作。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

comp_name="Apple"
browser = webdriver.Chrome()
browser.get('https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all')

#Accept the trems button
WebDriverWait(browser,10).until(EC.element_to_be_clickable((By.CSS_SELECTOR,"button#_evidon-banner-acceptbutton"))).click()
i=1
while i<11:
     try:
        element = WebDriverWait(browser,10).until(EC.element_to_be_clickable((By.XPATH,"//div[@class='search-result-more-txt' and text()='LOAD MORE RESULTS']")))
        element.location_once_scrolled_into_view
        browser.execute_script("arguments[0].click();", element)
        print(i)
        i=i+1

     except TimeoutException:
            print("Time out!")

【讨论】:

  • TimeoutException Traceback (most recent call last) in 9 10 #Accept the trems button > 11 WebDriverWait(browser,10).until(EC.element_to_be_clickable ((By.CSS_SELECTOR,"button#_evidon-banner-acceptbutton"))).click() 12 i=1 13 while i end_time: 79 break > 80 raise TimeoutException(message , 屏幕, stacktrace) 81 82 def until_not(self, method, message=''): TimeoutException: Message:
  • 感谢您的回复,但我不断收到上面的错误消息^
  • 我不确定你是否在 safari 上弹出这个窗口,你可以评论这一行并检查。我在与 chrome 交互时弹出这个窗口,因此添加它你可以评论那一行代码。
  • 我还想知道是否有任何建议可以在单击“加载更多结果”按钮后将结果合并到我以前的代码中?由于加载更多结果后 url 不会更改,因此以下 2 行仍然具有相同的内容: res = requests.get(url.format(1)) soup = BeautifulSoup(res.text,"lxml") 请问请建议?
猜你喜欢
  • 2021-07-15
  • 1970-01-01
  • 2017-05-24
  • 2017-07-24
  • 1970-01-01
  • 1970-01-01
  • 2016-10-18
  • 2017-03-18
  • 1970-01-01
相关资源
最近更新 更多