在 Webscraping 上使用 Selenium 加载更多内容答案

【问题标题】：Load More using Selenium on Webscraping在 Webscraping 上使用 Selenium 加载更多内容
【发布时间】：2020-01-07 03:51:01
【问题描述】：

我试图在 Reuters 上进行网络抓取以进行 nlp 分析，其中大部分都在工作，但我无法获取代码以单击“加载更多”按钮以获取更多新闻文章。下面是当前使用的代码：

import csv
import time
import pprint
from datetime import datetime, timedelta
import requests
import nltk
nltk.download('vader_lexicon')
from urllib.request import urlopen
from bs4 import BeautifulSoup
from bs4.element import Tag

comp_name = 'Apple'
url = 'https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all'

res = requests.get(url.format(1))
soup = BeautifulSoup(res.text,"lxml")
for item in soup.find_all("h3",{"class":"search-result-title"}):
    s = str(item)
    article_addr = s.partition('a href="')[2].partition('">')[0]
    headline = s.partition('a href="')[2].partition('">')[2].partition('</a></h3>')[0]
    article_link = 'https://www.reuters.com' + article_addr

    try:
        resp = requests.get(article_addr)
    except Exception as e:
        try:
            resp = requests.get(article_link)
        except Exception as e:
            continue

    sauce = BeautifulSoup(resp.text,"lxml")
    dateTag = sauce.find("div",{"class":"ArticleHeader_date"})
    contentTag = sauce.find("div",{"class":"StandardArticleBody_body"})

    date = None
    title = None
    content = None

    if isinstance(dateTag,Tag):
        date = dateTag.get_text().partition('/')[0]
    if isinstance(contentTag,Tag):
        content = contentTag.get_text().strip()
    time.sleep(3)
    link_soup = BeautifulSoup(content)
    sentences = link_soup.findAll("p")
    print(date, headline, article_link)

from selenium import webdriver
from selenium.webdriver.common.keys import keys
import time

browser = webdriver.Safari()
browser.get('https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all')
try:
    element = WebDriverWait(browser, 3).until(EC.presence_of_element_located((By.ID,'Id_Of_Element')))
except TimeoutException: 
    print("Time out!")

【问题讨论】：

^ 编辑器打开时，有在线帮助可用于代码格式化。
这取决于你想点击多少次。只要一次或只要元素可见你想点击？
感谢您的回复。理想情况下，我们希望点击“加载更多结果”按钮 10 次。

标签： javascript python selenium web-scraping lazy-loading

【解决方案1】：

要单击文本为 LOAD MORE RESULTS 的元素，您需要为 element_to_be_clickable() 诱导 WebDriverWait，您可以使用以下 Locator Strategies：

代码块：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')

comp_name = 'Apple'
driver.get('https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all')
while True:
    try:
        driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='search-result-more-txt']"))))
        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='search-result-more-txt']"))).click()
        print("LOAD MORE RESULTS button clicked")
    except TimeoutException:
        print("No more LOAD MORE RESULTS button to be clicked")
        break
driver.quit()

控制台输出：

LOAD MORE RESULTS button clicked
LOAD MORE RESULTS button clicked
LOAD MORE RESULTS button clicked
.
.
No more LOAD MORE RESULTS button to be clicked

参考

您可以在以下位置找到相关的详细讨论：

Clicking “More” button via selenium

【讨论】：

那么，这里如何提取搜索结果的标题呢？

【解决方案2】：

点击LOAD MORE RESULTS诱导WebDriverWait()和element_to_be_clickable()

使用while循环并检查计数器

我在 Chrome 上进行了测试，因为我没有 safari 浏览器，但它也应该可以工作。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

comp_name="Apple"
browser = webdriver.Chrome()
browser.get('https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all')

#Accept the trems button
WebDriverWait(browser,10).until(EC.element_to_be_clickable((By.CSS_SELECTOR,"button#_evidon-banner-acceptbutton"))).click()
i=1
while i<11:
     try:
        element = WebDriverWait(browser,10).until(EC.element_to_be_clickable((By.XPATH,"//div[@class='search-result-more-txt' and text()='LOAD MORE RESULTS']")))
        element.location_once_scrolled_into_view
        browser.execute_script("arguments[0].click();", element)
        print(i)
        i=i+1

     except TimeoutException:
            print("Time out!")

【讨论】：

TimeoutException Traceback (most recent call last) in 9 10 #Accept the trems button > 11 WebDriverWait(browser,10).until(EC.element_to_be_clickable ((By.CSS_SELECTOR,"button#_evidon-banner-acceptbutton"))).click() 12 i=1 13 while i end_time: 79 break > 80 raise TimeoutException(message , 屏幕, stacktrace) 81 82 def until_not(self, method, message=''): TimeoutException: Message:
感谢您的回复，但我不断收到上面的错误消息^
我不确定你是否在 safari 上弹出这个窗口，你可以评论这一行并检查。我在与 chrome 交互时弹出这个窗口，因此添加它你可以评论那一行代码。
我还想知道是否有任何建议可以在单击“加载更多结果”按钮后将结果合并到我以前的代码中？由于加载更多结果后 url 不会更改，因此以下 2 行仍然具有相同的内容： res = requests.get(url.format(1)) soup = BeautifulSoup(res.text,"lxml") 请问请建议？