【问题标题】:selenium in python is skipping articles while trying to scrape the datapython中的硒在尝试抓取数据时正在跳过文章
【发布时间】:2020-07-16 05:05:54
【问题描述】:

我试图在 python 中使用 selenium 从文章中提取数据,代码正在识别文章,但在运行循环时,一些文章被随机跳过。任何解决此问题的帮助将不胜感激。

#Importing libraries
import requests
import os
import json
from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup  
import time
import requests
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import traceback
from webdriver_manager.chrome import ChromeDriverManager  

#opening a chrome instance
options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

driver = webdriver.Chrome(options=options, executable_path=r"C:/selenium/chromedriver.exe")

#getting into the website
driver.get('https://academic.oup.com/rof/issue/2/2')

#getting the articles
articles = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, '/html/body/div[3]/main/section/div/div/div[1]/div/div[3]/div[2]/div[3]/div/div/div/div/h5')))

#loop to get in and out of articles
for article in articles:
    try:
        ActionChains(driver).key_down(Keys.CONTROL).click(article).key_up(Keys.CONTROL).perform()
        WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
        window1 = driver.window_handles[1]
        driver.switch_to_window(window1)
        driver.close()
        driver.switch_to_window(window0)
    except:
        print("couldnt get the article")

【问题讨论】:

  • 这可能看起来有点简单,但您是否尝试过增加每次点击的等待时间? 10 秒可能不足以打开文章。看不出你的代码有什么严重的错误。
  • 减少 XPATH 选择器的长度可能是值得的。 (By.XPATH, '//h5[@class="customLink item-title"]') 更简洁。查看选择器后,您确定 h5 是您要单击的,而不是直接子级的 吗?
  • @AaronS 我尝试增加等待时间,但没有奏效。是的,使用您建议的 xpath 使代码看起来有点干净。谢谢

标签: python-3.x selenium selenium-webdriver xpath web-scraping


【解决方案1】:

First,收集所有文章元素,你可以使用这个css选择器:

articles = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '.customLink.item-title a')))

Second,这是错误的方法:

driver.switch_to_window(window1)

应该:

driver.switch_to.window(window1)

请参阅上面的 _. 之间的区别。

Third,你忘了初始化window0变量:

window0 = driver.window_handles[0]

最后,试试下面的代码:

#getting into the website
driver.get('https://academic.oup.com/rof/issue/2/2')

#getting the articles
articles = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '.customLink.item-title a')))

#loop to get in and out of articles
for article in articles:
    try:
        ActionChains(driver).key_down(Keys.CONTROL).click(article).key_up(Keys.CONTROL).perform()
        WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
        window1 = driver.window_handles[1]
        driver.switch_to.window(window1)
        driver.close()
        window0 = driver.window_handles[0]
        driver.switch_to.window(window0)
    except:
        print("couldnt get the article")

driver.quit()

【讨论】:

  • 感谢它的工作。我是刮痧领域的业余爱好者,所以可以抽出几秒钟,让我知道 switch_to_window 和 switch_to.window 之间的区别
  • @VenuBhaskar 很高兴它起作用了,实际上我到目前为止使用 python 在 selenium 中找不到 switch_to_window 方法。我认为这是错误的方法。
猜你喜欢
  • 2021-05-27
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-11-17
  • 1970-01-01
相关资源
最近更新 更多