【发布时间】:2020-07-16 05:05:54
【问题描述】:
我试图在 python 中使用 selenium 从文章中提取数据,代码正在识别文章,但在运行循环时,一些文章被随机跳过。任何解决此问题的帮助将不胜感激。
#Importing libraries
import requests
import os
import json
from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup
import time
import requests
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import traceback
from webdriver_manager.chrome import ChromeDriverManager
#opening a chrome instance
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r"C:/selenium/chromedriver.exe")
#getting into the website
driver.get('https://academic.oup.com/rof/issue/2/2')
#getting the articles
articles = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, '/html/body/div[3]/main/section/div/div/div[1]/div/div[3]/div[2]/div[3]/div/div/div/div/h5')))
#loop to get in and out of articles
for article in articles:
try:
ActionChains(driver).key_down(Keys.CONTROL).click(article).key_up(Keys.CONTROL).perform()
WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
window1 = driver.window_handles[1]
driver.switch_to_window(window1)
driver.close()
driver.switch_to_window(window0)
except:
print("couldnt get the article")
【问题讨论】:
-
这可能看起来有点简单,但您是否尝试过增加每次点击的等待时间? 10 秒可能不足以打开文章。看不出你的代码有什么严重的错误。
-
减少 XPATH 选择器的长度可能是值得的。 (By.XPATH, '//h5[@class="customLink item-title"]') 更简洁。查看选择器后,您确定 h5 是您要单击的,而不是直接子级的 吗?
-
@AaronS 我尝试增加等待时间,但没有奏效。是的,使用您建议的 xpath 使代码看起来有点干净。谢谢
标签: python-3.x selenium selenium-webdriver xpath web-scraping