【问题标题】:python selenium driver not scrolling to collect all data pointspython selenium驱动程序不滚动以收集所有数据点
【发布时间】:2019-05-29 07:43:00
【问题描述】:

我正在尝试从 EPA website 获取一些数据,不幸的是我无法捕获所有数据点,我推测这是由于滚动和等待标签变得可见的组合。但是我从昨天开始就一直在做这个,没有运气。

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from selenium.webdriver.common.keys import Keys
import numpy as np

options = webdriver.ChromeOptions()
path = '/Users/<user>/Applications/chromedriver'
options.set_headless(True) 

driver = webdriver.Chrome(chrome_options= options, executable_path=path)
url = 'https://edap.epa.gov/public/single/?appid=73b2b6a5-70c6-4820-b3fa-186ac094f10d&obj=b5bf280c-3488-4e46-84f6-58e2a0c34108&opt=noanimate%2Cnoselections&select=clearall'

driver.set_window_size(1920, 1080)
driver.get(url)



SCROLL_PAUSE_TIME = 0.5

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

rin_data = []

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CLASS_NAME,"qv-st-value-overflow")))
    soup = BeautifulSoup(driver.page_source, "html.parser")
    tableURL = soup.select('.qv-st-value-overflow')
    for rin_val in tableURL:
        rin_data.append(rin_val.get_text())

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

【问题讨论】:

  • 如果您查看网络选项卡,您可以看到获取所有原始数据的请求:https://edap.epa.gov/public/qrs/extension/schema?xrfkey=i3EE7XfLtdhxtTBr。如果只需要获取一次,则可以从检查器中复制数据。
  • 嗨,Jacob,我正在尝试获取一个脚本,该脚本将在每个月更新时定期从中提取数据。我不确定我是否可以访问该 url,因为我收到错误 XSRF 预防检查失败。发现了可能的 XSRF。

标签: python selenium selenium-webdriver web-scraping beautifulsoup


【解决方案1】:

它使用Websocket 而不是Ajax 来获取数据,您需要滚动table[ng-style="tableStyles.content"] 而不是body,但它需要自定义滚动或使用mouse wheel 滚动。函数取自here

SCROLL_PAUSE_TIME = 2

driver.get(url)

# add mouse wheel function to the page
driver.execute_script('''
window.scrollTable = function() {
  var element = document.querySelector('table[ng-style="tableStyles.content"]')
  var box = element.getBoundingClientRect();
  var deltaY = box.height;
  var clientX = box.left + (box.width / 2);
  var clientY = box.top + (box.height / 2);
  var target = element.ownerDocument.elementFromPoint(clientX, clientY);

  for (var e = target; e; e = e.parentElement) {
    if (e === element) {
      target.dispatchEvent(new WheelEvent('wheel', {view: window, bubbles: true, cancelable: true, clientX: clientX, clientY: clientY, deltaY: deltaY}));
    }
  }
}
''')

rin_data = []

while True:
    WebDriverWait(driver, 30).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, 'tr[class^="qv-st-data-row"]'))
    )
    last_position = driver.find_element_by_css_selector(".scrollbar-thumb").get_attribute('style')
    rows = driver.find_elements_by_css_selector('tr[class^="qv-st-data-row"]')
    for row in rows:
        rin_data.append(row.text)

    # Scroll down the table
    driver.execute_script('scrollTable()')
    # Wait to load content from Websocket, maybe need to increase
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll position and compare with last scroll position
    new_position = driver.find_element_by_css_selector(".scrollbar-thumb").get_attribute('style')
    if new_position == last_position:
        break

注意,在这种情况下你不需要使用BeautifulSoup

【讨论】:

  • 这个成功了,非常感谢,我不知道websocket的使用。也感谢您提供之前的帖子。在旁注中,您是否推荐任何其他可以提供有关此细微差别的更多见解的好的抓取文档。
  • 嗨 ewwink,我正在研究你的代码,因为我似乎遗漏了一些行,我认为这可能是由于滚动,我试图找到 scrollTable() 函数,我可以了解更多关于这。当我谷歌搜索时,我没有看到它用于任何东西
  • scrollTable() 是一个自定义函数。如果缺少某些行,请尝试使用var deltaY = box.height; 设置更改值或增加SCROLL_PAUSE_TIME
猜你喜欢
  • 2014-04-06
  • 1970-01-01
  • 1970-01-01
  • 2018-10-20
  • 1970-01-01
  • 2022-11-21
  • 1970-01-01
  • 2021-05-23
  • 2018-07-12
相关资源
最近更新 更多