【发布时间】:2019-05-29 07:43:00
【问题描述】:
我正在尝试从 EPA website 获取一些数据,不幸的是我无法捕获所有数据点,我推测这是由于滚动和等待标签变得可见的组合。但是我从昨天开始就一直在做这个,没有运气。
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from selenium.webdriver.common.keys import Keys
import numpy as np
options = webdriver.ChromeOptions()
path = '/Users/<user>/Applications/chromedriver'
options.set_headless(True)
driver = webdriver.Chrome(chrome_options= options, executable_path=path)
url = 'https://edap.epa.gov/public/single/?appid=73b2b6a5-70c6-4820-b3fa-186ac094f10d&obj=b5bf280c-3488-4e46-84f6-58e2a0c34108&opt=noanimate%2Cnoselections&select=clearall'
driver.set_window_size(1920, 1080)
driver.get(url)
SCROLL_PAUSE_TIME = 0.5
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
rin_data = []
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CLASS_NAME,"qv-st-value-overflow")))
soup = BeautifulSoup(driver.page_source, "html.parser")
tableURL = soup.select('.qv-st-value-overflow')
for rin_val in tableURL:
rin_data.append(rin_val.get_text())
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
【问题讨论】:
-
如果您查看网络选项卡,您可以看到获取所有原始数据的请求:
https://edap.epa.gov/public/qrs/extension/schema?xrfkey=i3EE7XfLtdhxtTBr。如果只需要获取一次,则可以从检查器中复制数据。 -
嗨,Jacob,我正在尝试获取一个脚本,该脚本将在每个月更新时定期从中提取数据。我不确定我是否可以访问该 url,因为我收到错误 XSRF 预防检查失败。发现了可能的 XSRF。
标签: python selenium selenium-webdriver web-scraping beautifulsoup