【发布时间】:2020-07-05 12:08:27
【问题描述】:
我编写了使用 BeautifulSoup 和 Selenium 获取表格的代码。
但是,只获得了表格的一部分。访问website时没有出现的行和列不是soup对象获取的。
我确定问题出现在摘录WebDriverWait(driver, 10).until (EC.visibility_of_element_located((By.ID,"contenttabledivjqxGrid")))
...我尝试了其他几种替代方法,但都没有给我预期的结果(即在我使用 Selenium 更改日期之前加载此表的所有行和列)。
关注代码:
import os
import time
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
# Escolhe o driver Firefox com Profile e Options
driver = webdriver.FirefoxProfile()
driver.set_preference('intl.accept_languages', 'pt-BR, pt')
driver.set_preference('browser.download.folderList', '2')
driver.set_preference('browser.download.manager.showWhenStarting', 'false')
driver.set_preference('browser.download.dir', 'dwnd_path')
driver.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/octet-stream,application/vnd.ms-excel')
options = Options()
options.headless = False
driver = webdriver.Firefox(firefox_profile=driver, options=options)
# Cria um driver
site = 'http://mananciais.sabesp.com.br/HistoricoSistemas'
driver.get(site)
WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.ID,"contenttabledivjqxGrid")))
soup = BeautifulSoup(driver.page_source, 'html.parser')
# Cabeçalho
header = soup.find_all('div', {'class': 'jqx-grid-column-header'})
for i in header:
print(i.get_text())
# Seleciona as relevantes
head = []
for i in header:
if i.get_text().startswith(('Represa', 'Equivalente')):
print('Excluído: ' + i.get_text())
else:
print(i.get_text())
head.append(i.get_text())
print('-'*70)
print(head)
print('-'*70)
print('Número de Colunas: ' + str(len(head)))
# Valores
data = soup.find_all('div', {'class': 'jqx-grid-cell'})
values = []
for i in data:
print(i.get_text())
values.append(i.get_text())
import numpy as np
import pandas as pd
# Convert data to numpy array
num = np.array(values)
# Currently its shape is single dimensional
n_rows = int(len(num)/len(head))
n_cols = int(len(head))
reshaped = num.reshape(n_rows, n_cols)
# Construct Table
pd.DataFrame(reshaped, columns=head)
我只是一名水文学家,想要获取这些水库数据。有人可以帮助我吗?
目前我的结果表是这样的:
【问题讨论】:
-
如果不从您正在报废的页面中挖掘代码,我不知道解决方案。基本上,它只加载您正在查看的内容,因此您的 HTML 响应没有您想要的所有数据。页面上可能有某种类型的侦听器(如 JQuery),可在您需要时立即加载更多数据。如果您查看该 JQuery 脚本,您可能能够从它查询的任何资源中抓取。
标签: python selenium web-scraping beautifulsoup