Web Scraping 没有得到所有的表答案

【问题标题】：Web Scraping not get all the tableWeb Scraping 没有得到所有的表
【发布时间】：2020-07-05 12:08:27
【问题描述】：

我编写了使用 BeautifulSoup 和 Selenium 获取表格的代码。

但是，只获得了表格的一部分。访问website时没有出现的行和列不是soup对象获取的。

我确定问题出现在摘录WebDriverWait(driver, 10).until (EC.visibility_of_element_located((By.ID,"contenttabledivjqxGrid")))

...我尝试了其他几种替代方法，但都没有给我预期的结果（即在我使用 Selenium 更改日期之前加载此表的所有行和列）。

关注代码：

import os
import time
from selenium import webdriver
from bs4 import BeautifulSoup

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options

# Escolhe o driver Firefox com Profile e Options
driver = webdriver.FirefoxProfile()
driver.set_preference('intl.accept_languages', 'pt-BR, pt')
driver.set_preference('browser.download.folderList', '2')
driver.set_preference('browser.download.manager.showWhenStarting', 'false')
driver.set_preference('browser.download.dir', 'dwnd_path')
driver.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/octet-stream,application/vnd.ms-excel')

options = Options()
options.headless = False

driver = webdriver.Firefox(firefox_profile=driver, options=options)

# Cria um driver

site = 'http://mananciais.sabesp.com.br/HistoricoSistemas'
driver.get(site)


WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.ID,"contenttabledivjqxGrid")))
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Cabeçalho
header = soup.find_all('div', {'class': 'jqx-grid-column-header'})
for i in header:
    print(i.get_text())


# Seleciona as relevantes
head = []
for i in header:
    if i.get_text().startswith(('Represa', 'Equivalente')):
        print('Excluído: ' + i.get_text())
    else:
        print(i.get_text())
        head.append(i.get_text())

print('-'*70)
print(head)
print('-'*70)
print('Número de Colunas: ' + str(len(head)))

# Valores
data = soup.find_all('div', {'class': 'jqx-grid-cell'})
values = []
for i in data:
    print(i.get_text())
    values.append(i.get_text())


import numpy as np
import pandas as pd

# Convert data to numpy array
num = np.array(values)

# Currently its shape is single dimensional
n_rows = int(len(num)/len(head))
n_cols = int(len(head))
reshaped = num.reshape(n_rows, n_cols)

# Construct Table
pd.DataFrame(reshaped, columns=head)

我只是一名水文学家，想要获取这些水库数据。有人可以帮助我吗？

目前我的结果表是这样的：

【问题讨论】：

如果不从您正在报废的页面中挖掘代码，我不知道解决方案。基本上，它只加载您正在查看的内容，因此您的 HTML 响应没有您想要的所有数据。页面上可能有某种类型的侦听器（如 JQuery），可在您需要时立即加载更多数据。如果您查看该 JQuery 脚本，您可能能够从它查询的任何资源中抓取。

标签： python selenium web-scraping beautifulsoup

【解决方案1】：

看起来表格是动态加载的，并且在 HTML 中只有表格的可见部分，所以这就是为什么您只获得部分数据的原因。可能的解决方案是使用 Selenium 的滚动条并逐位读取数据。

【讨论】：

【解决方案2】：

我刚刚查看了网站。在 Firefox 中，如果您转到 Developer Tools > Network 并检查名称为“0”的文件，您会注意到该文件的响应是一个 JSON 文件，其中包含您需要的所有信息（图 1）。为了获取此信息，您必须遵循请求标头（图 2）

图 1：请求响应

图 2：请求标头

您需要使用这些标头向网站执行“GET”请求，如果接受，响应将是包含您所有数据的 JSON。请记住，某些请求可能会要求 cookie 标头，您需要在执行请求之前获取该标头。

我不太了解 Beatutiful Soup，但我知道这可以通过 Scrapy 或 Request Library 实现。我很确定这将为您指明正确的方向。

【讨论】：

答案本身在哪里？
哇！非常感谢！就是这样！一定！其作品！我忘记了 BeautifulSoup，现在正在研究如何将 Json 正确转换为数据框！
很高兴，这对您有所帮助！祝你的项目好运！