Selenium 和 Pandas 在页面中抓取表格答案

【问题标题】：Selenium and Pandas to scrape table in pageSelenium 和 Pandas 在页面中抓取表格
【发布时间】：2021-07-26 01:52:52
【问题描述】：

我正在尝试将“https://umich-biostatistics.shinyapps.io/covid19/”中可用的表格加载到数据框中，并导航到页面中的“指标”部分。因为页面在打开页面后加载数据，所以我尝试使用 selenium。有人可以帮忙找出我的错误吗？

import time
from selenium import webdriver
import pandas as pd

chrome_path = r"C:\\Selenium\\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
url = 'https://umich-biostatistics.shinyapps.io/covid19/'

page = driver.get(url)
time.sleep(10)

df = pd.read_html(driver.page_source)[0]
print(df.head())

通过运行上面的代码，我得到以下错误：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-4781af4c4464> in <module>
     10 time.sleep(10)
     11 
---> 12 df = pd.read_html(driver.page_source)[0]
     13 print(df.head())

~\anaconda3\lib\site-packages\pandas\io\html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only)
   1098         na_values=na_values,
   1099         keep_default_na=keep_default_na,
-> 1100         displayed_only=displayed_only,
   1101     )

~\anaconda3\lib\site-packages\pandas\io\html.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
    913             break
    914     else:
--> 915         raise retained
    916 
    917     ret = []

~\anaconda3\lib\site-packages\pandas\io\html.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
    893 
    894         try:
--> 895             tables = p.parse_tables()
    896         except ValueError as caught:
    897             # if `io` is an io-like object, check if it's seekable

~\anaconda3\lib\site-packages\pandas\io\html.py in parse_tables(self)
    211         list of parsed (header, body, footer) tuples from tables.
    212         """
--> 213         tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
    214         return (self._parse_thead_tbody_tfoot(table) for table in tables)
    215 

~\anaconda3\lib\site-packages\pandas\io\html.py in _parse_tables(self, doc, match, attrs)
    543 
    544         if not tables:
--> 545             raise ValueError("No tables found")
    546 
    547         result = []

ValueError: No tables found

【问题讨论】：

标签： python pandas selenium

【解决方案1】：

使用-

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

chrome_path = r"C:\\Selenium\\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
url = 'https://umich-biostatistics.shinyapps.io/covid19/'

page = driver.get(url)

try:
    element = WebDriverWait(driver, 1000).until(
        EC.presence_of_element_located((By.CLASS_NAME, "gt_table"))
    )
    df = pd.read_html(driver.page_source)[0]
finally:
    driver.quit()
    print(df.head())

输出

             Day   Date  Deaths   Cases    Tests    TPR  Vaccines
0          Today  05/01    3685  392576  1504698  26.1%   1826396
1      Yesterday  04/30    3525  402014  1804954  22.3%   2744456
2   One week ago  04/24    2761  348996  1402367  24.9%   2536585
3  One month ago  04/01     468   81398  1046605   7.8%   3671242

说明

time.sleep() 是一种非常粗暴的等待网页加载的方式。关于加载元素需要多长时间，您的猜测与我的猜测一样好。

现在，在您的情况下，网站已完全加载，但由于这是一个闪亮的应用程序，表格在网页完全加载后很长时间才会异步加载。

因此，time.sleep(10) 是一个命中/未命中，即使网页已加载，表格也可能无法完全检索。

在这种情况下，理想的实现是使用带有硬超时的WebDriverWait（您不能再等待了！），以便程序在表加载的那一刻移动。

在您的网页中，该表格的样式标签为gt_table，据此我可以拉出它。理想情况下，您希望有一个 ID 元素，以便您可以通过 By.ID 拉动，如果您不希望网页发生太大变化，更好的方法是 By.XPATH。

【讨论】：