使用 Selenium 返回空 DataFrame 从网站抓取表格答案

【问题标题】：Scrapping Table from Website with Selenium Returning Empty DataFrame使用 Selenium 返回空 DataFrame 从网站抓取表格
【发布时间】：2022-01-23 18:52:47
【问题描述】：

我刚开始学习网络报废并尝试从https://www.ishares.com/us/products/268752/ishares-global-reit-etf 的“Holdings”表中提取数据

首先，我使用 pandas，但它返回空数据框。后来发现这个表是动态的，需要用到selenium。但话又说回来，它也给我返回了空数据框。有人可以帮我吗？真的很感激。

import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup

# Instantiate options
options = webdriver.ChromeOptions()
options.headless = True

# Instantiate a webdriver
site = 'https://www.ishares.com/us/products/268752/ishares-global-reit-etf'
wd = webdriver.Chrome('chromedriver',options=options)
wd.get(site)

# Load the HTML page
html = wd.page_source

# Extract data with pandas
df = pd.read_html(html)
table = df[6]

【问题讨论】：

标签： pandas dataframe selenium web-scraping webdriverwait

【解决方案1】：

要从iShares Global REIT ETF 网页的Holdings 表中提取数据，您需要为visibility_of_element_located() 诱导WebDriverWait，并使用DataFrame from Pandas 您可以使用以下@ 987654326@:

代码块：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

wd.get("https://www.ishares.com/us/products/268752/ishares-global-reit-etf")
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
wd.execute_script("arguments[0].scrollIntoView();", WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@data-componentname]/h2[normalize-space()='Holdings']"))))
data = WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@aria-describedby='allHoldingsTable_info']"))).get_attribute("outerHTML")
df  = pd.read_html(data)
# df  = pd.read_html(data, flavor='html5lib')
print(df)

控制台输出：

[  Ticker                                Name       Sector Asset Class  ...      CUSIP          ISIN    SEDOL  Accrual Date
0    PLD                   PROLOGIS REIT INC  Real Estate      Equity  ...  74340W103  US74340W1036  B44WZD7             -
1   EQIX                    EQUINIX REIT INC  Real Estate      Equity  ...  29444U700  US29444U7000  BVLZX12             -
2    PSA                 PUBLIC STORAGE REIT  Real Estate      Equity  ...  74460D109  US74460D1090  2852533             -
3    SPG       SIMON PROPERTY GROUP REIT INC  Real Estate      Equity  ...  828806109  US8288061091  2812452             -
4    DLR       DIGITAL REALTY TRUST REIT INC  Real Estate      Equity  ...  253868103  US2538681030  B03GQS4             -
5      O             REALTY INCOME REIT CORP  Real Estate      Equity  ...  756109104  US7561091049  2724193             -
6   WELL                       WELLTOWER INC  Real Estate      Equity  ...  95040Q104  US95040Q1040  BYVYHH4             -
7    AVB      AVALONBAY COMMUNITIES REIT INC  Real Estate      Equity  ...  053484101  US0534841012  2131179             -
8    ARE  ALEXANDRIA REAL ESTATE EQUITIES RE  Real Estate      Equity  ...  015271109  US0152711091  2009210             -
9    EQR             EQUITY RESIDENTIAL REIT  Real Estate      Equity  ...  29476L107  US29476L1070  2319157             -

[10 rows x 12 columns]]

【讨论】：

非常感谢 DebanjanB。我编码不流利，所以我觉得很难。
我尝试了代码，但它显示“NameError: name 'driver' is not defined”，所以我输入“driver = webdriver.Chrome()”并显示“WebDriverException: chrome not reachable (会话信息：chrome=96.0.4664.110)"
@Mango 查看更新的答案并让我知道状态。
我仍然收到错误消息，说“NameError: name 'wd' is not defined” 也许我必须先声明 'wd' 或在我的电脑上安装一些东西？对不起，我不是编码员或开发人员，我的编码知识仍然很有限 =(
非常感谢@undetected Selenium，我终于用 WebDriver 让它工作了等待数据完全加载。然而，输出从 400+ 行被截断到只有 10 行。我应该如何加载所有 400 多行？非常感谢您的帮助