【问题标题】:Scrapping Table from Website with Selenium Returning Empty DataFrame使用 Selenium 返回空 DataFrame 从网站抓取表格
【发布时间】:2022-01-23 18:52:47
【问题描述】:

我刚开始学习网络报废并尝试从https://www.ishares.com/us/products/268752/ishares-global-reit-etf 的“Holdings”表中提取数据

首先,我使用 pandas,但它返回空数据框。后来发现这个表是动态的,需要用到selenium。但话又说回来,它也给我返回了空数据框。有人可以帮我吗?真的很感激。

import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup

# Instantiate options
options = webdriver.ChromeOptions()
options.headless = True

# Instantiate a webdriver
site = 'https://www.ishares.com/us/products/268752/ishares-global-reit-etf'
wd = webdriver.Chrome('chromedriver',options=options)
wd.get(site)

# Load the HTML page
html = wd.page_source

# Extract data with pandas
df = pd.read_html(html)
table = df[6]

【问题讨论】:

    标签: pandas dataframe selenium web-scraping webdriverwait


    【解决方案1】:

    要从iShares Global REIT ETF 网页的Holdings 表中提取数据,您需要为visibility_of_element_located() 诱导WebDriverWait,并使用DataFrame from Pandas 您可以使用以下@ 987654326@:

    代码块:

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    import pandas as pd
    
    wd.get("https://www.ishares.com/us/products/268752/ishares-global-reit-etf")
    WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
    wd.execute_script("arguments[0].scrollIntoView();", WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@data-componentname]/h2[normalize-space()='Holdings']"))))
    data = WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@aria-describedby='allHoldingsTable_info']"))).get_attribute("outerHTML")
    df  = pd.read_html(data)
    # df  = pd.read_html(data, flavor='html5lib')
    print(df)
    

    控制台输出:

    [  Ticker                                Name       Sector Asset Class  ...      CUSIP          ISIN    SEDOL  Accrual Date
    0    PLD                   PROLOGIS REIT INC  Real Estate      Equity  ...  74340W103  US74340W1036  B44WZD7             -
    1   EQIX                    EQUINIX REIT INC  Real Estate      Equity  ...  29444U700  US29444U7000  BVLZX12             -
    2    PSA                 PUBLIC STORAGE REIT  Real Estate      Equity  ...  74460D109  US74460D1090  2852533             -
    3    SPG       SIMON PROPERTY GROUP REIT INC  Real Estate      Equity  ...  828806109  US8288061091  2812452             -
    4    DLR       DIGITAL REALTY TRUST REIT INC  Real Estate      Equity  ...  253868103  US2538681030  B03GQS4             -
    5      O             REALTY INCOME REIT CORP  Real Estate      Equity  ...  756109104  US7561091049  2724193             -
    6   WELL                       WELLTOWER INC  Real Estate      Equity  ...  95040Q104  US95040Q1040  BYVYHH4             -
    7    AVB      AVALONBAY COMMUNITIES REIT INC  Real Estate      Equity  ...  053484101  US0534841012  2131179             -
    8    ARE  ALEXANDRIA REAL ESTATE EQUITIES RE  Real Estate      Equity  ...  015271109  US0152711091  2009210             -
    9    EQR             EQUITY RESIDENTIAL REIT  Real Estate      Equity  ...  29476L107  US29476L1070  2319157             -
    
    [10 rows x 12 columns]]
    

    【讨论】:

    • 非常感谢 DebanjanB。我编码不流利,所以我觉得很难。
    • 我尝试了代码,但它显示“NameError: name 'driver' is not defined”,所以我输入“driver = webdriver.Chrome()”并显示“WebDriverException: chrome not reachable (会话信息:chrome=96.0.4664.110)"
    • @Mango 查看更新的答案并让我知道状态。
    • 我仍然收到错误消息,说“NameError: name 'wd' is not defined” 也许我必须先声明 'wd' 或在我的电脑上安装一些东西?对不起,我不是编码员或开发人员,我的编码知识仍然很有限 =(
    • 非常感谢@undetected Selenium,我终于用 WebDriver 让它工作了 等待数据完全加载。然而,输出从 400+ 行被截断到只有 10 行。我应该如何加载所有 400 多行?非常感谢您的帮助
    猜你喜欢
    • 2022-01-12
    • 2021-10-04
    • 2021-12-04
    • 1970-01-01
    • 2016-10-14
    • 2019-10-31
    • 1970-01-01
    • 2021-06-19
    • 2021-10-02
    相关资源
    最近更新 更多