使用 Python Selenium 和 Beautiful 在网站上检索/web 抓取表失败答案

【问题标题】：Failed to retrieve/webscrape tables on website using Python Selenium and BeautifulS使用 Python Selenium 和 Beautiful 在网站上检索/web 抓取表失败
【发布时间】：2021-07-28 05:31:43
【问题描述】：

我正在尝试从以下动态网站检索表格并将其保存到数据框中： https://www.grants.gov/web/grants/search-grants.html

我尝试了一些方法，例如 pandas、requests.post、beautifulSoup 和 selenium，它们都没有返回结果，就好像表不存在或根本没有检测到一样。

下面是我的代码：

from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup 
import requests

#using pandas
pd.read_html('https://www.grants.gov/web/grants/search-grants.html')

# Using beautifulsoup
URL='https://www.grants.gov/web/grants/search-grants.html'
response = requests.get(URL, headers={})
soup = BeautifulSoup(response.text, 'lxml')
print(soup)

job_elems = soup.findAll('table')
print(job_elems)
for i in job_elems:
    txt=i.find("td").text.strip()
    print(txt)

tr=soup.findAll("tr",class_='gridevenrow')
for element in tr:
    row=element.find('td')
    print(row.text)


#using selenium
from selenium.webdriver.firefox.options import Options
options = Options()
options.headless = True
driver=webdriver.Firefox(executable_path ='/Users/**/geckodriver',options=options)
driver.get(URL)

elems = driver.find_elements_by_xpath("//td")
for e in elems:
    print(e.text)


#requests.post
url= "https://www.grants.gov/grantsws/rest/opportunities/search/"
data = """{"startRecordNum":0,"sortBy":"openDate|desc","oppStatuses":"forecasted|posted"}"""
soup = BeautifulSoup(requests.post(url, data=data).content, "xml")
data = []
for sn in soup.findAll("tr"):
    text=sn.find('td').text
    print(text)
    

#selenium + soup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True
driver = webdriver.Firefox(executable_path ='/Users/**/geckodriver',options=options)
driver.get('https://www.grants.gov/grantsws/rest/opportunities/search/') 

element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//tr"))) #waits 10 seconds until element is located. Can have other wait conditions  such as visibility_of_element_located or text_to_be_present_in_element

html = driver.page_source
soup = bs(html, "lxml")
dynamic_text = soup.find_all("td") #or other attributes, optional
print(dynamic_text)

【问题讨论】：

标签： python api selenium web-scraping beautifulsoup

【解决方案1】：

您看到的数据是从外部 URL 加载的。您可以使用此示例如何将其加载到 pandas DataFrame：

import json
import requests
import pandas as pd

url = "https://www.grants.gov/grantsws/rest/opportunities/search/"

payload = {
    "oppStatuses": "forecasted|posted",
    "sortBy": "openDate|desc",
    "startRecordNum": 0,
}

data = requests.post(url, json=payload).json()

# uncomment to print all data:
# print(json.dumps(data, indent=4))

df = pd.DataFrame(data["oppHits"])
df["cfdaList"] = df["cfdaList"].apply(lambda x: ", ".join(x))
print(df)
df.to_csv("data.csv", index=False)

打印：

        id                      number                                              title       agencyCode                                             agency    openDate   closeDate oppStatus   docType                                cfdaList
0   333348            72062421RFA00003                               SERVIR West Africa 2        USAID-GHA                                  Ghana USAID-Accra  05/05/2021  05/10/2021    posted  synopsis                                  98.001
1   333281  USDA-AMS-TM-RFSP-G-21-0009                  Regional Food System Partnerships         USDA-AMS                     Agricultural Marketing Service  05/05/2021  07/06/2021    posted  synopsis                                  10.177
2   333336            72038821RFI00002  USAID/Bangladesh Request for Information on US...        USAID-BAN                             Bangladesh USAID-Dhaka  05/05/2021  05/27/2021    posted  synopsis                                  98.001
3   333307            N00014-21-S-SN10  2021 Office of Naval Research (ONR) Global Res...          DOD-ONR                           Office of Naval Research  05/05/2021  07/09/2021    posted  synopsis                                  12.300
4   333337   SCAISB-21-AW-015-05052021      Promoting a Culture of Inclusion and Research          DOS-PAK                           U.S. Mission to Pakistan  05/05/2021  06/04/2021    posted  synopsis                                  19.501
5   333338               030ADV21R0179                      Teaching with Primary Sources              LOC                                Library of Congress  05/05/2021  05/28/2021    posted  synopsis                                  42.010
6   333343         RFI-675-21-HFECA-01  Guinea Health Facility Electrification and Con...        USAID-GUI                               Guinea USAID-Conakry  05/05/2021  06/28/2021    posted  synopsis                                  98.001
7   333342            O-BJA-2021-93001  BJA FY 21 Second Chance Act Pay for Success In...    USDOJ-OJP-BJA                       Bureau of Justice Assistance  05/05/2021  06/22/2021    posted  synopsis                                  16.812
8   333308            O-BJA-2021-04001  BJA FY 21 Sexual Assault Forensic Evidence - I...    USDOJ-OJP-BJA                       Bureau of Justice Assistance  05/05/2021  06/07/2021    posted  synopsis                                  16.741
9   333344            O-BJA-2021-94002  BJA FY 21 Safeguarding Correctional Facilities...    USDOJ-OJP-BJA                       Bureau of Justice Assistance  05/05/2021  06/07/2021    posted  synopsis                                  16.844
10  333313              DE-FOA-0002527          Equitable Access to Community-based Solar          DOE-GFO                                Golden Field Office  05/05/2021  06/01/2021    posted  synopsis                                  81.117
11  333310                 SFOP0008106  Tunisia Supporting the Inclusion of Vulnerable...       DOS-NEA-AC                            Assistance Coordination  05/05/2021  06/01/2021    posted  synopsis                                  19.600
12  333358                  L21AS00499  Department of the Interior - Bureau of Land Ma...          DOI-BLM                          Bureau of Land Management  05/05/2021  06/04/2021    posted  synopsis                                  15.224
13  333352                  PAR-21-224  NeuroNEXT Small Business Innovation in Clinica...        HHS-NIH11                      National Institutes of Health  05/05/2021  04/05/2024    posted  synopsis                                  93.853
14  333353               RFA-HL-23-004  NHLBI Outstanding Investigator Award (OIA) (R3...        HHS-NIH11                      National Institutes of Health  05/05/2021  04/25/2024    posted  synopsis  93.840, 93.233, 93.838, 93.839, 93.837
15  333311               RFA-FD-21-032  Integrated Pathogen Reduction Technologies for...          HHS-FDA                       Food and Drug Administration  05/05/2021  07/06/2021    posted  synopsis                                  93.103
16  333312              DE-FOA-0002526  Workforce Development Strategies Supporting th...          DOE-GFO                                Golden Field Office  05/05/2021  06/01/2021    posted  synopsis                                  81.117
17  333315               RFA-HL-23-005  NHLBI Emerging Investigator Award (EIA) (R35 C...        HHS-NIH11                      National Institutes of Health  05/05/2021  04/25/2024    posted  synopsis  93.840, 93.233, 93.838, 93.839, 93.837
18  333351                  PAR-21-223  NeuroNEXT Clinical Trials (U01 Clinical Trial ...        HHS-NIH11                      National Institutes of Health  05/05/2021  03/05/2024    posted  synopsis                                  93.853
19  333350          O-OJJDP-2021-00002  OJJDP FY 2021 Strategies To Support Children E...  USDOJ-OJP-OJJDP  Office of Juvenile Justice Delinquency Prevent...  05/05/2021  06/22/2021    posted  synopsis                                  16.818
20  333346               RFA-HD-22-020  Human Milk as a Biological System (R01 Clinica...        HHS-NIH11                      National Institutes of Health  05/05/2021  11/29/2021    posted  synopsis                                  93.865
21  333349          O-OJJDP-2021-92009            OJJDP FY 2021 Family Drug Court Program  USDOJ-OJP-OJJDP  Office of Juvenile Justice Delinquency Prevent...  05/05/2021  06/22/2021    posted  synopsis                                  16.585
22  333354                      21CS16      Women&rsquo;s Risk and Need Assessment (WRNA)    USDOJ-BOP-NIC                  National Institute of Corrections  05/05/2021  07/05/2021    posted  synopsis                                  16.601
23  329057   HHS-2021-ACF-ACYF-EV-1942  Family Violence Prevention and Services Discre...     HHS-ACF-FYSB  Administration for Children & Families - ACYF/...  05/05/2021  07/05/2021    posted  synopsis                                  93.592
24  333269   HHS-2021-ACF-OPRE-YR-1967  Head Start University Partnerships:  Building ...     HHS-ACF-OPRE    Administration for Children and Families - OPRE  05/05/2021  07/06/2021    posted  synopsis                                  93.600

并保存data.csv（来自 LibreOffice 的屏幕截图）：

【讨论】：

这太棒了，非常感谢！我有一个问题：还有什么方法可以检索链接吗？再次感谢您的精彩投入！
@zaza001 当您取消注释print(json.dumps(data, indent=4)) 时，您将看到服务器返回的所有信息。可以在 URL https://www.grants.gov/grantsws/rest/opportunity/details（POST 请求）上找到有关每个授权的详细信息。您可以在 Firefox 开发者工具 -> 网络选项卡（或 Chrome 等效项）中查看这些请求。