【发布时间】:2021-07-28 05:31:43
【问题描述】:
我正在尝试从以下动态网站检索表格并将其保存到数据框中: https://www.grants.gov/web/grants/search-grants.html
我尝试了一些方法,例如 pandas、requests.post、beautifulSoup 和 selenium,它们都没有返回结果,就好像表不存在或根本没有检测到一样。
下面是我的代码:
from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup
import requests
#using pandas
pd.read_html('https://www.grants.gov/web/grants/search-grants.html')
# Using beautifulsoup
URL='https://www.grants.gov/web/grants/search-grants.html'
response = requests.get(URL, headers={})
soup = BeautifulSoup(response.text, 'lxml')
print(soup)
job_elems = soup.findAll('table')
print(job_elems)
for i in job_elems:
txt=i.find("td").text.strip()
print(txt)
tr=soup.findAll("tr",class_='gridevenrow')
for element in tr:
row=element.find('td')
print(row.text)
#using selenium
from selenium.webdriver.firefox.options import Options
options = Options()
options.headless = True
driver=webdriver.Firefox(executable_path ='/Users/**/geckodriver',options=options)
driver.get(URL)
elems = driver.find_elements_by_xpath("//td")
for e in elems:
print(e.text)
#requests.post
url= "https://www.grants.gov/grantsws/rest/opportunities/search/"
data = """{"startRecordNum":0,"sortBy":"openDate|desc","oppStatuses":"forecasted|posted"}"""
soup = BeautifulSoup(requests.post(url, data=data).content, "xml")
data = []
for sn in soup.findAll("tr"):
text=sn.find('td').text
print(text)
#selenium + soup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs
from selenium.webdriver.firefox.options import Options
options = Options()
options.headless = True
driver = webdriver.Firefox(executable_path ='/Users/**/geckodriver',options=options)
driver.get('https://www.grants.gov/grantsws/rest/opportunities/search/')
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//tr"))) #waits 10 seconds until element is located. Can have other wait conditions such as visibility_of_element_located or text_to_be_present_in_element
html = driver.page_source
soup = bs(html, "lxml")
dynamic_text = soup.find_all("td") #or other attributes, optional
print(dynamic_text)
【问题讨论】:
标签: python api selenium web-scraping beautifulsoup