【发布时间】:2021-08-15 11:24:16
【问题描述】:
网站:https://www.newamerica.org/events/?period=past
我正在尝试抓取事件名称和 URL。但是当我运行代码时唯一的输出是:“在 1.7 秒内完成。”而已。我认为这可能是因为当您打开页面时,事件会在一段时间后加载,而不是立即加载,但这只是一个猜测。我能做些什么来解决这个问题?
from bs4 import BeautifulSoup
from docx import Document
from docx.shared import Pt
import requests
user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.``3945.88 Safari/537.37"
url = "https://www.newamerica.org/events/?period=past"
data = requests.get(url, headers={"User-Agent": user_agent})
soup = BeautifulSoup(data.text, "lxml")
document = Document()
events = soup.find_all("div", class_ = "card__text")
for event in events:
event_name = event.find("span")
link = event.find("a")
try:
print(event_name.text)
document.add_paragraph(event_name.text, style='List Bullet')
print(link['href'])
document.add_paragraph(link['href'])
except:
continue
document.save('demo.docx')
【问题讨论】:
-
...猜测该页面是大量的 javascript,当您获得该页面时,它没有完全呈现,但在
标签: python python-3.x beautifulsoup