【问题标题】:Beautifulsoup - Why it is impossible to scrape this website?Beautifulsoup - 为什么无法抓取这个网站?
【发布时间】:2021-08-15 11:24:16
【问题描述】:

网站:https://www.newamerica.org/events/?period=past

我正在尝试抓取事件名称和 URL。但是当我运行代码时唯一的输出是:“在 1.7 秒内完成。”而已。我认为这可能是因为当您打开页面时,事件会在一段时间后加载,而不是立即加载,但这只是一个猜测。我能做些什么来解决这个问题?

from bs4 import BeautifulSoup
from docx import Document
from docx.shared import Pt
import requests

user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.``3945.88 Safari/537.37"
url = "https://www.newamerica.org/events/?period=past"
data = requests.get(url, headers={"User-Agent": user_agent})
soup = BeautifulSoup(data.text, "lxml")

document = Document()

events = soup.find_all("div", class_ = "card__text")

for event in events:
    event_name = event.find("span")
    link = event.find("a")
    try:
        print(event_name.text)
        document.add_paragraph(event_name.text, style='List Bullet')
        print(link['href'])
        document.add_paragraph(link['href'])
    except:
        continue

document.save('demo.docx')

【问题讨论】:

  • ...猜测该页面是大量的 javascript,当您获得该页面时,它没有完全呈现,但在

标签: python python-3.x beautifulsoup


【解决方案1】:

该页面从 API 加载它的数据。您可以向该 API 端点发出请求并获取事件数据。

这里是 API:

https://www.newamerica.org/api/event/?time_period=past&page_size=12&page=1&story_image_rendition=small

这是您提出请求和获取事件数据的方式。此代码打印该页面中每个事件的titleurl

import requests
url = 'https://www.newamerica.org/api/event/?time_period=past&page_size=12&page=1&story_image_rendition=small'

r = requests.get(url)
data = r.json()

for i in data['results']:
    title = i['title']
    link = i['url']
    print(f'Title: {title}\nURL: {link}\n\n')
Title: [ONLINE] - INSide Out: Youth-Led Policy in the Heartland
URL: /indianapolis/events/inside-out-youth-led-policy-in-the-heartland-3/

Title: [ONLINE] - Stretch Your Impact: Building Pathways Towards Tech for Good Careers
URL: /pit-un/events/online-stretch-your-impact-building-pathways-towards-tech-for-good-careers/

Title: [ONLINE] - Designing Accessible and Inclusive Digital Public Infrastructure
URL: /digital-impact-governance-initiative/events/designing-accessible-and-inclusive-digital-public-goods/
.
.
.

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2013-06-07
    • 2019-09-25
    • 2020-10-27
    • 1970-01-01
    • 1970-01-01
    • 2020-01-10
    • 1970-01-01
    相关资源
    最近更新 更多