【问题标题】:Beautiful Soup to Download PDFs from Multiple PagesBeautiful Soup 从多个页面下载 PDF
【发布时间】:2021-09-24 14:34:30
【问题描述】:

Andrej 帮助我编写了这段代码,但现在我想知道如何导航到每个页面并下载名称中包含文本/标题“Public Comment”的所有 PDF?

import requests
from bs4 import BeautifulSoup

url = "https://www.ci.atherton.ca.us/Archive.aspx?AMID=41"
key = "Archive.aspx?ADID="

soup = BeautifulSoup(requests.get(url).content, "html.parser")

for link in soup.find_all("a"):
    if key in link.get("href", ""):
        print("https://www.ci.atherton.ca.us/" + link.get("href"))

打印:

https://www.ci.atherton.ca.us/Archive.aspx?ADID=3581
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3570
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3564
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3559
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3556
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3554
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3552

【问题讨论】:

标签: python html web-scraping beautifulsoup hyperlink


【解决方案1】:

试试:

import requests
from bs4 import BeautifulSoup

url = "https://www.ci.atherton.ca.us/Archive.aspx?AMID=41"
key = "Archive.aspx?ADID="

soup = BeautifulSoup(requests.get(url).content, "html.parser")

all_links = []
for link in soup.find_all("a"):
    if key in link.get("href", ""):
        all_links.append("https://www.ci.atherton.ca.us/" + link.get("href"))

for link in all_links:
    print("Checking {}...".format(link))
    soup = BeautifulSoup(requests.get(link).content, "html.parser")
    for a in soup.find_all(
        lambda tag: tag.name == "a" and "public comment" in tag.text.lower()
    ):
        pdf_link = "https://www.ci.atherton.ca.us" + a["href"]
        filename = a["href"].split("/")[-1] + ".pdf"
        print("Downloading {} to {}".format(pdf_link, filename))
        with open(filename, "wb") as f_out:
            f_out.write(requests.get(pdf_link).content)

打印:

...

Checking https://www.ci.atherton.ca.us/Archive.aspx?ADID=3514...
Checking https://www.ci.atherton.ca.us/Archive.aspx?ADID=3505...
Downloading https://www.ci.atherton.ca.us/DocumentCenter/View/8628/Public-Comments-1202021---ITEM-No-15 to Public-Comments-1202021---ITEM-No-15.pdf
Checking https://www.ci.atherton.ca.us/Archive.aspx?ADID=3498...
Checking https://www.ci.atherton.ca.us/Archive.aspx?ADID=3479...
Downloading https://www.ci.atherton.ca.us/DocumentCenter/View/8516/Wayne-Lee---Public-Comments_12162020 to Wayne-Lee---Public-Comments_12162020.pdf
Downloading https://www.ci.atherton.ca.us/DocumentCenter/View/8532/Discher-Stephanie_Public-Comments_12162020 to Discher-Stephanie_Public-Comments_12162020.pdf

...

并将 URL 中的 PDF 保存到文件中。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2018-11-25
    • 1970-01-01
    • 2022-01-04
    • 1970-01-01
    • 1970-01-01
    • 2016-05-16
    • 2020-12-01
    相关资源
    最近更新 更多