Beautiful Soup 从多个页面下载 PDF答案

【问题标题】：Beautiful Soup to Download PDFs from Multiple PagesBeautiful Soup 从多个页面下载 PDF
【发布时间】：2021-09-24 14:34:30
【问题描述】：

Andrej 帮助我编写了这段代码，但现在我想知道如何导航到每个页面并下载名称中包含文本/标题“Public Comment”的所有 PDF？

import requests
from bs4 import BeautifulSoup

url = "https://www.ci.atherton.ca.us/Archive.aspx?AMID=41"
key = "Archive.aspx?ADID="

soup = BeautifulSoup(requests.get(url).content, "html.parser")

for link in soup.find_all("a"):
    if key in link.get("href", ""):
        print("https://www.ci.atherton.ca.us/" + link.get("href"))

打印：

https://www.ci.atherton.ca.us/Archive.aspx?ADID=3581
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3570
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3564
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3559
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3556
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3554
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3552

【问题讨论】：

字符串“Public Comments”应该在哪里？你能给出示例网址吗？
更好的例子：ci.atherton.ca.us/Archive.aspx?ADID=3461 > ci.atherton.ca.us/DocumentCenter/View/8399/…

标签： python html web-scraping beautifulsoup hyperlink

【解决方案1】：

试试：

import requests
from bs4 import BeautifulSoup

url = "https://www.ci.atherton.ca.us/Archive.aspx?AMID=41"
key = "Archive.aspx?ADID="

soup = BeautifulSoup(requests.get(url).content, "html.parser")

all_links = []
for link in soup.find_all("a"):
    if key in link.get("href", ""):
        all_links.append("https://www.ci.atherton.ca.us/" + link.get("href"))

for link in all_links:
    print("Checking {}...".format(link))
    soup = BeautifulSoup(requests.get(link).content, "html.parser")
    for a in soup.find_all(
        lambda tag: tag.name == "a" and "public comment" in tag.text.lower()
    ):
        pdf_link = "https://www.ci.atherton.ca.us" + a["href"]
        filename = a["href"].split("/")[-1] + ".pdf"
        print("Downloading {} to {}".format(pdf_link, filename))
        with open(filename, "wb") as f_out:
            f_out.write(requests.get(pdf_link).content)

打印：

...

Checking https://www.ci.atherton.ca.us/Archive.aspx?ADID=3514...
Checking https://www.ci.atherton.ca.us/Archive.aspx?ADID=3505...
Downloading https://www.ci.atherton.ca.us/DocumentCenter/View/8628/Public-Comments-1202021---ITEM-No-15 to Public-Comments-1202021---ITEM-No-15.pdf
Checking https://www.ci.atherton.ca.us/Archive.aspx?ADID=3498...
Checking https://www.ci.atherton.ca.us/Archive.aspx?ADID=3479...
Downloading https://www.ci.atherton.ca.us/DocumentCenter/View/8516/Wayne-Lee---Public-Comments_12162020 to Wayne-Lee---Public-Comments_12162020.pdf
Downloading https://www.ci.atherton.ca.us/DocumentCenter/View/8532/Discher-Stephanie_Public-Comments_12162020 to Discher-Stephanie_Public-Comments_12162020.pdf

...

并将 URL 中的 PDF 保存到文件中。

【讨论】：