如何用python美汤下载一个类中的所有href（pdf）？答案

【问题标题】：How to download all the href (pdf) inside a class with python beautiful soup?如何用python美汤下载一个类中的所有href（pdf）？
【发布时间】：2021-12-30 11:25:02
【问题描述】：

我有大约 900 页，每页包含 10 个按钮（每个按钮都有 pdf）。我想下载所有的 pdf - 程序应该浏览到所有页面并一一下载 pdf。

仅搜索 .pdf 的代码，但我的 href 没有 .pdf page_no（1 到 900）。

https://bidplus.gem.gov.in/bidlists?bidlists&page_no=3

这是网站，下面是链接：

投标编号：GEM/2021/B/1804626

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://bidplus.gem.gov.in/bidlists"

#If there is no such folder, the script will create one automatically
folder_location = r'C:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
    #Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)

【问题讨论】：

好的答案需要好的问题，请通过改进您的问题来帮助大家理解您的问题 --> 您之前的尝试是什么样的，您在哪里没有得到进一步的帮助？谢谢
已编辑请检查@HedgeHog
您的网站是否仅在印度可用？
如果您完全不知道它是如何工作的，this post 可能会帮助您 - 关键是我们很乐意帮助您解决您遇到的特定问题，但我们没有您以前的尝试，无法编织现成的解决方案。
抱歉，这只是我发布的链接中接受的答案的副本 - 这种行为不好，没有表现出任何努力 - 我出去了

标签： python beautifulsoup

【解决方案1】：

您只需要与您调用按钮的链接相关联的 href。然后加上适当的协议+域前缀。

链接可以匹配以下选择器：

.bid_no > a

即锚 (a) 标记，其直接父元素的类为 bid_no。

这应该会在每页中提取 10 个链接。由于每次下载都需要一个文件名，我建议使用全局字典，将链接存储为值，将链接文本存储为键。我将链接描述中的“\”替换为“_”。您只需在循环期间添加所需的页数即可。

一些字典条目的示例：

由于有超过 800 个页面，我选择添加一个名为 end_number 的额外终止页面计数变量。我不想循环到所有页面，所以这让我可以提前退出。如果需要，您可以删除此参数。

接下来，您需要确定实际的页数。为此，您可以使用以下 css 选择器获取 Last 分页链接，然后提取其 data-ci-pagination-page 值并转换为整数。然后，这可以是 num_pages（页数）以终止循环：

.pagination li:last-of-type > a

这会寻找一个 a 标记，它是最后一个 li 元素的直接子元素，其中这些 li 元素与类 pagination 共享父元素，即最后一个 li 中的锚标记，这是分页元素中的最后一个页面链接。

在您的字典中拥有所有所需的链接和文件后缀（链接的描述文本）后，循环键、值对并发出内容请求。将该内容写入磁盘。

待办事项：

我建议您研究优化最终发出请求和写入磁盘的方法。例如，您可以首先异步发出所有请求并将其存储在字典中以优化 I/0 绑定过程。然后循环写入磁盘，可能使用多处理方法来优化 CPU 密集型进程。

我还会考虑是否应该在请求之间引入某种等待。或者，如果请求应该是批量的。理论上，您可能当前有类似 (836 * 10) + 836 个请求。

import requests
from bs4 import BeautifulSoup as bs

end_number = 3
current_page = 1
pdf_links = {}
path = '<your path>'

with requests.Session() as s:
    while True:
        r = s.get(f'https://bidplus.gem.gov.in/bidlists?bidlists&page_no={current_page}')
        soup = bs(r.content, 'lxml')
        for i in soup.select('.bid_no > a'):
            pdf_links[i.text.strip().replace('/', '_')] = 'https://bidplus.gem.gov.in' + i['href']
        #print(pdf_links)
        if current_page == 1:
            num_pages = int(soup.select_one('.pagination li:last-of-type > a')['data-ci-pagination-page'])
            print(num_pages)
        if current_page == num_pages or current_page > end_number:
            break
        current_page+=1
    
for k,v in pdf_links.items():
    with open(f'{path}/{k}.pdf', 'wb') as f:
        r = s.get(v)
        f.write(r.content)

【讨论】：

由于某种原因，所有 PDF 都没有下载............
你注意到我对or current_page > end_number 的评论了吗？我把它放在3页上。您是否删除了那段代码以获得所有结果或将 end_number 设置得更高？
我没有发表评论，但我已将 822 放入 END_NUMBER 它应该有 8220 PDF，但它只有 4412 条记录。我还检查了它是否要到最后一页并选择最后一页记录，但它没有在第一页和最后一页之间选择某些 PDF 文件
是每次都错过相同的 pdf 文件还是不同的 pdf 文件？你检查pdf_links的长度了吗？
pdf_links len 是 4416

【解决方案2】：

您的网站不适用于 90% 的人。但是你提供了html的例子。所以我希望这会对你有所帮助：

url = 'https://bidplus.gem.gov.in/bidlists'
response = requests.get(url)
soup = BeautifulSoup(response.text, features='lxml')
for bid_no in soup.find_all('p', class_='bid_no pull-left'):
    for pdf in bid_no.find_all('a'):
        with open('pdf_name_here.pdf', 'wb') as f:
            #if you have full link
            href = pdf.get('href')
            #if you have link exept full path, like /showbidDocument/2993132
            #href = url + pdf.get('href')
            response = requests.get(href)
            f.write(response.content)

【讨论】：

90% 部分相当准确。