使用 Python 网络抓取下载 PDF 不起作用答案

【问题标题】：Downloading PDF's using Python webscraping not working使用 Python 网络抓取下载 PDF 不起作用
【发布时间】：2020-11-10 14:46:04
【问题描述】：

这是我的代码：

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://mathsmadeeasy.co.uk/gcse-maths-revision/"

#If there is no such folder, the script will create one automatically
folder_location = r'E:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")     
for link in soup.select("a[href$='.pdf']"):
    #Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)

关于为什么代码不下载我的任何文件格式数学修订站点的任何帮助。谢谢。

【问题讨论】：

标签： python html web web-scraping beautifulsoup

【解决方案1】：

查看页面本身，虽然它可能看起来像它是静态的，但它不是。您尝试访问的内容被一些花哨的 javascript 加载所限制。我所做的评估只是记录 BS4 实际获得的页面并在文本编辑器中打开它：

with open(folder_location+"\page.html", 'wb') as f:
    f.write(response.content)

从外观上看，该页面正在用 JS 替换占位符，正如 HTML 文件的第 70 行注释所示：// interpolate json by replacing placeholders with variables

对于您的问题的解决方案，BS4 似乎无法加载 Javascript。我建议为遇到类似问题的人查看this answer。如果您打算进行更复杂的网络抓取，我还建议您查看Scrapy。

【讨论】：

感谢您的回答。很有帮助。