如何使用 Beautiful Soup 保存网站的附件？答案

【问题标题】：How to save attachments from websites using Beautiful Soup?如何使用 Beautiful Soup 保存网站的附件？
【发布时间】：2020-06-20 06:08:46
【问题描述】：

我编写了一个代码来抓取网站中的附件。它本质上是抓取附件的超链接。我无法找到一种方法来将这些附件直接保存在本地位置。

import requests
import pandas as pd 
from requests import get
url = 'https://www.amfiindia.com/research-information/amfi-monthly'
response = get(url,verify=False)
import bs4
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.content,'html.parser')

filetype = '.xls'
excel_sheets = html_soup.find_all('a')

#File name where the links to the excel sheet needs to be saved --> here: "All_Links_2.csv"
destination = open('All_Links_2.csv','wb')

for link in excel_sheets:
    href = link.get('href') + '\n'
    if filetype in href:
        print(href)

有人可以帮忙吗？

【问题讨论】：

destination.write(href) 而不是print(href)？ stackoverflow.com/questions/33289247/…

标签： python beautifulsoup get python-requests

【解决方案1】：

这并不是你用漂亮的汤做的事情，而是我们使用 urllib 库。

import urllib.request

urllib.request.urlretrieve(href, "file.jpg")

这将获取图像地址并将其保存为file.jpg。如果您想要不同的文件名，这适用于您的情况，请创建字符串"file" + i + ".jpg"，i 是您增加的一些值

【讨论】：

谢谢雪莉。但是 urllib 似乎无法识别附件的 url。代码：` import bs4 from bs4 import BeautifulSoup html_soup = BeautifulSoup(response.content,'html.parser') filetype = '.xls' excel_sheets = html_soup.find_all('a') for link in excel_sheets: href = link.get( 'href') + '\n' if filetype in href: urllib.request.urlretrieve(href,"file.xls") ` Error : 你可以访问这个网站：amfiindia.com/research-information/amfi-monthly我正在尝试在这里下载附件

【解决方案2】：

如果您只想获取链接，则不需要二进制模式，而且由于您已导入 pandas，您可以使用它来保存它们。

首先创建一个数据框：

df = pd.DataFrame([a['href'] for a in excel_sheets if filetype in a['href']])

然后只保存不带列名（header=False）：

df.to_csv('All_Links_2.csv', header=False)

【讨论】：

感谢 Bagon 的投入。但是，我正在尝试从网站下载附件。我尝试将链接保存在 csv 中，然后检索它们，但是在从 CSV 中再次提取时，超链接格式不会粘住。如果您能在从 csv 中提取链接时帮助我保留超链接格式，将会很有帮助。