使用漂亮的汤从网站上获取所有 mp3 文件——递归答案

【问题标题】：Using beautiful soup to get all mp3 files from a website -- recursively使用漂亮的汤从网站上获取所有 mp3 文件——递归
【发布时间】：2020-12-14 19:26:47
【问题描述】：

在这里很难找到这个问题的答案，我知道我已经搜索了一个多小时。许多都接近了，我已经尝试了其中一些的点点滴滴，但解决方案仍然逃避了我。（到目前为止，请参阅更新）

试图从https://www.crrow777radio.com/free-episodes/ 提取所有 MP3 文件，但它们嵌套得很深。我想我可以提供那个 URL，bs 会递归地跟踪所有链接，我过滤它们以下载我的特定文件。显然，必须为找到的每个 href 请求并解析该页面上的链接。

我有代码可以从包含它的页面中提取 MP3，但是对所有此类页面（从上到下递归）执行此操作并不像 bs 文档让我相信的那样容易。

更新：在 MendelG 和其他人的帮助下，我修改了代码。我相信这可以完成这项工作，但是将 [大尺寸] 文件内容放入变量中可能需要通过某种下载写入、下载写入方案来改进以减少内存影响：

def getMP3sOnPageP(session, h, p):
    soup = BeautifulSoup(session.get(p, headers=h).content, "html.parser")

    # Select all the buttons on this page with the text `LISTEN`
    for tag in soup.select("a.button"):
        # Extracts the link from the button, in order to perform a request to that page
        page = tag["href"]
        soup = BeautifulSoup(session.get(page, headers=h).content, "html.parser")

        # Finds the link to the mp3 file
        download_link = soup.select_one("a.btn[download]")["href"]
        file_name = re.search(r'(\d+-Hour-1.mp3)', download_link.split("/")[-1]).group()

        # Request the mp3 file
        print("Downloading ", file_name)
        mp3_file = session.get(download_link).content
        with open(file_name, "wb") as f:
            f.write(mp3_file)

HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}
with requests.Session() as session:
    soup = BeautifulSoup(session.get(URL, headers=HEADERS).content, "html.parser")

    # Select all the links on this page with a class of "page-numbers"
    for a in soup.select("a.page-numbers"):
        getMP3sOnPageP(session, HEADERS, a.get("href"))

如您所见，这需要对美丽汤 (bs) 进行 3 次调用。我对文档的阅读使我相信 bs 会递归操作，因此具有正确过滤/参数的单个调用就足够了。如果是我肯定不知道怎么做。

【问题讨论】：

标签： html python-3.x beautifulsoup

【解决方案1】：

我提供的更新解决了我的问题。尽管用户 MendelG 发布了回复，但它并没有解决整个问题，尽管它非常有帮助。他的贡献体现在我更新的getMP3sOnPageP函数中。

【讨论】：