为什么 BeautifulSoup 不返回子元素？答案

【问题标题】：Why is BeautifulSoup not returning child elements?为什么 BeautifulSoup 不返回子元素？
【发布时间】：2020-04-13 03:43:46
【问题描述】：

我正在尝试使用 Python 3 和 BeautifulSoup 4 获取从该页面下载 xlsx 文件的 URL：https://psnc.org.uk/funding-and-statistics/funding-distribution/retained-margin-category-m/

我需要获取最新文件的 url，该文件位于 <div> 内的 <p> 标记列表中的索引 0 处，我可以在控制台中使用 JS 获取它，如下所示：

var link = document.getElementsByClassName("toggle_container")[2].children[1].children[0].href

如果我使用BS4获取页面上的所有个<p>标签，我想要的链接在列表中：

import urllib
import requests
from bs4 import BeautifulSoup

cat_m_site = "https://psnc.org.uk/funding-and-statistics/funding-distribution/retained-margin-category-m/"

page = requests.get(cat_m_site)

soup = BeautifulSoup(page.text, 'html.parser')
p_elements = soup.find_all('p')

        for item in p_elements:
            print(item)

如果我尝试通过获取包含链接的 <div> 来重现 JS 解决方案，则应该有一个包含 29 个 <p> 元素的列表，但该列表为空：

import urllib
import requests
from bs4 import BeautifulSoup

cat_m_site = "https://psnc.org.uk/funding-and-statistics/funding-distribution/retained-margin-category-m/"

page = requests.get(cat_m_site)

soup = BeautifulSoup(page.text, 'html.parser')

divs = soup.find_all('div', {'class':'toggle_container'})
children = divs[2].findChildren("p", recursive=True)

        for child in children:
            print(child)

我更喜欢这种方式，因为我“知道”链接将位于此 div 的第 0 个元素中，但我觉得我缺少关于 findChildren 方法的一些内容。

【问题讨论】：

该页面是否使用 javascript 来动态创建这些元素？如果是这样，就不能使用requests来获取页面内容；你需要一些支持 javascript 的东西，比如 Selenium。
这个正确的网址是：psnc.org.uk/funding-and-statistics/pharmacy-funding/… 您提供的网址没有指向 xlsx 文件的链接。
@cgte - 抱歉，也许我的问题不够清楚。我提供的 url 包含指向 <div class="toggle_container"> 元素之一中大约 29 个 xlsx 文件的链接。链接的文本是“类别 M：产品和价格”

标签： python beautifulsoup

【解决方案1】：

改用soup = BeautifulSoup(page.text, 'lxml')

import urllib
import requests
from bs4 import BeautifulSoup

cat_m_site = "https://psnc.org.uk/funding-and-statistics/funding-distribution/retained-margin-category-m/"

page = requests.get(cat_m_site)

soup = BeautifulSoup(page.text, 'lxml')

divs = soup.find_all('div', {'class':'toggle_container'})
children = divs[2].findChildren("p", recursive=True)

for child in children:
    print(child)

输出：

<p><a href="https://psnc.org.uk/wp-content/uploads/2019/10/Category-M-201920-Q3-Oct-Dec-with-Aug-19-combined.xlsx">Category M 2019/20 Q3 Oct-Dec (with Aug 19 combined)</a> (MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2019/08/Category-M-2019-August-with-Jul-19-combined.xlsx">Category M 2019 August (with Jul 19 combined</a>) (MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2019/08/Category-M-2019-20-Q2-Jul-Sep-with-Apr-19-combined.xlsx">Category M 2019/20 Q2 Jul-Sep (with Apr 19 combined) </a>(MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2019/05/Cat-M-Apr-2019-1.xlsx">Category M: 2019/20 Q1 Apri-June (with Jan 2019 combined) </a>(MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2019/01/Category-M-2018.19-Q4-JanMar-with-Nov-18-combined.xlsx">Category M: 2018/19 Q4 Jan-Mar (with Nov 18 combined)</a> (MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2019/01/Category-M-Nov-18.xlsx">Category M: 2018 November (with Oct 18 combined)</a> (MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2018/09/Category-M-2018.19-Q3-OctDec-with-Aug-18-combined.xlsx">Category M: 2018/19 Q3 Oct-Dec (with Aug 18 combined)</a> (MS Excel)</p>
<p><a href="http://psnc.org.uk/wp-content/uploads/2018/06/Category-M-2018.19-Q2-JulSep-with-Apr-18-combined.xlsx">Category M: 2018/19 Q2 Jul-Sep (with Apr 18 combined)</a> (MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2018/04/Category-M-2018.19-Q1-AprJun-with-Jan-18-combined-v2.xlsx">Category M: 2018/19 Q1 Apr-Jun (with Jan 18 combined)</a> (MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2017/12/Category-M-Jan-18.xlsx">Category M: 2017/18 Q4 Jan-Mar (with Oct 17 combined)</a> (MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-Oct-17.xlsx">Category M: 2017/18 Q3 Oct-Dec (with Aug 17 combined)</a> (MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-Aug-17.xlsx">Category M: 2017 August (with Jul 17 combined)</a> (MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-Jul-17.xlsx">Category M: 2017/18 Q2 Jul-Sep (with Apr 17 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-Apr-17.xlsx">Category M: 2017/18 Q1 Apr-Jun (with Jan 17 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-Jan-17.xlsx">Category M: 2016/17 Q4 Jan-Mar (with Oct 16 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-Oct-16.xlsx">Category M: 2016/17 Q3 Oct-Dec (with Jul 16 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-Jul-16.xlsx">Category M: 2016/17 Q2 Jul – Sep (with Jun 16 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-June-16.xlsx">Category M: 2016 June (with Apr 16 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-April-16.xlsx" rel="">Category M: 2016/17 Q1 Apr – Jun (with Jan 16 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-2015.16-Q4-Jan-Mar-with-Oct-15-combined.xlsx">Category M: 2015/16 Q4 Jan – Mar (with Oct 15 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-2015.16-Q3-Oct-Dec-with-Jul-15-combined.xlsx">Category M: 2015/16 Q3 Oct </a><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Jun-15-and-Apr-15-Cat-M-prices.xlsx">–</a><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-2015.16-Q3-Oct-Dec-with-Jul-15-combined.xlsx"> Dec (with Jul 15 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Jun-15-and-Apr-15-Cat-M-prices.xlsx">Category M: 2015/16 Q2 Jul – Sep (with Apr 15 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Apr_15_and_Jan_15_Cat_M_prices-2.xlsx">Category M: 2015/16 Q1 Apr – Jun (with Jan 15 combined) updated</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Jan_15_and_Oct_14_Cat_M_prices.xlsx">Category M: 2014/15 Q4 Jan – Mar (with Oct 14 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Oct_14_and_Jul_14_Cat_M_prices.xlsx">Catgegory M: 2014/15 Q3 Oct – Dec (with Jul 14 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2013/07/Jul_14_and_Apr_14_Cat_M_Prices.xlsx">Category M: 2014/15 Q2 Jul – Sep (with Apr 14 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2013/07/Apr_14_and-Jan_14_Cat_M_Prices.xls.xlsx">Category M: 2014/15 Q1 Apr – Jun (with Jan 14 combined)</a> (MS Excel)</p>
<p></p>

【讨论】：

非常感谢。您能否添加一些内容来简要解释为什么 lxml 有效但 html.parser 无效？
@JeremyFox，见here