【问题标题】:Why is BeautifulSoup not returning child elements?为什么 BeautifulSoup 不返回子元素?
【发布时间】:2020-04-13 03:43:46
【问题描述】:

我正在尝试使用 Python 3 和 BeautifulSoup 4 获取从该页面下载 xlsx 文件的 URL:https://psnc.org.uk/funding-and-statistics/funding-distribution/retained-margin-category-m/

我需要获取最新文件的 url,该文件位于 <div> 内的 <p> 标记列表中的索引 0 处,我可以在控制台中使用 JS 获取它,如下所示:

var link = document.getElementsByClassName("toggle_container")[2].children[1].children[0].href

如果我使用BS4获取页面上的所有<p>标签,我想要的链接在列表中:

import urllib
import requests
from bs4 import BeautifulSoup

cat_m_site = "https://psnc.org.uk/funding-and-statistics/funding-distribution/retained-margin-category-m/"

page = requests.get(cat_m_site)

soup = BeautifulSoup(page.text, 'html.parser')
p_elements = soup.find_all('p')

        for item in p_elements:
            print(item)

如果我尝试通过获取包含链接的 <div> 来重现 JS 解决方案,则应该有一个包含 29 个 <p> 元素的列表,但该列表为空:

import urllib
import requests
from bs4 import BeautifulSoup

cat_m_site = "https://psnc.org.uk/funding-and-statistics/funding-distribution/retained-margin-category-m/"

page = requests.get(cat_m_site)

soup = BeautifulSoup(page.text, 'html.parser')

divs = soup.find_all('div', {'class':'toggle_container'})
children = divs[2].findChildren("p", recursive=True)

        for child in children:
            print(child)

我更喜欢这种方式,因为我“知道”链接将位于此 div 的第 0 个元素中,但我觉得我缺少关于 findChildren 方法的一些内容。

【问题讨论】:

  • 该页面是否使用 javascript 来动态创建这些元素?如果是这样,就不能使用requests来获取页面内容;你需要一些支持 javascript 的东西,比如 Selenium。
  • 这个正确的网址是:psnc.org.uk/funding-and-statistics/pharmacy-funding/… 您提供的网址没有指向 xlsx 文件的链接。
  • @cgte - 抱歉,也许我的问题不够清楚。我提供的 url 包含指向 <div class="toggle_container"> 元素之一中大约 29 个 xlsx 文件的链接。链接的文本是“类别 M:产品和价格”

标签: python beautifulsoup


【解决方案1】:

改用soup = BeautifulSoup(page.text, 'lxml')

import urllib
import requests
from bs4 import BeautifulSoup

cat_m_site = "https://psnc.org.uk/funding-and-statistics/funding-distribution/retained-margin-category-m/"

page = requests.get(cat_m_site)

soup = BeautifulSoup(page.text, 'lxml')

divs = soup.find_all('div', {'class':'toggle_container'})
children = divs[2].findChildren("p", recursive=True)

for child in children:
    print(child) 

输出:

<p><a href="https://psnc.org.uk/wp-content/uploads/2019/10/Category-M-201920-Q3-Oct-Dec-with-Aug-19-combined.xlsx">Category M 2019/20 Q3 Oct-Dec (with Aug 19 combined)</a> (MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2019/08/Category-M-2019-August-with-Jul-19-combined.xlsx">Category M 2019 August (with Jul 19 combined</a>) (MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2019/08/Category-M-2019-20-Q2-Jul-Sep-with-Apr-19-combined.xlsx">Category M 2019/20 Q2 Jul-Sep (with Apr 19 combined) </a>(MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2019/05/Cat-M-Apr-2019-1.xlsx">Category M: 2019/20 Q1 Apri-June (with Jan 2019 combined) </a>(MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2019/01/Category-M-2018.19-Q4-JanMar-with-Nov-18-combined.xlsx">Category M: 2018/19 Q4 Jan-Mar (with Nov 18 combined)</a> (MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2019/01/Category-M-Nov-18.xlsx">Category M: 2018 November (with Oct 18 combined)</a> (MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2018/09/Category-M-2018.19-Q3-OctDec-with-Aug-18-combined.xlsx">Category M: 2018/19 Q3 Oct-Dec (with Aug 18 combined)</a> (MS Excel)</p>
<p><a href="http://psnc.org.uk/wp-content/uploads/2018/06/Category-M-2018.19-Q2-JulSep-with-Apr-18-combined.xlsx">Category M: 2018/19 Q2 Jul-Sep (with Apr 18 combined)</a> (MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2018/04/Category-M-2018.19-Q1-AprJun-with-Jan-18-combined-v2.xlsx">Category M: 2018/19 Q1 Apr-Jun (with Jan 18 combined)</a> (MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2017/12/Category-M-Jan-18.xlsx">Category M: 2017/18 Q4 Jan-Mar (with Oct 17 combined)</a> (MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-Oct-17.xlsx">Category M: 2017/18 Q3 Oct-Dec (with Aug 17 combined)</a> (MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-Aug-17.xlsx">Category M: 2017 August (with Jul 17 combined)</a> (MS Excel)</p>
<p><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-Jul-17.xlsx">Category M: 2017/18 Q2 Jul-Sep (with Apr 17 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-Apr-17.xlsx">Category M: 2017/18 Q1 Apr-Jun (with Jan 17 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-Jan-17.xlsx">Category M: 2016/17 Q4 Jan-Mar (with Oct 16 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-Oct-16.xlsx">Category M: 2016/17 Q3 Oct-Dec (with Jul 16 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-Jul-16.xlsx">Category M: 2016/17 Q2 Jul – Sep (with Jun 16 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-June-16.xlsx">Category M: 2016 June (with Apr 16 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-April-16.xlsx" rel="">Category M: 2016/17 Q1 Apr – Jun (with Jan 16 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-2015.16-Q4-Jan-Mar-with-Oct-15-combined.xlsx">Category M: 2015/16 Q4 Jan – Mar (with Oct 15 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-2015.16-Q3-Oct-Dec-with-Jul-15-combined.xlsx">Category M: 2015/16 Q3 Oct </a><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Jun-15-and-Apr-15-Cat-M-prices.xlsx">–</a><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Category-M-2015.16-Q3-Oct-Dec-with-Jul-15-combined.xlsx"> Dec (with Jul 15 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Jun-15-and-Apr-15-Cat-M-prices.xlsx">Category M: 2015/16 Q2 Jul – Sep (with Apr 15 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Apr_15_and_Jan_15_Cat_M_prices-2.xlsx">Category M: 2015/16 Q1 Apr – Jun (with Jan 15 combined) updated</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Jan_15_and_Oct_14_Cat_M_prices.xlsx">Category M: 2014/15 Q4 Jan – Mar (with Oct 14 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2014/09/Oct_14_and_Jul_14_Cat_M_prices.xlsx">Catgegory M: 2014/15 Q3 Oct – Dec (with Jul 14 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2013/07/Jul_14_and_Apr_14_Cat_M_Prices.xlsx">Category M: 2014/15 Q2 Jul – Sep (with Apr 14 combined)</a> (MS Excel)</p>
<p style="text-align: justify;"><a href="https://psnc.org.uk/wp-content/uploads/2013/07/Apr_14_and-Jan_14_Cat_M_Prices.xls.xlsx">Category M: 2014/15 Q1 Apr – Jun (with Jan 14 combined)</a> (MS Excel)</p>
<p></p>

【讨论】:

  • 非常感谢。您能否添加一些内容来简要解释为什么 lxml 有效但 html.parser 无效?
  • @JeremyFox,见here
猜你喜欢
  • 2020-10-13
  • 2020-07-31
  • 2016-06-21
  • 1970-01-01
  • 1970-01-01
  • 2019-02-26
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多