【问题标题】:searching beautiful soup html with variable用变量搜索漂亮的汤 html
【发布时间】:2021-12-03 20:33:55
【问题描述】:

对于列表中的每个物种,我正在搜索一个网页,该网页都应包含具有字典样式信息<dt> english name </dt> <dd> water shrew </dd> , <dt> status </dt> <dd> endangered </dd> 等的相同文本框。如前所述,我想要的此信息位于前面有标题的文本框中:<h2 class="text-center" id="_02"> COSEWIC assessment aumary</h2>。这是它的实际外观。

我最终试图从这个框中提取“濒危”字符串,特别是稍后我想将其输入到字典中,包括物种名称等。对于我在 URL 上循环的每个物种都会略有不同,尽管页面的结构应该相同,但包含有关不同物种的信息。

由于每个物种的“状态”和“英文名称”的答案都会不同,我无法自己查找这些文本,此外,我不能使用 if-else 语句,因为它不是唯一的地方在出现关键字“濒危”或“受到威胁”的页面上。那么有没有办法只选择该文本框中的元素然后进一步搜索? (也不是页面上唯一的文本框)。还是通过 dt 搜索并检索相应的 dd?

供参考:https://www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/pacific-water-shrew-appraisal-summary-2016.html

感谢您的宝贵时间!!!

【问题讨论】:

  • 你想从网站中提取什么?
  • 好的,我更新了帖子以包含此信息。大多数情况下,我想提取“濒危”字符串。但是,是的,该字符串对于其他物种会有所不同,因此我的搜索方法必须比搜索字符串本身更通用

标签: html web-scraping beautifulsoup


【解决方案1】:

假设我理解正确,应该这样做:

from bs4 import BeautifulSoup as bs
import requests

url = """https://www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/pacific-water-shrew-appraisal-summary-2016.html"""
req = requests.get(url)

soup = bs(req.text, 'html.parser')
sumr = soup.select_one('div:has(> h2:-soup-contains-own("COSEWIC assessment aummary"))+div[class="mwspanel section"] .dl-horizontal')
targets = sumr.select('dt:has(strong)')
for target in targets:
    print(target.text.strip(),":", target.find_next('dd').text.strip())  

输出:

Common name : Pacific Water Shrew
Scientific name : Sorex bendirii
Status : Endangered
Reason for designation : This shrew is restricted to British Columbia’s Lower Mainland and adjacent low valleys. It is rare there, associated with freshwater streams and adjacent wet habitats. Urban development, agriculture, and forestry have reduced the amount and quality of habitat. There is an inferred and projected ongoing decline in habitat and subpopulations in much of its range in Canada.
Occurrence : British Columbia
Status history : Designated Threatened in April 1994 and in May 2000. Status re-examined and designated Endangered in April 2006. Status re-examined and confirmed in April 2016.

【讨论】:

  • 哇,这正是我所需要的!谢谢!虽然还没有在循环中测试它
猜你喜欢
  • 2017-05-23
  • 1970-01-01
  • 1970-01-01
  • 2014-09-28
  • 1970-01-01
  • 1970-01-01
  • 2019-09-28
  • 1970-01-01
  • 2017-09-17
相关资源
最近更新 更多