使用 BeautifulSoup 从网页中提取某些内容时遇到问题答案

【问题标题】：Trouble extracting some content from a webpage using BeautifulSoup使用 BeautifulSoup 从网页中提取某些内容时遇到问题
【发布时间】：2019-10-22 07:06:28
【问题描述】：

我使用 python 和 BeautifulSoup 库创建了一个脚本来从网页中抓取特定内容。我感兴趣的内容位于该页面的What does that mean 下。

Link to that page

更具体地说-我要解析的内容：

此标题What does that mean 下的所有内容，图片除外。

这是我迄今为止尝试过的：

import requests
from bs4 import BeautifulSoup

link = "https://www.obd-codes.com/p0100"

def fetch_data(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    [script.extract() for script in soup.select("script")]
    elem = [item.text for item in soup.select("h2:contains('What does that mean') ~ p")]
    print(elem)

if __name__ == '__main__':
    fetch_data(link)

但是，我尝试过的方式几乎为我提供了该页面上的所有内容，这不是我所期望的。

如何从上述页面获取What does that mean和What are some possible symptoms之间的内容？

PS 我不想使用正则表达式。

【问题讨论】：

标签： python python-3.x web-scraping beautifulsoup

【解决方案1】：

你可以利用itertools.takewhile (official doc) 函数来完成你想要的：

import requests
from bs4 import BeautifulSoup

from itertools import takewhile

link = "https://www.obd-codes.com/p0100"

def fetch_data(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    [script.extract() for script in soup.select("script")]
    elems = [i.text for i in takewhile(lambda tag: tag.name != 'h2', soup.select("h2:contains('What does that mean') ~ *"))]
    print(elems)

if __name__ == '__main__':
    fetch_data(link)

打印：

['This diagnostic trouble code (DTC) is a generic powertrain code, which means that it applies to OBD-II equipped vehicles that have a mass airflow sensor. Brands include but are not limited to Toyota, Nissan, Vauxhall, Mercedes Benz, Mitsubishi, VW, Saturn, Ford, Jeep, Jaguar, Chevy, Infiniti, etc. Although generic, the specific repair steps may vary depending on make/model.', "The MAF (mass air flow) sensor is a sensor mounted in a vehicle's engine air intake tract downstream from the air filter, and is used to measure the volume and density of air being drawn into the engine. The MAF sensor itself only measures a portion of the air entering and that value is used to calculate the total volume and density of air being ingested.", '\n\n\n\n\xa0', '\n', 'The powertrain control module (PCM) uses that reading along with other sensor parameters to ensure proper fuel delivery at any given time for optimum power and fuel efficiency.', 'This P0100 diagnostic trouble code (DTC) means that there is a detected problem with the Mass Air Flow (MAF)\nsensor or circuit. The PCM detects that the actual MAF sensor frequency signal\nis not performing within the normal expected range of the calculated MAF value.', 'Note: Some MAF sensors also incorporate an air temperature sensor, which is another value used by the PCM for optimal engine operation.', 'Closely related MAF circuit trouble codes include:', '\nP0101 Mass or Volume Air Flow "A" Circuit Range/Performance\nP0102 Mass\nor Volume Air Flow "A" Circuit Low Input\nP0103 Mass\nor Volume Air Flow "A" Circuit High Input\nP0104 Mass or Volume Air Flow "A" Circuit Intermittent\n', 'Photo of a MAF sensor:']

编辑：

如果您只想在<h2> 标记之后直接使用<p> 标记，请使用lambda tag: tag.name == 'p'。

【讨论】：

我不能接受这个~ *@Andrej Kesely。您应用通用同级选择器的方式非常好。
你好安德烈！是否有可能我可以纠正这个 soup.select("h2:contains('Symptoms') ~ *") 选择器以包含小 s 和大写 S 以便它除了thissoup.select("h2:contains('symptoms'),h2:contains('Symptoms') ~ *")之外，还能识别symptoms和Symptoms吗？
@MITHU 使用 CSS 选择器恐怕不可能，但你可以这样做：[h2 for h2 in soup.select("h2") if 'symptom' in h2.text.lower()]

【解决方案2】：

还有另一种方法可以达到同样的效果。让您的脚本继续运行，直到遇到此标签 h2。

import requests
from bs4 import BeautifulSoup

url = "https://www.obd-codes.com/p0100"

res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
[script.extract() for script in soup.select("script")]
elem_start = [elem for elem in soup.select_one("h2:contains('What does that mean')").find_all_next()]
content = []
for item in elem_start:
    if item.name=='h2': break
    content.append(' '.join(item.text.split()))
print(content)

【讨论】：