如何使用 Python 从 HTML 段落中提取描述答案

【问题标题】：how to extract description from HTML paragraph using Python如何使用 Python 从 HTML 段落中提取描述
【发布时间】：2021-05-31 16:55:44
【问题描述】：

我想从 HTML 源代码中提取 HTML 段落。但它正在获取颜色和 id 的数据。

import requests
from bs4 import BeautifulSoup

url = "https://www.nike.com/gb/t/air-max-viva-shoe-ZQTSV8/DB5268-003"

response = requests.get(url)

soup = BeautifulSoup(response.text, 'lxml')

description = soup.find(
    'div', {'class': 'description-preview body-2 css-1pbvugb'}).text
print(description)

【问题讨论】：

顺便说一句，对于您的进一步问题。 <p> 称为 HTML 段落。

标签： python selenium beautifulsoup python-requests webdriver

【解决方案1】：

在它之后使用 .find p。

description = soup.find('div', {'class':'description-preview body-2 css-1pbvugb'}).find("p").text

【讨论】：

无需提及完整的class 名称，甚至您也无需搜索房产！您可以将其作为一种方法访问。 soup.select_one('.description-preview').p.string

【解决方案2】：

看来你想要下一个<p>的文字：

description = soup.find('div', {'class':'description-preview body-2 css-1pbvugb'}).find_next('p').text

【讨论】：

无需提及完整的class 名称，甚至您无需搜索房产！您可以将其作为一种方法访问。 soup.select_one('.description-preview').p.string
是的。我把它留在里面，所以它是与操作原始代码的最小差异。

【解决方案3】：

如果这是您链接中的唯一目标，那么在这种情况下您不需要使用真正的解析器，因为这会加载 cache 内存中的所有内容。

您可以使用regex或bs4解析器比较操作时间。

下面是一个快速捕获：

import re
import requests

r = requests.get(
    'https://www.nike.com/gb/t/air-max-viva-shoe-ZQTSV8/DB5268-003')

match = re.search(r'descriptionPreview\":\"(.+?)\"', r.text).group(1)
print(match)

输出：

Designed with every woman in mind, the mixed material upper of the Nike Air Max Viva 
features a plush collar, detailed patterning and intricate stitching. The new lacing 
system uses 2 separate laces constructed from heavy-duty tech chord, letting you find the perfect fit. Mixing comfort with style, it combines Nike Air with a lifted foam 
heel for and unbelievable ride that looks as good as it feels.

如果你想使用bs4:

这是一个简短的用法：

soup = BeautifulSoup(r.text, 'lxml')
print(soup.select_one('.description-preview').p.string)

注意：我使用lxml 解析器，因为它是根据bs4-documentation 最快的解析器

【讨论】：