如何使用beautifulsoup获取包含在包含多个子标签的标签中的文本？答案

【问题标题】：How to get the text enclosed within a tag, which contains multiple sub-tags, with beautifulsoup?如何使用beautifulsoup获取包含在包含多个子标签的标签中的文本？
【发布时间】：2021-01-21 15:05:15
【问题描述】：

我正在尝试抓取具有以下标签的网页：

  <div style="text-align: center;">
            <img src="https://documents.google.com/" alt="" width="60" height="30" />
            <br />
            Pick me please.

        <p> Do not pick me please! </p>

        <br />
        <br />
    </div>

我想抓取“请接我”字符串，但不想抓取“请不要接我！”细绳。知道怎么做吗？

编辑：我希望有一个更通用的解决方案，我总是希望在特定标签下获取文本，该标签不在任何子标签内

【问题讨论】：

标签： python python-3.x web-scraping beautifulsoup web-crawler

【解决方案1】：

编辑

find() 中的非空 text node div 的更“通用”解决方案：

parent = soup.select_one('div')
parent.find(text=lambda text: text and text.strip(), recursive=False).strip()

要获取文本节点，请使用 previous_sibling 并避免空格，... strip() 结果。

soup.select_one('div p').previous_sibling.strip()

或使用get_text() 和strip：

soup.select_one('div').get_text('|', strip=True).split('|')[0]

小例子

from bs4 import BeautifulSoup

html = '''
<div style="text-align: center;">
            <img src="https://documents.google.com/" alt="" width="60" height="30" />
            <br />
            Pick me please.

        <p> Do not pick me please! </p>

        <br />
        <br />
    </div>
'''
soup = BeautifulSoup(html, 'lxml')

soup.select_one('div p').previous_sibling.strip()

输出

请接我。

【讨论】：

感谢您的回答，感谢您的解决方案。这个问题是否有更“通用”的解决方案？如果我总是希望获取特定标签下的文本，该标签不在任何子标签内？
@IntoAbhi ：看一下，添加了一个更“通用”的解决方案，应该可以满足您的要求 - 是吗？
是的，这正是解决问题的方法。谢谢你的回答！

【解决方案2】：

您也可以使用get_text() 方法。它以单个 Unicode 字符串的形式返回文档中或标签下的所有文本。这里我使用正则表达式re.compile 来获取文本。

import re
from bs4 import BeautifulSoup
html= """<div style="text-align: center;">
            <img src="https://documents.google.com/" alt="" width="60" height="30" />
            <br />
            Pick me please.

        <p> Do not pick me please! </p>

        <br />
        <br />
    </div>"""

soup = BeautifulSoup(html, 'lxml')
print(soup.find(text=re.compile("Pick me please.")).strip())

【讨论】：

【解决方案3】：

您可以搜索<br> 标签，然后调用find_next() 方法，该方法将返回第一个匹配项。

soup = BeautifulSoup(html, "html.parser")

print(soup.select_one('div br').find_next(text=True).strip())

输出：

Pick me please.

【讨论】：