BeautifulSoup 获取元素之间的文本答案

【问题标题】：BeautifulSoup get text between elementsBeautifulSoup 获取元素之间的文本
【发布时间】：2019-10-04 07:08:08
【问题描述】：

我有这样的事情：

<b>foo:</b> bar

<br />


<b>baz:</b>
<font color="green">YES</font> spam

<br />


<b>eggs:</b> ham

<br />

现在我想获取 s 之间的所有这些字符串。

我可以这样做：

from bs4 import BeautifulSoup
# get the html here
soup = BeautifulSoup(content, 'html.parser')
for element in soup.find_all('b'):
    print(element.next_sibling)

它有效，但仅适用于未封装的文本，即 标签。所以我会得到bar 和ham，但我不会得到YES，而且出乎意料的是，我什至不会得到spam。有没有办法在不使用正则表达式的情况下解析它？

【问题讨论】：

BeautifulSoup4 有一个内置函数可以专门获取标签之间的文本。它被称为get_text()。在此处查找更多信息：crummy.com/software/BeautifulSoup/bs4/doc/#get-text
但这对我有什么帮助呢？看起来和.text 完全一样
(Web scraping) I've located the proper tags, now how do I extract the text?的可能重复
不是重复的，我知道如何获取标签之间的文字，比如标签，但是这里的文字在 s之间

标签： python html python-3.x beautifulsoup

【解决方案1】：

您可以使用 find_all() 并检查所有标签，然后根据该标签查找标签。使用next_element 获取值。

from bs4 import BeautifulSoup
html='''<b>foo:</b> bar

<br />


<b>baz:</b>
<font color="green">YES</font> spam

<br />


<b>eggs:</b> ham

<br />'''
soup=BeautifulSoup(html,'lxml')
for item in soup.find_all():
    if item.name=='font':
       print(item.text.strip())
       print(item.next_element.next_element.strip())
    if item.name=='b':
       if item.next_element.next_element.strip()!='':
           print(item.next_element.next_element.strip())

输出：

bar
YES
spam
ham

【讨论】：

【解决方案2】：

我试了一下。希望有效


# get the html here
soup = BeautifulSoup(content, 'html.parser')
all_b=soup.find_all('b')
for b in all_b:
    print(b.get_text())
    next_b=b.findNext('b')
    #print(next_b)
    for sibling in b.next_siblings:
        if(sibling!=next_b):
            if(sibling!=None and isinstance(sibling,str)==False):
                print(sibling.get_text())
                sibling=sibling.next_sibling
            elif(sibling!=None and isinstance(sibling,str)==True):
                print(sibling)
                sibling=sibling.next_sibling
        elif(sibling==next_b):
            break
    print("new")

【讨论】：

嗯，它有点工作，但如果我也想打印之间的文本，我得到：foo: barbaz: baz: YES，所以它会重复+ <script> 页面底部的部分也会被解析
@dabljues 我认为你必须采用递归方法来扩展嵌套标签。