如何使用 BeautifulSoup 获取两个指定标签之间的所有文本？答案

【问题标题】：How to get all text between just two specified tags using BeautifulSoup?如何使用 BeautifulSoup 获取两个指定标签之间的所有文本？
【发布时间】：2012-08-02 06:32:45
【问题描述】：

html = """
...
<tt class="descname">all</tt>
<big>(</big>
<em>iterable</em>
<big>)</big>
<a class="headerlink" href="#all" title="Permalink to this definition">¶</a>
...
"""

我想在第一次出现a 标记之前获取起始标记big 之间的所有文本。这意味着如果我举这个例子，那么我必须得到(iterable) 作为一个字符串。

【问题讨论】：

标签： python html-parsing beautifulsoup

【解决方案1】：

一种迭代方法。

from BeautifulSoup import BeautifulSoup as bs
from itertools import takewhile, chain

def get_text(html, from_tag, until_tag):
    soup = bs(html)
    for big in soup(from_tag):
        until = big.findNext(until_tag)
        strings = (node for node in big.nextSiblingGenerator() if getattr(node, 'text', '').strip())
        selected = takewhile(lambda node: node != until, strings)
        try:
            yield ''.join(getattr(node, 'text', '') for node in chain([big, next(selected)], selected))
        except StopIteration as e:
            pass

for text in get_text(html, 'big', 'a'):
    print text

【讨论】：

【解决方案2】：

我会避免 nextSibling，因为从您的问题来看，您希望包含直到下一个 <a> 的所有内容，无论它是在兄弟元素、父元素还是子元素中。

因此，我认为最好的方法是找到下一个 <a> 元素的节点并递归循环直到那时，添加遇到的每个字符串。如果您的 HTML 与示例有很大不同，您可能需要整理以下内容，但这样的事情应该可以工作：

from bs4 import BeautifulSoup
#by taking the `html` variable from the question.
html = BeautifulSoup(html)
firstBigTag = html.find_all('big')[0]
nextATag = firstBigTag.find_next('a')
def loopUntilA(text, firstElement):
    text += firstElement.string
    if (firstElement.next.next == nextATag):             
        return text
    else:
        #Using double next to skip the string nodes themselves
        return loopUntilA(text, firstElement.next.next)
targetString = loopUntilA('', firstBigTag)
print targetString

【讨论】：

是的，确切地说，我想包含直到下一个标签“a”的所有内容，并且可能有任意数量的标签，第一个“大”标签和第一个“a”标签之间的文本

【解决方案3】：

你可以这样做：

from BeautifulSoup import BeautifulSoup
html = """
<tt class="descname">all</tt>
<big>(</big>
<em>iterable</em>
<big>)</big>
<a class="headerlink" href="test" title="Permalink to this definition"></a>
"""
soup = BeautifulSoup(html)
print soup.find('big').nextSibling.next.text

有关详细信息，请检查来自 here 的 BeautifulSoup 遍历 dom

【讨论】：

返回“iterable”而不是“(iterable)”

【解决方案4】：

>>> from BeautifulSoup import BeautifulSoup as bs
>>> parsed = bs(html)
>>> txt = []
>>> for i in parsed.findAll('big'):
...     txt.append(i.text)
...     if i.nextSibling.name != u'a':
...         txt.append(i.nextSibling.text)
...
>>> ''.join(txt)
u'(iterable)'

【讨论】：

nextiSbling 不能使用，因为我想包含每个文本直到第一次出现标记“a”