美丽的汤：当两个标签具有不同的 id 时，提取两个标签之间的所有内容答案

【问题标题】：Beautiful soup: Extract everything between two tags when these tags have different ids美丽的汤：当两个标签具有不同的 id 时，提取两个标签之间的所有内容
【发布时间】：2022-01-24 23:55:31
【问题描述】：

Beautiful soup: Extract everything between two tags

我通过上面的链接看到了一个问题，我们在其中获取了两个标签之间的信息。而当这些标签具有两个不同的 id 属性值时，我需要获取标签之间的信息。


    <h1 id = 'beautiful' ></h1>
    Text <i>here</i> has no tag
    <div>This is in a div</div>
    <h1 id = 'good' ></h1>

我正在使用 BeautifulSoup 从 HTML 文件中提取数据。我想获取两个标签之间的所有信息。这意味着如果我有这样的 HTML 部分：


    <h1></h1>
    Text <i>here</i> has no tag
    <div>This is in a div</div>
    <h1></h1>

如果我想要第一个 h1 和第二个 h1 之间的所有信息，输出将如下所示：


    Text <i>here</i> has no tag
    <div>This is in a div</div>

from bs4 import BeautifulSoup


html_doc = '''
This I <b>don't</b> want
<h1></h1>
Text <i>here</i> has no tag
<div>This is in a div</div>
<h1></h1>
This I <b>don't</b> want too
'''

soup = BeautifulSoup(html_doc, 'html.parser')

for c in list(soup.contents):
    if c is soup.h1 or c.find_previous('h1') is soup.h1:
        continue
    c.extract()

for h1 in soup.select('h1'):
    h1.extract()

print(soup)

打印：

Text <i>here</i> has no tag
<div>This is in a div</div>

这在没有 id 的情况下工作。

有人可以在这方面帮助我吗？

【问题讨论】：

标签： python html beautifulsoup tags

【解决方案1】：

parent 和 decompose 方法可能对您有所帮助。

# 1. Find the first item you are looking for. 

soup = BeautifulSoup(html_doc, 'html.parser')
hElem = soup.find("h1", {'id': 'beautiful'})


# 2. Find the second condition. 

endElem = soup.find('h1', {'id': 'good'})


# 3. Get parent element that contains both. 

hParent = hElem.parent  # Can be made more complex if multiple ancestors are needed to contain both conditions.


# 4. Iterate through children and remove all children outside the conditions.

childrenElems = hParent.children
inBetween = true
for child in childrenElems:
  if not inBetween:  
    child.decompose()
  if child == endElem:
    inBetween = false 

#  Remaining data.
print(childrenElems)

【讨论】：

【解决方案2】：

你可以自己遍历汤的内容，并在每个<h1>标签之间构建一个元素块：

from bs4 import BeautifulSoup
from bs4.element import Tag


html = """
<h1 id = '1' ></h1>
Text1 <i>here</i> has no tag
<div>This is in a div</div>
<h1 id = '2' ></h1>
Text2 <i>here</i> has no tag
<div>This is in a div</div>
<h1 id = '3' ></h1>
Text3 <i>here</i> has no tag
"""

soup = BeautifulSoup(html, "html.parser")

block = []
blocks = []
h1 = False

for el in soup.contents:
    if type(el) == Tag and el.name == 'h1':
        # Has a h1 tag been seen yet?
        if h1:
            blocks.append(block)
            block = []
        h1 = True
    elif h1:
        block.append(el)

# Add any final elements (missing a next h1)
if block:
    blocks.append(block)
        
# Display each block as html soup
for b in blocks:        
    soup.contents = b
    print(soup)        
    print("--------------")

这个例子会有 3 个这样的元素块：

Text1 <i>here</i> has no tag
<div>This is in a div</div>

--------------

Text2 <i>here</i> has no tag
<div>This is in a div</div>

--------------

Text3 <i>here</i> has no tag

--------------

【讨论】：