【问题标题】:Beautiful soup: Extract everything between two tags when these tags have different ids美丽的汤:当两个标签具有不同的 id 时,提取两个标签之间的所有内容
【发布时间】:2022-01-24 23:55:31
【问题描述】:

Beautiful soup: Extract everything between two tags

我通过上面的链接看到了一个问题,我们在其中获取了两个标签之间的信息。而当这些标签具有两个不同的 id 属性值时,我需要获取标签之间的信息。


    <h1 id = 'beautiful' ></h1>
    Text <i>here</i> has no tag
    <div>This is in a div</div>
    <h1 id = 'good' ></h1>



我正在使用 BeautifulSoup 从 HTML 文件中提取数据。我想获取两个标签之间的所有信息。这意味着如果我有这样的 HTML 部分:


    <h1></h1>
    Text <i>here</i> has no tag
    <div>This is in a div</div>
    <h1></h1>

如果我想要第一个 h1 和第二个 h1 之间的所有信息,输出将如下所示:


    Text <i>here</i> has no tag
    <div>This is in a div</div>

from bs4 import BeautifulSoup


html_doc = '''
This I <b>don't</b> want
<h1></h1>
Text <i>here</i> has no tag
<div>This is in a div</div>
<h1></h1>
This I <b>don't</b> want too
'''

soup = BeautifulSoup(html_doc, 'html.parser')

for c in list(soup.contents):
    if c is soup.h1 or c.find_previous('h1') is soup.h1:
        continue
    c.extract()

for h1 in soup.select('h1'):
    h1.extract()

print(soup)

打印:

Text <i>here</i> has no tag
<div>This is in a div</div>

这在没有 id 的情况下工作。

有人可以在这方面帮助我吗?

【问题讨论】:

    标签: python html beautifulsoup tags


    【解决方案1】:

    parentdecompose 方法可能对您有所帮助。

    # 1. Find the first item you are looking for. 
    
    soup = BeautifulSoup(html_doc, 'html.parser')
    hElem = soup.find("h1", {'id': 'beautiful'})
    
    
    # 2. Find the second condition. 
    
    endElem = soup.find('h1', {'id': 'good'})
    
    
    # 3. Get parent element that contains both. 
    
    hParent = hElem.parent  # Can be made more complex if multiple ancestors are needed to contain both conditions.
    
    
    # 4. Iterate through children and remove all children outside the conditions.
    
    childrenElems = hParent.children
    inBetween = true
    for child in childrenElems:
      if not inBetween:  
        child.decompose()
      if child == endElem:
        inBetween = false 
    
    #  Remaining data.
    print(childrenElems) 
    

    【讨论】:

      【解决方案2】:

      你可以自己遍历汤的内容,并在每个&lt;h1&gt;标签之间构建一个元素块:

      from bs4 import BeautifulSoup
      from bs4.element import Tag
      
      
      html = """
      <h1 id = '1' ></h1>
      Text1 <i>here</i> has no tag
      <div>This is in a div</div>
      <h1 id = '2' ></h1>
      Text2 <i>here</i> has no tag
      <div>This is in a div</div>
      <h1 id = '3' ></h1>
      Text3 <i>here</i> has no tag
      """
      
      soup = BeautifulSoup(html, "html.parser")
      
      block = []
      blocks = []
      h1 = False
      
      for el in soup.contents:
          if type(el) == Tag and el.name == 'h1':
              # Has a h1 tag been seen yet?
              if h1:
                  blocks.append(block)
                  block = []
              h1 = True
          elif h1:
              block.append(el)
      
      # Add any final elements (missing a next h1)
      if block:
          blocks.append(block)
              
      # Display each block as html soup
      for b in blocks:        
          soup.contents = b
          print(soup)        
          print("--------------")
          
      

      这个例子会有 3 个这样的元素块:

      Text1 <i>here</i> has no tag
      <div>This is in a div</div>
      
      --------------
      
      Text2 <i>here</i> has no tag
      <div>This is in a div</div>
      
      --------------
      
      Text3 <i>here</i> has no tag
      
      --------------
              
          
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2018-12-09
        • 2018-07-18
        • 2016-03-22
        • 1970-01-01
        • 2019-11-20
        • 2019-10-18
        • 2015-12-25
        • 1970-01-01
        相关资源
        最近更新 更多