【问题标题】:Split Tag into paragraphs along nested <br> tags沿嵌套的 <br> 标签将标签拆分为段落
【发布时间】:2023-01-14 00:43:51
【问题描述】:

我已经在同一个问题上停留了一天半了,但似乎没有任何效果。我正在解析 HTML 文件并提取文本段落。但是,有些页面的结构如下:

<p>First paragraph. <br/>Second paragraph.<br/>Third paragraph</p>

我想要的输出是这样的:

<p>First paragraph.</p>
<p>Second paragraph.</p>
<p>Third paragraph.</p>

我尝试了 BS4 replace_with 函数,但它似乎不起作用,因为我收到此错误:TypeError: 'NoneType' object is not callable

from bs4 import BeautifulSoup

html = "<p>First paragraph. <br/>Second paragraph.<br/>Third paragraph</p>"
soup = BeautifulSoup(html, "html.parser")
allparas = soup.find_all('p') #In the actual files there is more code

for p in allparas:
    if p.find_all(["br", "br/"]): #Some files don't have br tags
        for br in p.find_all(["br", "br/"]):
            new_p = br.new_tag('p', closed=True)
            br.replace_with(new_p)

我得到的最接近的是用字符串替换标签,但编码似乎出了问题:

if html.find_all(["br", "br/"]):
    for br in html.find_all(["br", "br/"]):
        br.replace_with("</p><p>")
        reslist = [p for p in html.find_all("p")]
        allparas = ''.join(str(p) for p in reslist) #Overwriting allparas here as I need it later

这有效,但我的打印输出如下:

<p>First paragraph.&lt;/p&gt;&lt;p&gt;Second paragraph.&lt;/p&gt;&lt;p&gt;Third paragraph.</p>

将字符串转换为 BS4 标签时出现问题。任何帮助将不胜感激!

【问题讨论】:

    标签: python html web-scraping beautifulsoup


    【解决方案1】:

    我会用 css 选择器(只是个人喜好)。在任何情况下,完全基于您的示例 html,您可以执行以下操作:

    for s in list(soup.strings):
        #wrap the text segments with a new tag
        s.wrap(soup.new_tag("p"))
    for br in soup.select('br'):
        #remove the original br tags
        br.extract()
    soup
    

    输出应该是您的预期输出。

    【讨论】:

      【解决方案2】:

      常规功能

      这是一个实现,它处理那些 &lt;br&gt; 标签(不仅仅是字符串)的任意兄弟标签:

      from bs4 import BeautifulSoup, Tag
      
      
      def breaks_to_paragraphs(
          tag: Tag,
          soup: BeautifulSoup,
          recursive: bool = False,
      ) -> None:
          """
          If `tag` contains <br> elements, it is split into `<p>` tags instead.
      
          The `<br>` tags are removed from `tag`.
          If no `<br>` tags are found, this function does nothing.
      
          Args:
              tag:
                  The `Tag` instance to mutate
              soup:
                  The `BeautifulSoup` instance the tag belongs to (for `new_tag`)
              recursive (optional):
                  If `True`, the function is applied to all nested tags recursively;
                  otherwise (default) only the children are affected.
          """
          elements = []
          contains_br = False
          for child in list(tag.children):
              if isinstance(child, Tag) and child.name != "br":
                  if recursive:
                      breaks_to_paragraphs(child, soup, recursive=recursive)
                  elements.append(child)
              elif not isinstance(child, Tag):  # it is a `NavigableString`
                  elements.append(child)
              else:  # it is a `<br>` tag
                  contains_br = True
                  p = soup.new_tag("p")
                  child.replace_with(p)
                  p.extend(elements)
                  elements.clear()
          if elements and contains_br:
              p = soup.new_tag("p")
              tag.append(p)
              p.extend(elements)
          soup.smooth()
      

      子类方法

      或者,由于您需要原始的 BeautifulSoup 实例来调用 new_tag 方法,您也可以将其子类化并将其实现为方法:

      from bs4 import BeautifulSoup, Tag
      
      
      class CustomSoup(BeautifulSoup):
          def breaks_to_paragraphs(self, tag: Tag, recursive: bool = False) -> None:
              """
              If `tag` contains <br> elements, it is split into `<p>` tags instead.
      
              The `<br>` tags are removed from `tag`.
              If no `<br>` tags are found, this method does nothing.
      
              Args:
                  tag:
                      The `Tag` instance to mutate
                  recursive (optional):
                      If `True`, the function is applied to all nested tags recursively;
                      otherwise (default) only the children are affected.
              """
              elements = []
              contains_br = False
              for child in list(tag.children):
                  if isinstance(child, Tag) and child.name != "br":
                      if recursive:
                          self.breaks_to_paragraphs(child, recursive=recursive)
                      elements.append(child)
                  elif not isinstance(child, Tag):  # it is a `NavigableString`
                      elements.append(child)
                  else:  # it is a `<br>` tag
                      contains_br = True
                      p = self.new_tag("p")
                      child.replace_with(p)
                      p.extend(elements)
                      elements.clear()
              if elements and contains_br:
                  p = self.new_tag("p")
                  tag.append(p)
                  p.extend(elements)
              self.smooth()
      

      演示

      这是一个快速测试:

      ...
      
      def main() -> None:
          html = """
          <p>
              First paragraph. <br/>
              Second paragraph.<br/> 
              <span>foo</span>
              <span>bar<br>baz</span>
          </p>
          """
          soup = CustomSoup(html, "html.parser")
          soup.breaks_to_paragraphs(soup.p)
          print(soup.p.prettify())
      
      
      if __name__ == "__main__":
          main()
      

      输出:

      <p>
       <p>
        First paragraph.
       </p>
       <p>
        Second paragraph.
       </p>
       <p>
        <span>
         foo
        </span>
        <span>
         bar
         <br/>
         baz
        </span>
       </p>
      </p>
      

      如果您改为使用 soup.breaks_to_paragraphs(soup.p, recursive=True) 调用它:

      <p>
       <p>
        First paragraph.
       </p>
       <p>
        Second paragraph.
       </p>
       <p>
        <span>
         foo
        </span>
        <span>
         <p>
          bar
         </p>
         <p>
          baz
         </p>
        </span>
       </p>
      </p>
      

      注意它是如何沿着嵌套的 &lt;br&gt; 拆分成 &lt;p&gt; 标签的。

      【讨论】:

        猜你喜欢
        • 2015-07-15
        • 2015-08-22
        • 2020-04-16
        • 2012-02-04
        • 2011-01-02
        • 1970-01-01
        • 1970-01-01
        • 2014-01-17
        • 2013-02-22
        相关资源
        最近更新 更多