沿嵌套的 标签将标签拆分为段落答案

【问题标题】：Split Tag into paragraphs along nested tags沿嵌套的 标签将标签拆分为段落
【发布时间】：2023-01-14 00:43:51
【问题描述】：

我已经在同一个问题上停留了一天半了，但似乎没有任何效果。我正在解析 HTML 文件并提取文本段落。但是，有些页面的结构如下：

<p>First paragraph. <br/>Second paragraph.<br/>Third paragraph</p>

我想要的输出是这样的：

<p>First paragraph.</p>
<p>Second paragraph.</p>
<p>Third paragraph.</p>

我尝试了 BS4 replace_with 函数，但它似乎不起作用，因为我收到此错误：TypeError: 'NoneType' object is not callable：

from bs4 import BeautifulSoup

html = "<p>First paragraph. <br/>Second paragraph.<br/>Third paragraph</p>"
soup = BeautifulSoup(html, "html.parser")
allparas = soup.find_all('p') #In the actual files there is more code

for p in allparas:
    if p.find_all(["br", "br/"]): #Some files don't have br tags
        for br in p.find_all(["br", "br/"]):
            new_p = br.new_tag('p', closed=True)
            br.replace_with(new_p)

我得到的最接近的是用字符串替换标签，但编码似乎出了问题：

if html.find_all(["br", "br/"]):
    for br in html.find_all(["br", "br/"]):
        br.replace_with("</p><p>")
        reslist = [p for p in html.find_all("p")]
        allparas = ''.join(str(p) for p in reslist) #Overwriting allparas here as I need it later

这有效，但我的打印输出如下：

<p>First paragraph.&lt;/p&gt;&lt;p&gt;Second paragraph.&lt;/p&gt;&lt;p&gt;Third paragraph.</p>

将字符串转换为 BS4 标签时出现问题。任何帮助将不胜感激！

【问题讨论】：

标签： python html web-scraping beautifulsoup

【解决方案1】：

我会用 css 选择器（只是个人喜好）。在任何情况下，完全基于您的示例 html，您可以执行以下操作：

for s in list(soup.strings):
    #wrap the text segments with a new tag
    s.wrap(soup.new_tag("p"))
for br in soup.select('br'):
    #remove the original br tags
    br.extract()
soup

输出应该是您的预期输出。

【讨论】：

【解决方案2】：

常规功能

这是一个实现，它处理那些   标签（不仅仅是字符串）的任意兄弟标签：

from bs4 import BeautifulSoup, Tag


def breaks_to_paragraphs(
    tag: Tag,
    soup: BeautifulSoup,
    recursive: bool = False,
) -> None:
    """
    If `tag` contains <br> elements, it is split into `<p>` tags instead.

    The `<br>` tags are removed from `tag`.
    If no `<br>` tags are found, this function does nothing.

    Args:
        tag:
            The `Tag` instance to mutate
        soup:
            The `BeautifulSoup` instance the tag belongs to (for `new_tag`)
        recursive (optional):
            If `True`, the function is applied to all nested tags recursively;
            otherwise (default) only the children are affected.
    """
    elements = []
    contains_br = False
    for child in list(tag.children):
        if isinstance(child, Tag) and child.name != "br":
            if recursive:
                breaks_to_paragraphs(child, soup, recursive=recursive)
            elements.append(child)
        elif not isinstance(child, Tag):  # it is a `NavigableString`
            elements.append(child)
        else:  # it is a `<br>` tag
            contains_br = True
            p = soup.new_tag("p")
            child.replace_with(p)
            p.extend(elements)
            elements.clear()
    if elements and contains_br:
        p = soup.new_tag("p")
        tag.append(p)
        p.extend(elements)
    soup.smooth()

子类方法

或者，由于您需要原始的 BeautifulSoup 实例来调用 new_tag 方法，您也可以将其子类化并将其实现为方法：

from bs4 import BeautifulSoup, Tag


class CustomSoup(BeautifulSoup):
    def breaks_to_paragraphs(self, tag: Tag, recursive: bool = False) -> None:
        """
        If `tag` contains <br> elements, it is split into `<p>` tags instead.

        The `<br>` tags are removed from `tag`.
        If no `<br>` tags are found, this method does nothing.

        Args:
            tag:
                The `Tag` instance to mutate
            recursive (optional):
                If `True`, the function is applied to all nested tags recursively;
                otherwise (default) only the children are affected.
        """
        elements = []
        contains_br = False
        for child in list(tag.children):
            if isinstance(child, Tag) and child.name != "br":
                if recursive:
                    self.breaks_to_paragraphs(child, recursive=recursive)
                elements.append(child)
            elif not isinstance(child, Tag):  # it is a `NavigableString`
                elements.append(child)
            else:  # it is a `<br>` tag
                contains_br = True
                p = self.new_tag("p")
                child.replace_with(p)
                p.extend(elements)
                elements.clear()
        if elements and contains_br:
            p = self.new_tag("p")
            tag.append(p)
            p.extend(elements)
        self.smooth()

演示

这是一个快速测试：

...

def main() -> None:
    html = """
    <p>
        First paragraph. <br/>
        Second paragraph.<br/> 
        <span>foo</span>
        <span>bar<br>baz</span>
    </p>
    """
    soup = CustomSoup(html, "html.parser")
    soup.breaks_to_paragraphs(soup.p)
    print(soup.p.prettify())


if __name__ == "__main__":
    main()

输出：

<p>
 <p>
  First paragraph.
 </p>
 <p>
  Second paragraph.
 </p>
 <p>
  <span>
   foo
  </span>
  <span>
   bar
   <br/>
   baz
  </span>
 </p>
</p>

如果您改为使用 soup.breaks_to_paragraphs(soup.p, recursive=True) 调用它：

<p>
 <p>
  First paragraph.
 </p>
 <p>
  Second paragraph.
 </p>
 <p>
  <span>
   foo
  </span>
  <span>
   <p>
    bar
   </p>
   <p>
    baz
   </p>
  </span>
 </p>
</p>

注意它是如何沿着嵌套的   拆分成  标签的。

【讨论】：