BeautifulSoup：将标签（包含其他标签）拆分为两个字符串答案

【问题标题】：BeautifulSoup: split tag (containing other tags) into two at stringBeautifulSoup：将标签（包含其他标签）拆分为两个字符串
【发布时间】：2021-08-05 23:51:19
【问题描述】：

我正在将一些 HTML 字典数据转换成 XML，以便导入 some dictionary software。

原始的 HTML 看起来像这样：

<div class="entry">
  <span class="headword">word</span> 
  <span class="pos">part of speech</span> 
  <span class="definition">sense1; sense2 
    <span class="example">(example2.1; example2.2)</span>
    ; sense3 <span class="example">(example3.1; example3.2)</span>
  </span> 
</div>

编辑： 事实上，输入的类与输出的 XML 标记并不完全匹配。在我的示例中，这只是为了说明关系。我需要用特定的 XML 标记替换特定的类，但它们不匹配。

理想的最终结果如下所示：

<entry>
  <headword>word</headword>
  <pos>part of speech</pos>
  <sense>
    <definition>sense1</definition>
  </sense>
  <sense>
    <definition>sense2</definition>
    <example>example2.1</example>
    <example>example2.2</example>
  </sense>
  <sense>
    <definition>sense3</definition>
    <example>example3.1</example>
    <example>example3.2</example>
  </sense>
</entry>

我的汤的当前状态（完成了简单的替换）是：

<entry>
  <headword>word</headword>
  <pos>part of speech</pos>
  <definition>sense1; sense2
    <example>example2.1</example>
    <example>example2.2</example>
    ; sense3 
    <example>example3.1</example>
    <example>example3.2</example>
  </definition>
</entry>

映射 1:1 的划分很容易，并且将定义+示例包装在一个语义标签中也应该如此，但问题是原始不加区别地使用 ; 来分隔意义和示例的方式。这意味着我需要先拆分example 标签，然后拆分; 处的definition 标签（即用</example>\n<example> 或</definition>\n<definition> 有效地替换; ）。自从我开始写这个问题以来，我已经想出了如何为 examples 做到这一点（因为它们只包含字符串），但是 definitions 很可能包含@987654332 @标签本身，所以我不能只使用split()，因为返回了一个列表&'list' object has no attribute 'split'。

有没有更简单的方法来拆分包含其他标签的标签，还是我必须遍历结果集列表并重新创建所有标签？

tags = soup.find_all("example")
for tag in tags:
    tag.string = re.sub(r"[()]", "", tag.string)     # remove parentheses
    egs = tag.string.split("; ")     # or str(tag.contents).split("; ") ?
    new = ""
    if len(egs) > 1:
        for eg in reversed(egs[1:]):
            new = soup.new_tag("example")
            new.string = eg
            tag.insert_after(new)
        tag.string = egs[0]             # orig tag becomes 1st seg only

【问题讨论】：

我没有时间再看这个，但我很惊讶现在我有一些时间比以前少了一个答案。奇怪的。我想知道这是不是反对票。郑重声明，我没有投反对票。
你能说清楚什么是输入，什么是预期输出？

标签： python html xml beautifulsoup

【解决方案1】：

您可以检查每个元素的soup.contents 并通过递归遍历soup.contents 中的非字符串元素来构建结构：

from bs4 import BeautifulSoup, NavigableString
import re
def to_xml(d):
   r, s, k = [], None, []
   for i in filter(lambda x:x != '\n', d.contents):
      if isinstance(i, NavigableString):
         if s is not None:
            r.append((s, k))
         s = [j for i in re.sub('^\(|\)$', '', i).split('; ') if (j:=re.sub('^\W+|\W+$', '', i))]
         k = []
      else:
         k.append(i)
   r.append((s, k))
   for a, b in r:
      if a is not None:
         if len(a) == 1 and not b:
            yield f'<{(c:=" ".join(d["class"]))}>{a[0]}</{c}>\n'
         elif not b:
            yield from ["<{}>\n<{}>{}</{}>\n</{}>\n".format(c, c1, i, c1, c) if (c:=re.sub('[\d+\.]+$', '', i)) != (c1:=" ".join(d["class"])) else f"<{c}>{i}</{c}>" for i in a]
         else:
            yield from ["<{}>\n<{}>{}</{}>\n</{}>\n".format((c:=re.sub('[\d+\.]+$', '', i)), (c1:=" ".join(d["class"])), i, c1, c) for i in a[:-1]]
            yield "<{}>\n<{}>{}</{}>\n{}\n</{}>\n".format((c:=re.sub('[\d+\.]+$', '', a[-1])), (c1:=' '.join(d['class'])), a[-1], c1, '\n'.join(j for k in b for j in to_xml(k)), c)
      else:
          yield '<{}>{}</{}>'.format((c1:=" ".join(d["class"])), "\n".join(j for k in b for j in to_xml(k)), c1)

s = """
 <div class="entry">
   <span class="headword">word</span> 
   <span class="pos">part of speech</span> 
   <span class="definition">sense1; sense2 
   <span class="example">(example2.1; example2.2)</span>
    ; sense3 <span class="example">(example3.1; example3.2)</span>
   </span> 
 </div>
"""
r = BeautifulSoup(''.join(to_xml(BeautifulSoup(s, 'html.parser').div)), 'html.parser')
print(r)

输出：

<entry>
   <headword>word</headword>
   <pos>part of speech</pos>
   <sense>
      <definition>sense1</definition>
   </sense>
   <sense>
      <definition>sense2</definition>
      <example>example2.1</example>
      <example>example2.2</example>
   </sense>
   <sense>
      <definition>sense3</definition>
      <example>example3.1</example>
      <example>example3.2</example>
   </sense>
</entry>

【讨论】：

我不得不承认这对我来说非常难以理解，但我怀疑我通过提供理想化的输入来破坏自己。实际上，输入类根本不匹配 XML 输出标签 - 它们之间存在关系，但它们是不同的。