如何使用 Python BeautifulSoup 提取 xml 文档中的标记偏移量答案

【问题标题】：How to extract tag offsets in xml document using Python BeautifulSoup如何使用 Python BeautifulSoup 提取 xml 文档中的标记偏移量
【发布时间】：2015-02-25 13:01:41
【问题描述】：

我需要一些帮助来查找 XML 文档中某些标记的文本偏移量。我有一个数据集，格式如下所示，其中 ROOT 元素包含多个 RECORD，尽管每个 RECORD 仅包含一个 TEXT 元素。在文本中可能存在几个 TAG 元素用作某些文本的注释。我需要将这些注释转换为另一种格式，需要使用 Python 对标签进行开始和结束偏移。

<ROOT>
    <RECORD ID="123">
        <TEXT>
        This is an example text written at <TAG TYPE="DATE">December 29th</TAG> to illustrate the problem.
        </TEXT>
    </RECORD>
</ROOT>

基本上，我想将上述格式转换为以下格式：

<ROOT>
    <RECORD ID="123">
        <TEXT>
        This is an example text written at December 29th to illustrate the problem.
        </TEXT>
        <TAG TYPE="DATE" BEGIN=36 END=49/>
    </RECORD>
</ROOT>

我尝试过使用 BeautifulSoup，但找不到提取标签偏移量的方法。有什么想法吗？

感谢您的帮助！

/雅各布

【问题讨论】：

为什么这被否决了？

标签： python xml annotations beautifulsoup

【解决方案1】：

想法是遍历所有TEXT节点，找到里面的所有TAG节点，得到TEXT的文本和@987654327上每个TAG的文本和create new tag的位置@level，然后是unwrap() TAG 来自TEXT：

from bs4 import BeautifulSoup

data = """
<ROOT>
    <RECORD ID="123">
        <TEXT>
This is an example text written at <TAG TYPE="DATE">December 29th</TAG> to illustrate the problem.
        </TEXT>
    </RECORD>
</ROOT>
"""

soup = BeautifulSoup(data, "xml")

for text in soup.find_all('TEXT'):

    record = text.parent
    for tag in text.find_all('TAG'):
        begin = text.text.index(tag.text)
        end = len(tag.text) + begin

        record.append(soup.new_tag(tag.name, BEGIN=begin, END=end))

        tag.unwrap()

print soup

打印：

<?xml version="1.0" encoding="utf-8"?>
<ROOT>
<RECORD ID="123">
<TEXT>
This is an example text written at December 29th to illustrate the problem.
        </TEXT>
<TAG BEGIN="36" END="49"/></RECORD>
</ROOT>

注意：如果多个TAGs 出现在TEXT 级别上，还没有测试过。但至少它应该给你一个起点。

【讨论】：

感谢您的回答，虽然如果多个标签具有相同的内容会出现问题，但我会解决的。
这是一个非常聪明的技巧。感谢您的想法。

【解决方案2】：

通过 lxml.etree

from lxml import etree
root = etree.fromstring(data)
insert_tag = etree.Element("TAG")
insert_t_attib = insert_tag.attrib
insert_t_attib["TYPE"] = "DATE"

for i in root.getiterator("TAG"):
    tag_text = i.text.strip()
    p = i.getparent()
    etree.strip_tags(p, "TAG")
    pp = p.getparent()
    p_text = p.text.strip()
    begin = p_text.find(tag_text)
    end = begin + len(tag_text) 
    insert_t_attib = insert_tag.attrib
    insert_t_attib["BEGIN"] = str(begin)
    insert_t_attib["END"] = str(end)

    pp.insert(pp.getchildren().index(p)+1, insert_tag)


print etree.tostring(root)

<ROOT>
    <RECORD ID="123">
        <TEXT>
        This is an example text written at December 29th to illustrate the problem.
        </TEXT>
    <TAG TYPE="DATE" BEGIN="35" END="48"/></RECORD>
</ROOT>

【讨论】：