如何在 Python 中将 xml 文本与前面的空元素相关联？答案

【问题标题】：How can I associate xml text with a preceding empty element in Python?如何在 Python 中将 xml 文本与前面的空元素相关联？
【发布时间】：2013-12-21 21:29:02
【问题描述】：

我继承了一些需要在 Python 中处理的 xml。我正在使用xml.etree.cElementTree，并且在将出现在空元素之后的文本与该空元素的标签相关联时遇到了一些问题。 xml 比我在下面粘贴的要复杂得多，但我已对其进行了简化以使问题更清晰（我希望！）。

我想要的结果是这样的字典：

期望的结果

{(9, 1): 'As they say, A student has usually three maladies:', (9, 2): 'poverty, itch, and pride.'}

元组还可以包含字符串（例如，('9', '1')）。在这个早期阶段我真的不在乎。

这是 XML：

test1.xml

<div1 type="chapter" num="9">
  <p>
    <section num="1"/> <!-- The empty element -->
      As they say, A student has usually three maladies: <!-- Here lies the trouble -->
    <section num="2"/> <!-- Another empty element -->
      poverty, itch, and pride.
  </p>
</div1>

我已经尝试过什么

尝试 1

>>> import xml.etree.cElementTree as ET
>>> tree = ET.parse('test1.xml')
>>> root = tree.getroot()
>>> chapter = root.attrib['num']
>>> d = dict()
>>> for p in root:
    for section in p:
        d[(int(chapter), int(section.attrib['num']))] = section.text


>>> d
{(9, 2): None, (9, 1): None}    # This of course makes sense, since the elements are empty

尝试 2

>>> for p in root:
    for section, text in zip(p, p.itertext()):    # unfortunately, p and p.itertext() are two different lengths, which also makes sense
        d[(int(chapter), int(section.attrib['num']))] = text.strip()


>>> d
{(9, 2): 'As they say, A student has usually three maladies:', (9, 1): ''}

正如您在后面的尝试中看到的那样，p 和 p.itertext() 是两个不同的长度。 (9, 2) 的值是我要与键 (9, 1) 关联的值，而我想与 (9, 2) 关联的值甚至不会出现在 d 中（因为 zip 会截断更长的 @987654334 @)。

任何帮助将不胜感激。提前致谢。

【问题讨论】：

标签： python xml xml.etree

【解决方案1】：

您是否尝试过使用.tail？

import xml.etree.cElementTree as ET

txt = """<div1 type="chapter" num="9">
         <p>
           <section num="1"/> <!-- The empty element -->
             As they say, A student has usually three maladies: <!-- Here lies the trouble -->
           <section num="2"/> <!-- Another empty element -->
             poverty, itch, and pride.
         </p>
         </div1>"""
root = ET.fromstring(txt)
for p in root:
    for s in p:
        print s.attrib['num'], s.tail

【讨论】：

太棒了。像魅力一样工作。谢谢。

【解决方案2】：

我会为此使用BeautifulSoup：

from bs4 import BeautifulSoup

html_doc = """<div1 type="chapter" num="9">
  <p>
    <section num="1"/>
      As they say, A student has usually three maladies:
    <section num="2"/>
      poverty, itch, and pride.
  </p>
</div1>"""

soup = BeautifulSoup(html_doc)

result = {}
for chapter in soup.find_all(type='chapter'):
    for section in chapter.find_all('section'):
      result[(chapter['num'], section['num'])] = section.next_sibling.strip()

import pprint
pprint.pprint(result)

打印出来：

{(u'9', u'1'): u'As they say, A student has usually three maladies:',
 (u'9', u'2'): u'poverty, itch, and pride.'}

【讨论】：