Python：从 XML 树中的标签内的标签中提取文本

【问题标题】：Python: extract text from tag inside tag in XML TreePython：从 XML 树中的标签内的标签中提取文本
【发布时间】：2017-08-08 20:59:54
【问题描述】：

我目前正在解析维基百科转储，试图提取一些有用的信息。解析在 XML 中进行，我只想提取每个页面的文本/内容。现在我想知道如何在另一个标签内的标签内找到所有文本。我搜索了类似的问题，但只找到了单个标签有问题的问题。这是我想要实现的示例：

  <revision>
    <timestamp>2001-01-15T13:15:00Z</timestamp>
    <contributor>
      <username>Foobar</username>
      <id>65536</id>
    </contributor>
    <comment>I have just one thing to say!</comment>
    <text>A bunch of [[text]] here.</text>
    <minor />
  </revision>

  <example_tag>
    <timestamp>2001-01-15T13:15:00Z</timestamp>
    <contributor>
      <username>Foobar</username>
      <id>65536</id>
    </contributor>
    <comment>I have just one thing to say!</comment>
    <text>A bunch of [[text]] here.</text>
    <minor />
  </example_tag>

如何提取文本标签内的文本，但仅当它包含在修订树中时？

【问题讨论】：

标签： python xml tags extract

【解决方案1】：

您可以为此使用 xml.etree.elementtree 包并使用 XPath 查询：

import xml.etree.ElementTree as ET

root = ET.fromstring(the_xml_string)
for content in root.findall('.//revision/othertag'):
    # ... process content, for instance
    print(content.text)

（其中the_xml_string 是一个包含 XML 代码的字符串）。

或者通过列表理解获取文本元素的列表：

import xml.etree.ElementTree as ET

texts = [content.text for content in ET.fromstring(the_xml_string).findall('.//revision/othertag')]

所以.text 有内部文本。请注意，您必须将othertag 替换为标签（例如text）。如果该标记可以在revision 标记的深处任意，则应使用.//revision//othertag 作为XPath 查询。

【讨论】：