当文本之间的元素时提取xml文本答案

【问题标题】：Extract xml text when elements in between text当文本之间的元素时提取xml文本
【发布时间】：2019-01-31 11:41:06
【问题描述】：

我有这个 xml 文件：

<do title='Example document' date='today'>
<db descr='First level'>
    <P>
        Some text here that
        <af d='reference 1'>continues</af>
        but then has some more stuff.
    </P>
</db>

我需要对其进行解析以提取其文本。我为此使用xml.etree.ElementTree (see documentation)。

这是我用来解析和探索文件的简单代码：

import xml.etree.ElementTree as ET
tree = ET.parse(file_path)
root = tree.getroot()

def explore_element(element):
    print(element.tag)
    print(element.attrib)
    print(element.text)
    for child in element:
        explore_element(child)

explore_element(root)

一切正常，除了元素<P> 没有完整的文本。特别是，我似乎缺少“但还有更多的东西”（<P> 中的文本在 <af> 元素之后）。

xml 文件是给定的，所以我无法改进它，即使有更好的推荐方法来编写它（而且有太多需要手动修复）。

有没有办法获取所有文本？

我的代码产生的输出（如果有帮助的话）是这样的：

do
{'title': 'Example document', 'date': 'today'}

db
{'descr': 'First level'}

P 
{}
        Some text here that

af
{'d': 'reference 1'}
continues

编辑：

接受的答案让我意识到我没有尽可能仔细地阅读文档。有相关问题的人也可能会发现 .tail 很有用。

【问题讨论】：

我们可以使用beautifulsoup吗？
我看到你提供了一个解决方案，这很完美。让我看看我能不能让它工作！

标签： python xml parsing xml-parsing

【解决方案1】：

使用 BeautifulSoup：

list_test.xml：

<do title='Example document' date='today'>
<db descr='First level'>
    <P>
        Some text here that
        <af d='reference 1'>continues</af>
        but then has some more stuff.
    </P>
</db>

然后：

from bs4 import BeautifulSoup

with open('list_test.xml','r') as f:
    soup = BeautifulSoup(f.read(), "html.parser")
    for line in soup.find_all('p'):
         print(line.text)

输出：

Some text here that
continues
but then has some more stuff.

编辑：

使用 elementree：

import xml.etree.ElementTree as ET
xml = '<p> Some text here that <af d="reference 1">continues</af> but then has some more stuff.</p>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))

输出：

Some text here that continues but then has some more stuff.

【讨论】：