使用 python 提取 XML 数据答案

【问题标题】：Extract XML-data with python使用 python 提取 XML 数据
【发布时间】：2021-05-29 14:05:20
【问题描述】：

我在 XML 格式的 <list> 中有大量不同作者及其所选作品的列表（命名为 bibliography.xml）。这是一个例子：

<list type="index">
                <item><persName>Poe, Edgar Allan</persName>, <note>1809—1849</note>, <bibl>The Black
                        Cat 1843 (<abbr>Cat.</abbr>).</bibl> — <bibl>The Gold-Bug 1843
                            (<abbr>Bug.</abbr>).</bibl> — <bibl>The Raven 1845
                        (<abbr>Rav.</abbr>).</bibl></item>

                <item><persName>Melville, Herman</persName>, <bibl>Benito Cereno 1855
                            (<abbr>Ben.</abbr>)</bibl> — <bibl>Moby-Dick 1851
                        (<abbr>MobD.</abbr>)</bibl> — <bibl>Typee: A Peep at Polynesian Life 1846
                            (<abbr>PolyL.</abbr>)</bibl></item>
                
                <item><persName>Barth, John</persName>, <note>(*1930)</note>, <bibl>The Sot-Weed
                        Factor 1960 (<abbr>Fac.</abbr>)</bibl> — <bibl>Giles Goat-Boy 1960
                            (<abbr>Gil.</abbr>)</bibl></item>
            </list>

import xml.etree.ElementTree as ET

tree = ET.parse('bibliography.xml')
root = tree.getroot()

for work in root:
    if(work.tag=='item'):
        print work.get('persName')
            if (attr.tag=='abbr')
                print (attr.text)

显然它不起作用，但由于我对 python 完全陌生，所以我无法全神贯注于我做错了什么。如果有人可以在这里帮助我，将不胜感激。

【问题讨论】：

好吧，这很奇怪，因为 Oxygen 和其他一些验证器可以使用 XML。请记住，我只是发布了<list> 的 sn-p，而不是整个 TEI-Header、正文等。

标签： python xml extract

【解决方案1】：

即使我尝试了与您相同的方法，也遇到了同样的问题。我别无选择，只能将整个 xml 转换为 pretty-xml，并将其视为单个字符串。然后将每一行迭代到一个特定的标签。

import xml.dom.minidom

dom = xml.dom.minidom.parse("bibliography.xml")
pretty_xml = dom.toprettyxml()
pretty_xml = pretty_xml.split("\n")
start, end = [], [] # store the beginning and the end of "item" tag

for idx in range(len(pretty_xml)):
        if "item" in pretty_xml[idx]:
            if "/" not in pretty_xml[idx]:
                start.append(idx)
            else:
                end.append(idx)

现在您知道在 start[0] 和 end[0] 之间您有第一个数据点可用。同样明智地使用“if”条件依次迭代两个列表的所有元素，结构有点像这样（我不是在编写整个代码）：

for idx in range(len(start)):
    for line in pretty_xml[start[idx] + 1 : end[idx]]:
        line.split("persName")[1].replace("<","").replace(">","").replace("/","")
         ...
         ...

（如果您找到更好的结构化方法，请告诉我。）

【讨论】：

非常感谢您的回答。我试过了，但我得到了回复：>IndexError: list index out of range 在stackoverflow (stackoverflow.com/questions/37619848/…) 上也有一些解决我的问题的方法，但它仍然无法正常工作。
您在代码 sn-p（我共享的）的第一部分或第二部分是否出错？您是否能够填充“开始”和“结束”列表？
第一部分说 "dom = xml.dom.minidom.parse("bibliography.xml")" 第二部分说上面已经提到的

【解决方案2】：

考虑使用XPath 来获取数据。只需致电tree.xpath("//item") 即可退回所有商品。

下面是一个基于 XML sn-p 的工作示例。 tree.getroot() 只能根据完整的 xml 工作。

基本工作示例：

import lxml.etree as etree

xml = '''<list type="index">
            <item><persName>Poe, Edgar Allan</persName>, <note>1809—1849</note>, <bibl>The Black
                    Cat 1843 <abbr>(Cat.).</abbr></bibl> — <bibl>The Gold-Bug 1843
                        <abbr>(Bug.)</abbr>.</bibl> — <bibl>The Raven 1845
                    <abbr>(Rav.)</abbr>.</bibl></item>

            <item><persName>Melville, Herman</persName>, <bibl>Benito Cereno 1855
                        (<abbr>Ben.</abbr>)</bibl> — <bibl>Moby-Dick 1851
                    (<abbr>MobD.</abbr>)</bibl> — <bibl>Typee: A Peep at Polynesian Life 1846
                        (<abbr>PolyL.</abbr>)</bibl></item>
            
            <item><persName>Barth, John</persName>, <note>(*1930)</note>, <bibl>The Sot-Weed
                    Factor 1960 (<abbr>Fac.</abbr>)</bibl> — <bibl>Giles Goat-Boy 1960
                        (<abbr>Gil.</abbr>)</bibl></item>
        </list>
'''
tree = etree.fromstring(xml)
#root = tree.getroot()

for work in tree.xpath("//item"):
    persName = work.find('persName').text.strip()
    abbr =' '.join([x.text for x in work.xpath('bibl/abbr')])
    print (f'{persName} {abbr}')

输出：

Poe, Edgar Allan (Cat.). (Bug.) (Rav.)
Melville, Herman Ben. MobD. PolyL.
Barth, John Fac. Gil.

【讨论】：

非常感谢，成功了。但只有当作者只有一个<persName>，如果有更多<persName>，每个<item>，就像我的情况一样，它只会打印一个名字，然后是所有<abbr>，没有打印相关名称。而且我想知道我是否必须将整个 xml 数据也放入该脚本中，或者我是否可以链接到脚本中的 xml 文件？
您可以将work.find('persName') 替换为work.xpath('//persName') 或work.findall('persName')，并为每个循环执行结果。如果您提供 XML 示例，那么我可以更新答案。
根据 XML，您也许可以做到 for work in tree.xpath("//persName"):
感谢您的回复。如果我用您的每个建议替换它，我会得到响应“AttributeError：'list' object has no attribute 'text'”我将在上面的帖子中提供更多 xml-data，我刚刚编辑了它 PS：列表去一个以相同的方式，大约 500 个书目条目）
在您的问题中运行 xml 仍然取得了正确的结果。错误“AttributeError: 'list' object has no attribute 'text' - 这很像您在列表（而不是项目）上调用.text。