【问题标题】:Loop through XML in Python在 Python 中循环遍历 XML
【发布时间】:2021-04-21 14:38:12
【问题描述】:

我的数据集如下:

<?xml version="1.0" encoding="UTF-8"?>
<depts xmlns="http://SOMELINK" 
        xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
        date="2021-01-15">
 <dept dept_id="00001" 
            col_two="00001value" 
            col_three="00001false"
            name = "some_name">     
    <owners>
      <currentowner col_four="00001value" 
                    col_five="00001value" 
                    col_six="00001false"
                    name = "some_name">
        <addr col_seven="00001value" 
                col_eight="00001value" 
                col_nine="00001false"/>
      </currentowner>
      <currentowner col_four="00001bvalue" 
                    col_five="00001bvalue" 
                    col_six="00001bfalse"
                    name = "some_name">
        <addr col_seven="00001bvalue" 
                col_eight="00001bvalue" 
                col_nine="00001bfalse"/>
      </currentowner>
    </owners>
  </dept>
  <dept dept_id="00002" 
            col_two="00002value" 
            col_three="00002value"
            name = "some_name">
    <owners>
      <currentowner col_four="00002value" 
                    col_five="00002value" 
                    col_six="00002false"
                    name = "some_name">
        <addr col_seven="00002value" 
                col_eight="00002value" 
                col_nine="00002false"/>
      </currentowner>
    </owners>
  </dept> 
</depts>

目前我有两个循环,一个通过child 数据进行迭代,另一个通过granchild 进行迭代

import pandas
import xml.etree.ElementTree as element_tree
from xml.etree.ElementTree import parse

tree = element_tree.parse('<HERE_GOES_XML>')
root = tree.getroot()
name_space = {'ns0': 'http://SOMELINK'}

#root
date_from = root.attrib['date']
print(date_from)

#child
for pharma in root.findall('.//ns0:dept', name_space):
    for key, value in pharma.items():
        print(key +': ' + value)
    
#granchild, this must be merged to above so entire script will iterate through entire dept node to move to the next
for owner in root.findall('.//ns0:dept/ns0:owners/ns0:currentowner', name_space):
    owner_dict = {}
    
    for key, value in owner.items():
        print(key +': ' + value)

目前的结果是:

2021-01-15
dept_id: 00001
col_two: 00001value
col_three: 00001false
dept_id: 00002
col_two: 00002value
col_three: 00002value
col_four: 00001value
col_five: 00001value
col_six: 00001false
col_four: 00002value
col_five: 00002value
col_six: 00002false

我的目标是嵌套外观,它将首先迭代整个 dept 子及其孙子,然后才移动到下一个。预期结果将低于设置,稍后将转换为pandas' 数据框(我将尝试在接下来的工作中)。某些列在子/孙子之间具有相同的名称,因此需要前缀或仅循环特定的children

dept.dept_id: 00001
dept.col_two: 00001value
dept.col_three: 00001false
dept.name: some_name
currentowner.col_four: 00001value
currentowner.col_five: 00001value
currentowner.col_six: 00001false
currentowner.name: some_name

currentowner.col_four: 00001bvalue
currentowner.col_five: 00001bvalue
currentowner.col_six: 00001bfalse
currentowner.name: some_name

addr.col_seven: 00001value
addr.col_eight: 00001value
addr.col_nine: 00001false

dept.dept_id: 00002
dept.col_two: 00002value
dept.col_three: 00002value
dept.name: some_name
currentowner.col_four: 00002value
currentowner.col_five: 00002value
currentowner.col_six: 00002false
currentowner.name: some_name
addr.col_seven: 00002value
addr.col_eight: 00002value
addr.col_nine: 00002false

[更新] - 我遇到了zip,它应该可以解决问题。

dept_list = []
for item in root.iterfind('.//ns0:dept', name_space):
    #print(item.attrib)
    dept_list.append(item.attrib)
#print(dept_list)


owner_list = []
for item in root.iterfind('.//ns0:dept/ns0:owners/ns0:currentowner', name_space):
    #print(item.attrib)
    owner_list.append(item.attrib)
#print(owner_list)

zipped = zip(dept_list, owner_list)

【问题讨论】:

    标签: python xml pandas loops elementtree


    【解决方案1】:

    循环可以在列表理解中完成,然后通过导航 DOM 构建 dict。以下代码直接进入数据框。

    xml = """<depts xmlns="http://SOMELINK" 
            xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
            date="2021-01-15">
      <dept dept_id="00001" 
                col_two="00001value" 
                col_three="00001false">
        <owners>
          <currentowner col_four="00001value" 
                        col_five="00001value" 
                        col_six="00001false">
            <addr col_seven="00001value" 
                    col_eight="00001value" 
                    col_nine="00001false"/>
          </currentowner>
        </owners>
      </dept>
      <dept dept_id="00002" 
                col_two="00002value" 
                col_three="00002value">
        <owners>
          <currentowner col_four="00002value" 
                        col_five="00002value" 
                        col_six="00002false">
            <addr col_seven="00002value" 
                    col_eight="00002value" 
                    col_nine="00002false"/>
          </currentowner>
        </owners>
      </dept> 
    </depts>"""
    
    import xml.etree.ElementTree as ET
    import pandas as pd
    
    root = ET.fromstring(xml)
    
    root.attrib
    ns = {'ns0': 'http://SOMELINK'}
    pd.DataFrame([{**d.attrib, 
      **d.find("ns0:owners/ns0:currentowner", ns).attrib, 
      **d.find("ns0:owners/ns0:currentowner/ns0:addr", ns).attrib}
     for d in root.findall("ns0:dept", ns)
    ])
    

    更安全的版本

    如果任何 dept 没有 currentownercurrentowner/addr 使用 .attrib 将失败。考虑这些元素是可选的,遍历 DOM。 dict 键构造更改为基于元素标签和属性名称的名称。根据您的数据设计构建理解的结构方式。需要考虑1对1,1对可选,1对多。真的可以追溯到 Codd 在 1970 年写的论文

    import xml.etree.ElementTree as ET
    import pandas as pd
    
    root = ET.fromstring(xml)
    ns = {'ns0': 'http://SOMELINK'}
    pd.DataFrame([{**{f"{d.tag.split('}')[1]}.{k}":v for k,v in d.items()}, 
      **{f"{co.tag.split('}')[1]}.{k}":v  for k,v in co.items()}, 
      **{f"{addr.tag.split('}')[1]}.{k}":v for addr in co.findall("ns0:addr", ns) for k,v in addr.items()} }
     for d in root.findall("ns0:dept", ns)
     for co in d.findall("ns0:owners/ns0:currentowner", ns)
    ])
    
    

    【讨论】:

    • 这看起来棒极了!由于发生的事情很少,我对预期输出添加了一些更新。
    • 我忘记了name 列出现在两个节点中,因此我需要在循环中添加前缀 od child 或过滤器。
    • 我用我的原始文件替换了源代码并得到了错误AttributeError: 'NoneType' object has no attribute 'attrib'。文件有大约 20k 条记录。
    • 遵循其他两个元素的模式... **{f"{d.tag.split('}')[1]}.{k}":v for k,v in d.items()} 而不是 **d.attrib
    • 已更新 - 也可查看更新的 cmets。您随后的用例不是技术性 Python / ElementTree / pandas 使用,而是真正了解您的数据设计以及如何导航它。
    【解决方案2】:

    您可以执行深度优先搜索:

    root = ElementTree.parse('data.xml').getroot()
    ns = {'ns0': 'http://SOMELINK'}
    
    date_from = root.get('date')
    print(f'{date_from=}')
    
    for dept in root.findall(f'./ns0:dept', ns):
        for key, value in dept.items():
            print(f'{key}: {value}')
        
        for node in dept.findall('.//*'):
            for key, value in node.items():
                print(f'{key}: {value}')
                
        print()
    

    【讨论】:

    • 我们还可以添加前缀child. 吗?我注意到我的最终文件中的某些列在子/孙级别具有相同的名称。
    • 或者我们可以研究一种机制,只循环特定的child
    • 您可以使用 XPath 来获得您想要的节点。 .//* 表示查看所有后代。您也不必将所有内容都塞入 XPath 表达式中。您可以使用 Python 字符串匹配来获取所需的字符串。我无权访问您的实际文件,所以我在这里有点盲目。包括一个sn-p,我可以看看
    • 我已经用扩展示例更新了我的帖子。它现在完全代表了我正在处理的原始文件。谢谢!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2018-06-19
    • 2012-11-23
    • 2013-01-26
    • 1970-01-01
    • 2022-06-11
    • 2018-08-27
    相关资源
    最近更新 更多