【问题标题】:Parse XML Schema Definition to CSV with Python使用 Python 将 XML 模式定义解析为 CSV
【发布时间】:2014-06-24 15:36:58
【问题描述】:

我想将 XML 模式定义的元素解析为 CSV 文件以进行文档和分析。我的 XSD 采用以下形式;

<xs:element name="ELEMENT">
<xs:complexType>
    <xs:sequence>
        <xs:element ref="element 1"/>
        <xs:element ref="element 2"/>
        <xs:element ref="element 3"/>
    </xs:sequence>
</xs:complexType>
</xs:element>

对于给定的元素名称,我想创建一个包含元素 1、元素 2、元素 3 等的 CSV。

我已经尝试过 Python lxml 库,但还不能按单个元素访问/过滤。

import xml.etree.ElementTree as ET
tree = ET.parse('doc.xsd')
root = tree.getroot()
for child in root:
  print child.tag, child.attrib

【问题讨论】:

  • 您希望这些元素作为列还是作为行?顺便说一句,上面的 xml 不完整,不是有效的 XML。尝试将其更新为最小的工作 XSD 文件。
  • 我建议您使用lxml。您必须安装它,这需要一点时间,但是您拥有非常强大的软件包,具有强大的 xpath 支持、模式验证等。要跟进,请转到lxml 提供的教程,它将回答您所有的问题。跨度>
  • 一月,感谢您的快速回复。我在本地拥有完整、有效的 XSD。这只是一个片段。我尝试了lxml,但卡住了。使用 lxml,如何找到特定元素?一旦找到它,如何访问子元素?顺便说一句,element1,element2,element3 的列表就足够了。
  • 教程解释了一些方法。一个是 xpath。

标签: python xml xsd


【解决方案1】:

以下代码显示了如何在 XSD 中搜索元素名称。

from lxml import etree
xsdstr = """
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="ELEMENT">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="element 1"/>
        <xs:element ref="element 2"/>
        <xs:element ref="element 3"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>
"""

doc = etree.fromstring(xsdstr.strip())

namespaces = {"xs": "http://www.w3.org/2001/XMLSchema"}

names = doc.xpath("//xs:element/@ref", namespaces=namespaces)
print names

运行它会打印:

['element 1', 'element 2', 'element 3']

如果您有更复杂的架构,您可能需要更好地定位名称,以下是可能的示例:

print "trying more precise targeting ------"
names = doc.xpath("//xs:element[@name='ELEMENT']//xs:sequence/xs:element/@ref", namespaces=namespaces)
print names

在我们的例子中,结果是一样的。

【讨论】:

  • 非常感谢。这无疑让我走上了正确的道路。我会投票,但我还没有代表。再次感谢。
【解决方案2】:

查找 XSD 到 CSV 解析器,如下所示: 使用下面的代码,也可以解析多节点的 XML。

import pandas as pd
from bs4 import BeautifulSoup


def xsd_to_dict(xsd_path):
    super_dict = {}
    soup = BeautifulSoup(open(xsd_path), "html.parser")
    for complex_type in soup.find_all('xs:complextype'):
        xsd_parsed = [x for x in ",".join(str(complex_type).split("\n"))
            .replace("</xs:sequence>", "")
            .replace("'<xs:sequence>", "")
            .replace("<xs:", "")
            .replace("</xs:complextype>", "")
            .replace("</xs:element>", "")
            .replace(">", "").replace("sequence", "")
            .split(",") if x != ""]

        if len(xsd_parsed[0]) > len("complextype") + 1:
            matrix_list = [e.split(" ") for e in xsd_parsed[-len(xsd_parsed) + 1:]]

            level_1 = ["|".join(["".join([":".join(final.split("=")) for final in y if len(final.split("=")) == 2])
                                 for y in [x.split(",") for x in item]]) for item in matrix_list]
            level_1.insert(0, xsd_parsed[0])
            for x in level_1[-len(xsd_parsed) + 1:]:
                flattened_dict = {x.split(":")[0]:"-".join(x.split(":")[-len(x.split(":")) + 1:])
                       for x in (level_1[0] + x).replace("=", ":").split("|")}
                xPath = flattened_dict.get("complextype name")
                xmlName = flattened_dict.get("name")
                dataType = flattened_dict.get("type")

                if xmlName != None:
                    final_dict = {x.split(":")[0]:x.split(":")[1]
                                for x in str("xpath:"+str(xPath)+",xmlFieldName:"+str(xmlName)+",dataPath:"+str(dataType)).split(",")}
                    for k, v in final_dict.items():
                        super_dict.setdefault(k, []).append(v)

    return super_dict



def xsd_to_csv(xsd_path):
    pd.DataFrame(xsd_to_dict(xsd_path)).to_csv(xsd_path.replace(".xsd", ".csv"))
    return "done"


xsd_to_csv("CustomersOrders.xsd")

输入:https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/linq/sample-xsd-file-customers-and-orders1

输出:

,xpath,xmlFieldName,dataPath
0,"""CustomerType""","""CompanyName""","""xs-string"""
1,"""CustomerType""","""ContactName""","""xs-string"""
2,"""CustomerType""","""ContactTitle""","""xs-string"""
3,"""CustomerType""","""Phone""","""xs-string"""
4,"""CustomerType""","""Fax""","""xs-string"""
5,"""CustomerType""","""FullAddress""","""AddressType"""
6,"""CustomerType""","""CustomerID""","""xs-token""</xs-attribute"
7,"""AddressType""","""Address""","""xs-string"""
8,"""AddressType""","""City""","""xs-string"""
9,"""AddressType""","""Region""","""xs-string"""
10,"""AddressType""","""PostalCode""","""xs-string"""
11,"""AddressType""","""Country""","""xs-string"""
12,"""AddressType""","""CustomerID""","""xs-token""</xs-attribute"
13,"""OrderType""","""CustomerID""","""xs-token"""
14,"""OrderType""","""EmployeeID""","""xs-token"""
15,"""OrderType""","""OrderDate""","""xs-dateTime"""
16,"""OrderType""","""RequiredDate""","""xs-dateTime"""
17,"""OrderType""","""ShipInfo""","""ShipInfoType"""
18,"""ShipInfoType""","""ShipVia""","""xs-integer"""
19,"""ShipInfoType""","""Freight""","""xs-decimal"""
20,"""ShipInfoType""","""ShipName""","""xs-string"""
21,"""ShipInfoType""","""ShipAddress""","""xs-string"""
22,"""ShipInfoType""","""ShipCity""","""xs-string"""
23,"""ShipInfoType""","""ShipRegion""","""xs-string"""
24,"""ShipInfoType""","""ShipPostalCode""","""xs-string"""
25,"""ShipInfoType""","""ShipCountry""","""xs-string"""
26,"""ShipInfoType""","""ShippedDate""","""xs-dateTime""

【讨论】:

    猜你喜欢
    • 2021-07-04
    • 1970-01-01
    • 1970-01-01
    • 2018-02-09
    • 2019-10-19
    • 2020-03-28
    • 2023-02-10
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多