【问题标题】:How can we split an xml with lxml?我们如何用 lxml 拆分 xml?
【发布时间】:2022-01-21 09:51:12
【问题描述】:

我正在寻找一种拆分以下xml的好方法

<?xml version='1.0' encoding='US-ASCII'?><cml xmlns="http://www.chemaxon.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.chemaxon.com/marvin/schema/mrvSchema_20_20_0.xsd" version="ChemAxon file format v20.20.0, generated by v21.14.0">
<MDocument><MChemicalStruct><molecule molID="m1"><atomArray atomID="a1 a2 a3" elementType="C C O"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
<MDocument><MChemicalStruct><molecule molID="m2"><atomArray atomID="a1 a2 a3 a4" elementType="C C C C"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/><bond id="b3" atomRefs2="a3 a4" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
</cml>

成碎片(在这个场合是两个):

<?xml version='1.0' encoding='US-ASCII'?><cml xmlns="http://www.chemaxon.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.chemaxon.com/marvin/schema/mrvSchema_20_20_0.xsd" version="ChemAxon file format v20.20.0, generated by v21.14.0">
<MDocument><MChemicalStruct><molecule molID="m1"><atomArray atomID="a1 a2 a3" elementType="C C O"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
</cml>

<?xml version='1.0' encoding='US-ASCII'?><cml xmlns="http://www.chemaxon.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.chemaxon.com/marvin/schema/mrvSchema_20_20_0.xsd" version="ChemAxon file format v20.20.0, generated by v21.14.0">
<MDocument><MChemicalStruct><molecule molID="m2"><atomArray atomID="a1 a2 a3 a4" elementType="C C C C"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/><bond id="b3" atomRefs2="a3 a4" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
</cml>

我正在试验下面的代码,但它看起来不是很优雅。有没有更好的方法来实现这一点?

from lxml import etree
starting_xml_string = '''<?xml version='1.0' encoding='US-ASCII'?><cml xmlns="http://www.chemaxon.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.chemaxon.com/marvin/schema/mrvSchema_20_20_0.xsd" version="ChemAxon file format v20.20.0, generated by v21.14.0">
<MDocument><MChemicalStruct><molecule molID="m1"><atomArray atomID="a1 a2 a3" elementType="C C O"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
<MDocument><MChemicalStruct><molecule molID="m2"><atomArray atomID="a1 a2 a3 a4" elementType="C C C C"/><bondArray><bond id="b1" atomRefs2="a1 a2" order="1"/><bond id="b2" atomRefs2="a2 a3" order="1"/><bond id="b3" atomRefs2="a3 a4" order="1"/></bondArray></molecule></MChemicalStruct></MDocument>
</cml>'''
root = etree.fromstring(starting_xml_string.encode('utf-8'))

# remove all children
envelope = deepcopy(root)
for mol in envelope:
    envelope.remove(mol)
fragments = []
for fragment in root.getchildren():
    tmp = deepcopy(envelope)
    tmp.append(fragment)
    tmp = etree.tostring(tmp, xml_declaration=True, encoding=root.getroottree().docinfo.encoding).decode('utf-8')
    fragments.append(tmp)

非常感谢您的帮助。

【问题讨论】:

    标签: python lxml


    【解决方案1】:

    我会通过以下方式处理它。请注意,您必须考虑命名空间,因此代码反映了这一点:

    #define a helper function
    def cleanup(id):
        root = etree.fromstring(starting_xml_string.encode('utf-8'))
        #define an xpath expression
        exp = f'//xx:MDocument[.//xx:molecule[@molID="{id}"]]'
        target = root.xpath(exp,namespaces=ns)[0]
        target.getparent().remove(target)
        fn = f"myfile_without_{id}.xml"
        with open(fn, 'w') as doc:
            final = etree.tostring(root, xml_declaration=True, pretty_print = True)
            doc.write(final.decode())
            
    #declare namespaces
    ns = {"xx":"http://www.chemaxon.com"}
    #get your target molecule ids
    ids = root.xpath('//xx:MDocument//xx:molecule/@molID',namespaces=ns)
    for id in ids:    
        cleanup(id)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2017-03-17
      • 2012-11-26
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多