【问题标题】:How to parse xml string having deep structures using python如何使用python解析具有深层结构的xml字符串
【发布时间】:2023-03-29 20:49:02
【问题描述】:

此处提出了类似的问题 (Python XML Parsing),但我无法访问我感兴趣的内容。

如果classification-scheme 标记值为CPC,我需要提取标记patent-classification 之间包含的所有信息。有多个这样的元素并包含在patent-classifications 标签内。

在下面给出的示例中,有三个这样的值:C 07 K 16 22 IA 61 K 2039 505 AC 07 K 2317 21 A

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?>
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">
    <ops:meta name="elapsed-time" value="21"/>
    <exchange-documents>
        <exchange-document system="ops.epo.org" family-id="39103486" country="US" doc-number="2009234106" kind="A1">
            <bibliographic-data>
                <publication-reference>
                    <document-id document-id-type="docdb">
                        <country>US</country>
                        <doc-number>2009234106</doc-number>
                        <kind>A1</kind>
                        <date>20090917</date>
                    </document-id>
                    <document-id document-id-type="epodoc">
                        <doc-number>US2009234106</doc-number>
                        <date>20090917</date>
                    </document-id>
                </publication-reference>
                <classifications-ipcr>
                    <classification-ipcr sequence="1">
                        <text>C07K  16/    44            A I                    </text>
                    </classification-ipcr>
                </classifications-ipcr>
                <patent-classifications>
                    <patent-classification sequence="1">
                        <classification-scheme office="" scheme="CPC"/>
                        <section>C</section>
                        <class>07</class>
                        <subclass>K</subclass>
                        <main-group>16</main-group>
                        <subgroup>22</subgroup>
                        <classification-value>I</classification-value>
                    </patent-classification>
                    <patent-classification sequence="2">
                        <classification-scheme office="" scheme="CPC"/>
                        <section>A</section>
                        <class>61</class>
                        <subclass>K</subclass>
                        <main-group>2039</main-group>
                        <subgroup>505</subgroup>
                        <classification-value>A</classification-value>
                    </patent-classification>
                    <patent-classification sequence="7">
                        <classification-scheme office="" scheme="CPC"/>
                        <section>C</section>
                        <class>07</class>
                        <subclass>K</subclass>
                        <main-group>2317</main-group>
                        <subgroup>92</subgroup>
                        <classification-value>A</classification-value>
                    </patent-classification>
                    <patent-classification sequence="1">
                        <classification-scheme office="US" scheme="UC"/>
                        <classification-symbol>530/387.9</classification-symbol>
                    </patent-classification>
                </patent-classifications>
            </bibliographic-data>
        </exchange-document>
    </exchange-documents>
</ops:world-patent-data>

【问题讨论】:

    标签: python xml


    【解决方案1】:

    如果没有 BeautifulSoup,请安装它:

    $ easy_install BeautifulSoup4

    试试这个:

    from bs4 import BeautifulSoup
    
    xml = open('example.xml', 'rb').read()
    bs = BeautifulSoup(xml)
    
    # find patent-classification
    patents = bs.findAll('patent-classification')
    # filter the ones with CPC
    for pa in patents:
        if pa.find('classification-scheme', {'scheme': 'CPC'} ):
            print pa.getText()
    

    【讨论】:

    • 谢谢,但是xml 在哪里用作变量?
    • 好 xml 变量是您加载 xml 的位置。实际上要尝试确切的代码,请创建一个文件名example.xml 并在其中写下您在问题上发布的内容,然后我编辑了我的答案,我错过了一行。谢谢
    • @user1140126 再次检查答案我更新了它。我少了一行
    【解决方案2】:

    可以使用pythonxml标准模块:

    import xml.etree.ElementTree as ET
    
    root = ET.parse('a.xml').getroot()
    
    for node in root.iterfind(".//{http://www.epo.org/exchange}classification-scheme[@scheme='CPC']/.."):
        data = []
        for d in node.getchildren():
            if d.text:
                data.append(d.text)
        print ' '.join(data)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-11-25
      • 1970-01-01
      相关资源
      最近更新 更多