【问题标题】:Parsing specific field in XML file in Python在 Python 中解析 XML 文件中的特定字段
【发布时间】:2015-08-31 02:26:08
【问题描述】:

我有一个如下所示的 xml 文件:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed xml:base="http://data.treasury.gov:8001/Feed.svc/" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" xmlns="http://www.w3.org/2005/Atom">
  <title type="text">DailyTreasuryYieldCurveRateData</title>
  <id>http://data.treasury.gov:8001/feed.svc/DailyTreasuryYieldCurveRateData</id>
  <updated>2015-08-30T15:17:09Z</updated>
  <link rel="self" title="DailyTreasuryYieldCurveRateData" href="DailyTreasuryYieldCurveRateData" />
  <entry>
    <id>http://data.treasury.gov:8001/Feed.svc/DailyTreasuryYieldCurveRateData(6404)</id>
    <title type="text"></title>
    <updated>2015-08-30T15:17:09Z</updated>
    <author>
      <name />
    </author>
    <link rel="edit" title="DailyTreasuryYieldCurveRateDatum" href="DailyTreasuryYieldCurveRateData(6404)" />
    <category term="TreasuryDataWarehouseModel.DailyTreasuryYieldCurveRateDatum" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
    <content type="application/xml">
      <m:properties>
        <d:Id m:type="Edm.Int32">6404</d:Id>
        <d:NEW_DATE m:type="Edm.DateTime">2015-08-03T00:00:00</d:NEW_DATE>
        <d:BC_1MONTH m:type="Edm.Double">0.03</d:BC_1MONTH>
        <d:BC_3MONTH m:type="Edm.Double">0.08</d:BC_3MONTH>
        <d:BC_6MONTH m:type="Edm.Double">0.17</d:BC_6MONTH>
        <d:BC_1YEAR m:type="Edm.Double">0.33</d:BC_1YEAR>
        <d:BC_2YEAR m:type="Edm.Double">0.68</d:BC_2YEAR>
        <d:BC_3YEAR m:type="Edm.Double">0.99</d:BC_3YEAR>
        <d:BC_5YEAR m:type="Edm.Double">1.52</d:BC_5YEAR>
        <d:BC_7YEAR m:type="Edm.Double">1.89</d:BC_7YEAR>
        <d:BC_10YEAR m:type="Edm.Double">2.16</d:BC_10YEAR>
        <d:BC_20YEAR m:type="Edm.Double">2.55</d:BC_20YEAR>
        <d:BC_30YEAR m:type="Edm.Double">2.86</d:BC_30YEAR>
        <d:BC_30YEARDISPLAY m:type="Edm.Double">2.86</d:BC_30YEARDISPLAY>
      </m:properties>
    </content>
  </entry>
  <entry>
    <id>http://data.treasury.gov:8001/Feed.svc/DailyTreasuryYieldCurveRateData(6405)</id>
    <title type="text"></title>
    <updated>2015-08-30T15:17:09Z</updated>
    <author>
      <name />
    </author>
    <link rel="edit" title="DailyTreasuryYieldCurveRateDatum" href="DailyTreasuryYieldCurveRateData(6405)" />
    <category term="TreasuryDataWarehouseModel.DailyTreasuryYieldCurveRateDatum" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
    <content type="application/xml">
      <m:properties>
        <d:Id m:type="Edm.Int32">6405</d:Id>
        <d:NEW_DATE m:type="Edm.DateTime">2015-08-04T00:00:00</d:NEW_DATE>
        <d:BC_1MONTH m:type="Edm.Double">0.05</d:BC_1MONTH>
        <d:BC_3MONTH m:type="Edm.Double">0.08</d:BC_3MONTH>
        <d:BC_6MONTH m:type="Edm.Double">0.18</d:BC_6MONTH>
        <d:BC_1YEAR m:type="Edm.Double">0.37</d:BC_1YEAR>
        <d:BC_2YEAR m:type="Edm.Double">0.74</d:BC_2YEAR>
        <d:BC_3YEAR m:type="Edm.Double">1.08</d:BC_3YEAR>
        <d:BC_5YEAR m:type="Edm.Double">1.6</d:BC_5YEAR>
        <d:BC_7YEAR m:type="Edm.Double">1.97</d:BC_7YEAR>
        <d:BC_10YEAR m:type="Edm.Double">2.23</d:BC_10YEAR>
        <d:BC_20YEAR m:type="Edm.Double">2.59</d:BC_20YEAR>
        <d:BC_30YEAR m:type="Edm.Double">2.9</d:BC_30YEAR>
        <d:BC_30YEARDISPLAY m:type="Edm.Double">2.9</d:BC_30YEARDISPLAY>
      </m:properties>
    </content>
  </entry>
</feed>

如何解析“BC_10YEAR”的“2.16”?我一直在查看 ElementTree 和 lxml 的其他示例,但我似乎无法将这些示例中的 xml 格式与我的文件匹配。

我尝试的最后一件事是:

from lxml import etree
doc = etree.parse(yield_xml)
memoryElem = doc.find('content')
print memoryElem.text        # element text
print memoryElem.get('type') # attribute

我收到一个错误:AttributeError: 'NoneType' object has no attribute 'text'

有没有简单的方法来做到这一点?

【问题讨论】:

    标签: python xml parsing


    【解决方案1】:

    您可以尝试内置的拆分方法:

    >>>[data.split('>')[1].split('<')[0] for data in str(xml_file).split('<d:') if 'BC_10YEAR' in data][0]
    '2.16'
    

    【讨论】:

    • 我试过 'with open('test.xml', 'rb') as xml_file: [data.split('>')[1].split('
    • 这意味着您的 test.xml 文件对象与上面的示例不同。
    • 这很奇怪,我很确定我的文件与我上面粘贴的完全一致。无论如何,我对此进行了修改以使其正常工作:with open(yield_xml, 'rb') as yield_file: for line in yield_file: if 'BC_10YEAR' in line: cur_yield = float(line.split('&gt;')[1].split('&lt;')[0]) break
    【解决方案2】:

    我建议使用lxmlxpath() 方法,它提供更好的XPath 表达式支持:

    from lxml import etree
    
    doc = etree.parse(yield_xml)
    
    #register prefixes to be used in xpath
    ns = {"foo": "http://www.w3.org/2005/Atom",
          "d": "http://schemas.microsoft.com/ado/2007/08/dataservices",
          "m": "http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"}
    
    #select element <d:BC_10YEAR>, and convert the value to number
    result = doc.xpath("number(//foo:content/m:properties/d:BC_10YEAR)", namespaces=ns)
    
    #print the result
    print(result)
    print(type(result))
    

    输出:

    2.16
    <type 'float'>
    

    如果您想知道为什么在上面的 xpath 表达式中 foo:content 而不仅仅是 foo,那是因为 content 从根元素隐式继承了默认命名空间。并且默认命名空间uri映射到上面代码中的前缀foo;相关问题:parsing xml containing default namespace to get an element value using lxml

    【讨论】:

    • 感谢代码有效。不幸的是,我对 xml 的了解非常有限,所以我无法理解您所说的很多内容。不过我确实有一个问题:代码如何区分 xml 文件中的两个“BC_10YEAR”值?第一个是 2.16,但还有一个是 2.23。
    • 代码将只返回第一个。只需对 xpath 参数稍作更改,就完全有可能获得另一个或全部 BC_10YEAR
    猜你喜欢
    • 2020-03-22
    • 2014-02-05
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-10-03
    • 1970-01-01
    • 2014-04-11
    相关资源
    最近更新 更多