【问题标题】:Accesing values in xml file with namespaces in python 2.7 lxml在 python 2.7 lxml 中使用命名空间访问 xml 文件中的值
【发布时间】:2016-03-25 20:48:22
【问题描述】:

我正在关注此链接以尝试获取多个标签的值:

Parsing XML with namespace in Python via 'ElementTree'

在这个链接中访问根标签是没有问题的:

import sys
from lxml import etree as ET


doc = ET.parse('file.xml')

namespaces_rdf = {'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'} # add more as needed
namespaces_dcat = {'dcat': 'http://www.w3.org/ns/dcat#'} # add more as needed
namespaces_dct = {'dct': 'http://purl.org/dc/terms/'}

print doc.findall('rdf:RDF', namespaces_rdf)
print doc.findall('dcat:Dataset', namespaces_dcat)
print doc.findall('dct:identifier', namespaces_dct)

输出:

[]
[<Element {http://www.w3.org/ns/dcat#}Dataset at 0x2269b98>]
[]

我只能访问 dcat:Dataset,看不到如何访问 rdf:about 的值

以及以后访问 dct:identifier

当然,一旦我访问了这些信息,我需要访问 dcat:distribution info

这是我的示例文件,使用 ckanext-dcat 生成:

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
  xmlns:dct="http://purl.org/dc/terms/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dcat="http://www.w3.org/ns/dcat#"
>
  <dcat:Dataset rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01">
    <dct:identifier>ec631628-2f46-4f17-a685-d62a37466c01</dct:identifier>
    <dct:description>FOO-Description</dct:description>
    <dct:title>FOO-title</dct:title>
    <dcat:keyword>keyword1</dcat:keyword>
    <dcat:keyword>keyword2</dcat:keyword>
    <dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-10-08T08:55:04.566618</dct:issued>
    <dct:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2015-06-25T11:04:10.328902</dct:modified>
    <dcat:distribution>
      <dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/f5707551-6bf3-468f-9a96-b4184cc51d1f">
        <dct:title>FOO-title-1</dct:title>
        <dct:description>FOO-Description-1</dct:description>
        <dcat:accessURL>http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/f5707551-6bf3-468f-9a96-b4184cc51d1f/download/myxls.xls</dcat:accessURL>
        <dct:format>XLS</dct:format>
      </dcat:Distribution>
    </dcat:distribution>
    <dcat:distribution>
      <dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/74c1acc8-b2b5-441b-afb2-d072d0d00a7f">
        <dct:format>XLS</dct:format>
        <dct:title>FOO-title-2</dct:title>
        <dct:description>FOO-Description-2</dct:description>
        <dcat:accessURL>http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/74c1acc8-b2b5-441b-afb2-d072d0d00a7f/download/myxls.xls</dcat:accessURL>
      </dcat:Distribution>
    </dcat:distribution>
  </dcat:Dataset>
</rdf:RDF>

关于如何访问此信息的任何想法? 谢谢

更新: 好吧,我需要访问 rdf:about in:

<dcat:Dataset rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01">

所以这段代码取自:

Parse xml with lxml - extract element value

for node in doc.xpath('//dcat:Dataset', namespaces=namespaces):
   # Iterate over attributes
   for attrib in node.attrib:
            print '@' + attrib + '=' + node.attrib[attrib]

我得到这个输出:

[<Element {http://www.w3.org/ns/dcat#}Dataset at 0x23d8ee0>]
@{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about=http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01

所以,问题是:

我如何询问属性是否是about来取这个值,因为在其他文件中我有几个标签。

更新 2:修正了我获得价值的方式(克拉克符号)

for node in doc.xpath('//dcat:Dataset', namespaces=namespaces):
   # Iterate over attributes
   for attrib in node.attrib:
      if attrib.endswith('about'):
        #do my jobs

嗯,差不多完成了,但我有最后一个问题:我需要知道何时访问我的

&lt;dct:title&gt;

属于哪个,我有:

<dcat:Dataset rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01">
       <dct:title>FOO-title</dct:title>

<dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/f5707551-6bf3-468f-9a96-b4184cc51d1f">
        <dct:title>FOO-title-1</dct:title>

<dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/74c1acc8-b2b5-441b-afb2-d072d0d00a7f">
        <dct:title>FOO-title-2</dct:title>

如果我这样做,我会得到:

for node in doc.xpath('//dct:title', namespaces=namespaces):
   print node.tag, node.text

{http://purl.org/dc/terms/}title FOO-title
{http://purl.org/dc/terms/}title FOO-title-1
{http://purl.org/dc/terms/}title FOO-title-2

谢谢

【问题讨论】:

    标签: python namespaces lxml


    【解决方案1】:

    xpath() 方法与namespaces 命名参数一起使用:

    namespaces = {
        'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
        'dcat': 'http://www.w3.org/ns/dcat#',
        'dct': 'http://purl.org/dc/terms/'
    }
    
    print(doc.xpath('//rdf:RDF', namespaces=namespaces))
    print(doc.xpath('//dcat:Dataset', namespaces=namespaces))
    print(doc.xpath('//dct:identifier', namespaces=namespaces))
    

    【讨论】:

    • 嗯,太好了!!!,就像一个魅力,我不知道 xpath,所以我花时间去理解它。谢谢
    • @davisoski 当然,请参阅stackoverflow.com/help/someone-answers。还要查看您以前的答案,看看是否有值得接受的答案。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2019-07-26
    • 2018-06-03
    • 1970-01-01
    • 2019-08-24
    • 2016-04-10
    • 2014-01-23
    相关资源
    最近更新 更多