【发布时间】:2020-11-18 00:04:22
【问题描述】:
我正在使用 Jupyter Notebook 中的 Python 将 OSM 文档整理到 MongoDB 中。我正在使用 xml.etree.ElementTree 来解析 XML 文件并写入 JSON 文件。
有很多标签键是用冒号分隔的键表示的复合键:
<node id='1234'>
<tag k='service:bicycle:diy', v='yes'/>
<tag k='service:bicycle:second_hand', v='yes'/>
<tag k='service:vehicle:brakes', v='yes'/>
</node>
我想在解析 XML 时从这些标签中创建一个字典树:
{ 'id': '1234',
'service': {'bicycle': {'diy': 'yes',
'second_hand': 'yes'},
'vehicle': {'brakes': 'yes'}}}
而且,我想递归地执行此操作,以便可以处理带有任意数量冒号的键:<tag k=addr:street', v='Main Street'/>
我尝试了几种方法,但它总是会覆盖字典,因此每个级别只有一个文档。 (例如,您丢失了 {'diy': 'yes'} 条目。)
这是我能得到的最精简的部分,同时仍然包括重要的部分:
### bicycle_node.osm ###
# <?xml version="1.0" encoding="UTF-8"?>
# <osm version="0.6" generator="Overpass API 0.7.56.7 b85c4387">
# <note>Data included in this document is from www.openstreetmap.org. The data is made available under ODbL.</note>
# <meta osm_base="2020-11-05T23:56:03Z"/>
# <bounds minlat="48.6458000" minlon="-122.5844000" maxlat="48.8595000" maxlon="-122.3455000"/>
# <node id="255801452">
# <tag k="name" v="The Hub"/>
# <tag k="service:bicycle:diy" v="yes"/>
# <tag k="service:bicycle:second_hand" v="yes"/>
# <tag k="service:vehicle:painting" v="no"/>
# <tag k="payment:coin" v="yes"/>
# <tag k="payment:cash" v="yes"/>
# </node>
# <way id="4176487913">
# <tag k="name" v="Some Place"/>
# <tag k="service" v="driveway"/>
# </way>
# </osm>
### Expected JSON ###
# {"_id": "255801452",
# "name": "The Hub",
# "service": {"bicycle": {"diy": "yes",
# "second_hand": "yes"},
# "vehicle": {"painting": "no"}},
# "payment": {"coin": "yes",
# "cash": "yes"}}
# {"_id": "4176487913",
# "name": "Some Place",
# "service": "driveway"}
import xml.etree.ElementTree as ET
import codecs
import json
def get_subdiv_dict():
return {"service": dict(), "payment": dict(), "wiki": dict()}
def subdiv_key(k, v, subdoc_dict):
k_split = k.split(":")
if len(k_split) == 1:
subdoc_dict.update({ k_split[0]: v })
else:
subdoc_dict.update({ k_split[0]: subdiv_key(k=":".join(k_split[1:]),
v=v,
subdoc_dict=dict()) })
return subdoc_dict
def shape_element(element):
doc = dict()
if element.tag in ["node", "way"]:
# Get attributes.
for att_k, att_v in element.attrib.items():
if att_k == "id":
doc["_id"] = att_v
# Handle subelements.
# Subdocs for subdivided keys.
subdiv_dict = get_subdiv_dict()
for sub_el in element.iter():
if sub_el.tag == "tag":
k = sub_el.attrib["k"]
v = sub_el.attrib["v"]
# Subdivide where appropriate.
k_split = k.split(":")
if k_split[0] in subdiv_dict.keys() and len(k_split) > 1:
subdiv_dict = subdiv_key(k=k, v=v, subdoc_dict=subdiv_dict)
else:
doc[k] = v
# Add subdocs to element
for subdoc_k in subdiv_dict.keys():
if subdiv_dict[subdoc_k]:
doc[subdoc_k] = subdiv_dict[subdoc_k]
return doc
def process_map(file_in, file_out):
file_out = file_out.format(file_in)
data = []
with codecs.open(file_out, "w") as fo:
for _, element in ET.iterparse(file_in):
el = shape_element(element)
if el:
data.append(el)
fo.write(json.dumps(el) + "\n")
return data
process_map('bicycle_node.osm', 'bicycle_node.json')
# Out[1]:
# [{'_id': '255801452',
# 'name': 'The Hub',
# 'service': {'vehicle': {'painting': 'no'}},
# 'payment': {'cash': 'yes'}},
# {'_id': '4176487913', 'name': 'Some Place', 'service': 'driveway'}]
【问题讨论】:
标签: python json xml dictionary recursion