【问题标题】:Parsing an XML and store values if they occur inside a list解析 XML 并存储值(如果它们出现在列表中)
【发布时间】:2020-01-14 12:19:44
【问题描述】:

我有以下 XML 文件(总共 3 Gb,这就是我依赖解析的原因):

<events version="1.0">
    <event time="13834.0" type="actend" person="1537047" link="335909" facility="home811408" actType="home"  />
    <event time="13834.0" type="departure" person="1537047" link="335909" legMode="car_passenger"  />
    <event time="14516.0" type="travelled" person="1537047" distance="9749.86232009391"  />
    <event time="14516.0" type="arrival" person="1537047" link="79554" legMode="car_passenger"  />
    <event time="14516.0" type="actstart" person="1537047" link="79554" facility="105155" actType="work"  />
    <event time="15380.0" type="actend" person="3716370" link="280959" facility="outside_484" actType="outside"  />
    <event time="15380.0" type="departure" person="3716370" link="280959" legMode="car"  />
    <event time="15380.0" type="PersonEntersVehicle" person="3716370" vehicle="3716370"  />
    <event time="15380.0" type="vehicle enters traffic" person="3716370" link="280959" vehicle="3716370" networkMode="car" relativePosition="1.0"  />
    <event time="15380.0" type="coldEmissionEvent" linkId="280959" vehicleId="3716370" NO2="0.00273337378166616" NOx="0.33" HC="3.78" CO="19.99" FC="23.79" PM="0.00789998099207878" NMHC="3.57"  />
    <event time="15381.0" type="left link" vehicle="3716370" link="280959"  />
    <event time="15381.0" type="entered link" vehicle="3716370" link="103801"  />
    <event time="15386.0" type="left link" vehicle="3716370" link="103801"  />
    <event time="15386.0" type="entered link" vehicle="3716370" link="502211"  />
    <event time="15386.0" type="warmEmissionEvent" linkId="103801" vehicleId="3716370" NO2="0.0016834393054024187" CO2_TOTAL="5.211468969715323" NOx="0.010865835516688339" SO2="2.6488925864494008E-5" HC="0.0029077588002405412" CO="0.02157863109652191" FC="1.6554329969579966" PM="4.59119810564296E-4" NMHC="0.002754718863385776"  />
    <event time="15391.0" type="left link" vehicle="3716370" link="502211"  />
</events>

此外,我创建了以下列表,其中包含一些有趣的链接。

closed_links = ["280959", "171962","7478","7477","335574","335575","7476","7475","7474","435947","254910","254911","294486","294487","172002","172003"
,"172004","172005","172000","172001","103801","294483","310984","310985","310982","310983","652344","255111","492823","537639","485764","485763"
,"639147","485766","485765","259614","259615","259612","259613","270874","244174","540827","658808","207","609975","609974","609973","537632"
,"537631","569248","345419","259731","557381","414858","573518","468058","83791","468029"]

我想要一个显示person 的表格,如果它已在任何closed_links 上注册(在名为link 的XML 中)。在最终表中,person 的每个值都应该是唯一的。在输出中包含link 不是强制性的,我只是希望它作为质量控制来查看代码是否有效。

到目前为止,我的代码没有提供结果,主要是因为我不知道如何使其有条件地出现对应于列表中任一值的事件:

import gzip
import xml.etree.ElementTree as ET
from collections import defaultdict
import pandas as pd

tree = ET.iterparse(gzip.open('V0_1pm/output_events.xml.gz', 'r'))
agents_o_i = defaultdict(list)
for xml_event, elem in tree:
    attributes = elem.attrib
    if elem.tag == 'event' and elem.attrib["link"] in closed_links:
         agents_o_i[attributes['person']].append(attributes['link'])

agents_o_i = pd.DataFrame.from_dict(agents_o_i, orient='index')
agents_o_i.to_csv("out/V1_10pct/traveltimes_V1.csv", sep=';')

想要的输出:

person  link   
3716370 280959 

非常感谢任何帮助!

【问题讨论】:

  • 如果块有问题。您能否发布该示例的预期输出?
  • @alec_djinn 感谢您的输入,请查看更新问题的底部
  • 链接103801 不在任何包含person 的行中。你怎么可能在你的输出中有它?
  • 你是对的,我的错。

标签: python xml pandas collections elementtree


【解决方案1】:

您的if block 因缺少密钥而崩溃。

请务必先检查一个键是否在属性中。

for xml_event, elem in tree:
    if elem.tag == 'event' \
    and 'person' in elem.attrib \
    and 'link' in elem.attrib \
    and elem.attrib['link'] in closed_links:
        agents_o_i[elem.attrib['person']].append(elem.attrib['link'])

目前的结果:

>>> print(agents_o_i)
defaultdict(list, {'3716370': ['280959', '280959', '280959']})

另外,您可以以大致相同的方式手动解析文件。

import gzip

agents_o_i = defaultdict(list)
with gzip.open('output_events.xml.gz','rb') as f:
    for line in f:
        if 'person' in line and 'link' in line:
            link = line.split('link="')[1].split('"')[0]
            if link in closed_links:
                person = line.split('person="')[1].split('"')[0]
                agents_o_i[person].append(link)

【讨论】:

  • 不幸的是,当我尝试在较大的文件上运行它时,内存不足。我在解决方案的最后一行下方插入了elem.clear(),具有相同的缩进。您对如何提高内存效率有什么建议吗?
  • 您可以逐行读取文件并使用自定义函数对其进行解析。这将避免加载内存中的所有数据。
猜你喜欢
  • 1970-01-01
  • 2014-04-25
  • 1970-01-01
  • 1970-01-01
  • 2015-03-22
  • 2016-09-23
  • 1970-01-01
  • 2013-01-22
  • 1970-01-01
相关资源
最近更新 更多