【发布时间】:2020-01-14 12:19:44
【问题描述】:
我有以下 XML 文件(总共 3 Gb,这就是我依赖解析的原因):
<events version="1.0">
<event time="13834.0" type="actend" person="1537047" link="335909" facility="home811408" actType="home" />
<event time="13834.0" type="departure" person="1537047" link="335909" legMode="car_passenger" />
<event time="14516.0" type="travelled" person="1537047" distance="9749.86232009391" />
<event time="14516.0" type="arrival" person="1537047" link="79554" legMode="car_passenger" />
<event time="14516.0" type="actstart" person="1537047" link="79554" facility="105155" actType="work" />
<event time="15380.0" type="actend" person="3716370" link="280959" facility="outside_484" actType="outside" />
<event time="15380.0" type="departure" person="3716370" link="280959" legMode="car" />
<event time="15380.0" type="PersonEntersVehicle" person="3716370" vehicle="3716370" />
<event time="15380.0" type="vehicle enters traffic" person="3716370" link="280959" vehicle="3716370" networkMode="car" relativePosition="1.0" />
<event time="15380.0" type="coldEmissionEvent" linkId="280959" vehicleId="3716370" NO2="0.00273337378166616" NOx="0.33" HC="3.78" CO="19.99" FC="23.79" PM="0.00789998099207878" NMHC="3.57" />
<event time="15381.0" type="left link" vehicle="3716370" link="280959" />
<event time="15381.0" type="entered link" vehicle="3716370" link="103801" />
<event time="15386.0" type="left link" vehicle="3716370" link="103801" />
<event time="15386.0" type="entered link" vehicle="3716370" link="502211" />
<event time="15386.0" type="warmEmissionEvent" linkId="103801" vehicleId="3716370" NO2="0.0016834393054024187" CO2_TOTAL="5.211468969715323" NOx="0.010865835516688339" SO2="2.6488925864494008E-5" HC="0.0029077588002405412" CO="0.02157863109652191" FC="1.6554329969579966" PM="4.59119810564296E-4" NMHC="0.002754718863385776" />
<event time="15391.0" type="left link" vehicle="3716370" link="502211" />
</events>
此外,我创建了以下列表,其中包含一些有趣的链接。
closed_links = ["280959", "171962","7478","7477","335574","335575","7476","7475","7474","435947","254910","254911","294486","294487","172002","172003"
,"172004","172005","172000","172001","103801","294483","310984","310985","310982","310983","652344","255111","492823","537639","485764","485763"
,"639147","485766","485765","259614","259615","259612","259613","270874","244174","540827","658808","207","609975","609974","609973","537632"
,"537631","569248","345419","259731","557381","414858","573518","468058","83791","468029"]
我想要一个显示person 的表格,如果它已在任何closed_links 上注册(在名为link 的XML 中)。在最终表中,person 的每个值都应该是唯一的。在输出中包含link 不是强制性的,我只是希望它作为质量控制来查看代码是否有效。
到目前为止,我的代码没有提供结果,主要是因为我不知道如何使其有条件地出现对应于列表中任一值的事件:
import gzip
import xml.etree.ElementTree as ET
from collections import defaultdict
import pandas as pd
tree = ET.iterparse(gzip.open('V0_1pm/output_events.xml.gz', 'r'))
agents_o_i = defaultdict(list)
for xml_event, elem in tree:
attributes = elem.attrib
if elem.tag == 'event' and elem.attrib["link"] in closed_links:
agents_o_i[attributes['person']].append(attributes['link'])
agents_o_i = pd.DataFrame.from_dict(agents_o_i, orient='index')
agents_o_i.to_csv("out/V1_10pct/traveltimes_V1.csv", sep=';')
想要的输出:
person link
3716370 280959
非常感谢任何帮助!
【问题讨论】:
-
如果块有问题。您能否发布该示例的预期输出?
-
@alec_djinn 感谢您的输入,请查看更新问题的底部
-
链接
103801不在任何包含person的行中。你怎么可能在你的输出中有它? -
你是对的,我的错。
标签: python xml pandas collections elementtree