使用 Python 将带有 xsd 的 xml 解析为 CSV？答案

【问题标题】：Parse xml w/ xsd to CSV with Python?使用 Python 将带有 xsd 的 xml 解析为 CSV？
【发布时间】：2020-03-28 16:41:41
【问题描述】：

我正在尝试解析从 OSHA 网站下载的一个非常大的 XML 文件，并将其转换为 CSV，以便我可以在 SQLite 数据库中使用它以及其他一些电子表格。我只会使用在线转换器，但osha file 显然对他们所有人来说都太大了。

我用 Python 写了一个脚本，如下所示：

import csv
import xml.etree.cElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()

xml_data_to_csv =open('Out.csv', 'w')

list_head=[]

Csv_writer=csv.writer(xml_data_to_csv)

count=0
for element in root.findall('data'): 
    List_nodes =[]

    if count== 0:
        inspection_number = element.find('inspection_number').tag
        list_head.append(inspection_number)
        
        establishment_name = element.find('establishment_name').tag
        list_head.append(establishment_name)
        
        city = element.find('city')
        list_head.append(city)

        state = element.find('state')
        list_head.append(state)
        
        zip_code = element.find('zip_code')
        list_head.append(zip_code)
        
        sic_code = element.find('sic_code')
        list_head.append(sic_code)
        
        naics_code = element.find('naics_code')
        list_head.append(naics_code)
        
        sampling_number = element.find('sampling_number')
        list_head.append(sampling_number)
        
        office_id = element.find('office_id')
        list_head.append(office_id)
        
        date_sampled = element.find('date_sampled')
        list_head.append(date_sampled)
        
        date_reported = element.find('date_reported')
        list_head.append(date_reported)
        
        eight_hour_twa_calc = element.find('eight_hour_twa_calc')
        list_head.append(eight_hour_twa_calc)
        
        instrument_type = element.find('instrument_type')
        list_head.append(instrument_type)
        
        lab_number = element.find('lab_number')
        list_head.append(lab_number)
        
        field_number = element.find('field_number')
        list_head.append(field_number)
        
        sample_type = element.find('sample_type')
        list_head.append(sample_type)
        
        blank_used = element.find('blank_used')
        list_head.append(blank_used)
        
        time_sampled = element.find('time_sampled')
        list_head.append(time_sampled)
        
        air_volume_sampled = element.find('air_volume_sampled')
        list_head.append(air_volume_sampled)
        
        sample_weight = element.find('sample_weight')
        list_head.append(sample_weight)
        
        imis_substance_code = element.find('imis_substance_code')
        list_head.append(imis_substance_code)
        
        substance = element.find('substance')
        list_head.append(substance)
        
        sample_result = element.find('sample_result')
        list_head.append(sample_result)
        
        unit_of_measurement = element.find('unit_of_measurement')
        list_head.append(unit_of_measurement)
        
        qualifier = element.find('qualifier')
        list_head.append(qualifier)

        Csv_writer.writerow(list_head)
        count = +1

    inspection_number = element.find('inspection_number').text
    List_nodes.append(inspection_number)

    establishment_name = element.find('establishment_name').text
    List_nodes.append(establishment_name)

    city = element.find('city').text
    List_nodes.append(city)

    state = element.find('state').text
    List_nodes.append(state)

    zip_code = element.find('zip_code').text
    List_nodes.append(zip_code)    

    sic_code = element.find('sic_code').text
    List_nodes.append(sic_code)

    naics_code = element.find('naics_code').text
    List_nodes.append(naics_code)

    sampling_number = element.find('sampling_number').text
    List_nodes.append(sampling_number)

    office_id = element.find('office_id').text
    List_nodes.append(office_id)

    date_sampled = element.find('date_sampled').text
    List_nodes.append(date_sampled)

    date_reported = element.find('date_reported').text
    List_nodes.append(date_reported)

    eight_hour_twa_calc = element.find('eight_hour_twa_calc').text
    List_nodes.append(eight_hour_twa_calc)    
    
    instrument_type = element.find('instrument_type').text
    List_nodes.append(instrument_type)

    lab_number = element.find('lab_number').text
    List_nodes.append(lab_number)

    field_number = element.find('field_number').text
    List_nodes.append(field_number)

    sample_type = element.find('sample_type').text
    List_nodes.append(sample_type)

    blank_used = element.find('blank_used').text
    List_nodes.append()

    time_sampled = element.find('time_sampled').text
    List_nodes.append(time_sampled)

    air_volume_sampled = element.find('air_volume_sampled').text
    List_nodes.append(air_volume_sampled)    
    
    sample_weight = element.find('sample_weight').text
    List_nodes.append(sample_weight)

    imis_substance_code = element.find('imis_substance_code').text
    List_nodes.append(imis_substance_code)

    substance = element.find('substance').text
    List_nodes.append(substance)

    sample_result = element.find('sample_result').text
    List_nodes.append(sample_result)

    unit_of_measurement = element.find('unit_of_measurement').text 
    List_nodes.append(unit_of_measurement)

    qualifier= element.find('qualifier').text
    List_nodes.append(qualifier)

    Csv_writer.writerow(List_nodes)

xml_data_to_csv.close()

但是当我运行代码时，我得到一个 CSV，其中没有任何内容。我怀疑这可能与与 XML 关联的 XSD 文件有关，但我不完全确定。

有人知道这是什么问题吗？

【问题讨论】：

您跳过的细节很重要：“非常大”是什么意思？如果您添加一些简单的打印语句，或者使用调试器，您是否知道您看到了来自 root.findall(‘data’) 的一些结果。您能否包含一个最小的 XML 示例，显示您问题中的代码问题 - 如果没有该示例，很难看出您可能做错了什么。
但我看不出 xsd 影响这一点的原因。如果没有其他问题，请尝试删除 xsd 文件，看看是否有相同的行为。

标签： python xml csv

【解决方案1】：

下面的代码是您的代码的“紧凑”版本。

它假定 XML 结构看起来像脚本变量 xml 中的那样。（基于https://www.osha.gov/opengov/sample_data_2011.zip）

此示例代码与您的主要区别在于我定义了我想要收集一次的字段（请参阅FIELDS），我在整个脚本中使用此定义。

import xml.etree.ElementTree as ET

FIELDS = ['lab_number', 'instrument_type']  # TODO add more fields

xml = '''<main xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="health_sample_data.xsd">
  <DATA_RECORD>
    <inspection_number>316180165</inspection_number>
    <establishment_name>PROFESSIONAL ENGINEERING SERVICES, LLC.</establishment_name>
    <city>EUFAULA</city>
    <state>AL</state>
    <zip_code>36027</zip_code>
    <sic_code>1799</sic_code>
    <naics_code>238990</naics_code>
    <sampling_number>434866166</sampling_number>
    <office_id>418600</office_id>
    <date_sampled>2011-12-30</date_sampled>
    <date_reported>2011-12-30</date_reported>
    <eight_hour_twa_calc>N</eight_hour_twa_calc>
    <instrument_type>TBD</instrument_type>
    <lab_number>L13645</lab_number>
    <field_number>S1</field_number>
    <sample_type>B</sample_type>
    <blank_used>N</blank_used>
    <time_sampled></time_sampled>
    <air_volume_sampled></air_volume_sampled>
    <sample_weight></sample_weight>
    <imis_substance_code>S777</imis_substance_code>
    <substance>Soil</substance>
    <sample_result>0</sample_result>
    <unit_of_measurement>AAAAA</unit_of_measurement>
    <qualifier></qualifier>
  </DATA_RECORD>
  <DATA_RECORD>
    <inspection_number>315516757</inspection_number>
    <establishment_name>MARGUERITE CONCRETE CO.</establishment_name>
    <city>WORCESTER</city>
    <state>MA</state>
    <zip_code>1608</zip_code>
    <sic_code>1771</sic_code>
    <naics_code>238110</naics_code>
    <sampling_number>423259902</sampling_number>
    <office_id>112600</office_id>
    <date_sampled>2011-12-30</date_sampled>
    <date_reported>2011-12-30</date_reported>
    <eight_hour_twa_calc>N</eight_hour_twa_calc>
    <instrument_type>GRAV</instrument_type>
    <lab_number>L13355</lab_number>
    <field_number>9831B</field_number>
    <sample_type>P</sample_type>
    <blank_used>N</blank_used>
    <time_sampled>184</time_sampled>
    <air_volume_sampled>340.4</air_volume_sampled>
    <sample_weight>.06</sample_weight>
    <imis_substance_code>9135</imis_substance_code>
    <substance>Particulates not otherwise regulated (Total Dust)</substance>
    <sample_result>0.176</sample_result>
    <unit_of_measurement>M</unit_of_measurement>
    <qualifier></qualifier>
  </DATA_RECORD></main>'''

root = ET.fromstring(xml)
records = root.findall('.//DATA_RECORD')
with open('out.csv', 'w') as out:
    out.write(','.join(FIELDS) + '\n')
    for record in records:
        values = [record.find(f).text for f in FIELDS]
        out.write(','.join(values) + '\n')

out.csv

lab_number,instrument_type
L13645,TBD
L13355,GRAV

【讨论】：

您可能应该将代码分成两个代码块，以显示实际代码的紧凑程度。现在，这 7 行实际代码被长输入字符串隐藏了。
我认为“主要区别”是您正在寻找 .//DATA_RECORD，而 OP 正在寻找 data。这种差异将完全解释 OP 得到一个空的 CSV。顺便说一句，您真的应该使用 csv 来生成 csv 文件，因为例如 csv.writer 将正确处理包含逗号或嵌入引号字符的字段。
@barny 你比较了两种解决方案的代码行数吗？差距很大。
呃，不失礼貌：那又怎样？ OP 没有要求单行，问题是关于功能而不是大小。并且使用csv.writer 不会添加很多代码并且会更加健壮。