Python iterparse 正在跳过值答案

【问题标题】：Python iterparse is skipping valuesPython iterparse 正在跳过值
【发布时间】：2019-05-12 18:21:21
【问题描述】：

我使用 iterparse 来解析一个大的 xml 文件 (1,8 gb)。我将所有数据写入 csv 文件。我制作的脚本运行良好，但由于某种原因它随机跳过行。这是我的脚本：

import xml.etree.cElementTree as ET
import csv
xml_data_to_csv =open('Out2.csv','w', newline='', encoding='utf8')
Csv_writer=csv.writer(xml_data_to_csv, delimiter=';')

file_path = "Products_50_producten.xml"
context = ET.iterparse(file_path, events=("start", "end"))

EcommerceProductGuid = ""
ProductNumber = ""
Description = ""
ShopSalesPriceInc = ""
Barcode = ""
AvailabilityStatus = ""
Brand = ""
# turn it into an iterator
#context = iter(context)
product_tag = False
for event, elem in context:
    tag = elem.tag

    if event == 'start' :
        if tag == "Product" :
            product_tag = True

        elif tag == 'EcommerceProductGuid' :
            EcommerceProductGuid = elem.text

        elif tag == 'ProductNumber' :
            ProductNumber = elem.text

        elif tag == 'Description' :
            Description = elem.text

        elif tag == 'SalesPriceInc' :
            ShopSalesPriceInc = elem.text

        elif tag == 'Barcode' :
            Barcode = elem.text

        elif tag == 'AvailabilityStatus' :
            AvailabilityStatus = elem.text


        elif tag == 'Brand' :
            Brand = elem.text

    if event == 'end' and tag =='Product' :
        product_tag = False
        List_nodes = []
        List_nodes.append(EcommerceProductGuid)
        List_nodes.append(ProductNumber)
        List_nodes.append(Description)
        List_nodes.append(ShopSalesPriceInc)
        List_nodes.append(Barcode)
        List_nodes.append(AvailabilityStatus)
        List_nodes.append(Brand)
        Csv_writer.writerow(List_nodes)
        print(EcommerceProductGuid)
        List_nodes.clear()
        EcommerceProductGuid = ""
        ProductNumber = ""
        Description = ""
        ShopSalesPriceInc = ""
        Barcode = ""
        AvailabilityStatus = ""
        Brand = ""

    elem.clear()


xml_data_to_csv.close()

“Products_50_producten.xml”文件的布局如下：

<?xml version="1.0" encoding="utf-16" ?>
<ProductExport xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<ExportInfo>
<ExportDateTime>2018-11-07T00:01:03+01:00</ExportDateTime>
<Type>Incremental</Type>
<ExportStarted>Automatic</ExportStarted>
</ExportInfo>
<Products>
<Product><EcommerceProductGuid>4FB8A271-D33E-4501-9EB4-17CFEBDA4177</EcommerceProductGuid><ProductNumber>982301017</ProductNumber><Description>Ducati Jas Radiaal Zwart Xxl Heren Tekst - 982301017</Description><Brand>DUCATI</Brand><ProductVariations><ProductVariation><SalesPriceInc>302.2338</SalesPriceInc><Barcodes><Barcode BarcodeOrder="1">982301017</Barcode></Barcodes></ProductVariation></ProductVariations></Product>
<Product><EcommerceProductGuid>4FB8A271-D33E-4501-9EB4-17CFEBDA4177</EcommerceProductGuid><ProductNumber>982301017</ProductNumber><Description>Ducati Jas Radiaal Zwart Xxl Heren Tekst - 982301017</Description><Brand>DUCATI</Brand><ProductVariations><ProductVariation><SalesPriceInc>302.2338</SalesPriceInc><Barcodes><Barcode BarcodeOrder="1">982301017</Barcode></Barcodes></ProductVariation></ProductVariations></Product>
</Products>

例如，如果我将“产品”复制 300 次，则会在 csv 文件的第 155 行将“EcommerceProductGuid”值留空。如果我复制 Product 400 次，它会在第 155、310 和 368 行留下一个空值。这怎么可能？

【问题讨论】：

标签EcommerceProductGuid 中可能是一个空的elem.text。添加条件if not elem.tag: print('Found empty elem.tag')。
是否有可能再次阅读“if EcommerceProductGuid == None:”这一行？因为由于某种原因，当信息位于“EcommerceProductGuid”标签内时，它会跳过该行。

标签： python python-3.x xml-parsing elementtree

【解决方案1】：

我认为问题出在if event == 'start'。

According to other questions/answers，text属性的内容不保证被定义。

不过，好像没有换成if event == 'end'那么简单。当我自己尝试时，我得到的空白字段多于填充字段。（更新：如果我从 iterparse 中删除 events=("start", "end")，则使用 event == 'end' 确实有效。）

最终的结果是完全忽略该事件，只测试是否填充了text。

更新代码...

import xml.etree.cElementTree as ET
import csv

xml_data_to_csv = open('Out2.csv', 'w', newline='', encoding='utf8')
Csv_writer = csv.writer(xml_data_to_csv, delimiter=';')

file_path = "Products_50_producten.xml"
context = ET.iterparse(file_path, events=("start", "end"))

EcommerceProductGuid = ""
ProductNumber = ""
Description = ""
ShopSalesPriceInc = ""
Barcode = ""
AvailabilityStatus = ""
Brand = ""
for event, elem in context:
    tag = elem.tag
    text = elem.text

    if tag == 'EcommerceProductGuid' and text:
        EcommerceProductGuid = text

    elif tag == 'ProductNumber' and text:
        ProductNumber = text

    elif tag == 'Description' and text:
        Description = text

    elif tag == 'SalesPriceInc' and text:
        ShopSalesPriceInc = text

    elif tag == 'Barcode' and text:
        Barcode = text

    elif tag == 'AvailabilityStatus' and text:
        AvailabilityStatus = text

    elif tag == 'Brand' and text:
        Brand = text

    if event == 'end' and tag == "Product":
        product_tag = False
        List_nodes = []
        List_nodes.append(EcommerceProductGuid)
        List_nodes.append(ProductNumber)
        List_nodes.append(Description)
        List_nodes.append(ShopSalesPriceInc)
        List_nodes.append(Barcode)
        List_nodes.append(AvailabilityStatus)
        List_nodes.append(Brand)
        Csv_writer.writerow(List_nodes)
        print(EcommerceProductGuid)
        List_nodes.clear()
        EcommerceProductGuid = ""
        ProductNumber = ""
        Description = ""
        ShopSalesPriceInc = ""
        Barcode = ""
        AvailabilityStatus = ""
        Brand = ""

    elem.clear()

xml_data_to_csv.close()

这似乎适用于我的 300 个 Product 元素的测试文件。

另外，如果您使用字典和csv.DictWriter，我认为您可以简化代码。

示例（产生与上述代码相同的输出）...

import xml.etree.cElementTree as ET
import csv
from copy import deepcopy

field_names = ['EcommerceProductGuid', 'ProductNumber', 'Description',
               'SalesPriceInc', 'Barcode', 'AvailabilityStatus', 'Brand']

values_template = {'EcommerceProductGuid': "",
                   'ProductNumber': "",
                   'Description': "",
                   'SalesPriceInc': "",
                   'Barcode': "",
                   'AvailabilityStatus': "",
                   'Brand': ""}

with open('Out2.csv', 'w', newline='', encoding='utf8') as xml_data_to_csv:

    csv_writer = csv.DictWriter(xml_data_to_csv, delimiter=';', fieldnames=field_names)

    file_path = "Products_50_producten.xml"
    context = ET.iterparse(file_path, events=("start", "end"))

    values = deepcopy(values_template)

    for event, elem in context:
        tag = elem.tag
        text = elem.text

        if tag in field_names and text:
            values[tag] = text

        if event == 'end' and tag == "Product":
            csv_writer.writerow(values)
            print(values.get('EcommerceProductGuid'))
            values = deepcopy(values_template)

        elem.clear()

【讨论】：

太好了，非常感谢！我对两者都进行了测试，并且都有效。我对 200.000 个产品进行了测试，第二个脚本更快了一点。第一个脚本耗时 57 秒，第二个脚本耗时 51 秒。我在“values = deepcopy(values_template)”和“elem.clear()”之间添加了“values.clear()”，因为有时有些字段是空的。

【解决方案2】：

对于它的价值和任何可能正在搜索的人，上述答案也适用于 lxml 库 iterparse()。我在使用 lxml 时遇到了类似的问题，我想我会试一试，它的工作原理几乎完全相同。

随机开始事件在使用它获取xml信息时，还没有拾取到文本项。尝试在结束事件中获取项目似乎已经解决了我使用大型 xml 文件的问题。看起来 Daniel Haley 所做的通过检查文本是否存在增加了另一层保护。

【讨论】：