【发布时间】:2019-05-12 18:21:21
【问题描述】:
我使用 iterparse 来解析一个大的 xml 文件 (1,8 gb)。我将所有数据写入 csv 文件。我制作的脚本运行良好,但由于某种原因它随机跳过行。这是我的脚本:
import xml.etree.cElementTree as ET
import csv
xml_data_to_csv =open('Out2.csv','w', newline='', encoding='utf8')
Csv_writer=csv.writer(xml_data_to_csv, delimiter=';')
file_path = "Products_50_producten.xml"
context = ET.iterparse(file_path, events=("start", "end"))
EcommerceProductGuid = ""
ProductNumber = ""
Description = ""
ShopSalesPriceInc = ""
Barcode = ""
AvailabilityStatus = ""
Brand = ""
# turn it into an iterator
#context = iter(context)
product_tag = False
for event, elem in context:
tag = elem.tag
if event == 'start' :
if tag == "Product" :
product_tag = True
elif tag == 'EcommerceProductGuid' :
EcommerceProductGuid = elem.text
elif tag == 'ProductNumber' :
ProductNumber = elem.text
elif tag == 'Description' :
Description = elem.text
elif tag == 'SalesPriceInc' :
ShopSalesPriceInc = elem.text
elif tag == 'Barcode' :
Barcode = elem.text
elif tag == 'AvailabilityStatus' :
AvailabilityStatus = elem.text
elif tag == 'Brand' :
Brand = elem.text
if event == 'end' and tag =='Product' :
product_tag = False
List_nodes = []
List_nodes.append(EcommerceProductGuid)
List_nodes.append(ProductNumber)
List_nodes.append(Description)
List_nodes.append(ShopSalesPriceInc)
List_nodes.append(Barcode)
List_nodes.append(AvailabilityStatus)
List_nodes.append(Brand)
Csv_writer.writerow(List_nodes)
print(EcommerceProductGuid)
List_nodes.clear()
EcommerceProductGuid = ""
ProductNumber = ""
Description = ""
ShopSalesPriceInc = ""
Barcode = ""
AvailabilityStatus = ""
Brand = ""
elem.clear()
xml_data_to_csv.close()
“Products_50_producten.xml”文件的布局如下:
<?xml version="1.0" encoding="utf-16" ?>
<ProductExport xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<ExportInfo>
<ExportDateTime>2018-11-07T00:01:03+01:00</ExportDateTime>
<Type>Incremental</Type>
<ExportStarted>Automatic</ExportStarted>
</ExportInfo>
<Products>
<Product><EcommerceProductGuid>4FB8A271-D33E-4501-9EB4-17CFEBDA4177</EcommerceProductGuid><ProductNumber>982301017</ProductNumber><Description>Ducati Jas Radiaal Zwart Xxl Heren Tekst - 982301017</Description><Brand>DUCATI</Brand><ProductVariations><ProductVariation><SalesPriceInc>302.2338</SalesPriceInc><Barcodes><Barcode BarcodeOrder="1">982301017</Barcode></Barcodes></ProductVariation></ProductVariations></Product>
<Product><EcommerceProductGuid>4FB8A271-D33E-4501-9EB4-17CFEBDA4177</EcommerceProductGuid><ProductNumber>982301017</ProductNumber><Description>Ducati Jas Radiaal Zwart Xxl Heren Tekst - 982301017</Description><Brand>DUCATI</Brand><ProductVariations><ProductVariation><SalesPriceInc>302.2338</SalesPriceInc><Barcodes><Barcode BarcodeOrder="1">982301017</Barcode></Barcodes></ProductVariation></ProductVariations></Product>
</Products>
例如,如果我将“产品”复制 300 次,则会在 csv 文件的第 155 行将“EcommerceProductGuid”值留空。如果我复制 Product 400 次,它会在第 155、310 和 368 行留下一个空值。这怎么可能?
【问题讨论】:
-
标签
EcommerceProductGuid中可能是一个空的elem.text。添加条件if not elem.tag: print('Found empty elem.tag')。 -
是否有可能再次阅读“if EcommerceProductGuid == None:”这一行?因为由于某种原因,当信息位于“EcommerceProductGuid”标签内时,它会跳过该行。
标签: python python-3.x xml-parsing elementtree