【发布时间】:2021-01-21 05:10:18
【问题描述】:
我有一个 6000 行的文本文件。每行由代表不同列的束字符组成。 例子: 'AS202003402092020MF1003 EXESTBOPF 01163500116000 000120200381R000540000116000WC05 Watawala Tea Ceylon Ltd. 1M'
上面的表达式是字符串,需要将每一列分别提取出来,如下:
Borkername = AS
Sale year = 2020
Saleno = 0340
sale_dte = 20/9/2020 # date need to be format
Factoryno = MF1003
Catalogu code= EXEST
Grade =BOPF
Gross weight =01163.50 #decimal point needed
Net Weight = 01160.50 #decimal point needed
Lot_No = 0001
invoice_year = 2020
invoice_no = WC05
price = 000540.00 #decimal point needed
Netweight = 01160.00 #decimal point needed
Buyer = 'Watawala Tea Ceylon Ltd.'
Buyer_code = '1M'
我用正则表达式编写了一个代码,用于在python中将每个字段分隔为熊猫数据框的列。
import re
import csv
### headings of the dataframe
headings = [
"Borkername", "Sale year", "Saleno", "sale_dte", "Factoryno", "Catalogu code", "Grade", "Gross
weight", "Net Weight", "Lot_No", "invoice_year", "invoice_no", "price", "Netweight", "Buyer", "Buyer_code"]
re_fields = re.compile(r'(.{2})(.{4})(.{3})(.{8})(.{6})(.{5})(.{4})(.{7})(.{7}) (.{4})(.{4})(.{5})(.{8})(.{7}).(.*?) (.{2})$')
with open('input.txt') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_writer = csv.writer(f_output)
csv_writer.writerow(headings)
for line in f_input:
fields = list(re_fields.match(line).groups())
fields[3] = "{}.{}.{}".format(fields[3][:2], fields[3][2:4], fields[3][4:])
fields[7] = float("{}.{}".format(fields[7][:5], fields[7][5:]))
fields[8] = float("{}.{}".format(fields[8][:5], fields[8][5:]))
fields[12] = float("{}.{}".format(fields[12][:6], fields[12][6:]))
fields[13] = float("{}.{}".format(fields[13][:5], fields[13][5:]))
csv_writer.writerow(fields)
不幸的是,这段代码在尝试时出错
fields = list(re_fields.match(line).groups())
AttributeError: 'NoneType' 对象没有属性 'groups'
希望技术专家可以建议我以正确的方式执行此操作,同时删除现有的错误 附上文本文件示例
AS202003402092020MF1003 EXESTBOPF 01163500116000 000120200381R000540000116000WC05 Watawala Tea
Ceylon Ltd. 1M
AS202003402092020MF0663 EXESTBOPF 01123500112000 000420200165R000550000112000WC05 Watawala Tea
Ceylon Ltd. 1M
AS202003402092020MF0069 EXESTBOP 00963500096000 000520200278R000570000096000CM01 Ceylon Tea
Marketing Ltd. 1M
AS202003402092020MF0069 EXESTBOPF 01103500110000 000620200282R000580000110000CM01 Ceylon Tea
Marketing Ltd. 1M
AS202003402092020MF0348 EXESTBOPF 01163500116000 000720200259R000570000116000CM01 Ceylon Tea
Marketing Ltd. 1M
AS202003402092020MF0348 EXESTBOPF 01163500116000 000820200264R000560000116000TT01 Tea Tang (Pvt)
Ltd 0M
AS202003402092020MF0703 EXESTBOPF 01123500112000 000920200193R000540000112000AB01 Akbar Brothers
(Pvt) Ltd 1M
AS202003402092020MF0552 EXESTBOPF 01123500112000 001120200266 000520000112000AB01 Akbar Brothers
(Pvt) Ltd 1M
AS202003402092020MF0294 EXESTBOP 01003500100000 001220200097R000560000100000UL01 Unilever
Lipton Ceylon Ltd, Tea Division 1M
【问题讨论】:
标签: python regex pandas dataframe csv