【发布时间】:2021-06-30 12:53:42
【问题描述】:
任务:
PDF 是银行对账单,包含列,即(日期、描述、存款、取款、余额),用各自的字段解析列并以 CSV 格式导出该数据。PDF。
我的代码:
import pdftotext
import re
import csv
# open PDF file
with open('test.pdf', 'rb') as pdf_file:
pdf = pdftotext.PDF(pdf_file)
# extract tabular text
lines = pdf[2].split('\n')[4:]
# CSV table
table = []
# loop over lines in table
for line in lines:
# replace trailing spaces with comas
row = re.sub(' ', ',', line)
# reducing the number of comas to one
row = [cols.strip() for cols in re.sub(',+', ',', row).split(',')]
# handling missed separators
row = ','.join(row).replace(' ', ',').split(',')
# append row to table
table.append(row)
print(table)
# write CSV output
with open('test.csv', 'w') as csv_file:
writer = csv.writer(csv_file)
writer.writerows(table)
问题:
我没有得到想要的输出,即一半的描述显示在日期表下。我附上 csv 以供进一步参考 here。
期望的输出:
例如
['04/02','克莱斯勒资本支付 0023582513','$469.88-','$51.15']
【问题讨论】:
标签: python pdf-scraping