【发布时间】:2014-08-03 17:15:43
【问题描述】:
我已经解析了一个 PDF 并尽我所能清理了它,但我一直在对齐文本文件中的信息。
我的输出如下所示:
Zone
1
Report Name
ARREST
Incident Time
01:41
Location of Occurrence
1300 block Liverpool St
Neighborhood
Highland Park
Incident
14081898
Age
27
Gender
M
Section
3921(a)
3925
903
Description
Theft by Unlawful Taking or Disposition - Movable item
Receiving Stolen Property.
Criminal Conspiracy.
我希望它看起来像这样:
Zone: 1
Report Name: ARREST
Incident Time: 01:41
Location of Occurrence: 1300 block Liverpool St
Neighborhood: Highland Park
Incident: 14081898
Age: 27
Gender: M
Section, Description:
3921(a): Theft by Unlawful Taking or Disposition - Movable item
3925: Receiving Stolen Property.
903: Criminal Conspiracy.
我试图枚举列表,但问题是某些字段不存在。所以这会导致它提取错误的信息。
这是解析PDF的代码
import os
import urllib2
import time
from datetime import datetime, timedelta
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
def parsePDF(infile, outfile):
password = ''
pagenos = set()
maxpages = 0
# output option
outtype = 'text'
imagewriter = None
rotation = 0
stripcontrol = False
layoutmode = 'normal'
codec = 'utf-8'
pageno = 1
scale = 1
caching = True
showpageno = True
laparams = LAParams()
rsrcmgr = PDFResourceManager(caching=caching)
if outfile:
outfp = file(outfile, 'w+')
else:
outfp = sys.stdout
device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams, imagewriter=imagewriter)
fp = file(infile, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fp, pagenos,
maxpages=maxpages, password=password,
caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
outfp.close()
return
# Set time zone to EST
#os.environ['TZ'] = 'America/New_York'
#time.tzset()
# make sure folder system is set up
if not os.path.exists("../pdf/"):
os.makedirs("../pdf/")
if not os.path.exists("../txt/"):
os.makedirs("../txt/")
# Get yesterday's name and lowercase it
yesterday = (datetime.today() - timedelta(1))
yesterday_string = yesterday.strftime("%A").lower()
# Also make a numberical representation of date for filename purposes
yesterday_short = yesterday.strftime("%Y%m%d")
# Get pdf from blotter site, save it in a file
pdf = urllib2.urlopen("http://www.city.pittsburgh.pa.us/police/blotter/blotter_" + yesterday_string + ".pdf").read();
f = file("../pdf/" + yesterday_short + ".pdf", "w+")
f.write(pdf)
f.close()
# Convert pdf to text file
parsePDF("../pdf/" + yesterday_short + ".pdf", "../txt/" + yesterday_short + ".txt")
# Save text file contents in variable
parsed_pdf = file("../txt/" + yesterday_short + ".txt", "r").read()
这是我目前所拥有的。
import os
OddsnEnds = [ "PITTSBURGH BUREAU OF POLICE", "Incident Blotter", "Sorted by:", "DISCLAIMER:", "Incident Date", "assumes", "Page", "Report Name"]
if not os.path.exists("../out/"):
os.makedirs("../out/")
with open("../txt/20140731.txt", 'r') as file:
blotterList = file.readlines()
with open("../out/test2.txt", 'w') as outfile:
cleanList = []
for line in blotterList:
if not any ([o in line for o in OddsnEnds]):
cleanList.append(line)
while '\n' in cleanList:
cleanList.remove('\n')
for i in [i for i, j in enumerate(cleanList) if j == 'ARREST\n']:
print ('Incident:%s' % cleanList[i])
for i in [i for i, j in enumerate(cleanList) if j == 'Incident Time\n']:
print ('Time:%s' % cleanList[i+1])
但是枚举给我一个输出
Time:16:20
Time:17:40
Time:17:53
Time:18:05
Time:Location of Occurrence
因为那个事件没有给出时间。旁注是所有字符串都以 \n 结尾。
非常感谢任何和所有的想法和帮助。
【问题讨论】:
-
投反对票的不是我,但我也不太明白你的问题。时间打印是问题吗?还是您希望您的 pdf 输出作为第二段?
-
我当前的输出是最好的例子。我正在尝试浏览字符串列表并将输出更改为第二个示例。我尝试使用 enumerate 来遍历列表并更改列表的结构,然后将其输出到文本文件,但是当字段为空白时它会中断
-
能否请您添加生成顶部示例的代码?我认为最好在那里改变它
-
我调整了问题以包含解析代码
标签: python parsing pdf python-3.x text