解析 PDF 后清理文本文件答案

【问题标题】：cleaning text file after parsing a PDF解析 PDF 后清理文本文件
【发布时间】：2014-08-03 17:15:43
【问题描述】：

我已经解析了一个 PDF 并尽我所能清理了它，但我一直在对齐文本文件中的信息。

我的输出如下所示：

Zone
1
Report Name
ARREST
Incident Time
01:41
Location of Occurrence
1300 block Liverpool St
Neighborhood
Highland Park
Incident
14081898
Age
27
Gender
M
Section
3921(a)
3925
903
Description
Theft by Unlawful Taking or Disposition - Movable item
Receiving Stolen Property.
Criminal Conspiracy.

我希望它看起来像这样：

Zone:    1
Report Name:    ARREST
Incident Time:    01:41
Location of Occurrence:    1300 block Liverpool St
Neighborhood:    Highland Park
Incident:    14081898
Age:    27
Gender:    M
Section, Description:
3921(a): Theft by Unlawful Taking or Disposition - Movable item
3925: Receiving Stolen Property.
903: Criminal Conspiracy.

我试图枚举列表，但问题是某些字段不存在。所以这会导致它提取错误的信息。

这是解析PDF的代码

import os
import urllib2
import time
from datetime import datetime, timedelta
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams

def parsePDF(infile, outfile):

    password = ''
    pagenos = set()
    maxpages = 0
    # output option
    outtype = 'text'
    imagewriter = None
    rotation = 0
    stripcontrol = False
    layoutmode = 'normal'
    codec = 'utf-8'
    pageno = 1
    scale = 1
    caching = True
    showpageno = True
    laparams = LAParams()
    rsrcmgr = PDFResourceManager(caching=caching)

    if outfile:
        outfp = file(outfile, 'w+')
    else:
        outfp = sys.stdout

    device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams, imagewriter=imagewriter)
    fp = file(infile, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(fp, pagenos,
                                      maxpages=maxpages, password=password,
                                      caching=caching, check_extractable=True):

        interpreter.process_page(page)
    fp.close()
    device.close()
    outfp.close()
    return  


# Set time zone to EST
#os.environ['TZ'] = 'America/New_York'
#time.tzset()

# make sure folder system is set up
if not os.path.exists("../pdf/"):
    os.makedirs("../pdf/")
if not os.path.exists("../txt/"):
    os.makedirs("../txt/")

# Get yesterday's name and lowercase it
yesterday = (datetime.today() - timedelta(1))
yesterday_string = yesterday.strftime("%A").lower()

# Also make a numberical representation of date for filename purposes
yesterday_short = yesterday.strftime("%Y%m%d")

# Get pdf from blotter site, save it in a file
pdf = urllib2.urlopen("http://www.city.pittsburgh.pa.us/police/blotter/blotter_" + yesterday_string + ".pdf").read();
f = file("../pdf/" + yesterday_short + ".pdf", "w+")
f.write(pdf)
f.close()

# Convert pdf to text file
parsePDF("../pdf/" + yesterday_short + ".pdf", "../txt/" + yesterday_short + ".txt")

# Save text file contents in variable
parsed_pdf = file("../txt/" + yesterday_short + ".txt", "r").read()

这是我目前所拥有的。

import os

OddsnEnds = [ "PITTSBURGH BUREAU OF POLICE", "Incident Blotter", "Sorted by:", "DISCLAIMER:", "Incident Date", "assumes", "Page", "Report Name"]    


if not os.path.exists("../out/"):
    os.makedirs("../out/")  
with open("../txt/20140731.txt", 'r') as file:
    blotterList = file.readlines()

with open("../out/test2.txt", 'w') as outfile:
    cleanList = []
    for line in blotterList:
        if not any ([o in line for o in OddsnEnds]):
            cleanList.append(line)
    while '\n' in cleanList:
        cleanList.remove('\n')
    for i in [i for i, j in enumerate(cleanList) if j == 'ARREST\n']:
        print ('Incident:%s' % cleanList[i])
    for i in [i for i, j in enumerate(cleanList) if j == 'Incident Time\n']:
            print ('Time:%s' % cleanList[i+1])

但是枚举给我一个输出

Time:16:20

Time:17:40

Time:17:53

Time:18:05

Time:Location of Occurrence

因为那个事件没有给出时间。旁注是所有字符串都以 \n 结尾。

非常感谢任何和所有的想法和帮助。

【问题讨论】：

投反对票的不是我，但我也不太明白你的问题。时间打印是问题吗？还是您希望您的 pdf 输出作为第二段？
我当前的输出是最好的例子。我正在尝试浏览字符串列表并将输出更改为第二个示例。我尝试使用 enumerate 来遍历列表并更改列表的结构，然后将其输出到文本文件，但是当字段为空白时它会中断
能否请您添加生成顶部示例的代码？我认为最好在那里改变它
我调整了问题以包含解析代码

标签： python parsing pdf python-3.x text

【解决方案1】：

我最喜欢的将 PDF 文件抓取为文本的方法是使用 pdftotext（来自 poppler 实用程序）和 -layout 选项。它非常擅长保留文档的原始布局。

您可以使用 subprocess 模块从 Python 中使用它。

【讨论】：

【解决方案2】：

一般来说，从 PDF 文件中提取文本（特别是当您想要包含文本的格式/间距/布局时）被认为是一项可能并不总是 100% 准确的任务。我从一家公司的支持技术人员那里知道了这一点，该公司生产了一个流行的库 (xpdf)，用于从 PDF 中提取文本，不久前，当我从事该领域的一个项目时。当时，我探索了几个从文本中提取 PDF 的库，包括 xpdf 和其他一些库。为什么它们不能总是给出完美的结果有明确的技术原因（尽管在许多情况下确实如此）；这些原因与 PDF 格式的性质以及 PDF 的生成方式有关。当您从某些 PDF 中提取文本时，布局和间距可能不会保留，即使您使用库的选项（如 keep_format=True 或等效项）也是如此。

此问题的唯一永久解决方案是不需要从 PDF 文件中提取文本。相反，请始终尝试使用生成 PDF 文件的数据格式和数据源，并使用它进行文本提取/操作。当然，如果您无法访问这些资源，说起来容易做起来难。

【讨论】：

Poppler，由 Roland Smith 在兄弟答案中提到，是 xpdf 的一个分支，根据 Roland 链接到的 Wikipedia 文章：en.wikipedia.org/wiki/Poppler_%28software%29
了解问题，这就是为什么我试图开发一种算法来将输出修改为我可以使用的东西。解析器的修改是一个有争议的问题