PDFMiner - 遍历页面并将其转换为文本答案

【问题标题】：PDFMiner - Iterating through pages and converting them to textPDFMiner - 遍历页面并将其转换为文本
【发布时间】：2014-02-02 12:37:30
【问题描述】：

因此，我试图从某些 PDF 中获取特定的文本，并且我正在将 Python 与 PDFMiner 一起使用，但由于November 2013 中发生的 API 更改而遇到了一些麻烦。基本上，要从 PDF 中获取我想要的部分文本，我目前必须将整个文件转换为文本，然后使用字符串函数来获取我想要的部分。我想要做的是循环浏览 PDF 的每一页并将每一页一一转换为文本。然后，一旦我找到了我想要的部分，我就会阻止它阅读那个 PDF。

我将发布位于我的文本编辑器 atm 中的代码，但它不是工作版本，它更像是高效解决方案的中途版本：P

#!/usr/bin/env python
# -*- coding: utf-8 -*- 

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.converter import LTChar, TextConverter
from pdfminer.layout import LAParams
from subprocess import call
from cStringIO import StringIO
import re
import sys
import os

argNum = len(sys.argv)
pdfLoc = str(sys.argv[1]) #CLI arguments

def convert_pdf_to_txt(path): #converts pdf to raw text (not my function)
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str

if (pdfLoc[-4:] == ".pdf"):
    contents = ""
    try: # Get the outlines (contents) of the document
        fp = open(pdfLoc, 'rb') #open a pdf document for reading
        parser = PDFParser(fp)
        document = PDFDocument(parser)
        outlines = document.get_outlines()
        for (level,title,dest,a,se) in outlines:
            title = re.sub(r".*\s", "", title) #get raw titles, stripped of formatting
            contents += title + "\n"
    except: #if pdfMiner can't get contents then manually get contents from text conversion
        #contents = convert_pdf_to_txt(pdfLoc)
        #startToCpos = contents.find("TABLE OF CONTENTS")
        #endToCpos = contents.rfind(". . .")
        #contents = contents[startToCpos:endToCpos+8]

        fp = open(pdfLoc, 'rb') #open a pdf document for reading
        parser = PDFParser(fp)
        document = PDFDocument(parser)
        pages = PDFPage(document, 3, {'Resources':'thing', 'MediaBox':'Thing'}) #God knows what's going on here
        for pageNumber, page in enumerate(pages.get_pages(PDFDocument, fp)): #The hell is the first argument?
            if pageNumber == 42:
                print "Hello"

        #for line in s:
        #   print line
        #   if (re.search("(\.\s){2,}", line) and not re.search("NOTES|SCOPE", line)):
        #       line = re.sub("(\.\s){2,}", "", line)
        #       line = re.sub("(\s?)*[0-9]*\n", "\n", line)
        #       line = re.sub("^\s", "", line)
        #       print line,


        #contents = contents.lower()
        #contents = re.sub("“", "\"", contents)
        #contents = re.sub("”", "\"", contents)
        #contents = re.sub("ﬁ", "f", contents)
        #contents = re.sub(r"(TABLE OF CONTENTS|LIST OF TABLES|SCOPE|REFERENCED DOCUMENTS|Identification|System (o|O)verview|Document (o|O)verview|Title|Page|Table|Tab)(\n)?|\.\s?|Section|[0-9]", "", contents)
        #contents = re.sub(r"This document contains proprietary information and may not be reproduced in any form whatsoever, nor may be used by or its contents divulged to third\nparties without written permission from the ownerAll rights reservedNumber:  STP SMEDate: -Jul-Issue: A  of CMC STPNHIndustriesCLASSIFICATION\nNATO UNCLASSIFIED                  AGUSTAEUROCOPTEREUROCOPTER DEUTSCHLAND                 FOKKER", "", contents)
        #contents = re.sub(r"(\r?\n){2,}", "", contents)
        #contents = contents.lstrip()
        #contents = contents.rstrip()
    #print contents
else:
    print "Not a valid PDF file"

This is the old way of doing it （或者至少知道旧方法是如何做到的，线程对我来说不是很有用）。但是现在我必须使用 PDFPage.get_pages 而不是 PDFDocument.get_pages 并且方法和参数完全不同。

目前，我正试图弄清楚我传递给PDFPage 的get_pages 方法的“Klass”变量到底是什么。

如果有人能对 API 的这一部分有所了解，甚至提供一个工作示例，我将非常感激。

【问题讨论】：

标签： python pdf pdfminer

【解决方案1】：

也许我来晚了，你已经解决了这个问题，但仍然供将来参考：

经过一番搜索，我想起了这个link，我会从中指出以下部分（粗体相关部分）：

Python 决定用一种方法来自动传递方法所属的实例，而不是自动接收：方法的第一个参数是调用该方法的实例。这使得方法与函数完全一样，并让您自己使用实际名称（尽管 self 是惯例，当您使用其他东西时人们通常会皱眉头。） self 对于代码，它只是另一个对象。

【讨论】：

【解决方案2】：

尝试使用PyPDF2。它使用起来要简单得多，而且不像 PDFMiner 那样不必要的功能丰富（这在您的情况下很好）。这就是您想要的，而且实现起来超级简单。

from PyPDF2 import PdfFileReader

PDF = PdfFileReader(file(pdf_fp, 'rb'))

if PDF.isEncrypted:
    decrypt = PDF.decrypt('')
    if decrypt == 0:
        print "Password Protected PDF: " + pdf_fp
        raise Exception("Nope")
    elif decrypt == 1 or decrypt == 2:
        print "Successfully Decrypted PDF"

for page in PDF.pages:
    print page.extractText()
    '''page.extractText() is the unicode string of the contents of the page
    And I am assuming you know how to play with a string and use regex
    If you find what you want just break like so:
    if some_condition == True:
        break'''

【讨论】：

PyPDF2 不会将 pdf 页面的所有内容都转换为文本。它不会 100% 提取