使用带有波兰字符的 pyPDF2 读取 pdf

【问题标题】：Reading pdf using pyPDF2 with polish characters使用带有波兰字符的 pyPDF2 读取 pdf
【发布时间】：2018-07-22 19:38:52
【问题描述】：

我正在尝试使用PyPDF2 库来读取由波兰字符（例如 ń、ś 等）组成的 pdf 文件，但在使用 extractText() 函数后，输出字符串缺少波兰字符。有没有办法仍然使用 PyPDF2 库但首先正确编码、解码 pdf 文件？我试图用 encoding='utf-8' 和 'latin-1' 打开文件，但没有成功。感谢您的帮助！

代码sn-p：

file = open(myPDFfile, "rb")
pdfreader = PyPDF2.PdfFileReader(file, strict=True)
page_obj = pdfreader.getPage(0)
page_txt = page_obj.extractText()
page_txt_split = page_txt.split()

【问题讨论】：

标签： python file pdf encode pypdf2

【解决方案1】：

好的，我以不同的方式处理它。由于jmcarp github，我使用pdfminer 使用UTF-8 编码从我的pdf 文件中提取文本，一切正常（没有丢失任何波兰字符）。我正在发布工作代码的 sn-p：

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO


def pdf_to_text(pdfname):
    # PDFMiner boilerplate
    rsrcmgr = PDFResourceManager()
    sio = StringIO()
    device = TextConverter(rsrcmgr, sio, codec='utf-8', laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
      # get text from file
    fp = file(pdfname, 'rb')
    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
    fp.close()
      # Get text from StringIO
    text = sio.getvalue()
      # close objects
    device.close()
    sio.close()

    return text

【讨论】：