遍历 .PDF 并使用 PDFMiner 将它们转换为 .txt答案

【问题标题】：Iterate through .PDFs and convert them to .txt using PDFMiner遍历 .PDF 并使用 PDFMiner 将它们转换为 .txt
【发布时间】：2017-05-09 18:53:43
【问题描述】：

我正在尝试合并我已经能够独立完成的两件不同的事情。不幸的是，PDFMiner 文档根本没有用。

我有一个包含数百个 PDF 的文件夹，名称为："[0-9].pdf"，在其中，没有特定的顺序，我不想对它们进行排序。我只需要一种方法来浏览它们并将它们转换为文本。

使用这篇文章：Extracting text from a PDF file using PDFMiner in python? - 我能够成功地从一个 PDF 中提取文本。

其中一些帖子：batch process text to csv using python - 有助于确定如何打开一个充满 PDF 的文件夹并使用它们。

现在，我只是不知道如何将它们组合成一个一个打开 PDF，将其转换为文本对象，将其保存到具有相同 original-filename.txt 的文本文件中，然后继续目录中的下一个 PDF。

这是我的代码：

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
import os
import glob

directory = r'./Documents/003/' #path
pdfFiles = glob.glob(os.path.join(directory, '*.pdf'))

resourceManager = PDFResourceManager()
returnString = StringIO()
codec = 'utf-8'
laParams = LAParams()
device = TextConverter(resourceManager, returnString, codec=codec, laparams=laParams)
interpreter = PDFPageInterpreter(resourceManager, device)

password = ""
maxPages = 0
caching = True
pageNums=set()

for one_pdf in pdfFiles:
    print("Processing file: " + str(one_pdf))
    fp = file(one_pdf, 'rb')
    for page in PDFPage.get_pages(fp, pageNums, maxpages=maxPages, password=password,caching=caching, check_extractable=True):
            interpreter.process_page(page)
    text = returnString.getvalue()
    filenameString = str(one_pdf) + ".txt"
    text_file = open(filenameString, "w")
    text_file.write(text)
    text_file.close()
    fp.close()

device.close()
returnString.close()

我没有收到编译错误，但我的代码没有做任何事情。

感谢您的帮助！

【问题讨论】：

哼！ pdfFiles 可能是空的...你能检查一下吗？
我可以在for ...之前看到print(pdfFiles)的输出
我认为pdfFiles 是空的，因为我什么也没看到。但为什么会这样呢？ @LaurentLAPORTE @stovfl
您的目录directory = r'./Documents/003/' 不存在：它是一个相对路径，因此结果取决于您在调用 Python 程序时在目录树中的位置。使用绝对路径。
成功了！我使用了os.path.abspath("../Documents/003/") 并且有效。谢谢！！ @LaurentLAPORTE

标签： python python-2.7 python-3.x glob pdfminer

【解决方案1】：

只是用@LaurentLAPORTE 的解决方案想法来回答我自己的问题。

使用os 将directory 设置为绝对路径，如下所示：os.path.abspath("../Documents/003/")。然后它就会工作。

【讨论】：