尝试使用 pdfminer.six 从 pdf 文件中提取文本时出错答案

【问题标题】：Error while trying to extract text from pdf file using pdfminer.six尝试使用 pdfminer.six 从 pdf 文件中提取文本时出错
【发布时间】：2020-11-09 23:17:44
【问题描述】：

我正在尝试使用 pdfminer.six 库（如 here）从 pdf 中提取文本，我已经将它安装在我的虚拟环境中。这是我的代码：

import pdfminer as miner

text = miner.high_level.extract_text('file.pdf')


print(text)

但是当我使用python pdfreader.py 执行代码时，出现以下错误：

Traceback (most recent call last):
  File ".\pdfreader.py", line 9, in <module>
    text = miner.high_level.extract_text('pdfBulletins/corona1.pdf')
AttributeError: module 'pdfminer' has no attribute 'high_level'

我怀疑它与 Python 路径有关，因为我在虚拟环境中安装了 pdfminer，但我看到它在我的系统 python 安装中安装了 pdf2txt.py。这种行为正常吗？我的意思是我的venv 内部发生的事情不应该改变我的系统 Python 安装。

我使用pdfminer.six 库附带的pdf2txt.py 实用程序成功提取了文本（从命令行并使用系统python 安装），但不是从我的venv 项目中的代码中提取。我的pdfminer.six 版本是20201018

我的代码可能有什么问题？

【问题讨论】：

这个答案有帮助吗？ stackoverflow.com/a/26495057/14316282
@RolvApneseth 在那里尝试了代码，不起作用，我怀疑它与 Python 路径有关，因为我在虚拟环境中安装了 pdfminer，但我看到它在外面安装了 pdf2txt.py在我的系统 python 安装中，这种行为正常吗？我的意思是我的 venv 内部发生的事情不应该改变我的系统 python 安装
这种行为肯定不正常。您安装的任何其他模块是否安装在系统上而不是虚拟环境中？

标签： python pdf windows-10 pdfminer

【解决方案1】：

pdfminer high_level extract_text 需要额外的参数才能正常工作。下面的代码使用 pdfminer.six 并从我的 pdf 文件中提取文本。

from pdfminer.high_level import extract_text

pdf_file = open('my_file.pdf', 'rb')
text = extract_text(pdf_file, password='', page_numbers=None, maxpages=0, caching=True, codec='utf-8', laparams=None)
print(text)

以下是我写的几篇关于从 PDF 文件中提取文本可能有用的附加帖子：

【讨论】：

【解决方案2】：

您的问题是尝试使用尚未导入的模块中的函数。导入 pdfminer 不会自动同时导入 pdfminer.high_level。

这行得通：

from pdfminer.high_level import extract_text

text = extract_text('file.pdf')

print(text)

【讨论】：

【解决方案3】：

尝试pdfreader 从 PDF 文档中提取文本（纯文本和包含 PDF 运算符）

这是从所有文档页面中提取上述所有内容的示例代码。

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)

plain_text = ""
pdf_markdown = ""
try:
    while True:
        viewer.render()
        pdf_markdown += viewer.canvas.text_content
        plain_text += "".join(viewer.canvas.strings)
        viewer.next()
except PageDoesNotExist:
    pass

【讨论】：

【解决方案4】：

您需要安装pdfminer.six 而不仅仅是pdfminer：

pip install pdfminer.six

只有在此之后，您才能将extract_text 导入为：

from pdfminer.high_level import extract_text

【讨论】：

【解决方案5】：

我的问题

pdfminer 和 pdfminer.six 都已安装， from pdfminer.high_level import extract_text 尝试使用错误的包。

解决方案

对我来说卸载 pdfminer 有效：

pip uninstall pdfminer

现在您应该只安装了 pdfminer.six 并且应该能够导入 extract_text。

【讨论】：

请详细说明否决票！