如何从一个文件夹中一个一个地读取多个pdf答案

【问题标题】：How to read multiple pdf from a folder one by one如何从一个文件夹中一个一个地读取多个pdf
【发布时间】：2022-01-25 12:03:05
【问题描述】：

我正在尝试从 pdf 文件中提取数据并将其转换为 pandas 数据框我使用 Pymupdf 模块中的“fitz”来提取数据。然后用 pandas 将其转换为数据框

from pathlib import Path
# returns all file paths that has .pdf as extension in the specified directory
pdf_search = Path("C:/Users/Ayesha.Gondekar/Eversana-CVs/").glob("*.pdf")
# convert the glob generator out put to list
# skip this if you are comfortable with generators and pathlib
pdf_files = pdf_files = [str(file.absolute()) for file in pdf_search]

#数据提取代码：

for pdf in pdf_files:
    with fitz.open(pdf) as doc:
        pypdf_text = ""
        for page in doc:
            pypdf_text += page.getText()

上面的代码只是提取文件夹中最后一个pdf的数据。从而仅给出该pdf的结果

但同样，我有一个包含许多 pdf 文档的文件夹。我的目标是从文件夹中逐个读取每个pdf文件并进行文本提取，然后将其转换为数据框。我如何在 python 中做到这一点？

【问题讨论】：

标签： python dataframe pdf

【解决方案1】：

试试这个：

import PyPDF2
import re

for k in range(1,100):
    # open the pdf file
    object = PyPDF2.PdfFileReader("C:/my_path/file%s.pdf"%(k))

    # get number of pages
    NumPages = object.getNumPages()


    # extract text and do the search
    for i in range(0, NumPages):
        PageObj = object.getPage(i)
        print("this is page " + str(i)) 
        Text = PageObj.extractText() 
        # print(Text)

或者这个：

from pdfminer.pdfpage import PDFPage
allyourfiles = os.listdir(fold)
firstpdf = ""
for i in allyourfiles:
    if '.pdf' in i:
        firstpdf = i
        break

with open('F:/technophile/Proj/SOURCE/'+firstpdf, 'rb') as fh:

    for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
        page_interpreter.process_page(page)

    text = fake_file_handle.getvalue()
    allyourpdf.append(text)

【讨论】：

得到以下 NameError: name 'PDFPage' is not defined @Tal Folkman
你需要导入我添加的@AyeshaGondekar
第一个sn-p作品？ @AyeshaGondekar
有点像@Tal Folkman

【解决方案2】：

您可以使用 pathlib 内置函数列出您目录中的所有 pdf 文件

from pathlib import Path
# returns all file paths that has .pdf as extension in the specified directory
pdf_search = Path("<path>/<to>/<pdfs>/").glob("*.pdf")
# convert the glob generator out put to list
# skip this if you are comfortable with generators and pathlib
pdf_files = pdf_files = [str(file.absolute()) for file in pdf_search]

现在您可以简单地在循环中运行您的代码块来迭代 pdf。

例如：

for pdf in pdf_files:
    with fitz.open(pdf) as doc:
        ...

【讨论】：

我试过代码是这个错误'mupdf: cannot open 1.pdf: No such file or directory'
我已更新代码以返回绝对路径而不是文件名。试试第 6 行的更新版本：pdf_files = [str(file.absolute()) for file in pdf_search]
是的，它有效，（现在它正在遍历文件夹）但它仅提取最后一个 pdf 的数据（而不是文件夹中的所有 pdf）。我正在编辑完整的代码。如果你也检查一次，那将是一个很大的帮助
我已经更新了有问题的完整代码@Sabbir Ahmed
在 for 块之外声明 pypdf_text = "" 这个语句。每次循环运行时，您的代码都会将 pypdf_text 初始化为空字符串。所以你只得到最后一个 pdf 字符串。 @AyeshaGondekar