【问题标题】:Comparing keywords with PDF files将关键字与 PDF 文件进行比较
【发布时间】:2021-06-24 07:42:12
【问题描述】:

这是通过文件夹名称调用文件并提取数据的程序。现在我想将数据与我在下面程序中使用的关键字进行比较。但它给了我:

pdfReader = pdfFileObj.loadPage(0)
AttributeError: '_io.BufferedReader' object has no attribute 'loadPage'

我想删除错误并将关键字与提取的数据进行比较。我在这个程序中使用了 PyMuPDF 库。

import fitz
import os

pdfFiles = []
for filename in os.listdir('resume/'):
    if filename.endswith('.pdf'):
        print(filename)
        # pdfFiles.append(filename)
        os.chdir('C:/Users/M. Abrar Hussain/Desktop/cv/resume')
        print('Current working dir : %s' % os.getcwd())
        pdfFileObj = open(filename, 'rb')
        pdfReader = pdfFileObj.loadPage(0)
        with fitz.open(pdfFileObj) as doc:
            text = ""
            for page in doc:
                text += page.getText()
                print(text)
                # split the docs
                pageObj = pdfReader.getpage(0)
                t1 = (pageObj.getText())
                t1 = t1.split(",")
                search_keywords = ['python', 'Laravel', 'Java']
                for sentence in t1:
                    lst = []
                    for word in search_keywords:
                        if word in search_keywords:
                            list.append(word)
                        print('{0} key word(s) in sentence: {1}'.format(len(lst), ', '.join(lst)))
        pdfFileObj.close()

【问题讨论】:

    标签: python pdf pymupdf python-pdfreader


    【解决方案1】:

    您错过了两行:import PyPDF2pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

    请注意getPage(0) 将返回页码 0 对象,在您的 for 循环中,您不断地阅读同一页,如果您想阅读每个迭代新页,您应该检查文档中有多少页并创建 i从 0 到 pdfReader.numPages 的参数。

    import fitz
    import os
    import PyPDF2
    
    pdfFiles = []
    for filename in os.listdir('resume/'):
        if filename.endswith('.pdf'):
            print(filename)
            # pdfFiles.append(filename)
            os.chdir('C:/Users/M. Abrar Hussain/Desktop/cv/resume')
            print('Current working dir : %s' % os.getcwd())
            pdfFileObj = open(filename, 'rb')
            pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
            pageObj = pdfReader.getPage(0)
            with fitz.open(pdfFileObj) as doc:
                text = ""
                for page in doc:
                    text += page.getText()
                    print(text)
                    # split the docs
                    pageObj = pdfReader.getPage(0)
                    t1 = (pageObj.getText())
                    t1 = t1.split(",")
                    search_keywords = ['python', 'Laravel', 'Java']
                    for sentence in t1:
                        lst = []
                        for word in search_keywords:
                            if word in search_keywords:
                                list.append(word)
                            print('{0} key word(s) in sentence: {1}'.format(len(lst), ', '.join(lst)))
            pdfFileObj.close()
    

    working-with-pdf-files-in-python

    【讨论】:

    • 感谢您的指导,我应用了您建议的相同程序,但仍然出现以下错误:t1 = (pageObj.getText()) AttributeError: 'NoneType' object has no attribute 'getText'
    • 你在使用 IDE 吗?它会对你有很大帮助。试试pageObj.extractText()。并确保在调用函数时使用区分大小写。请更新问题中的代码。
    • 我正在使用 PyCharm,这有助于解决问题。感谢您的帮助。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-01-06
    • 1970-01-01
    • 1970-01-01
    • 2018-10-05
    • 2017-06-21
    • 1970-01-01
    相关资源
    最近更新 更多