【问题标题】:Reading contents of Scanned PDF (JPEG) using OCR (Optical Character Recognition)使用 OCR(光学字符识别)读取扫描的 PDF (JPEG) 的内容
【发布时间】:2020-05-27 09:08:29
【问题描述】:

我一直在尝试使用 OCR(光学字符识别)来隐藏扫描的不可选择 PDF (JPEG)。

Scanned PDF Document to be Converted

但是,我收到一个附加错误。

请对此进行调查并建议我获得预期的结果。

# Import libraries 
from PIL import Image 
import pytesseract 
import sys 
from pdf2image import convert_from_path 
import os 

# Path of the pdf 
PDF_file = "document.pdf"

''' 
Part #1 : Converting PDF to images 
'''

# Store all the pages of the PDF in a variable 
pages = convert_from_path(PDF_file, 500) 

# Counter to store images of each page of PDF to image 
image_counter = 1

# Iterate through all the pages stored above 
for page in pages: 

    # Declaring filename for each page of PDF as JPG 
    # For each page, filename will be: 
    # PDF page 1 -> page_1.jpg 
    # PDF page 2 -> page_2.jpg 
    # PDF page 3 -> page_3.jpg 
    # .... 
    # PDF page n -> page_n.jpg 
    filename = "page_"+str(image_counter)+".jpg"

    # Save the image of the page in system 
    page.save(filename, 'JPEG') 

    # Increment the counter to update filename 
    image_counter = image_counter + 1

''' 
Part #2 - Recognizing text from the images using OCR 
'''

# Variable to get count of total number of pages 
filelimit = image_counter-1

# Creating a text file to write the output 
outfile = "out_text.txt"

# Open the file in append mode so that 
# All contents of all images are added to the same file 
f = open(outfile, "a") 

# Iterate from 1 to total number of pages 
for i in range(1, filelimit + 1): 

    # Set filename to recognize text from 
    # Again, these files will be: 
    # page_1.jpg 
    # page_2.jpg 
    # .... 
    # page_n.jpg 
    filename = "page_"+str(i)+".jpg"

    # Recognize the text as string in image using pytesserct 
    text = str(((pytesseract.image_to_string(Image.open(filename))))) 

    # The recognized text is stored in variable text 
    # Any string processing may be applied on text 
    # Here, basic formatting has been done: 
    # In many PDFs, at line ending, if a word can't 
    # be written fully, a 'hyphen' is added. 
    # The rest of the word is written in the next line 
    # Eg: This is a sample text this word here GeeksF- 
    # orGeeks is half on first line, remaining on next. 
    # To remove this, we replace every '-\n' to ''. 
    text = text.replace('-\n', '')   

    # Finally, write the processed text to the file. 
    f.write(text) 

# Close the file after writing all the text. 
f.close() 

附上要转换的文档和我遇到的错误。

【问题讨论】:

    标签: python-3.x python-2.7 ocr data-analysis python-tesseract


    【解决方案1】:

    问题在于您的 pdf 到图像的转换。我还没有尝试过pdf2image。 我使用fitz。该程序甚至能够提取单个 pdf 页面中存在的多个图像。

    安装包

    pip install PyMuPDF
    

    然后

    import fitz
    
    def converted(directory_to_store, path_of_pdf_file):
        file = fitz.open(path_of_pdf_file)
        page = len(file)
        j = 0
        for i in range(page):
            for image in file.getPageImageList(i):
                my_xref = image[0]
                pic = fitz.Pixmap(file, my_xref)    
                final_image = fitz.Pixmap(fitz.csRGB, pic)
                file_name = str(j) + '.png'
                image_path = directory_to_store + file_name
                final_image.writePNG(img_path)
                j+=1
                pic = None
                final_pic = None
    
        print('Conversion Complete')
    

    【讨论】:

      猜你喜欢
      • 2014-04-25
      • 2011-10-30
      • 2014-05-29
      • 2020-12-26
      • 2010-10-28
      • 2010-11-09
      • 1970-01-01
      • 2018-09-08
      • 1970-01-01
      相关资源
      最近更新 更多