【问题标题】:Count number of images in PDF with Python使用 Python 计算 PDF 中的图像数量
【发布时间】:2021-09-24 02:48:30
【问题描述】:

我正在尝试使用 Python 计算 PDF 中的图像数量并将结果写入 csv 文件。理想情况下,我想返回一个 csv,它显示文件的一列和每页的一列,其中包含每页中的图像数量。但是在文档中显示文件名和图像总数的列就足够了。

我试过了:

import fitz
import io
from PIL import Image
import csv

with open(r'output.csv', 'x', newline='', encoding='utf-8') as csvfile:
    # Declaring the writer 
    propertyWriter = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
    # Writing the headers 
    propertyWriter.writerow(['file', 'results', 'error'])
    for file in pdfs:

        # open the file
        pdf_file = fitz.open(file)


        # printing number of images found in this page
        if image_list:
            results = len(image_list[0])
            error = ""
            #print(results)
            #results = str(f"+ Found a total of {len(image_list)} images in page {page_index}")

        else:
            error = str("! No images found on page", page_index)
        propertyWriter.writerow([file, results, error])

参考:https://www.geeksforgeeks.org/how-to-extract-images-from-pdf-in-python/ 但是,使用此选项会声明您在每个 PDF 中有 9 个图像,但事实并非如此。

然后我尝试了:

import fitz
import csv
with open(r'output.csv', 'x', newline='', encoding='utf-8') as csvfile:
    # Declaring the writer 
    propertyWriter = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
    # Writing the headers 
    propertyWriter.writerow(['file', 'results'])
    for file in pdfs[0:5]:
        for i in range(len(doc)):
            for img in doc.getPageImageList(i):
                xref = img[0]
                pix = fitz.Pixmap(doc, xref)
                results = str(pix)

    propertyWriter.writerow([file, results])

参考:Extract images from PDF without resampling, in python? 但这又是说每个 PDF 中的图像数量相同,但事实并非如此。

【问题讨论】:

    标签: python python-3.x pdf


    【解决方案1】:

    我尝试了您提到的第一个参考 (https://www.geeksforgeeks.org/how-to-extract-images-from-pdf-in-python/),它运行良好(该页面上的代码)。有什么问题吗?它计算 PDF 中每一页的图像,您只需将每个 pdf 汇总在一起?

    如果你把它放到for循环中,你应该能够达到你的目标吗?

    import fitz
    import io
    from PIL import Image
    
    file = "doctest.pdf"
    pdf_file = fitz.open(file)
    results = 0
    
    for page_index in range(len(pdf_file)):
        image_list = pdf_file[page_index].getImageList()
        
        # printing number of images found in this page
        if image_list:
            results += len(image_list)
    
    print("Total images in this PDF: ", results)
    

    【讨论】:

      猜你喜欢
      • 2019-08-08
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-06-20
      • 2012-02-18
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多