【发布时间】:2021-09-24 02:48:30
【问题描述】:
我正在尝试使用 Python 计算 PDF 中的图像数量并将结果写入 csv 文件。理想情况下,我想返回一个 csv,它显示文件的一列和每页的一列,其中包含每页中的图像数量。但是在文档中显示文件名和图像总数的列就足够了。
我试过了:
import fitz
import io
from PIL import Image
import csv
with open(r'output.csv', 'x', newline='', encoding='utf-8') as csvfile:
# Declaring the writer
propertyWriter = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
# Writing the headers
propertyWriter.writerow(['file', 'results', 'error'])
for file in pdfs:
# open the file
pdf_file = fitz.open(file)
# printing number of images found in this page
if image_list:
results = len(image_list[0])
error = ""
#print(results)
#results = str(f"+ Found a total of {len(image_list)} images in page {page_index}")
else:
error = str("! No images found on page", page_index)
propertyWriter.writerow([file, results, error])
参考:https://www.geeksforgeeks.org/how-to-extract-images-from-pdf-in-python/ 但是,使用此选项会声明您在每个 PDF 中有 9 个图像,但事实并非如此。
然后我尝试了:
import fitz
import csv
with open(r'output.csv', 'x', newline='', encoding='utf-8') as csvfile:
# Declaring the writer
propertyWriter = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
# Writing the headers
propertyWriter.writerow(['file', 'results'])
for file in pdfs[0:5]:
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
results = str(pix)
propertyWriter.writerow([file, results])
参考:Extract images from PDF without resampling, in python? 但这又是说每个 PDF 中的图像数量相同,但事实并非如此。
【问题讨论】:
标签: python python-3.x pdf