PDF 到 IMAGE 到 PDF 全部在内存中完成答案

【问题标题】：PDF to IMG to PDF all done in memoryPDF 到 IMAGE 到 PDF 全部在内存中完成
【发布时间】：2018-06-12 05:28:03
【问题描述】：

为了从 PDF 中删除敏感内容，我将其转换为图像并再次转换回 PDF。

我可以在保存 jpeg 图像的同时执行此操作，但是我最终希望调整我的代码，以便文件始终在内存中。内存中的 PDF -> 内存中的 JPEG -> 内存中的 PDF。我在中间步骤遇到问题。

from pdf2image import convert_from_path, convert_from_bytes
import img2pdf

images = convert_from_path('testing.pdf', fmt='jpeg')
image = images[0]

# opening from filename
with open("output/output.pdf","wb") as f:
    f.write(img2pdf.convert(image.tobytes()))

在最后一行，我得到了错误：

ImageOpenError: cannot read input image (not jpeg2000). PIL: error reading image: cannot identify image file <_io.BytesIO object at 0x1040cc8f0>

我不确定如何将此图像转换为img2pdf 正在寻找的字符串。

【问题讨论】：

img2pdf.convert 方法接受列表文件名，因此您必须将所有转换后的图像存储到某个目录并将这些图像路径作为转换方法的输入。然后它会工作。

标签： python pdf in-memory

【解决方案1】：

pdf2image 模块会将图像提取为枕头图像。并且根据 Pillow tobytes() documention：“此方法从内部存储返回原始图像数据。”这是一些位图表示。

要让您的代码正常工作，请使用 BytesIO 模块，如下所示：

# opening from filename
import io
with open("output/output.pdf","wb") as f, io.BytesIO() as output:
    image.save(output, format='jpg')
    f.write(img2pdf.convert(output.getvalue()))

【讨论】：