如何使用 iText 从 PDF 表单中提取图像答案

【问题标题】：How to Extract Images from a PDF Form with iText如何使用 iText 从 PDF 表单中提取图像
【发布时间】：2021-07-22 23:48:18
【问题描述】：

这篇文章 (How to extract images from a PDF with iText in the correct order?) 介绍了如何从常规 PDF 文件中提取图像。我需要提取用户在 PDF 表单域中输入的图像。

我使用 iText 7。我可以使用如下代码访问 iText 中的表单字段：

PdfReader reader = new PdfReader(new FileInputStream(new ClassPathResource("myFile.pdf").getFile()));
PdfDocument document = new PdfDocument(reader);
PdfAcroForm acroForm = PdfAcroForm.getAcroForm(document, false);
Map<String, PdfFormField> fields = acroForm.getFormFields();
PdfButtonFormField imageField = null;
PdfDictionary dictionary = null;
for (String fldName : fields.keySet()) {
      PdfFormField field = fields.get(fldName);
      if ("Image1_af_image".equals(fldName)) {
            imageField = (PdfButtonFormField)fields.get("Image1_af_image");
            dictionary = imageField.getPdfObject();
       }
}

其中Image1_af_imgage 是表单中图像字段的默认名称。是否可以从PdfButtonFormField 或其关联的字典对象中提取图像流？

感谢您非常有帮助的回复。我已将您的代码合并如下：

    public void iTextTest3() throws IOException {

        PdfReader reader = new PdfReader(new FileInputStream(new ClassPathResource("templates/TestForm.pdf").getFile()));

        PdfDocument document = new PdfDocument(reader);
        String fieldname = "Image1_af_image";
        PdfAcroForm acroForm = PdfAcroForm.getAcroForm(document, false);

        PdfFormField imagefield = acroForm.getField(fieldname);
        // get the appearance dictionary
        PdfDictionary apDic = imagefield.getWidgets().get(0).getNormalAppearanceObject();
        // get the xobject resources
        PdfDictionary xObjDic = apDic.getAsDictionary(PdfName.Resources).getAsDictionary(PdfName.XObject);
        for (PdfName key : xObjDic.keySet()) {
            System.out.println(key);
            PdfStream s = xObjDic.getAsStream(key);
            // only process images
            if (PdfName.Image.equals(s.getAsName(PdfName.Subtype))) {  //*** code fails here ***
                PdfImageXObject pixo = new PdfImageXObject(s);
                byte[] imgbytes = pixo.getImageBytes();
                String ext = pixo.identifyImageFileExtension();

                // write the image to file
                String fileName = null;
                FileOutputStream fos = new FileOutputStream(fileName = key.toString().substring(1) + "." + ext);
                System.out.println(("image fileName: " + fileName));
                fos.write(imgbytes);
                fos.close();
            }
        }
        document.close();
    }

代码失败，因为s.getAsName(PdfName.Subtype) 返回值"Form"。我猜我需要做的是按照您在帖子中的建议递归到 XObject 树中，但不确定该怎么做。我试过xObjDic.getAsDictionary()，但不确定PdfName 作为参数传入。

【问题讨论】：

您在没有说明版本的情况下提到了 iText。根据您的代码，我假设它是 iText 7 版本？
对不起，是 iText7。
@StephenSchultz 我根据您的问题编辑修改了我的答案。

标签： java forms pdf itext itext7

【解决方案1】：

PDF 中按钮的视觉外观可以完全自定义，包括文本、图形和图像。因此，图像数据可以以稍微不同的方式存储在不同的 PDF 文档中。但是一般来说，表单域的widget annotation会有一个appearance stream，它的里面会有一个XObject的图片数据资源字典。

使用带有图像的按钮创建 PDF 以进行测试：

String fieldname = "Image1_af_image";
PdfAcroForm form = PdfAcroForm.getAcroForm(pdfDoc, true);
PdfButtonFormField imagefield = PdfFormField.createButton(pdfDoc, new Rectangle(100, 100, 50, 50),
        PdfButtonFormField.FF_PUSH_BUTTON);
imagefield.setImage("button.png").setFieldName(fieldname);
form.addField(imagefield);

从按钮获取图像数据：

PdfAcroForm acroForm = PdfAcroForm.getAcroForm(pdfDoc, false);
PdfFormField imagefield = acroForm.getField(fieldname);
// get the appearance dictionary
PdfDictionary apDic = imagefield.getWidgets().get(0).getNormalAppearanceObject();
// get the xobject resources
PdfDictionary xObjDic = apDic.getAsDictionary(PdfName.Resources).getAsDictionary(PdfName.XObject);
for (PdfName key : xObjDic.keySet()) {
    System.out.println(key);
    PdfStream s = xObjDic.getAsStream(key);
    // only process images
    if (PdfName.Image.equals(s.getAsName(PdfName.Subtype))) {
        PdfImageXObject pixo = new PdfImageXObject(s);
        byte[] imgbytes = pixo.getImageBytes();
        String ext = pixo.identifyImageFileExtension();
    
        // write the image to file
        FileOutputStream fos = new FileOutputStream(key.toString().substring(1) + "." + ext);
        fos.write(imgbytes);
        fos.close();
    }
}

您可以使用 PDF 对象查看器（例如 iText RUPS 或 Adobe Acrobat 的内置“浏览 PDF 内部结构”）来检查 PDF 文档的确切结构并找出图像数据的存储位置。

编辑：

一种更通用的提取图像数据的方法，以防它在嵌套的Form XObjects中：

PdfAcroForm acroForm = PdfAcroForm.getAcroForm(pdfDoc, false);
PdfFormField imagefield = acroForm.getField(fieldname);
// get the appearance dictionary
PdfDictionary apDic = imagefield.getWidgets().get(0).getNormalAppearanceObject();
// get the xobject resources
PdfDictionary xObjDic = apDic.getAsDictionary(PdfName.Resources).getAsDictionary(PdfName.XObject);
extractImagesFromXObj(xObjDic);

public void extractImagesFromXObj(PdfDictionary xObjDic) throws IOException {
    for (PdfName key : xObjDic.keySet()) {
        System.out.println(key);
        PdfStream s = xObjDic.getAsStream(key);
        PdfName subType = s.getAsName(PdfName.Subtype);
        // only process images
        if (PdfName.Image.equals(subType)) {
            PdfImageXObject pixo = new PdfImageXObject(s);
            byte[] imgbytes = pixo.getImageBytes();
            String ext = pixo.identifyImageFileExtension();

            // write the image to file
            FileOutputStream fos = new FileOutputStream(key.toString().substring(1) + "." + ext);
            fos.write(imgbytes);
            fos.close();
        }
        // process nested XObject dictionaries recursively
        else if (PdfName.Form.equals(subType)) {
            PdfDictionary nestedXObjDic = s.getAsDictionary(PdfName.Resources).getAsDictionary(PdfName.XObject);
            extractImagesFromXObj(nestedXObjDic);
        }
    }
}

【讨论】：

根据 PDF 创建软件的不同，图像的确切结构不同，图像 XObject 可能不是正常外观流的直接资源，而是嵌套形式的 XObject。因此，一般来说，还必须递归到外观流的 XObjects 形式并在那里查找图像 XObjects。
确实，@mkl。我不想用这种更通用的方法使我的初始答案复杂化。但是根据您的评论和对问题所做的编辑，我添加了一些示例代码来遍历嵌套字典。
这太棒了——完全按照你的描述工作，也正是我所需要的——谢谢！（我看到我缺少的是迭代 xObjDic.keyset()。学习新技巧总是很高兴！）。
@StephenSchultz，如果这解决了您的问题，请考虑接受答案1 2 3
当我尝试使用在我的帖子的初始答案中提供的代码以编程方式设置图像时，我得到这个异常：com.itextpdf.kernel.PdfException：没有关联的 PdfWriter 用于制作间接。在 com.itextpdf.kernel.pdf.PdfObject.makeIndirect(PdfObject.java:229。我将代码中的“button.png”解释为资源文件名。尝试了相对于应用程序路径的名称以及提供绝对路径。