为 Perl 使用 CAM::PDF - 无法从 pdf 中提取图像答案

【问题标题】：Using CAM::PDF for Perl - Can not extract image from pdf为 Perl 使用 CAM::PDF - 无法从 pdf 中提取图像
【发布时间】：2014-01-16 06:21:38
【问题描述】：

我有一个 pdf 文件，其中使用 CAM::PDF 的 listimages.pl 什么都不返回，但使用 PDF::GetImages 会提取图像。使用以下代码，我可以找到图像对象，但我不知道如何将其提取到文件中。而且我无法弄清楚为什么命令行工具不起作用。

#!/usr/bin/perl -w
use strict;

use Cwd;
use File::Basename;
use Data::Dumper;
use CAM::PDF;
use CAM::PDF::PageText;
use CAM::PDF::Renderer::Images;

my $file = shift @ARGV || die "Usage: get-pdf-images /path/to/file.pdf \n";

my $pdf = CAM::PDF->new($file) || die "$CAM::PDF::errstr\n";

#print $pdf->toString();

foreach my $p ( 1 .. $pdf->numPages() ) {
    my $page = $pdf->getPageContentTree($p);
    my $str = $pdf->getPageText($p);
    if (defined $str) {
#        CAM::PDF->asciify(\$str);
        print $str;
    }

    print "-------------------------------\n";
    my $gs = $page->findImages();
    my @imageNodes = @{$gs->{images}};
    print "Found " . scalar @imageNodes . " images on page $p\n";
    print Data::Dumper->Dump([\@imageNodes],['imageNodes']);
}

如果我运行 `pdfinfo.pl`，它会报告：

$ pdfinfo.pl test.pdf
File:         test.pdf
File Size:    4599 bytes
Pages:        1
Author:       þÿadmin01
CreationDate: Fri Jan  3 03:48:53 2014
Creator:      þÿPDFCreator Version 1.7.2
Keywords:
ModDate:      Fri Jan  3 03:48:53 2014
Producer:     GPL Ghostscript 9.10
Subject:
Title:        þÿVision6Card
Page Size:    variable
Optimized:    no
PDF version:  1.4
Security
  Passwd:     none
  Print:      yes
  Modify:     yes
  Copy:       yes
  Add:        yes

test.pdf文件可以从这里下载：http://imaptools.com:8080/dl/test.pdf

【问题讨论】：

所讨论的图像是一个 3x10 像素的图像，它被编码为内联图像。也许 listimages.pl 只识别 xobject 图像？在分析内部 PDF 结构时，Adobe Acrobat Preflight 还会显示此图像的“PDFEngine 错误：严重性：4，系统：0，错误：3”。因此，也许图像嵌入被破坏了并且 listimages.pl 出于这个原因没有找到它？此外，当显示 PDF 时，我看不到该图像。也许 listimages.pl 只提取可见图像？
我也收到了来自pdf-tools.com/pdf/validate-pdfa-online.aspx 的错误，但我认为这不是问题，因为 PDF::GetImages 和命令行工具 pdfimages 都成功地提取了图像。我正在使用 CAM::PDF 来提取其他信息，并希望使用它来提取图像。

标签： perl pdf cam-pdf

【解决方案1】：

CAM::PDF 的某些部分未完成。如果您查看 listimages.pl 的来源，您会发现 inline 图像的内容解析有些原始，例如它不允许在 BI 和 EI 之间存在不匹配的括号（情况如此），因此在此处找不到图像。有uninlinepdfimages.pl，它使用另一种启发式方法来解析内联图像，但对于这个文件，它似乎挂起，我无意调查是什么混淆了它。而且，CAM::PDF::Renderer::Images，就像在您的代码中一样，是对同一问题的另一种看法，最后它对内容流进行了正确的解析，但是该库似乎没有提供帮助在这里提取图像数据的方法。但是如果你非常需要它，我认为没有技术问题（除了你的时间），给定@imageNodes 中的信息（宽度、高度、深度、使用的压缩、图像数据），以编程方式提取图像。

【讨论】：

同意。我是 CAM-PDF 的作者。当我第一次编写它时（早在 2002 年），我试图实现一些非常具体的目标，并在需要时添加了功能。许多高级工具（如 listimages.pl 和 pdftotext.pl）只是启发式方法，甚至没有尝试涵盖所有可能性。
感谢所有反馈和建议。事实证明，示例中的 3x10 图像无论如何都不是我想要的。因此，我采用了使用 CAM::PDF 提取所需文本的方法，然后使用 ImageMagick 将 PDF 呈现为 jpg。我是操作 PDF 的新手，我学到了很多东西 - 谢谢！