【问题标题】:Finding javascript code in PDF using Apache PDFBox使用 Apache PDFBox 在 PDF 中查找 javascript 代码
【发布时间】:2016-01-17 15:52:13
【问题描述】:

我的目标是提取和处理 PDF 文档可能包含的任何 JavasSript 代码。通过在编辑器中打开 PDF,我可以看到如下对象:

    402 0 obj
<</S/JavaScript/JS(\n\r\n   /* Set day 25 */\r\n    FormRouter_SetCurrentDate\("25"\);\r)>>
endobj

我正在尝试使用 Apache PDFBox 来完成此任务,但到目前为止还没有成功。

此行返回一个空列表:

 jsObj = doc.getObjectsByType(COSName.JAVA_SCRIPT);

谁能给我一些指导?

【问题讨论】:

  • 棘手。如果您阅读 PDF 规范,您会发现很多地方都可以使用 javascript。并且处理它取决于用户操作,例如单击项目、更改项目内容等。所以它不是通过收集所有的 javascript 并运行它来完成的。
  • 是的,但我只是想要一种能够自己对 JS 代码执行静态分析的方法。它在 PDF 中某处的代码,所以我应该能够以某种方式提取它。例如,如果 /OpenAction 存在这样的操作,则很容易从 /OpenAction 中提取 JS 代码。
  • 我认为您基本上必须从文档根目录遍历对象层次结构并检查所有可能附加了 JS 的对象。有些工作要做。
  • 要理解我们的意思,请下载 2.0 版本的 PDFDebugger 并访问树。对不同的 PDF 执行此操作。
  • @TilmanHausherr 如果您在 PDFBox 支持组中发布您发送给我的代码,我很高兴接受您的回答。如果您能发布一些示例代码,说明如何详尽地查看 PDF 中的所有 COSString 对象,我也将不胜感激(我不明白为什么我发现这个 API 如此复杂,可能是因为我对 PDF 格式的了解不多)。

标签: java pdf pdfbox


【解决方案1】:

此工具基于 PDFBox 中的 PrintFields 示例。它将在表单中显示 Javascript 字段。我去年为一个对 AcroForm 字段之间的关系有问题的人写的(根据其他字段的值启用/禁用某些字段)。还有其他地方可以有 Javascript。

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package pdfboxpageimageextraction;

import java.io.File;
import java.io.IOException;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.interactive.action.PDAction;
import org.apache.pdfbox.pdmodel.interactive.action.PDActionJavaScript;
import org.apache.pdfbox.pdmodel.interactive.action.PDFormFieldAdditionalActions;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationWidget;
import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
import org.apache.pdfbox.pdmodel.interactive.form.PDField;
import org.apache.pdfbox.pdmodel.interactive.form.PDNonTerminalField;
import org.apache.pdfbox.pdmodel.interactive.form.PDTerminalField;

/**
 * This example will take a PDF document and print all the fields from the file.
 *
 * @author Ben Litchfield
 *
 */
public class PrintJavaScriptFields
{

    /**
     * This will print all the fields from the document.
     *
     * @param pdfDocument The PDF to get the fields from.
     *
     * @throws IOException If there is an error getting the fields.
     */
    public void printFields(PDDocument pdfDocument) throws IOException
    {
        PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
        PDAcroForm acroForm = docCatalog.getAcroForm();
        List<PDField> fields = acroForm.getFields();

        //System.out.println(fields.size() + " top-level fields were found on the form");
        for (PDField field : fields)
        {
            processField(field, "|--", field.getPartialName());
        }
    }

    private void processField(PDField field, String sLevel, String sParent) throws IOException
    {
        String partialName = field.getPartialName();

        if (field instanceof PDTerminalField)
        {
            PDTerminalField termField = (PDTerminalField) field;
            PDFormFieldAdditionalActions fieldActions = field.getActions();
            if (fieldActions != null)
            {
                System.out.println(field.getFullyQualifiedName() + ": " + fieldActions.getClass().getSimpleName() + " js field actionS:\n" + fieldActions.getCOSObject());
                printPossibleJS(fieldActions.getK());
                printPossibleJS(fieldActions.getC());
                printPossibleJS(fieldActions.getF());
                printPossibleJS(fieldActions.getV());
            }
            for (PDAnnotationWidget widgetAction : termField.getWidgets())
            {
                PDAction action = widgetAction.getAction();
                if (action instanceof PDActionJavaScript)
                {
                    System.out.println(field.getFullyQualifiedName() + ": " + action.getClass().getSimpleName() + " js widget action:\n" + action.getCOSObject());
                    printPossibleJS(action);
                }
            }
        }

        if (field instanceof PDNonTerminalField)
        {
            if (!sParent.equals(field.getPartialName()))
            {
                if (partialName != null)
                {
                    sParent = sParent + "." + partialName;
                }
            }
            //System.out.println(sLevel + sParent);

            for (PDField child : ((PDNonTerminalField) field).getChildren())
            {
                processField(child, "|  " + sLevel, sParent);
            }
        }
        else
        {
            String fieldValue = field.getValueAsString();
            StringBuilder outputString = new StringBuilder(sLevel);
            outputString.append(sParent);
            if (partialName != null)
            {
                outputString.append(".").append(partialName);
            }
            outputString.append(" = ").append(fieldValue);
            outputString.append(", type=").append(field.getClass().getName());
            //System.out.println(outputString);
        }
    }

    private void printPossibleJS(PDAction kAction)
    {
        if (kAction instanceof PDActionJavaScript)
        {
            PDActionJavaScript jsAction = (PDActionJavaScript) kAction;
            String jsString = jsAction.getAction();
            if (!jsString.contains("\n"))
            {
                // avoid display problems with netbeans
                jsString = jsString.replaceAll("\r", "\n").replaceAll("\n\n", "\n");
            }
            System.out.println(jsString);
            System.out.println();
        }
    }

    /**
     * This will read a PDF file and print out the form elements. <br />
     * see usage() for commandline
     *
     * @param args command line arguments
     *
     * @throws IOException If there is an error importing the FDF document.
     */
    public static void main(String[] args) throws IOException
    {
        PDDocument pdf = null;
        try
        {
            pdf = PDDocument.load(new File("XXXX", "YYYYY.pdf"));
            PrintJavaScriptFields exporter = new PrintJavaScriptFields();
            exporter.printFields(pdf);
        }
        finally
        {
            if (pdf != null)
            {
                pdf.close();
            }
        }
    }

} 

作为奖励,这里是显示所有 COSString 对象的代码:

public class ShowAllCOSStrings
{
    static Set<COSString> strings = new HashSet<COSString>();

    static void crawl(COSBase base)
    {
        if (base instanceof COSString)
        {
            strings.add((COSString)base);
            return;
        }
        if (base instanceof COSDictionary)
        {
            COSDictionary dict = (COSDictionary) base;
            for (COSName key : dict.keySet())
            {
                crawl(dict.getDictionaryObject(key));
            }
            return;
        }
        if (base instanceof COSArray)
        {
            COSArray ar = (COSArray) base;

            for (COSBase item : ar)
            {
                crawl(item);
            }
            return;
        }
        if (base instanceof COSNull || 
                base instanceof COSObject || 
                base instanceof COSName || 
                base instanceof COSNumber || 
                base instanceof COSBoolean || 
                base == null)
        {
            return;
        }
        System.out.println("huh? " + base);
    }

    public static void main(String[] args) throws IOException
    {
        PDDocument doc = PDDocument.load(new File("XXX","YYY.pdf"));

        for (COSObject obj : doc.getDocument().getObjects())
        {
            COSBase base = obj.getObject();
            //System.out.println(obj + ": " + base);
            crawl(base);
        }
        System.out.println(strings.size() + " strings:");
        for (COSString s : strings)
        {
            String str = s.getString();
            if (!str.contains("\n"))
            {
                // avoid display problems with netbeans
                str = str.replaceAll("\r", "\n").replaceAll("\n\n", "\n");
            }
            System.out.println(str);
        }
        doc.close();
    }
}

不过,Javascript 也可以在流中。请参阅 PDF 规范“特定于再现操作的附加条目”,JS 条目:

包含 JavaScript 脚本的文本字符串或流 触发动作时执行。

您也可以更改上面的代码来捕获 COSStream 对象; COSStream 是从 COSDictionary 扩展而来的。

【讨论】:

  • PDF 的结构有点像这部奥斯卡获奖影片:youtube.com/watch?v=6mtluyHcOnk
  • 如果您转到 Javadocs->“使用”用于 PDAction 并查找“返回 PDAction”的所有示例,这些示例的代码,然后提取 if instanceof PDActionJavaScript,这是一个好的开始吗? PDFSpec 的“再现动作”是 PDAction 吗?
  • @TimAllison 我什么都不相信,我会在 PDFBox 源代码和 PDF 规范中搜索“javascript”。
  • 是的,我已经完成了规范,接下来停止源代码。谢谢!
猜你喜欢
  • 1970-01-01
  • 2015-03-23
  • 2013-03-22
  • 2021-01-05
  • 2013-11-07
  • 2023-04-07
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多