PdfBox - 如何从文本中加载颜色答案

【问题标题】：PdfBox - How to load color from textPdfBox - 如何从文本中加载颜色
【发布时间】：2021-04-09 18:22:41
【问题描述】：

我在许多不同的论坛上都看到过这个问题，但我还没有看到它得到正确的回答。有一些可能对某些人有用，但它们过于复杂了。我自己找到了解决方案，所以如果您有兴趣找到解决方案，请查看答案。

【问题讨论】：

【解决方案1】：

答案：通过 PDFTextStripper 类中的 processTextPosition() 方法提取每个字符的颜色。

对于要提取的颜色，PDFTextStripper 中的构造函数需要被覆盖，以便它有更多的运算符来从文本中提取颜色，因为这最初不是默认 PDFTextStripper 中的功能。检查文本提取下的https://pdfbox.apache.org/2.0/migration.html 以获取更多信息。从该链接中，我们找到要添加到 PDFTextStripper 的覆盖构造函数的运算符：

addOperator(new SetStrokingColorSpace());
addOperator(new SetNonStrokingColorSpace());
addOperator(new SetStrokingDeviceCMYKColor());
addOperator(new SetNonStrokingDeviceCMYKColor());
addOperator(new SetNonStrokingDeviceRGBColor());
addOperator(new SetStrokingDeviceRGBColor());
addOperator(new SetNonStrokingDeviceGrayColor());
addOperator(new SetStrokingDeviceGrayColor());
addOperator(new SetStrokingColor());
addOperator(new SetStrokingColorN());
addOperator(new SetNonStrokingColor());
addOperator(new SetNonStrokingColorN());

然后我们可以向我们的新子类添加一个布尔值，在处理文本时每次开始新行时将其设置为 true：

public class PDFTextStripperSuper extends PDFTextStripper {
    boolean newLine = true;
    
    public PDFTextStripperSuper() throws IOException {
        addOperator(new SetStrokingColorSpace());
        addOperator(new SetNonStrokingColorSpace());
        addOperator(new SetStrokingDeviceCMYKColor());
        addOperator(new SetNonStrokingDeviceCMYKColor());
        addOperator(new SetNonStrokingDeviceRGBColor());
        addOperator(new SetStrokingDeviceRGBColor());
        addOperator(new SetNonStrokingDeviceGrayColor());
        addOperator(new SetStrokingDeviceGrayColor());
        addOperator(new SetStrokingColor());
        addOperator(new SetStrokingColorN());
        addOperator(new SetNonStrokingColor());
        addOperator(new SetNonStrokingColorN());
    }
    
    @Override
    protected void startPage(PDPage page) throws IOException {
        newLine = true;
        super.startPage(page);
    }

    @Override
    protected void writeLineSeparator() throws IOException {
        newLine = true;
        super.writeLineSeparator();
    }
}

所以现在我们有了一个文本处理器，可以提取每一行文本以及字符颜色。为了实现这一点，我们要做的就是重写 writeString() 方法来获取每一行文本，以及重写 processTextPosition() 方法来获取每个字符的颜色：

public class DocAnalyzer {
    public DocAnalyzer(PDDocument doc) throws IOException {
        ArrayList<String> lines = new ArrayList<>();
        ArrayList<PDColor> charColors = new ArrayList<>();
        PDFTextStripperSuper tp = new PDFTextStripperSuper() {
            @Override
            protected void writeString(String text, List<TextPosition> textPositions)
                    throws IOException {
                if (newLine) {
                    lines.add(text);
                    newLine = false;
                }
                super.writeString(text, textPositions);
            }
            
            @Override
            protected void processTextPosition(TextPosition text) {
                super.processTextPosition(text);
                charColors.add(getGraphicsState().getNonStrokingColor());
            }
        };
        
        tp.getText(doc);//processes the text and adds to our lists
    }
}

你有它！文本的所有颜色都应该在您的 charColors 列表中。这就是我给你的所有帮助;)！

【讨论】：