使用 PDFBox 获取文本行的位置答案

【问题标题】：Using PDFBox to get location of line of text使用 PDFBox 获取文本行的位置
【发布时间】：2015-10-06 19:33:12
【问题描述】：

我正在使用 PDFBox 从 pdf 中提取信息，而我目前正在尝试查找的信息与该行中第一个字符的 x 位置有关。不过，我找不到与如何获取该信息相关的任何内容。我知道 pdfbox 有一个名为 TextPosition 的类，但我也不知道如何从 PDDocument 中获取 TextPosition 对象。如何从pdf中获取一行文本的位置信息？

【问题讨论】：

有多个示例展示了如何从文档中获取TextPosition 对象，例如his answer 在 一般程序和 PDFBox 问题 部分。问题同时已经解决。
@mkl writeString 是怎么调用的？它是受保护的，所以它很可能是从 TextStripper 的另一个方法中调用的，但我不确定是哪一个。我在下一个答案中尝试了关于charactersByArticle的解决方案，但结果我得到的向量是空的。
writeString 是如何调用的 - 你将PDFTextStripper 实例应用到你的文档中，然后该实例一次又一次地调用writeString。
我在下一个答案中尝试了有关 charactersByArticle 的解决方案 - 这仅适用于包含某些附加元信息的 pdf，这些元信息用于分隔文档中的多篇文章。如果您的 PDF 没有此类信息，charactersByArticle 将无济于事。
抱歉，我对看 pdf 很陌生，我觉得你在引用你认为我应该知道但我不知道的东西。您说将 PDFTextStripper 实例应用到我的文档中就可以了，但我该怎么做呢？我试过调用 startDocument 和 getText，但它们都没有在新的 writeString 方法中运行代码。

标签： java pdf pdfbox

【解决方案1】：

一般

要使用 PDFBox 提取文本（带有或不带有位置、颜色等额外信息），您可以实例化 PDFTextStripper 或从它派生的类并像这样使用它：

PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);

（有许多PDFTextStripper 属性允许您限制从中提取文本的页面。）

在getText的执行过程中，相关页面的内容流（以及从这些页面引用的xObjects形式的内容流）被解析并处理文本绘制命令。

如果要更改文本提取行为，则必须更改此文本绘制命令处理，这通常应通过覆盖此方法来执行：

/**
 * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
 * and just calls {@link #writeString(String)}.
 *
 * @param text The text to write to the stream.
 * @param textPositions The TextPositions belonging to the text.
 * @throws IOException If there is an error when writing the text.
 */
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
    writeString(text);
}

如果您还需要知道新行何时开始，您可能还需要覆盖

/**
 * Write the line separator value to the output stream.
 * @throws IOException
 *             If there is a problem writing out the lineseparator to the document.
 */
protected void writeLineSeparator( ) throws IOException
{
    output.write(getLineSeparator());
}

writeString 可以被覆盖以将文本信息引导到单独的成员中（例如，如果您可能希望结果的格式比单纯的 String 更结构化），或者可以覆盖它以简单地将一些额外信息添加到结果String。

writeLineSeparator 可以被覆盖以触发行之间的某些特定输出。

有更多方法可以被覆盖，但一般来说您不太可能需要它们。

手头的情况

我正在使用 PDFBox 从 pdf 中提取信息，而我目前正在尝试查找的信息与该行中第一个字符的 x 位置有关。

这可以实现如下（只需在每一行的开头添加信息）：

PDFTextStripper stripper = new PDFTextStripper()
{
    @Override
    protected void startPage(PDPage page) throws IOException
    {
        startOfLine = true;
        super.startPage(page);
    }

    @Override
    protected void writeLineSeparator() throws IOException
    {
        startOfLine = true;
        super.writeLineSeparator();
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        if (startOfLine)
        {
            TextPosition firstProsition = textPositions.get(0);
            writeString(String.format("[%s]", firstProsition.getXDirAdj()));
            startOfLine = false;
        }
        super.writeString(text, textPositions);
    }
    boolean startOfLine = true;
};

text = stripper.getText(document);

（ExtractText.java 方法 extractLineStart 由 testExtractLineStartFromSampleFile 测试）

【讨论】：

这个答案对我帮助很大。我在调用 getText 之前也发现了我的问题，在我知道扩展 PDFTextStripper 之前我自己已经放入了一个 getText 函数，这使它无法调用新的 writeString 函数。谢谢！
@Beez 你能分享你的代码吗，我也被这类问题困住了。我想将文本中的蓝色变为黑色（以“http”或“https”开头）。