PDFBox hasGlyph() 为不受支持的 unicode 控制字符返回 true答案

【问题标题】：PDFBox hasGlyph() returns true for unsupported unicode control charactersPDFBox hasGlyph() 为不受支持的 unicode 控制字符返回 true
【发布时间】：2017-07-23 21:25:49
【问题描述】：

我正在使用 Apache 的 PDFBox 库来编写一个 PdfDocumentBuilder 类。在尝试将字符写入文件之前，我使用currentFont.hasGlyph(character) 检查字符是否具有字形。问题是当字符是像'\u001f'这样的unicode控制字符时，hasGlyph()返回true，导致encode()在写入时抛出异常（参见下面的PdfDocumentBuilder代码和堆栈跟踪以供参考）。

我做了一些研究，似乎我正在使用的字体 (Courier Prime) 不支持这些 unicode 控制字符。

那么，为什么 hasGlyph() 在不支持 unicode 控制字符时会返回 true 呢？当然，我可以在输入writeTextWithSymbol() 方法之前使用简单的replaceAll 从行中删除控制字符，但如果hasGlyph() 方法没有按我的预期工作，我就有更大的问题。

PdfDocumentBuilder：

private final PDType0Font baseFont;
private PDType0Font currentFont;   

public PdfDocumentBuilder () {
    baseFont = PDType0Font.load(doc, this.getClass().getResourceAsStream("/CourierPrime.ttf"));
    currentFont = baseFont;
}

private void writeTextWithSymbol (String text) throws IOException {
    StringBuilder nonSymbolBuffer = new StringBuilder();
    for (char character : text.toCharArray()) {
        if (currentFont.hasGlyph(character)) {
            nonSymbolBuffer.append(character);
        } else {
            //handling writing line with symbols...
        }
    }
    if (nonSymbolBuffer.length() > 0) {
        content.showText(nonSymbolBuffer.toString());
    }
}

堆栈跟踪：

java.lang.IllegalArgumentException: No glyph for U+001F in font CourierPrime
at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.encode(PDCIDFontType2.java:400)
at org.apache.pdfbox.pdmodel.font.PDType0Font.encode(PDType0Font.java:351)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:316)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:414)
at org.main.export.PdfDocumentBuilder.writeTextWithSymbol(PdfDocumentBuilder.java:193)

【问题讨论】：

你用的是什么版本？最新的是 2.0.4，所以请重试以确保。像 Arial 这样的普通字体也会出现这种效果吗？你是如何创建字体对象的？
@TilmanHausherr 我使用的是 2.0.4。我已经用其他几种标准字体进行了测试，但问题仍然存在。我添加了创建 currentFont 对象的代码。
我创建了一个简单的测试，是的，它也发生在 Arial 上。我假设错误在 hasGlyph 中，或者 hasGlyph 中的参数不是人们想的那样。对您来说，最好的方法是对每个字符调用 font.encode() 并捕获 IllegalArgumentException，以了解该字符是否受支持。这样你就肯定知道了。稍后我会在 JIRA 中创建一个问题。
@TilmanHausherr 好的，谢谢。我认为这是最好的选择。我进一步挖掘，由于某种原因，字符的值在 PDType0Font 的 CMap 中，但不在 PDCIDFontType2 的 CMapSubtable 中，因此 getGlyphId 返回 0，导致异常。
我创建了 issues.apache.org/jira/browse/PDFBOX-3708 。如果您在 JIRA 中注册，您可以关注任何发展。如果你愿意，你可以用你的解决方法自己在这里回答这个问题，这样它就可以帮助别人，我有点忙，这里有足够的积分:-)

标签： java unicode pdfbox

【解决方案1】：

正如上面 cmets 中所解释的，hasGlyph() 并不意味着接受 unicode 字符作为参数。所以如果你需要在写一个字符之前检查它是否可以被编码，你可以这样做：

private void writeTextWithSymbol (String text) throws IOException {
    StringBuilder nonSymbolBuffer = new StringBuilder();
    for (char character : text.toCharArray()) {
        if (isCharacterEncodeable(character)) {
            nonSymbolBuffer.append(character);
        } else {
            //handle writing line with symbols...
        }
    }
    if (nonSymbolBuffer.length() > 0) {
        content.showText(nonSymbolBuffer.toString());
    }
}

private boolean isCharacterEncodeable (char character) throws IOException {
    try {
        currentFont.encode(Character.toString(character));
        return true;
    } catch (IllegalArgumentException iae) {
        LOGGER.trace("Character cannot be encoded", iae);
        return false;
    }
}

【讨论】：

您在isCharacterEncodeable() 中捕获IOException。你不应该抓住IllegalArgumentException吗？