PdfBox 从 pdf 中提取具有相同字体系列的文本答案

【问题标题】：PdfBox extract text with same font-family from pdfPdfBox 从 pdf 中提取具有相同字体系列的文本
【发布时间】：2013-09-23 12:55:28
【问题描述】：

我需要从 pdf 中提取一段文本。此文本与特征具有相同的字体系列。有任何想法吗？干杯

编辑： 让我换个方式问这个问题：如何从 pdf 页面中提取“粗体”文本？

【问题讨论】：

您可以从PDFTextStripper 派生出自己的文本提取类，并在其中过滤要添加到提取文本中的数据。但是，根据您的源 PDF，实际问题可能是识别粗体文字。有时，如果使用真正的粗体字体来宣布它的粗体，这很容易。但是，有时字体无法分辨，有时使用机制来模拟粗体，例如使用较小的偏移量进行双重绘图或使用较大的笔划值进行绘图。我不确定 PDFBox 是否能够立即识别所有这些技术。
你找到解决办法了吗？

标签： java pdf pdfbox extraction

【解决方案1】：

public String pdftoText(String fileName){
    try {
        File f = new File(fileName);
        if (!f.isFile()) {
            System.out.println("File not exist.");
            return null;
        }
        parser = new PDFParser(new FileInputStream(f));
        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdDoc = new PDDocument(cosDoc);
        parsedText = pdfStripper.getText(pdDoc);
        cosDoc.close();
        pdDoc.close();
        return parsedText;
    } catch (IOException ex) {
        Logger.getLogger(PDFTextParser.class.getName()).log(Level.SEVERE, null, ex);
        return null;
    }
}

运行前：将 pdfbox.jar 添加到您的项目中

【讨论】：

这根本不会按照操作的要求检查 Bold 等字体特征，是吗？
stackoverflow.com/questions/19770987/…