【问题标题】:Get font of each line using PDFBox使用 PDFBox 获取每行的字体
【发布时间】:2014-03-09 11:33:29
【问题描述】:

有没有办法使用 PDFBox 获取 PDF 文件每一行的字体?我已经尝试过了,但它只列出了该页面中使用的所有字体。它不显示以该字体显示的行或文本。

List<PDPage> pages = doc.getDocumentCatalog().getAllPages();
for(PDPage page:pages)
{
Map<String,PDFont> pageFonts=page.getResources().getFonts();
for(String key : pageFonts.keySet())
   {
    System.out.println(key+" - "+pageFonts.get(key));
    System.out.println(pageFonts.get(key).getBaseFont());
    }
}

感谢任何输入。谢谢!

【问题讨论】:

    标签: pdf fonts pdfbox


    【解决方案1】:

    每当您尝试使用 PDFBox 从 PDF 中提取文本(纯文本或带有样式信息)时,您通常应该开始尝试使用 PDFTextStripper 类或其亲属之一。该课程已经为您完成了 PDF 内容解析所涉及的所有繁重工作。

    您可以像这样使用普通的PDFTextStripper 类:

    PDDocument document = ...;
    PDFTextStripper stripper = new PDFTextStripper();
    // set stripper start and end page or bookmark attributes unless you want all the text
    String text = stripper.getText(document);
    

    这仅返回纯文本,例如来自一些 R40 表格:

    Claim for repayment of tax deducted 
    from savings and investments
    How to fill in this form
    Please fill in this form with details of your income for the
    above tax year. The enclosed Notes will help you (but there is
    not a note for every box on the form). If you need more help
    with anything on this form, please phone us on the number
    shown above.
    If you are not a UK resident, do not use this form – please 
    contact us.
    Please do not send us any personal records, or tax
    certificates or vouchers with your form. We will contact 
    you if we need these.
    Please allow four weeks before contacting us about your
    repayment. We will pay you as quickly as possible.
    Use black ink and capital letters
    Cross out any mistakes and write the
    correct information below
    ...
    

    另一方面,您可以覆盖其方法writeString(String, List&lt;TextPosition&gt;) 并处理比单纯的文本更多的信息。要在字体更改的任何位置添加有关所用字体名称的信息,您可以使用:

    PDFTextStripper stripper = new PDFTextStripper() {
        String prevBaseFont = "";
    
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException
        {
            StringBuilder builder = new StringBuilder();
    
            for (TextPosition position : textPositions)
            {
                String baseFont = position.getFont().getBaseFont();
                if (baseFont != null && !baseFont.equals(prevBaseFont))
                {
                    builder.append('[').append(baseFont).append(']');
                    prevBaseFont = baseFont;
                }
                builder.append(position.getCharacter());
            }
    
            writeString(builder.toString());
        }
    };
    

    对于您获得的相同形式

    [DHSLTQ+IRModena-Bold]Claim for repayment of tax deducted 
    from savings and investments
    How to fill in this form
    [OIALXD+IRModena-Regular]Please fill in this form with details of your income for the
    above tax year. The enclosed Notes will help you (but there is
    not a note for every box on the form). If you need more help
    with anything on this form, please phone us on the number
    shown above.
    If you are not a UK resident, do not use this form – please 
    contact us.
    [DHSLTQ+IRModena-Bold]Please do not send us any personal records, or tax
    certificates or vouchers with your form. We will contact 
    you if we need these.
    [OIALXD+IRModena-Regular]Please allow four weeks before contacting us about your
    repayment. We will pay you as quickly as possible.
    Use black ink and capital letters
    Cross out any mistakes and write the
    correct information below
    ...
    

    如果您不希望字体信息与文本合并,只需在您的方法覆盖中创建单独的结构。

    TextPosition 提供了更多关于它所代表的文本的信息。检查它!

    【讨论】:

      【解决方案2】:

      如果您使用的是 pdfbox 2.0.8,要添加到 mkl 的答案中:

      • 使用position.getFont().getName() 而不是position.getFont().getBaseFont()
      • 使用position.getUnicode() 而不是position.getCharacter()

      有关PDFontText Position 的更多信息可以在他们的在线Javadocs 上找到。

      【讨论】:

        猜你喜欢
        • 2014-01-18
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-03-23
        • 2013-06-14
        • 2016-11-17
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多