每当您尝试使用 PDFBox 从 PDF 中提取文本(纯文本或带有样式信息)时,您通常应该开始尝试使用 PDFTextStripper 类或其亲属之一。该课程已经为您完成了 PDF 内容解析所涉及的所有繁重工作。
您可以像这样使用普通的PDFTextStripper 类:
PDDocument document = ...;
PDFTextStripper stripper = new PDFTextStripper();
// set stripper start and end page or bookmark attributes unless you want all the text
String text = stripper.getText(document);
这仅返回纯文本,例如来自一些 R40 表格:
Claim for repayment of tax deducted
from savings and investments
How to fill in this form
Please fill in this form with details of your income for the
above tax year. The enclosed Notes will help you (but there is
not a note for every box on the form). If you need more help
with anything on this form, please phone us on the number
shown above.
If you are not a UK resident, do not use this form – please
contact us.
Please do not send us any personal records, or tax
certificates or vouchers with your form. We will contact
you if we need these.
Please allow four weeks before contacting us about your
repayment. We will pay you as quickly as possible.
Use black ink and capital letters
Cross out any mistakes and write the
correct information below
...
另一方面,您可以覆盖其方法writeString(String, List<TextPosition>) 并处理比单纯的文本更多的信息。要在字体更改的任何位置添加有关所用字体名称的信息,您可以使用:
PDFTextStripper stripper = new PDFTextStripper() {
String prevBaseFont = "";
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
StringBuilder builder = new StringBuilder();
for (TextPosition position : textPositions)
{
String baseFont = position.getFont().getBaseFont();
if (baseFont != null && !baseFont.equals(prevBaseFont))
{
builder.append('[').append(baseFont).append(']');
prevBaseFont = baseFont;
}
builder.append(position.getCharacter());
}
writeString(builder.toString());
}
};
对于您获得的相同形式
[DHSLTQ+IRModena-Bold]Claim for repayment of tax deducted
from savings and investments
How to fill in this form
[OIALXD+IRModena-Regular]Please fill in this form with details of your income for the
above tax year. The enclosed Notes will help you (but there is
not a note for every box on the form). If you need more help
with anything on this form, please phone us on the number
shown above.
If you are not a UK resident, do not use this form – please
contact us.
[DHSLTQ+IRModena-Bold]Please do not send us any personal records, or tax
certificates or vouchers with your form. We will contact
you if we need these.
[OIALXD+IRModena-Regular]Please allow four weeks before contacting us about your
repayment. We will pay you as quickly as possible.
Use black ink and capital letters
Cross out any mistakes and write the
correct information below
...
如果您不希望字体信息与文本合并,只需在您的方法覆盖中创建单独的结构。
TextPosition 提供了更多关于它所代表的文本的信息。检查它!