使用 PDFBox、FontBox 等将 PDF 解析为文本时出现字体问题答案

【问题标题】：Font problems in parsing PDF to text using PDFBox,FontBox etc使用 PDFBox、FontBox 等将 PDF 解析为文本时出现字体问题
【发布时间】：2011-09-17 11:06:28
【问题描述】：

我正在使用 pdfbox api 从 pdf 中提取文本。
我的程序运行良好它实际上是从 pdf 中提取文本，但 pdf 中文本的问题字体是 CDAC-GISTSurekh(Hindi font) 并且我的程序的输出与 曼格拉。
它甚至不匹配 pdf 中的文本。
我下载了相同的字体，即 CDAC-GISTSurekh（印地语字体）并将其添加到我的计算机字体中，但输出仍然是 Mangla 格式。
有什么方法可以在解析时更改输出字体。

感谢任何帮助..

我写的代码：



    import java.io.File;
    import java.io.FileInputStream;
    import java.io.IOException;
    import org.apache.pdfbox.cos.COSDocument;
    import org.apache.pdfbox.pdfparser.PDFParser;
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.util.PDFTextStripper;

    public class PDFTextParser {
        static String pdftoText(String fileName) {
            PDFParser parser;
            String parsedText = null;
            PDFTextStripper pdfStripper = null;
            PDDocument pdDoc = null;
            COSDocument cosDoc = null;
            File file = new File(fileName);
            if (!file.isFile()) {
                System.out.println("File " + fileName + " does not exist.");
                return null;
            }
            try {
                parser = new PDFParser(new FileInputStream(file));
            } catch (IOException e) {
                System.out.println("Unable to open PDF Parser. " + e.getMessage());
                return null;
            }
            try {
                parser.parse();
                cosDoc = parser.getDocument();
                pdfStripper = new PDFTextStripper();
                pdDoc = new PDDocument(cosDoc);
                pdfStripper.setStartPage(1);
                pdfStripper.setEndPage(5);
                parsedText = pdfStripper.getText(pdDoc);
            } catch (Exception e) {
                        e.printStackTrace();
                System.out.println("An exception occured in parsing the PDF Document."+ e.getMessage());
            } finally {
                try {
                    if (cosDoc != null)
                        cosDoc.close();
                    if (pdDoc != null)
                        pdDoc.close();
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
            return parsedText;
        }
        public static void main(String args[]){
            System.out.println(pdftoText("J:\\Users\\Shantanu\\Documents\\NetBeansProjects\\Pdf\\src\\PDfman\\A0410001.pdf"));
        }
    }

【问题讨论】：

您是否正在尝试阅读投票列表。如果是，那么我发现的一件事是文本是图像格式，因此很难解析。我也在尝试做同样的事情事情。你解析成功了吗？

标签： java pdfbox

【解决方案1】：

当您创建新的 PdfStripper 对象时，请使用以下语法并为其指定编码。

PdfTextStripper pdfStripper = new PDFTextStripper(ISO-XXXX)

其中 (ISO -XXX) 是 PDF 中使用的字符编码。

【讨论】：

你在哪里找到的代码？有没有办法找出保存 pdf 时使用的 ISO 代码？
@Yonkee arg 没有这样的构造函数