【问题标题】:writing the output of a program into a file将程序的输出写入文件
【发布时间】:2016-04-12 14:18:32
【问题描述】:

我编写了一个程序来将 pdf 解析为文本。我在控制台中获得输出,但我无法将其写入文件。这是我所做的代码:

public class PDFTextParser {

public static void main(String args[]) throws IOException {
    PDFTextStripper pdfStripper = null;
    COSDocument cosDoc = null;
    try {


         File file = new File("1.pdf");
         PDDocument pdDoc = PDDocument.load(file);
         pdfStripper = new PDFTextStripper();
         String parsedText = pdfStripper.getText(pdDoc);
         System.out.println(parsedText);
         FileWriter out = new FileWriter("output.txt"); 
         BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
         String line = in.readLine();
         while (line!= null) {

                 out.append(line);
                 out.append("\n");
               }
        out.close();
    }catch (IOException e) {
         e.printStackTrace();}
   }
}

输出是:

Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser      parseFileObject
WARNING: Object (6:0) at offset 1013093 does not end with 'endobj' but  with '7'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (7:0) at offset 1013211 does not end with 'endobj' but with '483'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (9:0) at offset 1020280 does not end with 'endobj' but with '10'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (10:0) at offset 1020396 does not end with 'endobj' but with '15'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (15:0) at offset 1020519 does not end with 'endobj' but with '16'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (16:0) at offset 1020640 does not end with 'endobj' but with '17'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (17:0) at offset 1020756 does not end with 'endobj' but with '18'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (18:0) at offset 1020874 does not end with 'endobj' but with '19'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (19:0) at offset 1020993 does not end with 'endobj' but with '24'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (24:0) at offset 1021111 does not end with 'endobj' but with '25'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (25:0) at offset 1021228 does not end with 'endobj' but with '26'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (26:0) at offset 1021350 does not end with 'endobj' but with '27'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (27:0) at offset 1021469 does not end with 'endobj' but with '28'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (28:0) at offset 1021589 does not end with 'endobj' but with '489'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (458:0) at offset 1026684 does not end with 'endobj' but with '463'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (463:0) at offset 1026809 does not end with 'endobj' but with '464'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (464:0) at offset 1026932 does not end with 'endobj' but with '465'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (465:0) at offset 1027050 does not end with 'endobj' but with '466'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (466:0) at offset 1027170 does not end with 'endobj' but with '495'

解析后的 pdf 文本出现在控制台中..但我得到一个空文件作为输出

【问题讨论】:

  • 这段代码的输出是什么?
  • 我在您的代码中看到的内容,您只需在文件中写入parsedTextout.append(parsedText); 并关闭out,但您为什么要使用...new InputStreamReader(System.in)?你想从用户那里得到输入吗?
  • 您的程序只是将System.in 复制到output.txt 文件中。所以要在那里看到一些东西,你需要为程序提供一些输入。
  • @Henry yup,如果用户输入任何内容并点击Enter,应用程序将进入无限循环,将相同的line 添加到文件编写器:)
  • @Ria 对现有答案的补充 - 警告意味着 PDF 不符合 PDF 规范(如果您上传 PDF,我可以告诉更多信息)。可能这是一个以 ascii 格式传输的二进制文件,也可能是 PDF 的制作者犯了一个错误。可悲的是,这种情况并不少见。如果 PDF 是在贵公司制作的,请告诉他们。

标签: java pdfbox netbeans-8.1


【解决方案1】:

您已经从 PDF 中获取了文本,只需将其写入文件, 其余代码尝试从用户那里获取输入(例如,键盘) 你不需要它,只需使用以下代码:

String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
FileWriter out = new FileWriter("output.txt"); 
out.append(parsedText);
out.close();

//no need for this code, it reads input from user (using keyboard)
 /*
 BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
 String line = in.readLine();
 while (line!= null) {

         out.append(line);
         out.append("\n");
       }
out.close();
*/

【讨论】:

    【解决方案2】:

    你看过这篇文章了吗? system-out-to-a-file-in-java

    不过我喜欢他的第一个回答

    java -jar myjar.jar > output.txt
    

    在你的情况下会是这样的

    java -cp <classpath>/PDFTextParser > output.txt
    

    希望对你有帮助

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2010-10-09
      • 1970-01-01
      • 2012-10-24
      • 1970-01-01
      • 2013-04-05
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多