【问题标题】:which is the best way to read data from a table in a pdf?从 pdf 的表格中读取数据的最佳方法是什么?
【发布时间】:2018-03-19 21:08:01
【问题描述】:

我想从这个 PDF 的表格中读取数据。

PDF

我曾想过阅读 PDF,将其导出到 Excel,然后使用数据。读取pdf并将其导出到Excel的问题是,由于我使用Apache Poi读取,因此存在列元素移动到空列,并且通过这种方式将整个PDF保存在一个字符串中。

另一种方法是读取精确的坐标数据,但我认为这不是一个很好的选择。

有人可以给我建议吗?哪种方式更好或一些新的方式?

【问题讨论】:

  • 阅读 PDF 的最佳方式是使用 PDF 库。现在你去做一些研究,看看你是否能找到一个允许从 Java 读取 PDF 表格内容的工具。

标签: java pdf export


【解决方案1】:

我在使用 Xpdf pdftotext 和 -layout 和 -table 选项的组合时遇到了最好的运气。

Xpdf Link

你会这样调用:

pdftotext -table c:\temp\ENaB20180317.pdf c:\temp\output.txt

然后您可以通过从每个页面的标题中获取起始列位置来进行解析。

另一个不错的选择是PDFBox,它可以以您可以使用的格式提取文本,而无需调用单独的命令行应用程序。

【讨论】:

    【解决方案2】:

    您的示例文件已被标记。只要您感兴趣的 PDF 被适当标记,您就可以使用可以提取标记信息的工具,例如使用 iText 7,您可以做到

    PdfReader reader = new PdfReader(resource);
    PdfDocument document = new PdfDocument(reader);
    
    TaggedPdfReaderTool tool = new TaggedPdfReaderTool(document);
    try (   OutputStream result = new FileOutputStream("ENaB 20180317.txt")   ) {
        tool.convertToXml(result);
    }
    

    结果如下:

    <Workbook>
    <Worksheet>
    <Table bBox="[-12.1039 12.0189 829.816 607.339 ]" o="/Layout">
    <TR>
    <TD backgroundColor="[0.600006 0.800003 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    FECHA</TD>
    <TD backgroundColor="[0.600006 0.800003 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    HORA</TD>
    <TD backgroundColor="[0.600006 0.800003 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    CATEGOR&#205;A</TD>
    <TD backgroundColor="[0.600006 0.800003 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    PARTIDO</TD>
    <TD backgroundColor="[0.600006 0.800003 1.0 ]" blockAlign="/Middle" o="/Layout">
    PABELL&#211;N</TD>
    <TD backgroundColor="[0.600006 0.800003 1.0 ]" blockAlign="/Middle" o="/Layout">
    LOCALIDAD</TD>
    <TD backgroundColor="[0.600006 0.800003 1.0 ]" blockAlign="/Middle" o="/Layout">
    ARBITRO
    PRINCIPAL</TD>
    <TD backgroundColor="[0.600006 0.800003 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    ARBITRO
    AUXILIAR</TD>
    <TD backgroundColor="[0.600006 0.800003 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    ACTA</TD>
    <TD backgroundColor="[0.600006 0.800003 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    CRONO</TD>
    <TD>
    FECHA</TD>
    <TD>
    HORA</TD>
    <TD>
    CATEGOR&#205;A</TD>
    <TD>
    PARTIDO</TD>
    <TD>
    PABELL&#211;N</TD>
    <TD>
    LOCALIDAD</TD>
    <TD>
    ARBITRO
    PRINCIPAL</TD>
    <TD>
    ARBITRO
    AUXILIAR</TD>
    <TD>
    ACTA</TD>
    <TD>
    CRONO</TD>
    </TR>
    <TR>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    16/03/2018</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    17:30</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    MINI. FEM. GR.B</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    EG&#220;ES-MENDILLORRI 2006*  -  ARDOI ABEJAS</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" o="/Layout">
    P. TRINKETE</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" o="/Layout">
    PAMPLONA</TD>
    <TD>
    </TD>
    <TD>
    </TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    I.VELASCO</TD>
    </TR>
    <TR>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    16/03/2018</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    17:30</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    PREMINI. MAS. GR.1-10</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    SAN IGNACIO 07  -  CB BURLADA</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" o="/Layout">
    P. SAN IGNACIO</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" o="/Layout">
    PAMPLONA</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" o="/Layout">
    A.ELOUARDI</TD>
    <TD>
    </TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    A.IRAOLA</TD>
    </TR>
    <TR>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    17/03/2018</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    09:10</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    INFANTIL MAS. 1&#170;</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    GAZTE BERRIAK  -  LAGUNAK</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" o="/Layout">
    P. ARROSADIA P.I</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" o="/Layout">
    PAMPLONA</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" o="/Layout">
    P.HORME&#209;O</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    M.YKHELEF</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    I.VELASCO</TD>
    </TR>
    <TR>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    17/03/2018</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    09:10</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    INFANTIL MAS. 1&#170;</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    SAN CERNIN B  -  AUTOBUSES LA PAMPLONESA ARDOI</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" o="/Layout">
    P. ARROSADIA P.II</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" o="/Layout">
    PAMPLONA</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" o="/Layout">
    J.HUEGUN</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    A.PIDAL</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    M.TESO</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    OFI.CNaB</TD>
    </TR>
    <TR>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    17/03/2018</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    09:10</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    MINI. MAS. GR.B</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    PATXI LARRAINZAR  LARANJA*  -  LA COMPASION*</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" o="/Layout">
    P. AZPILAGA&#209;A P.III</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" o="/Layout">
    PAMPLONA</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" o="/Layout">
    S.VELIZAROVA</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    N.VILLAR</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    L.ORTIZ</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    OFI.CNaB</TD>
    </TR>
    <TR>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    17/03/2018</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    09:10</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    MINI. MAS. GR.A</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    NAVARRO VILLOSLADA 06M  -  ADEMAR</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" o="/Layout">
    P. IRIBARREN P.II</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" o="/Layout">
    PAMPLONA</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" o="/Layout">
    A.ELOUARDI</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    R.MARTIN</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    A.DE LA NAVA</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    OFI.CNaB</TD>
    </TR>
    <TR>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    17/03/2018</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    09:10</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    CADETE MAS. 2&#170;</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    MENDIEG&#220;ES BNC*  -  BERRIOZAR MKE</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" o="/Layout">
    P. LARRABIDE P.I</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" o="/Layout">
    PAMPLONA</TD>
    <TD>
    </TD>
    <TD>
    </TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    I.LECUMBERRI</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    J.RAMIREZ</TD>
    </TR>
    <TR>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    17/03/2018</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    09:15</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    PREINF. FEM. 1&#170;</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    MENDILLORRI 2005 AMARILLO  -  CBASK M</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" o="/Layout">
    P. AZPILAGA&#209;A P.I</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" o="/Layout">
    PAMPLONA</TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" o="/Layout">
    A.PEREZGARCIA</TD>
    <TD>
    </TD>
    <TD backgroundColor="[1.0 1.0 1.0 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    A.IRIGOIEN</TD>
    </TR>
    <TR>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    17/03/2018</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    09:15</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    INFANTIL MAS. 2&#170;</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    CANTOLAGUA  -  CB ONCINEDA SK*</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" o="/Layout">
    P. AZPILAGA&#209;A P.II</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" o="/Layout">
    PAMPLONA</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" o="/Layout">
    P.ALVAREZ</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    R.GARCIAMIGUEL</TD>
    <TD backgroundColor="[0.752945 0.752945 0.752945 ]" blockAlign="/Middle" borderColor="[0.0 0.0 0.0 ]" borderStyle="/Solid" borderThickness="1.0" o="/Layout">
    I.TAGUA</TD>
    </TR>
    [...]
    </Table>
    </Worksheet>
    </Workbook>
    

    【讨论】:

      猜你喜欢
      • 2020-01-16
      • 2010-09-06
      • 2010-09-24
      • 2016-08-10
      • 1970-01-01
      • 2022-01-22
      • 2011-05-28
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多