【问题标题】:pdf form to csv python or similarpdf 格式到 csv python 或类似格式
【发布时间】:2014-12-03 14:57:27
【问题描述】:

我有一堆用 Adob​​e formscentral 创建的 pdf 表单 - 它们的格式都相同,我想将字段中的数据提取到 CSV 文件中。我(开始)对 python 有点熟悉,并尝试了一些库来通过 XML 标签提取文本。不过,我已经到了超出我的深度的地步:(

我已设法使用“pdfquery”和/或“beautifulsoup”阅读 PDF,但在任何地方都找不到简单的教程来帮助我将 pdf 解析为 csv/excel。我已经搜索过,似乎找不到任何完全相关的东西。我设法提取的 XML 树为我提供了字段名称的标签(见下文),但不知道如何从这里开始。有没有人有过这种操作的经验或能指出任何教程的方向。

感谢您的帮助!

谢谢

马蒂

    <pdfxml ModDate="D:20140414114502+03'00'" CreationDate="D:20140407143830-04'00'" Producer="Adobe FormsCentral 889953 S" Creator="Adobe FormsCentral 738134">
  <LTPage bbox="[0, 0, 595.27, 841.89]" height="841.89" pageid="1" rotate="0" width="595.27" x0="0" x1="595.27" y0="0" y1="841.89" page_index="0" page_label="">
    <LTRect bbox="[0.0, 0.0, 595.27, 841.89]" height="841.89" linewidth="0" pts="[[0.0, 0.0], [595.27, 0.0], [595.27, 841.89], [0.0, 841.89]]" width="595.27" x0="0.0" x1="595.27" y0="0.0" y1="841.89">
      <LTTextLineHorizontal bbox="[34.015, 732.217, 133.831, 745.798]" height="13.582" width="99.816" word_margin="0.1" x0="34.015" x1="133.831" y0="732.217" y1="745.798"><LTTextBoxHorizontal bbox="[34.015, 732.217, 133.831, 745.798]" height="13.582" index="1" width="99.816" x0="34.015" x1="133.831" y0="732.217" y1="745.798">Name of organisation: </LTTextBoxHorizontal></LTTextLineHorizontal>
      <LTTextLineHorizontal bbox="[34.015, 707.554, 128.739, 721.135]" height="13.582" width="94.724" word_margin="0.1" x0="34.015" x1="128.739" y0="707.554" y1="721.135"><LTTextBoxHorizontal bbox="[34.015, 707.554, 128.739, 721.135]" height="13.582" index="2" width="94.724" x0="34.015" x1="128.739" y0="707.554" y1="721.135">Type of organisation: </LTTextBoxHorizontal></LTTextLineHorizontal>
      <LTTextBoxHorizontal bbox="[34.025, 631.024, 136.667, 657.37]" height="26.347" index="3" width="102.642" x0="34.025" x1="136.667" y0="631.024" y1="657.37"><LTTextLineHorizontal bbox="[34.025, 643.789, 136.667, 657.37]" height="13.582" width="102.642" word_margin="0.1" x0="34.025" x1="136.667" y0="643.789" y1="657.37">Number of employees/ </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 631.024, 112.269, 645.166]" height="14.143" width="78.244" word_margin="0.1" x0="34.025" x1="112.269" y0="631.024" y1="645.166">members (male): </LTTextLineHorizontal></LTTextBoxHorizontal>
      <LTTextBoxHorizontal bbox="[34.025, 581.871, 136.667, 620.462]" height="38.592" index="4" width="102.642" x0="34.025" x1="136.667" y0="581.871" y1="620.462"><LTTextLineHorizontal bbox="[34.025, 606.881, 136.667, 620.462]" height="13.582" width="102.642" word_margin="0.1" x0="34.025" x1="136.667" y0="606.881" y1="620.462">Number of employees/ </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 594.116, 134.963, 608.259]" height="14.143" width="100.938" word_margin="0.1" x0="34.025" x1="134.963" y0="594.116" y1="608.259">members aged 18-35 </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 581.871, 64.076, 596.014]" height="14.143" width="30.051" word_margin="0.1" x0="34.025" x1="64.076" y0="581.871" y1="596.014">(male) </LTTextLineHorizontal></LTTextBoxHorizontal>
      <LTTextLineHorizontal bbox="[34.025, 557.728, 112.861, 571.31]" height="13.582" width="78.836" word_margin="0.1" x0="34.025" x1="112.861" y0="557.728" y1="571.31"><LTTextBoxHorizontal bbox="[34.025, 557.728, 112.861, 571.31]" height="13.582" index="5" width="78.836" x0="34.025" x1="112.861" y0="557.728" y1="571.31">Location/Address </LTTextBoxHorizontal></LTTextLineHorizontal>
      <LTTextBoxHorizontal bbox="[34.025, 494.974, 138.371, 533.045]" height="38.071" index="6" width="104.346" x0="34.025" x1="138.371" y0="494.974" y1="533.045"><LTTextLineHorizontal bbox="[34.025, 519.463, 99.821, 533.045]" height="13.582" width="65.795" word_margin="0.1" x0="34.025" x1="99.821" y0="519.463" y1="533.045">Type of waste </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 507.218, 138.371, 520.8]" height="13.582" width="104.346" word_margin="0.1" x0="34.025" x1="138.371" y0="507.218" y1="520.8">management activities </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 494.974, 85.066, 508.555]" height="13.582" width="51.04" word_margin="0.1" x0="34.025" x1="85.066" y0="494.974" y1="508.555">carried out: </LTTextLineHorizontal></LTTextBoxHorizontal>

【问题讨论】:

    标签: python xml csv pdf converter


    【解决方案1】:

    我更喜欢使用 lxml package,因为它有一个非常方便的 objectify 模块,可以让解析 XML 变得非常简单。

    这是一个示例,展示了从 XML 中提取数据的几种方法:

    from lxml import objectify
    
    #----------------------------------------------------------------------
    def parser(xml):
        """"""
        root = objectify.fromstring(xml)
        print root.LTPage.LTRect.attrib
        for item in root.LTPage.LTRect.getchildren():
            print item.tag
            print item.text
            print item.attrib
            print item.attrib["bbox"]
    
    if __name__ == "__main__":
        xml = """<pdfxml ModDate="D:20140414114502+03'00'" CreationDate="D:20140407143830-04'00'" Producer="Adobe FormsCentral 889953 S" Creator="Adobe FormsCentral 738134">
      <LTPage bbox="[0, 0, 595.27, 841.89]" height="841.89" pageid="1" rotate="0" width="595.27" x0="0" x1="595.27" y0="0" y1="841.89" page_index="0" page_label="">
        <LTRect bbox="[0.0, 0.0, 595.27, 841.89]" height="841.89" linewidth="0" pts="[[0.0, 0.0], [595.27, 0.0], [595.27, 841.89], [0.0, 841.89]]" width="595.27" x0="0.0" x1="595.27" y0="0.0" y1="841.89">
          <LTTextLineHorizontal bbox="[34.015, 732.217, 133.831, 745.798]" height="13.582" width="99.816" word_margin="0.1" x0="34.015" x1="133.831" y0="732.217" y1="745.798"><LTTextBoxHorizontal bbox="[34.015, 732.217, 133.831, 745.798]" height="13.582" index="1" width="99.816" x0="34.015" x1="133.831" y0="732.217" y1="745.798">Name of organisation: </LTTextBoxHorizontal></LTTextLineHorizontal>
          <LTTextLineHorizontal bbox="[34.015, 707.554, 128.739, 721.135]" height="13.582" width="94.724" word_margin="0.1" x0="34.015" x1="128.739" y0="707.554" y1="721.135"><LTTextBoxHorizontal bbox="[34.015, 707.554, 128.739, 721.135]" height="13.582" index="2" width="94.724" x0="34.015" x1="128.739" y0="707.554" y1="721.135">Type of organisation: </LTTextBoxHorizontal></LTTextLineHorizontal>
          <LTTextBoxHorizontal bbox="[34.025, 631.024, 136.667, 657.37]" height="26.347" index="3" width="102.642" x0="34.025" x1="136.667" y0="631.024" y1="657.37"><LTTextLineHorizontal bbox="[34.025, 643.789, 136.667, 657.37]" height="13.582" width="102.642" word_margin="0.1" x0="34.025" x1="136.667" y0="643.789" y1="657.37">Number of employees/ </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 631.024, 112.269, 645.166]" height="14.143" width="78.244" word_margin="0.1" x0="34.025" x1="112.269" y0="631.024" y1="645.166">members (male): </LTTextLineHorizontal></LTTextBoxHorizontal>
          <LTTextBoxHorizontal bbox="[34.025, 581.871, 136.667, 620.462]" height="38.592" index="4" width="102.642" x0="34.025" x1="136.667" y0="581.871" y1="620.462"><LTTextLineHorizontal bbox="[34.025, 606.881, 136.667, 620.462]" height="13.582" width="102.642" word_margin="0.1" x0="34.025" x1="136.667" y0="606.881" y1="620.462">Number of employees/ </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 594.116, 134.963, 608.259]" height="14.143" width="100.938" word_margin="0.1" x0="34.025" x1="134.963" y0="594.116" y1="608.259">members aged 18-35 </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 581.871, 64.076, 596.014]" height="14.143" width="30.051" word_margin="0.1" x0="34.025" x1="64.076" y0="581.871" y1="596.014">(male) </LTTextLineHorizontal></LTTextBoxHorizontal>
          <LTTextLineHorizontal bbox="[34.025, 557.728, 112.861, 571.31]" height="13.582" width="78.836" word_margin="0.1" x0="34.025" x1="112.861" y0="557.728" y1="571.31"><LTTextBoxHorizontal bbox="[34.025, 557.728, 112.861, 571.31]" height="13.582" index="5" width="78.836" x0="34.025" x1="112.861" y0="557.728" y1="571.31">Location/Address </LTTextBoxHorizontal></LTTextLineHorizontal>
          <LTTextBoxHorizontal bbox="[34.025, 494.974, 138.371, 533.045]" height="38.071" index="6" width="104.346" x0="34.025" x1="138.371" y0="494.974" y1="533.045"><LTTextLineHorizontal bbox="[34.025, 519.463, 99.821, 533.045]" height="13.582" width="65.795" word_margin="0.1" x0="34.025" x1="99.821" y0="519.463" y1="533.045">Type of waste </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 507.218, 138.371, 520.8]" height="13.582" width="104.346" word_margin="0.1" x0="34.025" x1="138.371" y0="507.218" y1="520.8">management activities </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 494.974, 85.066, 508.555]" height="13.582" width="51.04" word_margin="0.1" x0="34.025" x1="85.066" y0="494.974" y1="508.555">carried out: </LTTextLineHorizontal></LTTextBoxHorizontal>
        </LTRect>
        </LTPage>
        </pdfxml>
          """
        parser(xml)
    

    请注意,我修改了 XML 以具有正确的结束标记。您可能还会发现本教程很有用:

    【讨论】:

      猜你喜欢
      • 2014-12-24
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-02-05
      • 1970-01-01
      • 1970-01-01
      • 2014-09-13
      • 2018-09-06
      相关资源
      最近更新 更多