如何使用python从pdf中的不平衡表中提取数据？答案

【问题标题】：How can I extract data from unbalanced tables in pdf using python?如何使用python从pdf中的不平衡表中提取数据？
【发布时间】：2021-06-17 06:11:31
【问题描述】：

我需要使用 Python 从 PDF 中的表格（如下所示）中提取数据。首先，我想将所有左侧数据放在页面上，然后是右侧数据。我曾尝试使用text.split('\n') 和re.split(r'\s{3,}')，但没有任何效果。

Link of the pdf

import re
import pdfplumber

pdf = 'Example.pdf'

lines = []
with pdfplumber.open(pdf) as pdf:
pages = pdf.pages
for page in pages:
    text = page.extract_text()
    for line in text.split('\n'):
        nline = text.split(r'\s{3,}')
        print(nline)

首先，我想要一个如下列表：

Text:    
1110 Crop production
1111A0 Oilseed farming 11111-2
1111B0 Grain farming 11113-6, 11119
----------------------------------
----------------------------------
311520 Ice cream and frozen dessert manufacturing 311520
----------------------------------
----------------------------------

有人可以帮忙吗？

【问题讨论】：

标签： python python-3.x pdftotext

【解决方案1】：

您可以使用这些库从 pdf 中提取文本。

PyPDF2
PDFMiner

此链接中提供了轻松使用这些库的说明。 1

希望这有帮助。

【讨论】：

PyPDF2 和 PDFMiner 在这种情况下不起作用。

【解决方案2】：

使用 tabula-py 或 camelot 怎么样？最近，我使用了这些包并从 pdf 解析为 pandas 数据帧。

这里是网站。

https://tabula-py.readthedocs.io/en/latest/

https://camelot-py.readthedocs.io/en/master/

【讨论】：

tabula-py 不起作用。并且无法让camelot-py 工作。