如何使用 Pandas 和 tabula-py 从一个 PDF 文件中提取多个表格答案

【问题标题】：How to extract multiples tables from one PDF file using Pandas and tabula-py如何使用 Pandas 和 tabula-py 从一个 PDF 文件中提取多个表格
【发布时间】：2021-07-16 12:01:02
【问题描述】：

谁能帮我从 ONE pdf 文件中提取多个表格。我有 5 页，每页都有一个表，表头列 exp 相同：

每页的表格exp

student  Score Rang
Alex     50     23
Julia    80     12
Mariana  94     4

我想在一个数据框中提取所有这些表，首先我做了

df = tabula.read_pdf(file_path,pages='all',multiple_tables=True)

但我得到了一个凌乱的输出，所以我尝试了如下代码行：

[student  Score Rang
Alex     50     23
Julia    80     12
Mariana  94     4 ,student  Score Rang
Maxim    43     34
Nourah   93     5]

所以我像这样编辑了我的代码将熊猫导入为 pd 导入表格

    file_path = "filePath.pdf"
    
    # read my file
    df1 = tabula.read_pdf(file_path,pages=1,multiple_tables=True)
    df2 = tabula.read_pdf(file_path,pages=2,multiple_tables=True)
    df3 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
    df4 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
    df5 = tabula.read_pdf(file_path,pages=5,multiple_tables=True)

它为每个表提供了一个数据框，但我不知道如何将其重新组合成一个数据框和任何其他解决方案以避免重复代码行。

【问题讨论】：

标签： python pandas dataframe pdf tabula

【解决方案1】：

根据documentation of tabula，read_pdf在传递multiple_table=True选项时返回一个列表。

因此，您可以在其输出中使用pandas.concat 来连接数据帧：

df = pd.concat(tabula.read_pdf(file_path,pages='all',multiple_tables=True))

【讨论】：

我也试过了，但我收到了错误TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
这条命令的返回是什么：type(tabula.read_pdf(file_path,pages=1,multiple_tables=True))？我怀疑，这是一个列表，因为multiple_tables=True 选项，你需要拿第一项。如果退货是list，请同时提供退货：type(tabula.read_pdf(file_path,pages=1,multiple_tables=True)[0])
根据tabula's documentation，read_pdf 返回一个列表。查看我的更新答案
拜托，你能把pd.concat(tabula.read_pdf(file_path,pages='all',multiple_tables=True))的输出也给我吗？
好的，我会更新我的答案以保持最简单的解决方案