PDF 到 Pandas 数据框答案

【问题标题】：PDF to Pandas Data FramePDF 到 Pandas 数据框
【发布时间】：2020-10-10 07:57:16
【问题描述】：

就在我认为我终于得到它的时候，这样一个新手。

我正在尝试从 PDF 表格的列中获取数字列表。

第一步我想转换成 Panda DF。

pip install tabula-py
pip install PyPDF2

import pandas as pd
import tabula
df = tabula.read_pdf('/content/Manifest.pdf')

然而，我得到的输出是 1 的列表，而不是 DF。当我查看 DF 信息时，我只是不知道如何访问它，因为它是 1 的列表。

所以不知道为什么我没有得到一个 DF，也不知道我对 1.Output 的列表有什么意义

不确定是否重要，但我使用的是 google Colab。

任何帮助都会很棒。

谢谢

【问题讨论】：

嘿，既然你是新人，请查看How to Ask。您不应该包含代码的图片/图像。此外，如果我们没有样本输入（即 pdf），很难确定 df 应该采用什么。另外，你想要的输出到底是什么？查看tabulatabula-py.readthedocs.io/en/latest/tabula.html的文档，具体看函数read_pdf()的返回类型
感谢您提供的信息，要学习很多东西，看起来如何正确提问就是其中之一。干杯

标签： python pandas google-colaboratory

【解决方案1】：

tabula.read_pdf 返回没有任何附加参数的数据帧列表。要访问您的特定数据框，您可以选择索引并使用它。

这是我阅读文档并选择第一个索引并比较类型的示例

import tabula

df = tabula.read_pdf(
    "https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf")

df_0 = df[0]

print("type of df :", type(df))
print("type of df_0", type(df_0))

type of df : <class 'list'>
type of df_0 <class 'pandas.core.frame.DataFrame'>

【讨论】：

【解决方案2】：

尝试一下 df = tabula.read_pdf('/content/Manifest.pdf', sep=' ')

【讨论】：