从R中的pdf中提取表格答案

【问题标题】：Extracting tables from pdf in R从R中的pdf中提取表格
【发布时间】：2018-09-05 13:49:35
【问题描述】：

我需要从 pdf 中提取表格。这是链接

https://ainfo.cnptia.embrapa.br/digital/bitstream/item/155505/1/doc-202-1.pdf

我想从第 15 页到第 21 页中提取表格。所有这些表格都具有相同的结构（18 列）和标题。这是单个表的快照。

在每个表中，我只对第 6 - 8 列和第 17 列感兴趣：Ciclo、Graus Dias/dias、Epcaja de PlantionandRegiao de Adaptacao`

这就是我所做的：

library(dplyr)
library(tabulizer)

out <- extract_tables("mydocument.pdf"), pages = c(15:21))

# this gives me a list of 7 tables. 

temp <- data.frame(out[[1]]) # taking the first table as an example
temp %>% dplyr::select(X3, X4, X5, X12) # these are the columns corresponding to `Ciclo`, `Graus Dias/dias`, Epcaja de Plantion` and `Regiao de adaptacao`

# this is a snapshot of first table

但是，当我提取第 7 个表时：

  temp <- data.frame(out[[7]])

#  Column 1: 4 are merged into a single column.

综上所述，extract_tables 函数在某些表中没有做一致的列位置和合并列。我该如何解决它，以便我拥有
在一个 csv 文件中包含 Ciclo,Graus Dias/dias, Epcaja de Plantion 和 Regiao de adaptacao 列的组合表。

【问题讨论】：

标签： r pdf

【解决方案1】：

根据我的经验，这是一个数据准备和争论问题，而不是解析问题，因为在这种情况下，除了在方法之间进行更改之外，tabulizer 的解析算法没有提供太多余地。从我在尝试提取您的表格时所看到的情况来看，不仅是第 7 页的表格被错误地解析了。每个页面的解析方式都不同，但所有数据似乎都被保留了。我可以看到您的第一个表有 13 列，第二个 17 列，第三个 12 列，第四个 10 列和最后三个 11 列。我建议做的是单独解析每个页面并根据您对每个页面的期望输出执行数据清理，然后将它们绑定在一起。这是一个漫长的过程，并且对每个解析的表都非常具体，所以我只提供一个示例脚本：

library(dplyr)
library(tidyr)
library(tabulizer)
# I create a dummy list to iterate through all the pages and push a data.frame in
result <- list()
for (i in 15:21){
  out <- as.data.frame(extract_tables("mydocument.pdf", page = i, method = 'stream'), stringsAsFactors = FALSE)
  result[[i]] <- out
}
# Remove excess list items -
# there is probably a better way to do this from within the for loop
result <- result[-(1:14)]

## ------- DATA CLEANING OPERATIONS examples:
# Remove top 3x lines from the first page of table1 not part of data
result[[1]] <- result[[1]][-(1:3),]
# Perform data cleaning operations such as split/ merge columns according to your liking
# for instance if you want to split column X1 into 4 (as in your original post), you can do that by splitting by whitespace
result[[1]] <- separate(result[[1]], 1, into = c('X1.1','X1.2','X1.3', 'X1.4'),sep = ' ', remove = TRUE)

## ---- After data cleaning operations:
# Bind all dataframes (they should have equal number of columns by now into one and make sure the colnames match as well)
df <-bind_rows(result)
# Write your output csv file
write.csv(df, 'yourfilename.csv')

您可能还想看看制表符的不同解析方法（我在这里将其设置为“流”，因为根据我的经验，这通常会产生最好的结果，但也许“格子”对于某些表）。

【讨论】：

tabulizer 已被弃用，如果有一个不使用 tabulizer 的新解决方案会很高兴