【发布时间】:2018-10-29 04:27:17
【问题描述】:
如有任何建议,我们将不胜感激。这对时间很敏感。我有大部分是文本块的 PDF 报告。它们是长篇报告(约 50-100 页)。我正在尝试编写一个能够使用开始/停止位置字符串提取这些 PDF 报告的特定部分的 R 脚本。注意:报告的长度各不相同。简短的例子:
DOCUMENT TITLE
01. SECTION 1
This is a test section that I DONT want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
02. SECTION 2
This is a test section that I do want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...
11. SECTION 11
This is a test section that I do want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...
12. SECTION 12
This is a test section that I DONT want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...
因此,此示例中的目标是提取第 2 节下方的段落并将其存储为字段/数据点。我还想将第 11 节存储为字段/数据点。 请注意文档为 PDF 格式
我已经尝试过使用 pdftools、tm、stringr,我确实花了 20 多个小时来寻找有关如何执行此操作的解决方案和教程。我知道这是可能的,因为我以前使用 SAS 做过......
请看下面的代码,我添加了带有问题的 cmets。我相信 RegEx 将成为解决方案的一部分,但我很迷茫。
# Init Step
libs <- c("tm","class","stringr","testthat",
"pdftools")
lapply(libs, require, character.only= TRUE)
# File name & location
filename = "~/pdf_test/test.pdf"
# converting PDF to text
textFile <- pdf_text(filename)
cat(textFile[1]) # Text of pg. 1 of PDF
cat(textFile[2]) # Text of pg. 2 of PDF
# I'm at a loss of how to parse the values I want. I have seen things
like:
sectionxyz <- str_extract_all(textFile, #??? )
rm_between()
# 1) How do I loop through each page of PDF file?
# 2) How do I identify start/stop positions for section to be extracted?
# 3) How do I add logic to extract text between start/stop positions
# and then add the result to a data field?
# 4) Sections in PDF will be long sections of text (i.e. 100+ words into a field)
新-----
所以我已经能够: - 正确准备文档 - 识别正确的开始/停止模式:
length(grep("^11\\. LIMITS OF LIABILITY( +){1}$",source_main2))
length(grep("Applicable\\s+[Ll]imits\\s+[Oo]f",source_main2))
pat_st_lol <- "^11\\. LIMITS OF LIABILITY( +){1}$"
pat_ed_lol <- "Applicable\\s+[Ll]imits\\s+[Oo]f"
length(grep()) 语句验证仅找到 1 个实例。从这里开始,我有点迷失了如何使用 gsub 或类似方法来提取我想要的数据部分。我试过了:
pat <- paste0(".*",pat_st_lol,"(.*)",pat_ed_lol,".*")
test <- gsub(".*^11\\. LIMITS OF LIABILITY( +){1}$(.*)\n",
"Applicable\\s+[Ll]imits\\s+[Oo]f", source_main2)
test2 <-gsub(".*pat_st_lol(.*)\npat_ed_lol.*")
到目前为止,进展不大,但无论如何都有进展。
【问题讨论】:
-
在将 pdf 转换为文本时,我尝试了多种方法......但是文本只是在 RStudio 中存储为“值”。我真的需要能够识别开始POSITION(从哪里开始提取字符串)--> START EXTRACTING STRING INTO DATAFRAME --> IDENTIFY END POSITION(从哪里停止提取字符串)
标签: r regex text-extraction data-extraction