【发布时间】:2017-09-14 06:14:13
【问题描述】:
我在一个文件夹中有一个 .pdf 文件列表,我想首先访问该文件夹的前两段文本,然后将它们存储在 .csv 文件中,我可以转换 pdf 文本但不能先提取两段。
这是我尝试过的
setwd("D/All_PDF_Files")
install.packages("pdftools")
install.packages("qdapRegex")
library(pdftools)
library(qdapRegex)
All_files=Sys.glob("*.pdf")
txt <- pdf_text("first.pdf")
cat(txt[1])
rm_between(txt, 'This ', '1. ', extract=TRUE)[[1]]
但这给了我“NA”
cat(txt[1])的输出为:
"Maharashtra Real Estate Regulatory Authority
REGISTRATION CERTIFICATE OF PROJECT
FORM 'C'
[See rule 6(a)]
This registration is granted under section 5 of the Act to the following project under project registration number :
P52100000255
Project: Ganga Legend A3 And B3.., Plot Bearing / CTS / Survey / Final Plot No.: Sr No 305 P , 306 P and 339 P ,
Village Bavdhan Budruk, Taluka Mulashi,District Pune at Pune (M Corp.), Pune City, Pune, 411001;
1. Goel Ganga Developers (I) Pvt Ltd having its registered office / principal place of business at Tehsil: Pune City,
District: Pune, Pin: 411001.
2. This registration is granted subject to the following conditions, namely:"
我要提取的是文字
This registration is granted under section 5 of the Act to the following project under project registration number :
P52100000255
Project: Ganga Legend A3 And B3.., Plot Bearing / CTS / Survey / Final Plot No.: Sr No 305 P , 306 P and 339 P ,
Village Bavdhan Budruk, Taluka Mulashi,District Pune at Pune (M Corp.), Pune City, Pune, 411001;
有更好的方法吗?
【问题讨论】:
-
查看textreadr 包中的
read_pdf函数,其中在开始读取数据之前可以跳过行数以满足您的目的 -
使用 read_pdf::: s=read_pdf("D:/All_PDF_Files/first.pdf", skip = 4, remove.empty = TRUE, trim = TRUE) s$text[1:4] , 在不同的行中给出所有行,而不是在一行中
-
在上述步骤之后,仅删除包含“Maharastra......”的行不会解决问题吗?
标签: r apply text-extraction qdapregex