将多个 .txt 文件导入 R 并跳至实际数据行答案

【问题标题】：Import multiple .txt files into R and skip to actual data rows将多个 .txt 文件导入 R 并跳至实际数据行
【发布时间】：2016-07-10 23:04:13
【问题描述】：

我有 537 个 .txt 文件，我需要将它们导入到 R 中的列表或单独的数据框中。我不想附加任何数据，因为将所有内容分开至关重要。

我已经重命名了每个文件，所以文件名都是统一的。在每个文件中，都有一个包含许多杂项信息的标题部分。此标题部分为 12-16 行，具体取决于文件。对于数据，我有 5 到 7 列。数据都是制表符分隔的。列数在 5 到 9 列之间变化，并且列的顺序并不总是相同，因此我可以将列名与数据一起导入很重要（列名跨文件是统一的）。文件格式如下：

标题

标题...最多 16 行

（（标题和列名之间的空格数不同））

日期(\t)时间(\t)dataCol1(\t)dataCol2(\t)dataCol3(\t)dataCol4

((列名和单位之间没有空行))

mm/dd/yyyy(\t)hh:mm:ss(\t)units(\t)units(\t)units(\t)units

((单元和数据之间有 1 个空行))

01/31/2016(\t)14:32:02(\t)14.9(\t)25.3(\t)15.8(\t)25.6

（（数据最多重复 4000 行））

回顾一下我需要什么：将所有文件导入单个数据框或数据框列表。跳过标题信息到带有“日期”的行（并可能删除带有单位和空行的两行）给我留下一行列名和后面的数据。

这是我一直在编写的代码的粗略副本。这个想法是，在将所有文件导入 R 之后，确定每个文件中 1-2 列的最大值。然后，导出一个文件，每个文件有 1 行，2 列包含每个文件的 2 个最大值。

##list files and create list for data.frames
path <- list.files("Path",pattern = NULL, all.files=FALSE,full.names=TRUE)
files <- list()

##Null list for final data to be extracted to
results <- NULL

##add names to results list (using file name - extension
results$name <- substr(basename(path),1,nchar(basename(Path))-4)

##loop to read in data files and calculate max 
for(i in 1:length(path){
   ##read files
   files[[i]] <- read.delim(path[[i]],header = FALSE, sep = "\t", skip = 18

   ##will have to add code:
     ##"if columnx exists do this; if columny exists do this"
   ##convert 2 columns for calculation to numeric 
   x.x <- as.numeric(as.character(files$columnx))
   x.y <- as.numeric(as.character(files$columny))

   ##will have to add code: 
     ##"if column x exists, do this....if not, "NA"
   ##get max value for 2 specific columns 
   results$max.x <- max(files$columnx)
   results$max.y <- max(files$columny)
}

##add results to data frame 
max <- data.frame(results)

##export to .csv
write.csv(max,file="PATH")

我现在知道，我的代码只是将所有内容都跳过到数据中（max 直到文件的很晚才出现，所以跳过 1 或 2 行不会伤害我），并且它假设列在每个文件中的顺序相同。这是一种可怕的做法，在大约 5% 的数据点上给了我一些不好的结果，但我想正确地做到这一点。我主要关心的是以可用的格式将数据放入 R 中。然后，我可以添加其他计算和转换。我是 R 新手，经过 2 天的搜索，我还没有找到我需要的帮助已经发布到任何论坛。

【问题讨论】：

看起来你上面的方法已经走上了正轨。为了解决标题信息问题，这是我的方法（没有所有细节）：读取前 20 行，标题
谢谢，Dave2e。 @TJGorrie 将您的回复添加到他的代码中，所以在你们两个之间，我比以前更接近了！
很高兴听到有进展。如果您遇到麻烦，请使用更新的代码和一些示例数据发布一个新问题以获得更多帮助。论坛喜欢看到起点和期望终点的样本。

标签： r import

【解决方案1】：

假设标题的结构遵循一行\n行\n数据，我们可以使用grep来查找“mm/dd/yyyy”所在的行号

这样：

system("grep -nr 'mm/dd/yyyy' ran.txt", intern=T)
# ran.txt is an arbitrary text file I created, we will substitute
# 'ran.txt' with path[[i]] later on.
#[1] "6:mm/dd/yyyy\thh:mm:ss\tunits\tunits\tunits\tunits"

由此我们可以将strsplit 的输出转换为: 之前的数字，并将该参数用作skip 的必要值。

as.numeric(strsplit(system("grep -nr 'mm/dd/yyyy' ran.txt", intern=T),":")[[1]][1])
# [[1]][1] will specify the first element of the output of strsplit as
# in the output the hh:mm:ss also is split.
# [1] 6

由于我们调用的行和实际数据之间有一个空行，我们可以将其加 1，然后开始读取数据。

因此：

##list files and create list for data.frames
path <- list.files("Path",pattern = NULL, all.files=FALSE,full.names=TRUE)
files <- list()

##Null list for final data to be extracted to
results <- NULL

##add names to results list (using file name - extension
results$name <- substr(basename(path),1,nchar(basename(Path))-4)

##loop to read in data files and calculate max 
for(i in 1:length(path)){
##read files
# Calculate the number of rows to skip.
# Using Dave2e's suggestion:
header <-readLines("path[[i]]", n=20)
skip <- grep("^mm/dd/yy", header)
#Add one due to missing line
skip <- skip + 1
files[[i]] <- read.delim(path[[i]],
                         header = FALSE,
                         sep = "\t",
                         skip = skip)

 ##will have to add code:
 ##"if columnx exists do this; if columny exists do this"
 ##convert 2 columns for calculation to numeric 
 x.x <- as.numeric(as.character(files$columnx))
 x.y <- as.numeric(as.character(files$columny))

 ##will have to add code: 
 ##"if column x exists, do this....if not, "NA"
 ##get max value for 2 specific columns 
 results$max.x <- max(files$columnx)
 results$max.y <- max(files$columny)
}

##add results to data frame 
max <- data.frame(results)

##export to .csv
write.csv(max,file="PATH")

我认为这涵盖了所有内容。

【讨论】：

我已经更新了我的代码，但出现以下错误：“系统错误（粘贴（“grep -nr 'mm/dd/yyyy'”，路径[[i]]）， intern = T) : 'grep' not found" 这是我的代码：##loop to read in data files for (i in 1:length(path)) { ##read files ##calculate number of rows to skip skip <- as.numeric( strsplit( system(paste("grep -nr \'mm/dd/yyyy\'",path[[i]]), intern = T),":")[[1]][1]) skip <- skip + 1 files[[i]] <- read.delim(path[[i]],header = FALSE,sep = "\t",skip = skip) }
这是我的错误，我已经修改了我的答案以合并@dave2e 评论。因为这也会做你想做的事。
我错了，我假设您使用的是 linux，我推测您可能使用的是 Windows，因此您的 system 将无法识别 grep，除非您安装了 Rtools。
好的，谢谢！是的，我正在使用窗户。我什至没想过要把它添加到我的问题中。
我已经做了一些工作，但我的老板把我拉到另一个项目。下周我会告诉你结果如何。

【解决方案2】：

我想我会在这里添加它，以防它帮助其他有类似问题的人。 @TJGorrie 的解决方案帮助解决了我略有不同的挑战。我有几个 .rad 文件需要读取、标记和合并。 .rad 文件的标题从随机行开始，所以我需要一种方法来找到带有标题的行。除了创建标签列之外，我不需要做任何额外的计算。希望这对将来的某人有所帮助，但感谢@TJGorrie 的精彩回答！

##list files and create list for data.frames
path <- list.files(pattern="*.rad")
files <- list()

##loop to read in data files 
for(i in 1:length(path)){

# Using Dave2e's suggestion:
header <-readLines(path[[i]], n=20)
skip <- grep("Sample", header)

#Subtract one row to keep the row with "Sample" in it as the header
skip <- skip - 1

files[[i]] <- read.table(path[[i]],
                         header = TRUE,
                         fill = TRUE,
                         skip = skip,
                         stringsAsFactors = FALSE)

# Name the newly created file objects the same name as the original file. 
names(files)[i] = gsub(".rad", "", (path[i]))

files[[i]] = na.omit(as.data.frame(files[[i]]))

# Create new column that includes the file name to act as a tag
# when the dfs get merged through rbind

files[[i]]$Tag = names(files)[i]

# bind all the dfs listed in the file into a single df

df = do.call("rbind",
             c(files, make.row.names = FALSE))
}

##export to .csv
write.csv(df,file="PATH.csv", row.names = FALSE)

【讨论】：