重新排列多个txt文件的结构，然后将它们合并到一个数据框中答案

【问题标题】：Rearranging the structure of many txt files and then merging them in one data frame重新排列多个txt文件的结构，然后将它们合并到一个数据框中
【发布时间】：2015-04-01 20:07:05
【问题描述】：

非常感谢您在这方面的帮助！

我有大约 4.5k txt 文件，如下所示：

Simple statistics using MSPA parameters: 8_3_1_1 on input file: 20130815 104359  875  000000 0528 0548_result.tif

 MSPA-class [color]:  Foreground/data pixels [%]  Frequency
============================================================
    CORE(s) [green]:               --                   0
    CORE(m) [green]:      48.43/13.45                   1
    CORE(l) [green]:               --                   0
      ISLET [brown]:       3.70/ 1.03                  20
 PERFORATION [blue]:       0.00/ 0.00                   0
       EDGE [black]:      30.93/ 8.59                  11
      LOOP [yellow]:       9.66/ 2.68                   6
       BRIDGE [red]:       0.00/ 0.00                   0
    BRANCH [orange]:       7.28/ 2.02                  40
  Background [grey]:       --- /72.22                  11
    Missing [white]:            0.00                    0

我想将一个目录中的所有 txt 文件读入 R，然后在将它们合并之前对其执行重新排列任务。

txt 文件中的值可以更改，因此在现在有 0.00 的地方，可能是某些文件中的相关数字（所以我们需要这些）。对于现在有 -- 的字段，如果脚本可以测试是否有 -- 或数字，那就太好了。如果有 --，那么它应该把它们变成 NA。另一方面，真正的 0.00 值是有价值的，我需要它们。 Missing white 列（或此处的行）只有一个值，然后应将该值复制到两列中，即前景% 和数据像素%。

我需要的一般重新排列是将所有数据作为列提供，每个 txt 文件只有 1 行。对于此处txt文件中的每一行数据，输出文件中应该有3列（前景％，数据像素％和每种颜色的频率）。行的名称应该是文件开头提到的图像名称，这里：20130815 104359 875 000000 0528 0548

其余的可以省略。

输出应该是这样的：

我正在同时进行这项工作，但不确定该采取哪个方向。因此，我们非常欢迎任何帮助！

最好，莫里茨

【问题讨论】：

我看不出样本数据如何与您发布的图片匹配
嗯，文件中的所有数据都是针对某种情况的，看txt文件开头的字符串。因此，例如，我们创建一个名为 foreground_core_m 的列和行 20130815 104359 875 000000 0528 0548，而不是列 foreground% 和 row core(m)。然后该值为 48.43。与 datapixels_core_m 相同，它将使用同一行，值将是 13.45。这更容易理解吗？我不确定如何实现。

标签： regex r merge dataframe text-files

【解决方案1】：

我认为这会将其设置为您想要的格式，但该示例与您的图像不匹配，因此我无法确定：

(lf <- list.files('~/desktop', pattern = '^image\\d+.txt', full.names = TRUE))
# [1] "/Users/rawr/desktop/image001.txt" "/Users/rawr/desktop/image002.txt"
# [3] "/Users/rawr/desktop/image003.txt"

lapply(lf, function(xx) {
  rl <- readLines(con <- file(xx), warn = FALSE)
  close(con)
  ## assuming the file name is after "file: " until the end of the string
  ## and ends in .tif
  img_name <- gsub('.*file:\\s+(.*).tif', '\\1', rl[1])
  ## removes each string up to and including the ===== string
  rl <- rl[-(1:grep('==', rl))]
  ## remove leading whitespace
  rl <- gsub('^\\s+', '', rl)

  ## split the remaining lines by larger chunks of whitespace
  mat <- do.call('rbind', strsplit(rl, '\\s{2, }'))
  ## more cleaning, setting attributes, etc
  mat[mat == '--'] <- NA
  mat <- cbind(image_name = img_name, `colnames<-`(t(mat[, 2]), mat[, 1]))
  as.data.frame(mat)
})

我使用您的示例创建了三个文件，并让每个文件略有不同，以展示这将如何在包含多个文件的目录上工作：

# [[1]]
#                                        image_name CORE(s) [green]: CORE(m) [green]: CORE(l) [green]: ISLET [brown]:   PERFORATION [blue]: EDGE [black]: LOOP [yellow]: BRIDGE [red]: BRANCH [orange]: Background [grey]: Missing [white]:
#   1 20130815 104359  875  000000 0528 0548_result             <NA>      48.43/13.45             <NA>     3.70/ 1.03          0.00/ 0.00     30.93/ 8.59     9.66/ 2.68    0.00/ 0.00       7.28/ 2.02         --- /72.22             0.00
# 
# [[2]]
#                                        image_name CORE(s) [green]: CORE(m) [green]: CORE(l) [green]: ISLET [brown]:   PERFORATION [blue]: EDGE [black]: LOOP [yellow]: BRIDGE [red]: BRANCH [orange]: Background [grey]: Missing [white]:
#   1 20139341 104359  875  000000 0528 0548_result               23      48.43/13.45               23           <NA>          0.00/ 0.00     30.93/ 8.59     9.66/ 2.68    0.00/ 0.00       7.28/ 2.02         --- /72.22             0.00
# 
# [[3]]
#                                        image_name CORE(s) [green]: CORE(m) [green]: CORE(l) [green]: ISLET [brown]: PERFORATION [blue]:  EDGE [black]: LOOP [yellow]: BRIDGE [red]: BRANCH [orange]: Background [grey]: Missing [white]:
#   1 20132343 104359  875  000000 0528 0548_result             <NA>             <NA>             <NA>           <NA>                <NA>    30.93/ 8.59     9.66/ 2.68    0.00/ 0.00       7.28/ 2.02               <NA>             0.00

编辑

进行了一些更改以提取所有信息：

(lf <- list.files('~/desktop', pattern = '^image\\d+.txt', full.names = TRUE))
# [1] "/Users/rawr/desktop/image001.txt" "/Users/rawr/desktop/image002.txt"
# [3] "/Users/rawr/desktop/image003.txt"

res <- lapply(lf, function(xx) {
  rl <- readLines(con <- file(xx), warn = FALSE)
  close(con)
  img_name <- gsub('.*file:\\s+(.*).tif', '\\1', rl[1])
  rl <- rl[-(1:grep('==', rl))]
  rl <- gsub('^\\s+', '', rl)
  mat <- do.call('rbind', strsplit(rl, '\\s{2, }'))
  dat <- as.data.frame(mat, stringsAsFactors = FALSE)
  tmp <- `colnames<-`(do.call('rbind', strsplit(dat$V2, '[-\\/\\s]+', perl = TRUE)),
                      c('Foreground','Data pixels'))
  dat <- cbind(dat[, -2], tmp, image_name = img_name)
  dat[] <- lapply(dat, as.character)
  dat[dat == ''] <- NA
  names(dat)[1:2] <- c('MSPA-class','Frequency')

  zzz <- reshape(dat, direction = 'wide', idvar = 'image_name', timevar = 'MSPA-class')
  names(zzz)[-1] <- gsub('(.*)\\.(.*) (?:.*)', '\\2_\\1', names(zzz)[-1], perl = TRUE)
  zzz
})

这是结果（我只是转换成一个长矩阵，这样更容易阅读。真正的结果在一个非常宽的数据框中，每个文件一个）：

`rownames<-`(matrix(res[[1]]), names(res[[1]]))
# [,1]                                           
# image_name              "20130815 104359  875  000000 0528 0548_result"
# CORE(s)_Frequency       "0"                                            
# CORE(s)_Foreground      "NA"                                           
# CORE(s)_Data pixels     "NA"                                           
# CORE(m)_Frequency       "1"                                            
# CORE(m)_Foreground      "48.43"                                        
# CORE(m)_Data pixels     "13.45"                                        
# CORE(l)_Frequency       "0"                                            
# CORE(l)_Foreground      "NA"                                           
# CORE(l)_Data pixels     "NA"                                           
# ISLET_Frequency         "20"                                           
# ISLET_Foreground        "3.70"                                         
# ISLET_Data pixels       "1.03"                                         
# PERFORATION_Frequency   "0"                                            
# PERFORATION_Foreground  "0.00"                                         
# PERFORATION_Data pixels "0.00"                                         
# EDGE_Frequency          "11"                                           
# EDGE_Foreground         "30.93"                                        
# EDGE_Data pixels        "8.59"                                         
# LOOP_Frequency          "6"                                            
# LOOP_Foreground         "9.66"                                         
# LOOP_Data pixels        "2.68"                                         
# BRIDGE_Frequency        "0"                                            
# BRIDGE_Foreground       "0.00"                                         
# BRIDGE_Data pixels      "0.00"                                         
# BRANCH_Frequency        "40"                                           
# BRANCH_Foreground       "7.28"                                         
# BRANCH_Data pixels      "2.02"                                         
# Background_Frequency    "11"                                           
# Background_Foreground   "NA"                                           
# Background_Data pixels  "72.22"                                        
# Missing_Frequency       "0"                                            
# Missing_Foreground      "0.00"                                         
# Missing_Data pixels     "0.00"

使用您的示例数据：

lf <- list.files('~/desktop/data', pattern = '.txt', full.names = TRUE)

`rownames<-`(matrix(res[[1]]), names(res[[1]]))

#                         [,1]                                    
# image_name              "20130815 103704  780  000000 0372 0616"
# CORE(s)_Frequency       "0"                                     
# CORE(s)_Foreground      "NA"                                    
# CORE(s)_Data pixels     "NA"                                    
# CORE(m)_Frequency       "1"                                     
# CORE(m)_Foreground      "54.18"                                 
# CORE(m)_Data pixels     "15.16"                                 
# CORE(l)_Frequency       "0"                                     
# CORE(l)_Foreground      "NA"                                    
# CORE(l)_Data pixels     "NA"                                    
# ISLET_Frequency         "11"                                    
# ISLET_Foreground        "3.14"                                  
# ISLET_Data pixels       "0.88"                                  
# PERFORATION_Frequency   "0"                                     
# PERFORATION_Foreground  "0.00"                                  
# PERFORATION_Data pixels "0.00"                                  
# EDGE_Frequency          "1"                                     
# EDGE_Foreground         "34.82"                                 
# EDGE_Data pixels        "9.75"                                  
# LOOP_Frequency          "1"                                     
# LOOP_Foreground         "4.96"                                  
# LOOP_Data pixels        "1.39"                                  
# BRIDGE_Frequency        "0"                                     
# BRIDGE_Foreground       "0.00"                                  
# BRIDGE_Data pixels      "0.00"                                  
# BRANCH_Frequency        "20"                                    
# BRANCH_Foreground       "2.89"                                  
# BRANCH_Data pixels      "0.81"                                  
# Background_Frequency    "1"                                     
# Background_Foreground   "NA"                                    
# Background_Data pixels  "72.01"                                 
# Missing_Frequency       "0"                                     
# Missing_Foreground      "0.00"                                  
# Missing_Data pixels     "0.00"

【讨论】：

非常感谢！我会查一下。与此同时，我就我认为如何以我需要的格式获取数据提供了评论（您在原始问题中发表评论的地方）
嗯，很遗憾没有。我明白了： strsplit(dat$V2, "[-\\/\\s]+", perl = TRUE) 中的错误：非字符参数。在这里为您提供了一些数据文件的链接，可能最容易解决问题。谢谢！ dropbox.com/sh/f7trh2148otwnrq/AACMt46DSQkIkJ1RTVU6zG8ja?dl=0
修复了最后几行，以便名称更符合您的要求。也许dat 作为因素？检查str(dat) 这就是dat[] <- lapply(dat, as.character) 行的用途
代码似乎对我有用。我使用lf <- list.files('~/desktop/data', pattern = '.txt', full.names = TRUE) 获取文件的路径并运行新的lapply 函数
@MoritzSchmid 查看编辑，只需在您的数据文件上运行它

【解决方案2】：

我将您的数据复制并粘贴到一个文本文件中，并调整了空间以保持它们之间的一致性。您可能想要这样做，或者如果您可以附加文本文件，它会很容易使用。您可以使用 pastebin - http://en.wikipedia.org/wiki/Pastebin

首先设置你的工作目录如下：

setwd("path of your file")

#EDIT：创建所有文件的单个数据框

split.row.data <- function(x){

  a1 = sub("( )+(.*)", '\\2', x)

  b1 = unlist(strsplit(sub("( )+(.*)", '\\2', (strsplit(a1, ":"))[[1]][2]), "  "))

  c1 = unlist(strsplit(b1[1], "/"))

  if(length(c1) == 1){
    if(which(b1[1:2] %in% "") == 1){
      c1 = c(NA, c1)
    }else if(which(b1[1:2] %in% "") == 2){
      c1 = c(c1, NA)
    }
  }

  c1[which(c1 %in% c("--", "--- "))] <- NA

  return(c(unlist(strsplit(strsplit(a1, ":")[[1]][1], " ")), 
           c1,
           b1[length(b1)]))
}

df2 <- data.frame(matrix(nrow = 1, ncol = 6), stringsAsFactors = FALSE)

file_list = list.files('~/desktop', pattern = '^image\\d+.txt', full.names = TRUE)

for (infile in file_list){

  file_data <- readLines(con <- file(infile))

  close(con)

  filename = sub("(.*)(input file:)(.*)(.tif)", "\\3", file_data[3])

  a2 <- file_data[7:length(file_data)]

  d1 = lapply(a2, function(x) split.row.data(x))

  df1 <- data.frame(matrix(nrow= length(d1), ncol = 5), stringsAsFactors = FALSE)

  for(i in 1:length(d1)){df1[i, ] <- d1[[i]]}

  df1 <- cbind(data.frame(rep(filename, nrow(df1)), stringsAsFactors = FALSE), 
               df1)

  colnames(df1) <- colnames(df2)

  df2 <- rbind(df2, df1)

}

df2 <- df2[2:nrow(df2), ]

df2[,4] <- as.numeric(df2[,4])

df2[,5] <- as.numeric(df2[,5])

df2[,6] <- as.numeric(df2[,6])

e1 = unlist(lapply(df2[,3], function(x) gsub(']', '', x)))

df2[,3] = unlist(lapply(e1, function(x) gsub("[[]", '', x)))

header_names <- unlist(lapply(strsplit(file_data[5], "/"), function(x) strsplit(x, "  ")))

colnames(df2) <- c("filename",  
                   strsplit(header_names[1], " ")[[1]][2],
                   "color", 
                   header_names[2:length(header_names)])

row.names(df2) <- 1:nrow(df2)

输出：

print(head(df2))

    filename                               MSPA-class  color  Foreground  data pixels [%] Frequency
 1  20130815 103739  599  000000 0944 0788     CORE(s) green         NA         NA         0
 2  20130815 103739  599  000000 0944 0788     CORE(m) green      63.46         17.41      1
 3  20130815 103739  599  000000 0944 0788     CORE(l) green         NA         NA         0
 4  20130815 103739  599  000000 0944 0788       ISLET brown       0.00         0.00       0
 5  20130815 103739  599  000000 0944 0788 PERFORATION  blue       0.00         0.00       0
 6  20130815 103739  599  000000 0944 0788        EDGE black      35.00         9.60       1

#从“MSPA-class”列中获取仅“背景”的数据

df2_background  <- df2[which(df2[, "MSPA-class"] %in% "Background"), ]

print(df2_background)

    filename                               MSPA-class  color  Foreground  data pixels [%] Frequency
11  20130815 103739  599  000000 0944 0788 Background  grey         NA    72.57           1
22  20130815 143233  712  000000 1048 0520 Background  grey         NA    77.51           1
33  20130902 163929  019  000000 0394 0290 Background  grey         NA    54.55           6

【讨论】：

太棒了！非常感谢。我能够在这里很好地重现它。我现在正在按照上述方法拆分前景和数据像素列上的数据，实际上我们可以删除颜色列。知道如何以一种格式获取数据框，其中我只有 1 行具有图像名称的值（当前该信息在您的代码中丢失）并且我们现在拥有的行也作为列？
而且，事实上，您定义结构的部分使其仅适用于这个文件。你能帮忙改变它，以便该部分在 for 循环和许多文件中工作吗？这是 pastebin 中的 3 个示例文件！ pastebin.com/embed_js.php?i=5gg9WNaN
您可以在 read.table 中使用na.strings = '--' 并跳过该步骤