R - lapply 函数中的 stringer 和 grep 下标越界答案

【问题标题】：R - subscript out of bounds with stringer and grep in lapply functionR - lapply 函数中的 stringer 和 grep 下标越界
【发布时间】：2022-04-24 16:12:48
【问题描述】：

我的目标是在匹配特定模式后，从多个文件夹中的多个文本文件中提取固定长度（8位）的数字字符串。

我花了一整天的时间来构建一个 lapply 函数，这样子目录中的所有文件（最多 20 个）都可以自动处理。虽然我失败了。工作流的代码是可执行的，但是由于我对R 的了解不足，仅限于一个文件。

在带有数字的行之间，每个文件都有一个字符串，每个不同，我想提取它。字符串提取的输出应按文件夹存储。

字符串具有以下结构：String[one or two digits]_[eight digits] 。例如，String1_20220101 或 String12_20220108。我想提取下划线后面的部分。

文本文件以这种方式构造，每个文件超过 10000 行。

文件 1 的示例：

     X1  X2
1 1000 100
2 1050 100
3 1100 100
4 1150 100
5 1200 100
6 String1_20220101
7 1250 100
8 1300 100
9 1350 100
10 1400 100

x1 <- list(c(seq(1000,1400, by=50)))
[1] 1000 1050 1100 1150 1200 1250 1300 1350 1400

x2 <- list(c(rep(100, 9)))
[1] 100 100 100 100 100 100 100 100 100

文件 2：

   x1     x2
1 2000  200
2 3000  200
3 4000  200
4 5000  200
5 6000  200
6 7000  200
7 String12_20220108
8 8000  200
9 9000  200
10 10000 200


x1 <- list(c(seq(1000,10000,by=1000)))
[1]  1000  2000  3000  4000  5000  6000  7000  8000  9000 10000

x2 <- list(c(rep(200, 9)))
[1] 200 200 200 200 200 200 200 200 200

文件位于编号的文件夹中，它们的名称来自文件夹编号，属于一个观察。

我的文件夹 1 的代码：

library(stringr)

Folderno1 <- list.files(path = "path/to/file/1/",
pattern = "*.txt",
full.names = TRUE)

FUN <- function(Folder1) {
folder_input <- readLines(Folderno1)
string <- grep("String[0-9]_", folder_input, value = TRUE)
output <- capture.output(as.numeric(str_extract_all(string, "(?<=[0-9]{1,2}_)[0-9]+")[[1]]))
write(output, file="/pathtofile/String1.tex")
}

lapply(Folderno1, FUN)

Error in str_extract_all(string, "(?<=[0-9]{1,2}_)[0-9]+")[[1]] : 
subscript out of bounds

出现上述错误信息。尽管有错误消息，但文件 String1.tex 可以被覆盖，但只有一个结果：

[1] 20220101

重新运行调试显示：

function (x) 
.Internal(withVisible(x))

能否请您指导我如何成功更改工作流程，以便处理每个文件？我无法理解它。

谢谢。

【问题讨论】：

标签： r lapply stringr subscript

【解决方案1】：

您在函数中每次 (write(output, file="/pathtofile/String1.tex")) 都会覆盖同一个文件。可能，您想为每个 .txt 文件创建一个新的 .tex 文件。

根据错误消息，我认为某些文件没有我们正在寻找的模式 (String[0-9]_)。 String[0-9]_ 不适用于像 String12_20220108 这样的 2 位数字。我已将其更改为使用String[0-9]+_。为了更安全，我还添加了一个if 条件来检查输出的长度。

试试这个解决方案 -

Folderno1 <- list.files(path = "path/to/file/1/",
                        pattern = "*.txt",
                        full.names = TRUE)

FUN <- function(Folder1) {
  #Read the file
  folder_input <- readLines(Folder1)
  #Extract the line which has "String" in it
  string <- grep("String[0-9]+_", folder_input, value = TRUE)
  #If such line exists
  if(length(string)) {
    #Remove everything till underscore to get 8-digit number
    output <- sub('.*_', '', string)
    #Remove everything after underscore to get "String1", "String12"
    out <- sub('_.*', '', string)
    #Write the output
    write(output, file= paste0('/pathtofile/', out, '.tex'))
  }
}

lapply(Folderno1, FUN)

【讨论】：

感谢您的帮助！它会按预期工作。不幸的是，由于昨晚累了，我忘了提到八位数字后面有一个括号，后面是字符和行尾的数字。例如。 String1_20220101) Abcdefgh abcdeDIGIT 可以使用 \\( 之类的东西吗？再次感谢。
用output <- sub('.*_(\\d+).*', '\\1', string) 替换output 行怎么样？我在sub('.*_(\\d+).*', '\\1', 'String1_20220101) Abcdefgh abcde') 上尝试过，我认为它有效。