在 R 中组合字符向量中的行答案

【问题标题】：Combining lines in character vector in R在 R 中组合字符向量中的行
【发布时间】：2015-10-07 15:01:11
【问题描述】：

我在 R 中有一个大约 50,000 行的字符向量（内容）。但是，从文本文件中读取的某些行是在单独的行上，不应该是。具体来说，这些行看起来像这样：

[1] hello,
[2] world
[3] ""
[4] how
[5] are 
[6] you
[7] ""

我想合并这些行，这样我就有了如下所示的内容：

[1] hello, world
[2] how are you

我试着写了一个for循环：

for(i in 1:length(content)){
    if(content[i+1] != ""){
        content[i+1] <- c(content[i], content[i+1])
    }
}

但是当我运行循环时，我得到一个错误：需要 TRUE/FALSE 的地方缺少值。

谁能提出一个更好的方法来做到这一点，甚至可能不使用循环？

谢谢！

编辑：我实际上正在尝试将其应用于每个文档都有数千行的语料库。关于如何将这些解决方案转换为可应用于每个文档内容的函数的任何想法？

【问题讨论】：

您收到错误是因为缺少content[i+1]。
@Heroka，你能再解释一下吗？
您正在迭代内容的长度，然后以内容长度 + 1 访问内容。这会产生缺失值。但是这种方法不会轻易产生您想要的输出，为您寻找答案。

标签： regex r text

【解决方案1】：

你不需要循环来做到这一点

x <- c("hello,", "world", "", "how", "\nare", "you", "")

dummy <- paste(
  c("\n", sample(letters, 20, replace = TRUE), "\n"), 
  collapse = ""
) # complex random string as a split marker
x[x == ""] <- dummy #replace empty string by split marker
y <- paste(x, collapse = " ") #make one long string
z <- unlist(strsplit(y, dummy)) #cut the string at the split marker
gsub(" $", "", gsub("^ ", "", z)) # remove space at start and end

【讨论】：

不错的一个。小事：您的解决方案会生成开头带有空格的句子。这可以扩展到 50000 行吗？
我们能确定没有字符串包含\n吗？

【解决方案2】：

我认为有更优雅的解决方案，但这可能对你有用：

chars <- c("hello,","world","","how","are","you","")
###identify groups that belong together (id increases each time a "" is found)
ids <- cumsum(chars=="")

#split vector (an filter out "" by using the select vector)
select <- chars!=""
splitted <- split(chars[select], ids[select])

#paste the groups together
res <- sapply(splitted,paste, collapse=" ")

#remove names(if necessary, probably not)
res <- unname(res) #thanks @Roland

> res
[1] "hello, world" "how are you"

【讨论】：

我也打算提出这个。最后一步也可以使用unname。
@Roland 不知道那个，谢谢。编辑答案。
在内部它基本上完成了你正在做的事情。它只是更方便和更具可读性。
@Heroka - 我编辑了上面的问题，但知道如何将其应用于文本文件的语料库吗？
@Heroka 在我稍微操作过的语料库上使用了 lapply。谢谢！

【解决方案3】：

这是使用data.table 的另一种方法，它可能比for 或*apply 循环更快：

library(data.table)
dt <- data.table(x)
dt[, .(paste(x, collapse = " ")), rleid(x == "")][V1 != ""]$V1
#[1] "hello, world" "how are you"

样本数据：

x <- c("hello,", "world", "", "how", "are", "you", "")

【讨论】：

【解决方案4】：

将"" 替换为您以后可以拆分的内容，然后将字符折叠在一起，然后使用strsplit()。在这里，我使用了换行符，因为如果您只是粘贴它，您可以在输出中获得不同的行，例如cat(txt3) 将在单独的行上输出每个短语。

txt <-  c("hello", "world", "", "how", "are", "you", "", "more", "text", "")
txt2 <- gsub("^$", "\n", txt)
txt3 <- paste(txt2, collapse = " ")
unlist(strsplit(txt3, "\\s\n\\s*"))
## [1] "hello world" "how are you" "more text"

【讨论】：

【解决方案5】：

另一种添加方式：

tapply(x[x != ''], cumsum(x == '')[x != '']+1, paste, collapse=' ')
#             1              2              3 
#"hello, world"  "how are you"    "more text"

按非空字符串分组。并按组将元素粘贴在一起。

【讨论】：