将文本分隔到 R 中的变量答案

【问题标题】：Separate text to variables in R将文本分隔到 R 中的变量
【发布时间】：2014-04-17 16:52:49
【问题描述】：

我在表格的一列中有这个：

paragemcard-resp+insufcardioresp
dpco+pneumonia
posopperfulceragastrica+ards
pos op hematoma #rim direito expontanea
miopatiaduchenne-erb+insuf.resp
dpco+dhca+#femur
posde#subtroncantГ©ricaesqВЄ+complicepidural
dpco+asma

我想像这样分开它们：

paragemcard-resp                            insufcardioresp
dpco                                        pneumonia
posopperfulceragastrica                     ards
pos op hematoma #rim direito expontanea
miopatiaduchenne-erb                        insuf.resp
dpco                                        dhca                   #femur
posde#subtroncantГ©ricaesqВЄ                complicepidural
dpco                                        asma

但问题是它们的长度不同。如您所见，在第 3 行，我们有 2 个变量，在第 6 行，我们有 3 个。

我想在同一列中创建此字符串以供进一步分析。

谢谢

【问题讨论】：

你试过strsplit()吗？
这只不过是一个 CSV 文件，其中分隔符是“+”字符（而不是通常的“,”或制表符或“;”）。就像@AnanandaMahto 说的那样使用read.csv/read.table；甚至比 strsplit 更简单。
@smci，是和否。直接read.table 方法的一个限制是在这种情况下，数据不平衡并且具有最大字段数的行不在前五行内。
啊。您不能通过使用最大数量的变量在标题或虚拟行 1 前添加内容吗？例如var1+var2+var3... ?

标签： r csv separator strsplit

【解决方案1】：

您可以使用read.table，但您应该首先使用count.fields 或某种正则表达式来确定正确的列数。使用罗伯特的“文本”样本数据：

Cols <- max(sapply(gregexpr("+", text, fixed = TRUE), length))+1
## Cols <- max(count.fields(textConnection(text), sep = "+"))

read.table(text = text, comment.char="", header = FALSE, 
           col.names=paste0("V", sequence(Cols)), 
           fill = TRUE, sep = "+")
#                                        V1              V2     V3
# 1                        paragemcard-resp insufcardioresp       
# 2                                    dpco       pneumonia       
# 3                 posopperfulceragastrica            ards       
# 4 pos op hematoma #rim direito expontanea                       
# 5                    miopatiaduchenne-erb      insuf.resp       
# 6                                    dpco            dhca #femur
# 7            posde#subtroncantГ©ricaesqВЄ complicepidural       
# 8                                    dpco            asma

另外，可能有用：“stringi”库使计算元素变得容易（作为上述gregexpr 步骤的替代方法）。

library(stringi)
Cols <- max(stri_count_fixed(x, "+") + 1)

为什么需要“Cols”步骤？ read.table 和 family 通过 (1) 在前 5 行数据中检测到的最大字段数或 (2) col.names 参数的长度来决定使用多少列。在您的示例行中，字段数最多的是第六行，因此直接使用 read.csv 或 read.table 会导致数据包装不正确。

【讨论】：

你能不能不使用最大数量的变量在标题或虚拟行 1 前添加内容？例如var1+var2+var3... ?
@smci，当然...如果您知道首先需要多少变量。对于像这样参差不齐的数据，如果认为您已正确读取数据只是为了查看最后几行并发现事情完全不正常，这可能会非常令人沮丧，因为文件中的某处有几行比其他行具有更多的值。
琐碎：您可以使用 awk 找到最大计数为 '+' 的行。或者egrep '(.*\+){5}(.*)'，我们手动将数字 5 从 9 递减。如果这是一个快速破解，并非所有内容都必须是程序化的。
@smci，如果有人无法识别他们只是在变相处理 csv，则不一定是微不足道的 ;-)
是的，这就是我们帮助 OP 的原因...... ;-) R 的“CSV 文件”是一个严重的误称，它应该真正称为“分隔记录格式”，因为它是什么.此外，分隔符应该是一个正则表达式，以允许多个分隔符，例如'[,;\t]'。我能感觉到一个更通用的read.records(...) 即将到来，其中 sep 可以是一个正则表达式......需要任何其他概括吗？也许暗示和推断 colClasses？为什么我们一般不会在文件中使用简单的wc -l 来估计 nrows？一切都是为了时间和内存效率。看在他妈的份上，之后做一个可选的gc(reset=TRUE)。

【解决方案2】：

你可以使用strsplit:

text <- c("paragemcard-resp+insufcardioresp", "dpco+pneumonia", "posopperfulceragastrica+ards", "pos op hematoma #rim direito expontanea", "miopatiaduchenne-erb+insuf.resp", "dpco+dhca+#femur", "posde#subtroncantГ©ricaesqВЄ+complicepidural", "dpco+asma")

strings <- strsplit(text, "+", fixed = TRUE)
maxlen <- max(sapply(strings, length))
strings <- lapply(strings, function(s) { length(s) <- maxlen; s })
strings <- data.frame(matrix(unlist(strings), ncol = maxlen, byrow = TRUE))

看起来像

                                          X1              X2     X3
   1                        paragemcard-resp insufcardioresp   <NA>
   2                                    dpco       pneumonia   <NA>
   3                 posopperfulceragastrica            ards   <NA>
   4 pos op hematoma #rim direito expontanea            <NA>   <NA>
   5                    miopatiaduchenne-erb      insuf.resp   <NA>
   6                                    dpco            dhca #femur
   7            posde#subtroncantГ©ricaesqВЄ complicepidural   <NA>
   8                                    dpco            asma   <NA>

【讨论】：