【发布时间】:2015-03-11 20:25:48
【问题描述】:
我有一个数据框,其中包含一个字符列,其中包含由换行符 \n 分隔的多个字符串形式的电子邮件元数据:
person myString
1 John To name5@email.com by sender6 on 01-12-2014\n
2 Jane To name@email.com,name4@email.com by sender1 on 01-22-2014\nTo name3@email.com by sender2 on 02-03-2014\nTo email5@domain.com by sender1 on 06-21-2014\n
3 Tim To name2@email.com by sender2 on 05-11-2014\nTo name@email.com by sender2 on 06-03-2015\n
我想将 myString 的不同子字符串拆分成不同的列,这样它看起来像这样:
person email1 email2 email3
1 John To name5@email.com by sender6 on 01-12-2014 <NA> <NA>
2 Jane To name@email.com,name4@email.com by sender1 on 01-22-2014 To name3@email.com by sender2 on 02-03-2014 To email5@domain.com by sender1 on 06-21-2014
3 Tim To name2@email.com by sender2 on 05-11-2014 To name@email.com by sender2 on 06-03-2015 <NA>
我目前的方法使用 tidyr 包中的separate:
library(dplyr)
library(tidyr)
res1 <- df %>%
separate(col = myString, into = paste(rep("email", 3), 1:3), sep = "\\n", extra = "drop")
res1[res1 == ""] <- NA
但是使用这种方法,我必须手动指定要提取三列。
我希望通过以下一种或两种方式改进此过程:
- 一种自动计算分隔符的最大出现次数(即需要多少新变量)的方法
- 其他拆分成未知列数的方法
如果有一个好的解决方案可以以长格式而不是宽格式返回数据,那也很棒。
样本数据:
df <- structure(list(person = c("John", "Jane", "Tim"), myString = c("To name5@email.com by sender6 on 01-12-2014\n",
"To name@email.com,name4@email.com by sender1 on 01-22-2014\nTo name3@email.com by sender2 on 02-03-2014\nTo email5@domain.com by sender1 on 06-21-2014\n",
"To name2@email.com by sender2 on 05-11-2014\nTo name@email.com by sender2 on 06-03-2015\n"
)), .Names = c("person", "myString"), row.names = c(NA, -3L), class = "data.frame")
【问题讨论】: