【问题标题】:Splitting string into unknown number of new dataframe columns将字符串拆分为未知数量的新数据框列
【发布时间】:2015-03-11 20:25:48
【问题描述】:

我有一个数据框,其中包含一个字符列,其中包含由换行符 \n 分隔的多个字符串形式的电子邮件元数据:

  person                                                                                                                                                 myString
1   John                                                                                                            To name5@email.com by sender6 on 01-12-2014\n
2   Jane To name@email.com,name4@email.com by sender1 on 01-22-2014\nTo name3@email.com by sender2 on 02-03-2014\nTo email5@domain.com by sender1 on 06-21-2014\n
3    Tim                                                                To name2@email.com by sender2 on 05-11-2014\nTo name@email.com by sender2 on 06-03-2015\n

我想将 myString 的不同子字符串拆分成不同的列,这样它看起来像这样:

  person                                                     email1                                      email2                                        email3
1   John                To name5@email.com by sender6 on 01-12-2014                                        <NA>                                          <NA>
2   Jane To name@email.com,name4@email.com by sender1 on 01-22-2014 To name3@email.com by sender2 on 02-03-2014 To email5@domain.com by sender1 on 06-21-2014
3    Tim                To name2@email.com by sender2 on 05-11-2014  To name@email.com by sender2 on 06-03-2015                                          <NA>

我目前的方法使用 tidyr 包中的separate

library(dplyr)
library(tidyr)
res1 <- df %>% 
    separate(col = myString, into = paste(rep("email", 3), 1:3), sep = "\\n", extra = "drop")
res1[res1 == ""] <- NA

但是使用这种方法,我必须手动指定要提取三列。

我希望通过以下一种或两种方式改进此过程:

  1. 一种自动计算分隔符的最大出现次数(即需要多少新变量)的方法
  2. 其他拆分成未知列数的方法

如果有一个好的解决方案可以以长格式而不是宽格式返回数据,那也很棒。

样本数据:

df <- structure(list(person = c("John", "Jane", "Tim"), myString = c("To name5@email.com by sender6 on 01-12-2014\n", 
    "To name@email.com,name4@email.com by sender1 on 01-22-2014\nTo name3@email.com by sender2 on 02-03-2014\nTo email5@domain.com by sender1 on 06-21-2014\n", 
    "To name2@email.com by sender2 on 05-11-2014\nTo name@email.com by sender2 on 06-03-2015\n"
    )), .Names = c("person", "myString"), row.names = c(NA, -3L), class = "data.frame")

【问题讨论】:

    标签: regex r string


    【解决方案1】:

    我会从我的“splitstackshape”包中建议cSplit

    library(splitstackshape)
    cSplit(df, "myString", "\n")
    #    person                                                 myString_1
    # 1:   John                To name5@email.com by sender6 on 01-12-2014
    # 2:   Jane To name@email.com,name4@email.com by sender1 on 01-22-2014
    # 3:    Tim                To name2@email.com by sender2 on 05-11-2014
    #                                     myString_2
    # 1:                                          NA
    # 2: To name3@email.com by sender2 on 02-03-2014
    # 3:  To name@email.com by sender2 on 06-03-2015
    #                                       myString_3
    # 1:                                            NA
    # 2: To email5@domain.com by sender1 on 06-21-2014
    # 3:                                            NA
    

    您也可以尝试使用“stringi”包中的stri_split_fixed 参数simplify = TRUE(尽管对于您的示例数据,这会在末尾添加一个额外的空列)。该方法类似于:

    library(stringi)
    data.frame(person = df$person, 
               stri_split_fixed(df$myString, "\n", 
                                simplify = TRUE))
    

    【讨论】:

    • 这太棒了——我见过的针对这个特定问题的最直接的功能。看起来这个包里还有其他好东西。谢谢!
    • @SamFirke,谢谢。我希望您注意到cSplit 也有一个“方向”参数,如果您想要长格式,可以将其设置为"long"
    • 只是过来感谢您提供的包裹。 cSplit 是一种美丽的东西!
    【解决方案2】:

    看起来很老套,但是你去吧......

    使用 strsplit 分割字符向量。获取最大长度,将其用于列。

    df <- data.frame(
      person = c("John", "Jane", "Tim"),
      myString = c("To name5@email.com by sender6 on 01-12-2014\n",
                   "To name@email.com,name4@email.com by sender1 on 01-22-2014\nTo name3@email.com by sender2 on 02-03-2014\nTo email5@domain.com by sender1 on 06-21-2014\n",
                   "To name2@email.com by sender2 on 05-11-2014\nTo name@email.com by sender2 on 06-03-2015\n"
      ), stringsAsFactors=FALSE
    )
    
    a <- strsplit(df$myString, "\n")
    max_len <- max(sapply(a, length))
    for(i in 1:max_len){
      df[,paste0("email", i)] <- sapply(a, "[", i)
    }
    

    【讨论】:

      【解决方案3】:

      这是一个长格式的有效途径:

      a <- strsplit(df$myString, "\n")
      lens <- vapply(a, length, integer(1L)) # or lengths(a) in R 3.2
      longdf <- df[rep(seq_along(a), lens),]
      longdf$string <- unlist(a)
      

      请注意,stack() 在这些情况下通常很有用。

      可以使用 IRanges Bioconductor 包进行简化:

      longdf <- df[togroup(a),]
      longdf$string <- unlist(a)
      

      然后,如果确实需要,转到宽格式:

      longdf$myString <- NULL
      longdf$token <- sequence(lens)
      widedf <- reshape(longdf, timevar="token", idvar="person", direction="wide")
      

      【讨论】:

        【解决方案4】:

        这可能就足够了:

        library(data.table)
        dt = as.data.table(df) # or setDT to convert in place
        
        dt[, strsplit(myString, split = "\n"), by = person]
        #   person                                                         V1
        #1:   John                To name5@email.com by sender6 on 01-12-2014
        #2:   Jane To name@email.com,name4@email.com by sender1 on 01-22-2014
        #3:   Jane                To name3@email.com by sender2 on 02-03-2014
        #4:   Jane              To email5@domain.com by sender1 on 06-21-2014
        #5:    Tim                To name2@email.com by sender2 on 05-11-2014
        #6:    Tim                 To name@email.com by sender2 on 06-03-2015
        

        然后可以轻松转换为宽格式:

        dcast(dt[, strsplit(myString, split = "\n"), by = person][, idx := 1:.N, by = person],
              person ~ idx, value.var = 'V1')
        #   person                                                          1                                           2                                             3
        #1:   Jane To name@email.com,name4@email.com by sender1 on 01-22-2014 To name3@email.com by sender2 on 02-03-2014 To email5@domain.com by sender1 on 06-21-2014
        #2:   John                To name5@email.com by sender6 on 01-12-2014                                          NA                                            NA
        #3:    Tim                To name2@email.com by sender2 on 05-11-2014  To name@email.com by sender2 on 06-03-2015                                            NA
        
        # (load reshape2 and use dcast.data.table instead of dcast if not using 1.9.5+)
        

        【讨论】:

        • @SamFirke 你不需要reshape2 和最新版本的data.table (在末尾编辑评论以使其更清晰)
        • 感谢澄清 - 我有最新的 CRAN 版本的 data.table,但我看到那是 1.9.4,而 1.9.5 是 GitHub 上当前的开发版本。现在这对我有用。
        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2011-05-20
        • 2013-01-22
        相关资源
        最近更新 更多