【问题标题】:Splitting data frame rows with concatenated values使用连接值拆分数据框行
【发布时间】:2015-06-11 09:32:27
【问题描述】:

我有一个如下所示的 data.frame:

df <- data.frame(col1=c("a","b","c","d"), col2=c("1","1;2;3","5","3;2;5;5;3"), col3=c("0","1;1;0","0","0;0;1;1;0"))

#   col1      col2      col3
# 1    a         1         0
# 2    b     1;2;3     1;1;0
# 3    c         5         0
# 4    d 3;2;5;5;3 0;0;1;1;0

换句话说,有些行的列的值由“;”连接。在读取 data.frame 之前,我不知道哪些列将包含连接值,但我知道它们对于所有具有该值的行都是相同的。我还知道,对于具有连接值的列的行,连接值的数量在所有这些列中都是相同的(第 2 行在 col2 和 col3 中都有 3 个值,第 4 行在这些列中有 5 个值)

我想创建一个新的 data.frame,将这些连接的值拆分为单独的行。对于那些行,没有连接值的列中的值应该按照连接值的数量进行复制。

生成的 data.frame 将是:

df <- data.frame(col1=c("a","b","b","b","c","d","d","d","d","d"), col2=c("1","1","2","3","5","3","2","5","5","3"), col3=c("0","1","1","0","0","0","0","1","1","0"))

#    col1 col2 col3
# 1     a    1    0
# 2     b    1    1
# 3     b    2    1
# 4     b    3    0
# 5     c    5    0
# 6     d    3    0
# 7     d    2    0
# 8     d    5    1
# 9     d    5    1
# 10    d    3    0

【问题讨论】:

    标签: r split dataframe


    【解决方案1】:

    这里有一个选项

    df <- data.frame(col1=c("a","b","c","d"), col2=c("1","1;2;3","5","3;2;5;5;3"), col3=c("0","1;1;0","0","0;0;1;1;0"))
    
    df2 <- data.frame(col1=c("a","b","b","b","c","d","d","d","d","d"), col2=c("1","1","2","3","5","3","2","5","5","3"), col3=c("0","1","1","0","0","0","0","1","1","0"))
    
    
    ## reshape `col1` to make it look like the others
    v <- Vectorize(gsub)
    df$col1 <- v('\\b\\d\\b', df$col1, df$col2)
    
    #        col1      col2      col3
    # 1         a         1         0
    # 2     b;b;b     1;2;3     1;1;0
    # 3         c         5         0
    # 4 d;d;d;d;d 3;2;5;5;3 0;0;1;1;0
    
    
    ## split on white space or `;` and coerce back into a data frame
    data.frame(do.call('cbind', lapply(df, function(x)
      unlist(strsplit(as.character(x), '[\\s;]')))))
    
    #    col1 col2 col3
    # 1     a    1    0
    # 2     b    1    1
    # 3     b    2    1
    # 4     b    3    0
    # 5     c    5    0
    # 6     d    3    0
    # 7     d    2    0
    # 8     d    5    1
    # 9     d    5    1
    # 10    d    3    0
    

    【讨论】:

      【解决方案2】:

      不像 rawr 的回答那样复杂,但可能更容易看到发生了什么

      df1 <- data.frame(col1=c("a","b","c","d"), 
                        col2=c("1","1;2;3","5","3;2;5;5;3"), 
                        col3=c("0","1;1;0","0","0;0;1;1;0"),
                        stringsAsFactors=FALSE)
      
      df1_rows   <- nrow(df1)
      col1_split <- strsplit(df1$col1,";") 
      col2_split <- strsplit(df1$col2,";") 
      col3_split <- strsplit(df1$col3,";") 
      
      df2 <- data.frame(col1=character(), 
                        col2=character(), 
                        col3=character(),
                        stringsAsFactors=FALSE) 
      
      for (n in 1:df1_rows){ df2 <- rbind(df2, 
             data.frame(col1=col1_split[[n]],
                        col2=col2_split[[n]],
                        col3=col3_split[[n]], 
                        stringsAsFactors=FALSE))}
      

      这给了

      > df2 
         col1 col2 col3
      1     a    1    0
      2     b    1    1
      3     b    2    1
      4     b    3    0
      5     c    5    0
      6     d    3    0
      7     d    2    0
      8     d    5    1
      9     d    5    1
      10    d    3    0
      

      【讨论】:

        【解决方案3】:

        这是我编写“splitstackshape”包所针对的数据类型。你可以使用cSplit,像这样:

        library(splitstackshape)
        cSplit(df, c("col2", "col3"), ";", "long")
        #     col1 col2 col3
        #  1:    a    1    0
        #  2:    b    1    1
        #  3:    b    2    1
        #  4:    b    3    0
        #  5:    c    5    0
        #  6:    d    3    0
        #  7:    d    2    0
        #  8:    d    5    1
        #  9:    d    5    1
        # 10:    d    3    0
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2018-01-26
          • 2020-08-14
          • 1970-01-01
          • 1970-01-01
          • 2011-10-26
          • 2021-09-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多