【问题标题】:R - break character vector of i comma-separated IDs into i discrete vectors of a data frameR - 将 i 逗号分隔 ID 的字符向量分解为数据帧的 i 个离散向量
【发布时间】:2021-08-12 23:58:01
【问题描述】:

数据框df 包含两个字符向量。以下是前 10 行:

rowid  codes_raw                            
a      15-1132, 15-1133                     
b      21-1091, 21-1094, 21-1099            
c      25-9011, 25-9021, 25-9031, 25-9099   
d      31-9093, 31-9099                     
e      33-9092, 33-9099                     
f      37-2011, 37-2019                     
g      39-4011, 39-4021                     
h      47-5051, 47-5099                     
i      49-2094, 49-2095                     
j      49-9041                    

df$codes_raw 包含给定行的 1 到 i 个离散标识符。这些标识符需要分布在同一数据帧中的 i 个新向量中。结果应如下所示:

rowid codes_raw                            code_1     code_2     code_3     code_4
a     15-1132, 15-1133                     15-1132    15-1133
b     21-1091, 21-1094, 21-1099            21-1091    21-1094    21-1099
c     25-9011, 25-9021, 25-9031, 25-9099   25-9011    25-9021    25-9031    25-9099
d     31-9093, 31-9099                     31-9093    31-9099
e     33-9092, 33-9099                     33-9092    33-9099
f     37-2011, 37-2019                     37-2011    37-2019
g     39-4011, 39-4021                     39-4011    39-4021
h     47-5051, 47-5099                     47-5051    47-5099
i     49-2094, 49-2095                     49-2094    49-2095
j     49-9041                              49-9041

我当前的解决方案涉及对每个字符串的单独调用if_else(),这很笨重。例如:

df$code_2 <- if_else(
  grepl(',', df$codes_raw),
  sub('.*,\\s*', '', df$codes_raw),
  ' ')

我还希望解决方案能够在 df$codes_raw 中有多达 20 个逗号的情况下工作。我正在寻找更优雅、更有活力的替代品。

【问题讨论】:

    标签: r regex string vector data-cleaning


    【解决方案1】:

    使用'separate()'

    library(tidyr)
    
    lengths <- max(sapply(strsplit(df$codes_raw, split= ", "), length)) 
    names  <- sapply(seq(lengths), function(x) paste0("code_", x))
    
    df %>%
      separate(codes_raw,
               into = names, sep = ", " , remove = FALSE)
    
       rowid                       codes_raw  code_1  code_2  code_3  code_4
    1      a                 15-1132,15-1133 15-1132 15-1133    <NA>    <NA>
    2      b         21-1091,21-1094,21-1099 21-1091 21-1094 21-1099    <NA>
    3      c 25-9011,25-9021,25-9031,25-9099 25-9011 25-9021 25-9031 25-9099
    4      d                 31-9093,31-9099 31-9093 31-9099    <NA>    <NA>
    5      e                 33-9092,33-9099 33-9092 33-9099    <NA>    <NA>
    6      f                 37-2011,37-2019 37-2011 37-2019    <NA>    <NA>
    7      g                 39-4011,39-4021 39-4011 39-4021    <NA>    <NA>
    8      h                 47-5051,47-5099 47-5051 47-5099    <NA>    <NA>
    9      i                 49-2094,49-2095 49-2094 49-2095    <NA>    <NA>
    10     j                         49-9041 49-9041    <NA>    <NA>    <NA>   
    

    【讨论】:

    • 此解决方案运行良好,但 (a) 需要手动指定 into 变量的数量,并且 (b) 由于 into 向量的 NA 值而产生一系列警告。 suppressWarnings() wrapper 显然可以解决,但如果更新原始数据框,则会引入一个明显的问题
    • 你可以设置参数'extra'来做你想做的事。查看文档
    • 如果不知道这些值必须拆分成的列数怎么办?因此我建议使用动态命名
    【解决方案2】:

    为了自动输入列名,我建议这样做

    library(tidyverse)
    df %>% 
      separate_rows(codes_raw, sep = ", ") %>% 
      group_by(rowid) %>% 
      mutate(id_cols = row_number()) %>% 
      pivot_wider(rowid, names_from = id_cols, values_from = codes_raw, names_prefix = "code_") %>% 
      ungroup()
    
    # A tibble: 10 x 5
       rowid code_1  code_2  code_3  code_4 
       <chr> <chr>   <chr>   <chr>   <chr>  
     1 a     15-1132 15-1133 NA      NA     
     2 b     21-1091 21-1094 21-1099 NA     
     3 c     25-9011 25-9021 25-9031 25-9099
     4 d     31-9093 31-9099 NA      NA     
     5 e     33-9092 33-9099 NA      NA     
     6 f     37-2011 37-2019 NA      NA     
     7 g     39-4011 39-4021 NA      NA     
     8 h     47-5051 47-5099 NA      NA     
     9 i     49-2094 49-2095 NA      NA     
    10 j     49-9041 NA      NA      NA 
    

    nm <- paste0("code_", seq_len(max(str_count(df$codes_raw, pattern = ",")) + 1))
    
    df %>% 
      separate(
        codes_raw, 
        into = nm, 
        sep = ", ")
    

    【讨论】:

      【解决方案3】:

      您说最大列数是 20,所以有一种方法可以使用包含捕获组的正则表达式(使用 library(namedCapture)),例如

      rowid <- c("a","b","c","d","e")
      codes_raw <- c("15-1132, 15-1133", "21-1091, 21-1094, 21-1099", "25-9011, 25-9021, 25-9031, 25-9099", "31-9093, 31-9099", "49-9041")
      df <- data.frame(rowid, codes_raw)
      
      library(namedCapture)
      n = 20                              # Max number of columns
      pattern <- "^(?P<code_1>\\d+-\\d+)" # Pattern start
      for (x in 2:n) {                    # Add more optional columns
        pattern <- paste0(pattern, "(?:\\s*,\\s*(?P<code_",x,">\\d+-\\d+))?")
      }
      pattern <- paste0(pattern,"$")      # End of string anchor added
      df1 <- str_match_named(df$codes_raw, pattern)  # Extract column data
      df1 <- df1[, colSums(df1 != "") != 0] # Remove empty columns
      df1 <- cbind(rowid, df1)              # Put back the rowid column
      

      输出:

      > cbind(rowid, df1)
           rowid code_1    code_2    code_3    code_4   
      [1,] "a"   "15-1132" "15-1133" ""        ""       
      [2,] "b"   "21-1091" "21-1094" "21-1099" ""       
      [3,] "c"   "25-9011" "25-9021" "25-9031" "25-9099"
      [4,] "d"   "31-9093" "31-9099" ""        ""       
      [5,] "e"   "49-9041" ""        ""        ""   
      

      这里是a sample regex demo

      • ^ - 字符串开头
      • (?P&lt;code_1&gt;\d+-\d+) - 一个命名的捕获组,其中 code_1 名称匹配一个或多个数字,- 和一个或多个数字
      • (?:\s*,\s*(?P&lt;code_2&gt;\d+-\d+))? - 一个可选的逗号序列,包含零个或多个空格,然后是匹配 1+ 位、-、1+ 位等的“code_2”组。

      【讨论】:

        【解决方案4】:

        像这样动态地(创建列名)。这适用于连接在一起的任意数量的字符串

        df <- read.table(text = 'rowid  codes_raw                            
        a      "15-1132, 15-1133"                     
        b      "21-1091, 21-1094, 21-1099"            
        c      "25-9011, 25-9021, 25-9031, 25-9099"   
        d      "31-9093, 31-9099"                     
        e      "33-9092, 33-9099"                     
        f      "37-2011, 37-2019"                     
        g      "39-4011, 39-4021"                     
        h      "47-5051, 47-5099"                     
        i      "49-2094, 49-2095"                     
        j      49-9041', header = T)
        df
        #>    rowid                          codes_raw
        #> 1      a                   15-1132, 15-1133
        #> 2      b          21-1091, 21-1094, 21-1099
        #> 3      c 25-9011, 25-9021, 25-9031, 25-9099
        #> 4      d                   31-9093, 31-9099
        #> 5      e                   33-9092, 33-9099
        #> 6      f                   37-2011, 37-2019
        #> 7      g                   39-4011, 39-4021
        #> 8      h                   47-5051, 47-5099
        #> 9      i                   49-2094, 49-2095
        #> 10     j                            49-9041
        
        library(tidyr)
        library(stringr)
        df %>% separate(codes_raw, into = paste0('code_', seq_len(1 + max(str_count(df$codes_raw, ', ')))), 
                        remove = F, sep = ', ')
        #> Warning: Expected 4 pieces. Missing pieces filled with `NA` in 9 rows [1, 2, 4,
        #> 5, 6, 7, 8, 9, 10].
        #>    rowid                          codes_raw  code_1  code_2  code_3  code_4
        #> 1      a                   15-1132, 15-1133 15-1132 15-1133    <NA>    <NA>
        #> 2      b          21-1091, 21-1094, 21-1099 21-1091 21-1094 21-1099    <NA>
        #> 3      c 25-9011, 25-9021, 25-9031, 25-9099 25-9011 25-9021 25-9031 25-9099
        #> 4      d                   31-9093, 31-9099 31-9093 31-9099    <NA>    <NA>
        #> 5      e                   33-9092, 33-9099 33-9092 33-9099    <NA>    <NA>
        #> 6      f                   37-2011, 37-2019 37-2011 37-2019    <NA>    <NA>
        #> 7      g                   39-4011, 39-4021 39-4011 39-4021    <NA>    <NA>
        #> 8      h                   47-5051, 47-5099 47-5051 47-5099    <NA>    <NA>
        #> 9      i                   49-2094, 49-2095 49-2094 49-2095    <NA>    <NA>
        #> 10     j                            49-9041 49-9041    <NA>    <NA>    <NA>
        

        reprex package (v2.0.0) 于 2021-05-25 创建

        【讨论】:

          【解决方案5】:

          您可以使用stringr 库中的str_split() 来拆分列表中的代码,然后将向量列表(长度不等)转换为矩阵,然后使用mutate() 加入您的原始数据框。这是一个例子:

          #your example data
          df<-data.frame(rowid = c("a","b", "c","d", "e", "f","g","h","i","j"),
                         codes_raw = c("15-1132, 15-1133", "21-1091, 21-1094, 21-1099" ,"25-9011, 25-9021, 25-9031, 25-9099","31-9093, 31-9099", "33-9092, 33-9099",                     
                  "37-2011, 37-2019","39-4011, 39-4021", "47-5051, 47-5099", "49-2094, 49-2095","49-9041"))
          
          library(stringr)
          library(dplyr)
          #Split codes raw by comma
          l<-str_split(df$codes_raw, ",")
          #get length of each code
          n.codes <- sapply(l, length)
          #find the longest number of codes, and make a sequence from 1 to that number.
          seq.max <- seq_len(max(n.codes))
          #Fill NAs in blanks as you make a matrix. Convert to dataframe.
          codes_in_columns <- t(sapply(l, "[", i = seq.max)) %>% 
            data.frame(.)
          #Set the desired column names.
          names(codes_in_columns)<- paste0("code_",seq.max)
          #combine original with separated codes
          df<-df %>% mutate(codes_in_columns )
          

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2013-07-14
            • 2014-05-27
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            相关资源
            最近更新 更多