【问题标题】:Turning multiple String Patterns into Binary Columns将多个字符串模式转换为二进制列
【发布时间】:2019-01-18 16:21:15
【问题描述】:

我正在尝试使用 R 编程语言将特定字符串模式转换为三个不同列的二进制列。

这是我所拥有的:

have <- structure(list(rep1 = c("china", "na", "bay", "eng", "giad", 
"china", "sing", "giad", "na", "china", "china, camp", "guat,camp", 
"na", "na", "cis", "trans", "stron, mon"), rep2 = c("china", 
"na", "bay", "eng", "giad", "china", "sing", "giad", "na", "china", 
"china, camp", "camp", "na", "na", "cis", "trans", "stron, mon"
), rep3 = c("na", "na", "bay", "eng", "giad", "china", "sing", 
"giad", "china", "china", "china, camp", "camp", "na", "na", 
"cis", "trans", "stron, mon")), row.names = c(NA, -17L), class = c("data.table", 
"data.frame"))

这就是我想要的:

    want <- structure(list(rep1 = c("china", "na", "bay", "eng", "giad", 
"china", "sing", "giad", "na", "china", "china, camp", "guat,camp", 
"na", "na", "cis", "trans", "stron, mon"), rep2 = c("china", 
"na", "bay", "eng", "giad", "china", "sing", "giad", "na", "china", 
"china, camp", "camp", "na", "na", "cis", "trans", "stron, mon"
), rep3 = c("na", "na", "bay", "eng", "giad", "china", "sing", 
"giad", "china", "china", "china, camp", "camp", "na", "na", 
"cis", "trans", "stron, mon"), rep1_chi = c(1, 0, 0, 0, 0, 1, 
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0), rep2_chi = c(1, 0, 0, 0, 0, 
1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0), rep3_chi = c(0, 0, 0, 0, 
0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0), rep1_bay = c(0, 0, 1, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep2_bay = c(0, 0, 
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep3_bay = c(0, 
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep1_gia = c(0, 
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep2_gia = c(0, 
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep3_gia = c(0, 
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep1_sin = c(0, 
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep2_sin = c(0, 
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep3_sin = c(0, 
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, 
-17L))

我能够使用ifelsestringr::str_detect 创建一个可行的解决方案,如下所示:

want <- have %>% dplyr::select(rep1, rep2, rep3) %>% mutate(
      rep1_chi = ifelse(str_detect(rep1,"chi") == T,1,0),
      rep2_chi = ifelse(str_detect(rep2,"chi") == T,1,0),
      rep3_chi = ifelse(str_detect(rep3,"chi") == T,1,0),
      rep1_bay = ifelse(str_detect(rep1,"bay") == T,1,0),
      rep2_bay = ifelse(str_detect(rep2,"bay") == T,1,0),
      rep3_bay = ifelse(str_detect(rep3,"bay") == T,1,0),          
      rep1_gia = ifelse(str_detect(rep1,"gia") == T,1,0),
      rep2_gia = ifelse(str_detect(rep2,"gia") == T,1,0),
      rep3_gia = ifelse(str_detect(rep3,"gia") == T,1,0),           
      rep1_sin = ifelse(str_detect(rep1,"sin") == T,1,0),
      rep2_sin = ifelse(str_detect(rep2,"sin") == T,1,0),
      rep3_sin = ifelse(str_detect(rep3,"sin") == T,1,0))

我最大的问题是它似乎相当重复。 我想知道是否有更优雅的解决方案?考虑到“rep”列的数字顺序为 1-3,我认为可能有更好的编程方法。

通过 SO,我发现使用 model.matrixfollowing solution 似乎在您想要每个模式并且只对单个列感兴趣时工作得很好。我试着把它变成一个函数,这样我就可以选择多个列——但我仍然必须删除不感兴趣的模式的字符串。

【问题讨论】:

    标签: r dplyr stringr


    【解决方案1】:

    这是一种使用mutate_all 的方法。如果您只想对特定列执行此操作,则只需使用 mutate_at 并指定列即可。

    library(dplyr)
    library(stringr)
    
    mutate_all(have, funs(chi = as.numeric(str_detect(., "chi")),
                      bay = as.numeric(str_detect(., "bay")),
                      gia = as.numeric(str_detect(., "gia")),
                      sin = as.numeric(str_detect(., "sin"))))
    

    mutate_atvars 的示例:

    want <- have %>% mutate_at(vars(rep1,rep2,rep3), funs( 
                               tox = as.numeric(str_detect(., "chi")), 
                               bay = as.numeric(str_detect(., "bay")), 
                               gia = as.numeric(str_detect(., "gia")), 
                               iso = as.numeric(str_detect(., "sin"))))
    

    【讨论】:

      【解决方案2】:

      这里有一些丑陋且低效(性能方面)的基本代码,您不必自己构建列名:

      want_new <- have
      colold <- colnames(want_new)
      for (p in pattern) {
        cname <- paste0(
          colold, 
          "_",
          p
        )
        for (col in cname) {
          want_new[, col] <- as.numeric(str_detect(
            want_new[, gsub(paste0("_", p), "", col, fixed)],
            p
          ))
        }
      }
      

      很确定这可以通过进一步调整来改进。

      【讨论】:

      • 您可以通过使用 purrr pacakge 来改进这一点 - 使用 map_chr 代替将它们作为字符名称输出
      • 嗯,我看不出 map_chr() 在这里有什么帮助。
      猜你喜欢
      • 2021-10-13
      • 2021-02-24
      • 2020-02-03
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-08-02
      相关资源
      最近更新 更多