【问题标题】:How do I create a binary column based off characters in another column in R?如何根据 R 中另一列中的字符创建二进制列?
【发布时间】:2021-08-31 21:44:32
【问题描述】:
ROW   ID       SEX               RACE               
2  REC1000023   F                1.Black
7  REC1000032   M                6.White
8  REC1000066   M                4.Asian
9  REC1000078   M                6.White
10 REC1000099   M                5.Multiracial 

我想创建一个二进制变量“Black”,并将其设为 0 或 1,具体取决于“RACE”列中的值。我还想要一个“白色”列和一个“其他”列。像这样:

ROW   ID       SEX               RACE           Black   White  Other        
2  REC1000023   F                1.Black         1      0      0
7  REC1000032   M                6.White         0      1      0
8  REC1000066   M                4.Asian         0      0      1
9  REC1000078   M                6.White         0      1      0
10 REC1000099   M                5.Multiracial   0      0      1

【问题讨论】:

    标签: r dataframe if-statement


    【解决方案1】:

    使用 ifelse:

    library(tidyverse)
    # Example data
    df <- data.frame(
      stringsAsFactors = FALSE,
                   ROW = c(2L, 7L, 8L, 9L, 10L),
                    ID = c("REC1000023","REC1000032",
                           "REC1000066","REC1000078","REC1000099"),
                   SEX = c("F", "M", "M", "M", "M"),
                  RACE = c("1.Black","6.White","4.Asian",
                           "6.White","5.Multiracial")
    )
    
    # Create new columns
    df2 <- df %>% 
      mutate(Black = ifelse(RACE == "1.Black", 1, 0),
             White = ifelse(RACE == "6.White", 1, 0),
             Other = ifelse(RACE != "1.Black" & RACE != "6.White", 1, 0))
    df2
    #  ROW         ID SEX          RACE Black White Other
    #1   2 REC1000023   F       1.Black     1     0     0
    #2   7 REC1000032   M       6.White     0     1     0
    #3   8 REC1000066   M       4.Asian     0     0     1
    #4   9 REC1000078   M       6.White     0     1     0
    #5  10 REC1000099   M 5.Multiracial     0     0     1
    

    --

    不确定速度是否是您的应用程序的一个因素,但这里是使用示例数据集的基准:

    ronak_func <- function(df){
      df %>%
        mutate(col = sub('\\d+\\.', '', RACE), 
               col = replace(col, !col %in% c('Black', 'White'), 'Other')) %>%
        pivot_wider(names_from = col, values_from = col, 
                    values_fn = length, values_fill = 0)
    }
    
    jared_func <- function(df){
      df %>% 
        mutate(Black = ifelse(RACE == "1.Black", 1, 0),
               White = ifelse(RACE == "6.White", 1, 0),
               Other = ifelse(RACE != "1.Black" & RACE != "6.White", 1, 0))
    }
    
    karthik_func <- function(df){
      df %>% mutate(Black = +str_detect(RACE,'Black'),
                    White = +str_detect(RACE,'White'),
                    Other = +(!str_detect(RACE,'Black|White')))
    }
    
    jpdugo17_func <- function(df){
      map_dfc(list('1.Black', '6.White'), ~ transmute(df, '{str_sub(.x, 3, -1)}' := if_else(RACE == .x, 1, 0))) %>% 
        mutate(other = if_else(Black + White == 1, 0, 1)) %>% cbind(df, .)
    }
    
    GKi1_func <- function(df) {
      df$Black <- +(df$RACE == "1.Black")
      df$White <- +(df$RACE == "6.White")
      df$Other <- 1 - (df$Black | df$White)
      df
    }
    
    GKi2_func <- function(df) {
      df$Black <- +grepl("Black", df$RACE, fixed = TRUE)
      df$White <- +grepl("White", df$RACE, fixed = TRUE)
      df$Other <- 1 - (df$Black | df$White)
      df
    }
    
    jared_func_dt <- function(df){
      setDT(df)
      df[, Black := +(df$RACE == "1.Black")][, White := +(df$RACE == "6.White")][, Other :=  1 - (df$Black | df$White)]
    }
    
    res <- microbenchmark::microbenchmark(ronak_func(df),
                                          jared_func(df),
                                          karthik_func(df),
                                          jpdugo17_func(df),
                                          GKi1_func(df),
                                          GKi2_func(df),
                                          jared_func_dt(df))
    autoplot(res)
    

    以及使用具有 10k 行的示例数据集的基准测试:

    df2 <- data.frame(stringsAsFactors = FALSE,
                      ROW = 1:10000,
                      ID = rep(c("REC1000023","REC1000032",
                                 "REC1000066","REC1000078",
                                 "REC1000099"), times = 2000),
                      SEX = sample(c("F", "M"),
                                   replace = TRUE,
                                   size = 10000),
                      RACE = sample(c("1.Black","6.White","4.Asian",
                               "6.White","5.Multiracial"),
                               replace = TRUE,
                               size = 10000))
    res <- microbenchmark::microbenchmark(ronak_func(df2),
                                          jared_func(df2),
                                          karthik_func(df2),
                                          jpdugo17_func(df2),
                                          GKi1_func(df2),
                                          GKi2_func(df2),
                                          jared_func_dt(df2))
    autoplot(res)
    

    【讨论】:

      【解决方案2】:

      如果 Black 始终编码为 1.BlackWhite 始终编码为 6.White 您可以使用 == 进行比较,然后将 TRUE/FALSE使用+:1/0 中的向量:

      df$Black <- +(df$RACE == "1.Black")
      df$White <- +(df$RACE == "6.White")
      

      如果其他字符发生变化,则可以使用grepl

      df$Black <- +grepl("Black", df$RACE, fixed = TRUE)
      df$White <- +grepl("White", df$RACE, fixed = TRUE)
      

      要获取剩余的列Other,只需使用BlackWhite 中已有的内容:

      df$Other <- 1 - (df$Black | df$White)
      

      结果:

      df
      #  ROW         ID SEX          RACE Black White Other
      #1   2 REC1000023   F       1.Black     1     0     0
      #2   7 REC1000032   M       6.White     0     1     0
      #3   8 REC1000066   M       4.Asian     0     0     1
      #4   9 REC1000078   M       6.White     0     1     0
      #5  10 REC1000099   M 5.Multiracial     0     0     1
      

      【讨论】:

        【解决方案3】:
        library(tidyverse)
        df <- 
        read_table(file = "ROW   ID       SEX               RACE               
        2  REC1000023   F                1.Black
        7  REC1000032   M                6.White
        8  REC1000066   M                4.Asian
        9  REC1000078   M                6.White
        10 REC1000099   M                5.Multiracial ")
        
        map_dfc(list('1.Black', '6.White'), ~ transmute(df, '{str_sub(.x, 3, -1)}' := if_else(RACE == .x, 1, 0))) %>% 
            mutate(other = if_else(Black + White == 1, 0, 1)) %>% cbind(df, .)
        #>        ROW   ID SEX          RACE Black White other
        #> 1 2  REC1000023   F       1.Black     1     0     0
        #> 2 7  REC1000032   M       6.White     0     1     0
        #> 3 8  REC1000066   M       4.Asian     0     0     1
        #> 4 9  REC1000078   M       6.White     0     1     0
        #> 5 10 REC1000099   M 5.Multiracial     0     0     1
        

        reprex package (v2.0.0) 于 2021-06-16 创建

        【讨论】:

          【解决方案4】:

          这行得通吗:

          library(dplyr)
          library(stringr)
          df %>% mutate(Black = +str_detect(RACE,'Black'),
                        White = +str_detect(RACE,'White'),
                        Other = +(!str_detect(RACE,'Black|White')))
          # A tibble: 5 x 7
              ROW ID         SEX   RACE          Black White Other
            <dbl> <chr>      <chr> <chr>         <int> <int> <int>
          1     2 REC1000023 F     1.Black           1     0     0
          2     7 REC1000032 M     6.White           0     1     0
          3     8 REC1000066 M     4.Asian           0     0     1
          4     9 REC1000078 M     6.White           0     1     0
          5    10 REC1000099 M     5.Multiracial     0     0     1
          

          【讨论】:

          • 这很聪明 - 不错!
          【解决方案5】:

          创建一个新列,将除c('Black', 'White') 之外的任何值更改为'Other' 并使用pivot_wider

          library(dplyr)
          library(tidyr)
          
          df %>%
            mutate(col = sub('\\d+\\.', '', RACE), 
                   col = replace(col, !col %in% c('Black', 'White'), 'Other')) %>%
            pivot_wider(names_from = col, values_from = col, 
                        values_fn = length, values_fill = 0)
          
          #    ROW ID         SEX   RACE          Black White Other
          #  <int> <chr>      <chr> <chr>         <int> <int> <int>
          #1     2 REC1000023 F     1.Black           1     0     0
          #2     7 REC1000032 M     6.White           0     1     0
          #3     8 REC1000066 M     4.Asian           0     0     1
          #4     9 REC1000078 M     6.White           0     1     0
          #5    10 REC1000099 M     5.Multiracial     0     0     1
          

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 2022-01-12
            • 1970-01-01
            • 2016-10-02
            • 1970-01-01
            • 2022-08-19
            • 2023-02-05
            • 1970-01-01
            • 1970-01-01
            相关资源
            最近更新 更多