【问题标题】:Is there a more concise method to filter columns containing a number greater than or less than, in dplyr?在 dplyr 中是否有更简洁的方法来过滤包含大于或小于数字的列?
【发布时间】:2018-11-30 06:32:41
【问题描述】:

这是我的 tibble 示例

    protein patient           value 
  <chr>   <chr>             <dbl> 
1 BOD1L2  RF0064_Case-9-d-   10.4  
2 PPFIA2  RF0064_Case-20-d-  7.83 
3 STAT4   RF0064_Case-11-d-  11.0 
4 TOM1L2  RF0064_Case-29-d-  13.0  
5 SH2D2A  RF0064_Case-2-d-   8.28  
6 TIGD4   RF0064_Case-49-d-  9.71  

在“患者”列中,“d”与“Case-x-d”一样表示天数。我想做的是创建一个新列,说明“患者”列中的字符串是否包含小于 14d 的值。

我已经使用以下命令做到了这一点:

under14 <- "-1d|-2d|-3d|-4d|-4d|-5d|-6d|-7d|-8d|-9d|-11d|-12d|-13d|-14d"

data <- data %>%
    mutate(case=ifelse(grepl(under14,data$patient),'under14days','over14days'))

然而,这看起来非常笨重,实际上需要很长时间才能打字。我将不得不多次更改我的搜索词,所以想要一个更快的方法来做到这一点?也许使用某种正则表达式是最好的选择,但我真的不知道从哪里开始。

R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] readxl_1.1.0    Rmisc_1.5       plyr_1.8.4      lattice_0.20-35 forcats_0.3.0   stringr_1.3.1   dplyr_0.7.5     purrr_0.2.5    
 [9] readr_1.1.1     tidyr_0.8.1     tibble_1.4.2    ggplot2_2.2.1   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17     cellranger_1.1.0 pillar_1.2.3     compiler_3.5.0   bindr_0.1.1      tools_3.5.0      lubridate_1.7.4 
 [8] jsonlite_1.5     nlme_3.1-137     gtable_0.2.0     pkgconfig_2.0.1  rlang_0.2.1      psych_1.8.4      cli_1.0.0       
[15] rstudioapi_0.7   yaml_2.1.19      parallel_3.5.0   haven_1.1.1      bindrcpp_0.2.2   xml2_1.2.0       httr_1.3.1      
[22] hms_0.4.2        grid_3.5.0       tidyselect_0.2.4 glue_1.2.0       R6_2.2.2         foreign_0.8-70   modelr_0.1.2    
[29] reshape2_1.4.3   magrittr_1.5     scales_0.5.0     rvest_0.3.2      assertthat_0.2.0 mnormt_1.5-5     colorspace_1.3-2
[36] utf8_1.1.4       stringi_1.2.3    lazyeval_0.2.1   munsell_0.5.0    broom_0.4.4      crayon_1.3.4    
> 

【问题讨论】:

    标签: r regex dplyr tidyverse


    【解决方案1】:

    由于patient 的格式已明确定义,base-R 中使用gsub 的可能解决方案是提取days 并检查范围为:

    df$case <- ifelse(as.integer(gsub("RF0064_Case-(\\d+)-d-","\\1", df$patient)) <= 14,
                                                                  "under14days", "over14days")
    

    同理,OP 可以将mutate 中使用的代码修改为:

    library(dplyr)
    df <- df %>% 
    mutate(case = ifelse(as.integer(gsub("RF0064_Case-(\\d+)-d-","\\1", patient)) <= 14,
                                                              "under14days", "over14days"))
    
    df
    
    #   protein           patient value        case
    # 1  BOD1L2  RF0064_Case-9-d- 10.40 under14days
    # 2  PPFIA2 RF0064_Case-20-d-  7.83  over14days
    # 3   STAT4 RF0064_Case-11-d- 11.00 under14days
    # 4  TOM1L2 RF0064_Case-29-d- 13.00  over14days
    # 5  SH2D2A  RF0064_Case-2-d-  8.28 under14days
    # 6   TIGD4 RF0064_Case-49-d-  9.71  over14days
    

    数据:

    df <- read.table(text = 
    "protein patient           value 
    1 BOD1L2  RF0064_Case-9-d-   10.4  
    2 PPFIA2  RF0064_Case-20-d-  7.83 
    3 STAT4   RF0064_Case-11-d-  11.0 
    4 TOM1L2  RF0064_Case-29-d-  13.0  
    5 SH2D2A  RF0064_Case-2-d-   8.28  
    6 TIGD4   RF0064_Case-49-d-  9.71",
    header = TRUE, stringsAsFactors = FALSE)
    

    【讨论】:

      【解决方案2】:

      我们也可以直接用正则表达式提取数字。 ?&lt;=-是look behind,用“-”标识位置

      library(tidyverse)
      
      dat2 <- dat %>%
        mutate(Day = as.numeric(str_extract(patient, pattern = "(?<=-)[0-9]*"))) %>% 
        mutate(case = ifelse(Day <= 14,'under14days','over14days'))
      dat2
      #   protein           patient value Day        case
      # 1  BOD1L2  RF0064_Case-9-d- 10.40   9 under14days
      # 2  PPFIA2 RF0064_Case-20-d-  7.83  20  over14days
      # 3   STAT4 RF0064_Case-11-d- 11.00  11 under14days
      # 4  TOM1L2 RF0064_Case-29-d- 13.00  29  over14days
      # 5  SH2D2A  RF0064_Case-2-d-  8.28   2 under14days
      # 6   TIGD4 RF0064_Case-49-d-  9.71  49  over14days
      

      数据

      dat <- read.table(text = "    protein patient           value 
      1 BOD1L2  'RF0064_Case-9-d-'   10.4  
      2 PPFIA2  'RF0064_Case-20-d-'  7.83 
      3 STAT4   'RF0064_Case-11-d-'  11.0 
      4 TOM1L2  'RF0064_Case-29-d-'  13.0  
      5 SH2D2A  'RF0064_Case-2-d-'   8.28  
      6 TIGD4   'RF0064_Case-49-d-'  9.71",
                        header = TRUE, stringsAsFactors = FALSE)
      

      【讨论】:

        【解决方案3】:

        一种可能性是使用tidyr::separate

        library(tidyverse)
        df %>%
            separate(patient, into = c("ID1", "Days", "ID2"), sep = "-", extra = "merge", remove = F) %>%
            mutate(case = ifelse(as.numeric(Days) <= 14, "under14days", "over14days")) %>%
            select(-ID1, -ID2)
        #  protein           patient Days value        case
        #1  BOD1L2  RF0064_Case-9-d-    9 10.40 under14days
        #2  PPFIA2 RF0064_Case-20-d-   20  7.83  over14days
        #3   STAT4 RF0064_Case-11-d-   11 11.00 under14days
        #4  TOM1L2 RF0064_Case-29-d-   29 13.00  over14days
        #5  SH2D2A  RF0064_Case-2-d-    2  8.28 under14days
        #6   TIGD4 RF0064_Case-49-d-   49  9.71  over14days
        

        样本数据

        df <-read.table(text =
            "    protein patient           value
        1 BOD1L2  RF0064_Case-9-d-   10.4
        2 PPFIA2  RF0064_Case-20-d-  7.83
        3 STAT4   RF0064_Case-11-d-  11.0
        4 TOM1L2  RF0064_Case-29-d-  13.0
        5 SH2D2A  RF0064_Case-2-d-   8.28
        6 TIGD4   RF0064_Case-49-d-  9.71  ", header = T, row.names = 1)
        

        【讨论】:

        • 可以在separate函数中设置convert = TRUE自动转换数值。
        猜你喜欢
        • 2022-01-24
        • 2011-04-03
        • 1970-01-01
        • 2021-03-07
        • 2019-01-27
        • 2020-10-14
        • 1970-01-01
        • 2010-11-19
        • 1970-01-01
        相关资源
        最近更新 更多