【问题标题】:Remove duplicated substrings in R删除R中的重复子串
【发布时间】:2021-02-15 19:39:25
【问题描述】:

我在 R 中有一个如下的数据框

   bacteria    sample
1    A         HM_001
2    B         HM_001_HM_001
3    C         A2_HM_001
4    D         A2_HM_001_HM_001
5    E         HM_002
6    F         HM_002_HM_002
7    G         A2_HM_002
8    H         A2_HM_002_HM_002

并希望从sample 列中删除重复的子字符串,以便最终输出如下:

   bacteria    sample
1    A         HM_001
2    B         HM_001
3    C         A2_HM_001
4    D         A2_HM_001
5    E         HM_002
6    F         HM_002
7    G         A2_HM_002
8    H         A2_HM_002

【问题讨论】:

    标签: r regex stringr


    【解决方案1】:

    gsub 中使用正则表达式

    df1$sample_new <-  with(df1, gsub("([A-Z]+_\\d+)_?\\1+", "\\1", sample))
    

    -输出

    df1
    #   bacteria           sample sample_new
    #1        A           HM_001     HM_001
    #2        B    HM_001_HM_001     HM_001
    #3        C        A2_HM_001  A2_HM_001
    #4        D A2_HM_001_HM_001  A2_HM_001
    #5        E           HM_002     HM_002
    #6        F    HM_002_HM_002     HM_002
    #7        G        A2_HM_002  A2_HM_002
    #8        H A2_HM_002_HM_002  A2_HM_002
    

    数据

    df1 <- structure(list(bacteria = c("A", "B", "C", "D", "E", "F", "G", 
    "H"), sample = c("HM_001", "HM_001_HM_001", "A2_HM_001", "A2_HM_001_HM_001", 
    "HM_002", "HM_002_HM_002", "A2_HM_002", "A2_HM_002_HM_002")), 
    class = "data.frame", row.names = c("1", 
    "2", "3", "4", "5", "6", "7", "8"))
    

    【讨论】:

      猜你喜欢
      • 2021-05-30
      • 1970-01-01
      • 1970-01-01
      • 2018-10-23
      • 1970-01-01
      • 2020-11-16
      • 1970-01-01
      • 1970-01-01
      • 2014-07-28
      相关资源
      最近更新 更多