【问题标题】:Sort variable according to multiple regex substrings根据多个正则表达式子字符串对变量进行排序
【发布时间】:2018-08-24 12:58:52
【问题描述】:

我正在尝试在 R 中订购一个变量,它是一个文件名列表,其中包含我想要订购的三个子字符串。文件名的格式如下:

MAF001.incMHC.zPGS.S1
MAF002.incMHC.zPGS.S1
MAF003.incMHC.zPGS.S1
MAF001.incMHC.zPGS.S2
MAF002.incMHC.zPGS.S2
MAF003.incMHC.zPGS.S2
MAF001.noMHC_incRS148.zPGS.S1
MAF002.noMHC_incRS148.zPGS.S1
MAF003.noMHC_incRS148.zPGS.S1
MAF001.noMHC_incRS148.zPGS.S2
MAF002.noMHC_incRS148.zPGS.S2
MAF003.noMHC_incRS148.zPGS.S2
MAF001.noMHC.zPGS.S1
MAF002.noMHC.zPGS.S1
MAF003.noMHC.zPGS.S1
MAF001.noMHC.zPGS.S2
MAF002.noMHC.zPGS.S2
MAF003.noMHC.zPGS.S2

我想先在 MAF 子串上排序这个列表,然后是 MHC 子串,然后是 S 子串,这样的顺序是:

MAF001.incMHC.zPGS.S1
MAF001.noMHC_incRS148.zPGS.S1
MAF001.noMHC.zPGS.S1
MAF001.incMHC.zPGS.S2
MAF001.noMHC_incRS148.zPGS.S2
MAF001.noMHC.zPGS.S2
MAF002.incMHC.zPGS.S1
MAF002.noMHC_incRS148.zPGS.S1
MAF002.noMHC.zPGS.S1
MAF002.incMHC.zPGS.S2
MAF002.noMHC_incRS148.zPGS.S2
MAF002.noMHC.zPGS.S2
MAF003.incMHC.zPGS.S1
MAF003.noMHC_incRS148.zPGS.S1
MAF003.noMHC.zPGS.S1
MAF003.incMHC.zPGS.S2
MAF003.noMHC_incRS148.zPGS.S2
MAF003.noMHC.zPGS.S2

在看到有关单个子字符串的这个问题的答案后,我玩弄了 gsub: R Sort strings according to substring

但我不确定如何将这个想法扩展到字符串中的多个子字符串(混合字符和数字类)。

【问题讨论】:

    标签: r regex sorting substring


    【解决方案1】:

    此结果与您想要的输出匹配,但它仅根据MAFS 排序。我不明白如何使用MHC 字符串进行排序,如果此答案不符合您的需求,请详细说明该部分。

    library(stringr)
    maf <- str_extract(filenames, "MAF\\d+\\.")
    mhc <- str_extract(filenames, "\\..*MHC.*\\.")
    s <- str_extract(filenames, "S\\d+$")
    
    library(magrittr)
    library(dplyr)
    
    data.frame(filenames, maf, mhc, s) %>% 
      arrange(maf, s) %>% 
      select(filenames)
    

    输出是:

                           filenames
    1          MAF001.incMHC.zPGS.S1
    2          MAF001.incMHC.zPGS.S2
    3           MAF001.noMHC.zPGS.S1
    4           MAF001.noMHC.zPGS.S2
    5  MAF001.noMHC_incRS148.zPGS.S1
    6  MAF001.noMHC_incRS148.zPGS.S2
    7          MAF002.incMHC.zPGS.S1
    8          MAF002.incMHC.zPGS.S2
    9           MAF002.noMHC.zPGS.S1
    10          MAF002.noMHC.zPGS.S2
    11 MAF002.noMHC_incRS148.zPGS.S1
    12 MAF002.noMHC_incRS148.zPGS.S2
    13         MAF003.incMHC.zPGS.S1
    14         MAF003.incMHC.zPGS.S2
    15          MAF003.noMHC.zPGS.S1
    16          MAF003.noMHC.zPGS.S2
    17 MAF003.noMHC_incRS148.zPGS.S1
    18 MAF003.noMHC_incRS148.zPGS.S2
    

    filenames 在哪里

    filenames <- read.table(text="MAF001.incMHC.zPGS.S1
    MAF002.incMHC.zPGS.S1
    MAF003.incMHC.zPGS.S1
    MAF001.incMHC.zPGS.S2
    MAF002.incMHC.zPGS.S2
    MAF003.incMHC.zPGS.S2
    MAF001.noMHC_incRS148.zPGS.S1
    MAF002.noMHC_incRS148.zPGS.S1
    MAF003.noMHC_incRS148.zPGS.S1
    MAF001.noMHC_incRS148.zPGS.S2
    MAF002.noMHC_incRS148.zPGS.S2
    MAF003.noMHC_incRS148.zPGS.S2
    MAF001.noMHC.zPGS.S1
    MAF002.noMHC.zPGS.S1
    MAF003.noMHC.zPGS.S1
    MAF001.noMHC.zPGS.S2
    MAF002.noMHC.zPGS.S2
    MAF003.noMHC.zPGS.S2", header=FALSE, stringsAsFactors=FALSE)
    

    【讨论】:

      【解决方案2】:

      这是基础 R 中的单行代码:

      bar <- foo[order(sapply(strsplit(foo, "\\."), function(x) paste(x[1], x[4])))]
      head(data.frame(result = bar), 10)
      
                                result
      1          MAF001.incMHC.zPGS.S1
      2  MAF001.noMHC_incRS148.zPGS.S1
      3           MAF001.noMHC.zPGS.S1
      4          MAF001.incMHC.zPGS.S2
      5  MAF001.noMHC_incRS148.zPGS.S2
      6           MAF001.noMHC.zPGS.S2
      7          MAF002.incMHC.zPGS.S1
      8  MAF002.noMHC_incRS148.zPGS.S1
      9           MAF002.noMHC.zPGS.S1
      10         MAF002.incMHC.zPGS.S2
      

      解释:

      • . 使用strsplit 分割字符串:strsplit(foo, "\\.")
      • 提取和组合元素 1 和 4:paste(x[1], x[4])
      • 使用order获取所有组合的顺序
      • foo[]获取对应值

      数据(foo):

      c("MAF001.incMHC.zPGS.S1", "MAF002.incMHC.zPGS.S1", "MAF003.incMHC.zPGS.S1", 
      "MAF001.incMHC.zPGS.S2", "MAF002.incMHC.zPGS.S2", "MAF003.incMHC.zPGS.S2", 
      "MAF001.noMHC_incRS148.zPGS.S1", "MAF002.noMHC_incRS148.zPGS.S1", 
      "MAF003.noMHC_incRS148.zPGS.S1", "MAF001.noMHC_incRS148.zPGS.S2", 
      "MAF002.noMHC_incRS148.zPGS.S2", "MAF003.noMHC_incRS148.zPGS.S2", 
      "MAF001.noMHC.zPGS.S1", "MAF002.noMHC.zPGS.S1", "MAF003.noMHC.zPGS.S1", 
      "MAF001.noMHC.zPGS.S2", "MAF002.noMHC.zPGS.S2", "MAF003.noMHC.zPGS.S2"
      )
      

      【讨论】:

      • 这很好用,谢谢。谢谢你的解释。我假设这是由 MHC 子字符串自动排序的,因为 R 已经按照我想要的顺序有这个子字符串?
      • strsplit(foo, '.', fixed = TRUE) 也可以。
      【解决方案3】:

      使用tidyrdplyr

      library(tidyr)
      library(dplyr)
      
      df <- data.frame(filenames = c(...))
      
      pattern = "^([^.]+)\\.([^.]+)"
      df %>%
        extract(filenames, 
                into = c("maf", "mhc"), 
                regex = pattern, remove = FALSE) %>%
        arrange(maf, mhc) %>%
        select(filenames)
      

      产量

                             filenames
      1          MAF001.incMHC.zPGS.S1
      2          MAF001.incMHC.zPGS.S2
      3           MAF001.noMHC.zPGS.S1
      4           MAF001.noMHC.zPGS.S2
      5  MAF001.noMHC_incRS148.zPGS.S1
      6  MAF001.noMHC_incRS148.zPGS.S2
      7          MAF002.incMHC.zPGS.S1
      8          MAF002.incMHC.zPGS.S2
      9           MAF002.noMHC.zPGS.S1
      10          MAF002.noMHC.zPGS.S2
      11 MAF002.noMHC_incRS148.zPGS.S1
      12 MAF002.noMHC_incRS148.zPGS.S2
      13         MAF003.incMHC.zPGS.S1
      14         MAF003.incMHC.zPGS.S2
      15          MAF003.noMHC.zPGS.S1
      16          MAF003.noMHC.zPGS.S2
      17 MAF003.noMHC_incRS148.zPGS.S1
      18 MAF003.noMHC_incRS148.zPGS.S2
      

      【讨论】:

        【解决方案4】:

        这里已经添加了许多好的解决方案。我正在添加另一个仅基于 vector 的使用。

        注意: OP 旨在对 MAFMHCS 子字符串进行排序。我坚持这条规则来对所有三个进行排序。因此我的答案的结果可能与其他答案不匹配。

        方法:

        1. 使用 regmatches 在 OP 中查找每个描述的子字符串
        2. 使用paste 准备字符串,基于这些字符串可以执行sort
        3. 使用setNames设置向量的名称
        4. 按名称排序vector

          v[order(names(setNames(v, 
                paste(regmatches(v, regexpr("^MAF\\d+", v, perl = TRUE)),
                      regmatches(v, regexpr("\\w*MHC\\w*", v, perl = TRUE)),
                      regmatches(v, regexpr("\\w+\\d+$", v, perl = TRUE))
                     ))))]
          #Result
          [1] "MAF001.incMHC.zPGS.S1"
          [2] "MAF001.incMHC.zPGS.S2"
          [3] "MAF001.noMHC.zPGS.S1"
          [4] "MAF001.noMHC.zPGS.S2"
          [5] "MAF001.noMHC_incRS148.zPGS.S1"
          [6] "MAF001.noMHC_incRS148.zPGS.S2"
          [7] "MAF002.incMHC.zPGS.S1"
          [8] "MAF002.incMHC.zPGS.S2"
          [9] "MAF002.noMHC.zPGS.S1"
          [10] "MAF002.noMHC.zPGS.S2"
          [11] "MAF002.noMHC_incRS148.zPGS.S1"
          [12] "MAF002.noMHC_incRS148.zPGS.S2"
          [13] "MAF003.incMHC.zPGS.S1"
          [14] "MAF003.incMHC.zPGS.S2"
          [15] "MAF003.noMHC.zPGS.S1"
          [16] "MAF003.noMHC.zPGS.S2"
          [17] "MAF003.noMHC_incRS148.zPGS.S1"
          [18] "MAF003.noMHC_incRS148.zPGS.S2"
          

        数据

        v <- c("MAF001.incMHC.zPGS.S1", "MAF001.noMHC_incRS148.zPGS.S1", "MAF001.noMHC.zPGS.S1", 
               "MAF001.incMHC.zPGS.S2", "MAF001.noMHC_incRS148.zPGS.S2", "MAF001.noMHC.zPGS.S2", 
               "MAF002.incMHC.zPGS.S1", "MAF002.noMHC_incRS148.zPGS.S1", "MAF002.noMHC.zPGS.S1", 
               "MAF002.incMHC.zPGS.S2", "MAF002.noMHC_incRS148.zPGS.S2", "MAF002.noMHC.zPGS.S2", 
               "MAF003.incMHC.zPGS.S1", "MAF003.noMHC_incRS148.zPGS.S1", "MAF003.noMHC.zPGS.S1", 
               "MAF003.incMHC.zPGS.S2", "MAF003.noMHC_incRS148.zPGS.S2", "MAF003.noMHC.zPGS.S2"
        )
        

        【讨论】:

          【解决方案5】:

          我有一个专门为这样的任务设计的功能:

          功能

          reg_sort <- function(x,...,verbose=F) {
              ellipsis <-   sapply(as.list(substitute(list(...)))[-1], deparse, simplify="array")
              reg_list <-   paste0(ellipsis, collapse=',')
              reg_list %<>% strsplit(",") %>% unlist %>% gsub("\\\\","\\",.,fixed=T)
              pattern  <-   reg_list %>% map_chr(~sub("^-\\\"","",.) %>% sub("\\\"$","",.) %>% sub("^\\\"","",.) %>% trimws)
              descInd  <-   reg_list %>% map_lgl(~grepl("^-\\\"",.)%>%as.logical)
          
              reg_extr <-   pattern %>% map(~str_extract(x,.)) %>% c(.,list(x)) %>% as.data.table
              reg_extr[] %<>% lapply(., function(x) type.convert(as.character(x), as.is = TRUE))
          
              map(rev(seq_along(pattern)),~{reg_extr<<-reg_extr[order(reg_extr[[.]],decreasing = descInd[.])]})
          
              if(verbose) { tmp<-lapply(reg_extr[,.SD,.SDcols=seq_along(pattern)],unique);names(tmp)<-pattern;tmp %>% print }
          
              return(reg_extr[[ncol(reg_extr)]])
          }
          

          数据:

          vec <- c("MAF001.incMHC.zPGS.S1", "MAF002.incMHC.zPGS.S1", "MAF003.incMHC.zPGS.S1", 
            "MAF001.incMHC.zPGS.S2", "MAF002.incMHC.zPGS.S2", "MAF003.incMHC.zPGS.S2", 
            "MAF001.noMHC_incRS148.zPGS.S1", "MAF002.noMHC_incRS148.zPGS.S1", 
            "MAF003.noMHC_incRS148.zPGS.S1", "MAF001.noMHC_incRS148.zPGS.S2", 
            "MAF002.noMHC_incRS148.zPGS.S2", "MAF003.noMHC_incRS148.zPGS.S2", 
            "MAF001.noMHC.zPGS.S1", "MAF002.noMHC.zPGS.S1", "MAF003.noMHC.zPGS.S1", 
            "MAF001.noMHC.zPGS.S2", "MAF002.noMHC.zPGS.S2", "MAF003.noMHC.zPGS.S2"
          )
          

          致电:

          reg_sort(x=vec, "^.*?(?=\\.)","(?<=\\.).*(?<=\\.S)","S\\d+$")
          

          结果:(一个字符向量)

          1          MAF001.incMHC.zPGS.S1
          2          MAF001.incMHC.zPGS.S2
          3           MAF001.noMHC.zPGS.S1
          4           MAF001.noMHC.zPGS.S2
          5  MAF001.noMHC_incRS148.zPGS.S1
          6  MAF001.noMHC_incRS148.zPGS.S2
          7          MAF002.incMHC.zPGS.S1
          8          MAF002.incMHC.zPGS.S2
          9           MAF002.noMHC.zPGS.S1
          10          MAF002.noMHC.zPGS.S2
          11 MAF002.noMHC_incRS148.zPGS.S1
          12 MAF002.noMHC_incRS148.zPGS.S2
          13         MAF003.incMHC.zPGS.S1
          14         MAF003.incMHC.zPGS.S2
          15          MAF003.noMHC.zPGS.S1
          16          MAF003.noMHC.zPGS.S2
          17 MAF003.noMHC_incRS148.zPGS.S1
          18 MAF003.noMHC_incRS148.zPGS.S2
          

          其他特点是:

          • 降序排列:(添加-infront)reg_sort(x=vec, -"^.*?(?=\\.)","(?&lt;=\\.).*(?&lt;=\\.S)",-"S\\d+$")

          • 详细模式:reg_sort(x=vec, "^.*?(?=\\.)","(?&lt;=\\.).*(?&lt;=\\.S)","S\\d+$",verbose=T)(查看/检查 regEx 模式提取的内容以进行排序)

          【讨论】:

            猜你喜欢
            • 2014-03-29
            • 2015-05-03
            • 1970-01-01
            • 1970-01-01
            • 2015-08-13
            • 2015-09-11
            • 2014-10-11
            • 2019-11-15
            • 1970-01-01
            相关资源
            最近更新 更多