R中字符匹配的快速方法答案

【问题标题】：Fast way for character matching in RR中字符匹配的快速方法
【发布时间】：2017-03-30 01:49:07
【问题描述】：

我正在尝试查找characters 的vector 是否映射到另一个，并在R 中寻找一种快速的方法。

具体来说，我的字母表是氨基酸：

aa.LETTERS <- c('G','P','A','V','L','I','M','C','F','Y','W','H','K','R','Q','N','E','D','S','T')

我有一个肽和蛋白质序列的载体：

set.seed(1)
peptides.vec <- sapply(1:100,function(p) paste(aa.LETTERS[sample(20,ceiling(runif(1,8,12)),replace=T)],collapse=""))
proteins.vec <- sapply(1:1000,function(p) paste(aa.LETTERS[sample(20,ceiling(runif(1,200,400)),replace=T)],collapse=""))

我想尝试查看peptides.vec 中的每个肽序列是否存在于proteins.vec 中的任何序列中。

这是显而易见的做法之一：

mapping.mat <- do.call(rbind,lapply(peptides.vec,function(p){
   grepl(p,proteins.vec)
}))

另一个正在使用Biostrings Bioconductor 包：

require(Biostrings)
peptides.set <- AAStringSet(x=peptides.vec)
proteins.set <- AAStringSet(x=proteins.vec)
mapping.mat <- vcountPDict(peptides.set,proteins.set)

对于我正在使用的维度来说，两者都很慢：

> microbenchmark(do.call(rbind,lapply(peptides.vec,function(p){
   grepl(p,proteins.vec)
 })),times=100)
Unit: milliseconds
                                                                             expr      min       lq     mean   median       uq      max neval
 do.call(rbind, lapply(peptides.vec, function(p) {     grepl(p, proteins.vec) })) 477.2509 478.8714 482.8937 480.4398 484.3076 509.8098   100
> microbenchmark(vcountPDict(peptides.set,proteins.set),times=100)
Unit: milliseconds
                                    expr    min       lq     mean   median       uq      max neval
 vcountPDict(peptides.set, proteins.set) 283.32 284.3334 285.0205 284.7867 285.2467 290.6725   100

知道如何更快地完成这项工作吗？

【问题讨论】：

在我的脑海中（没有测试），添加fixed = TRUE 通常可以提高速度。另见“stringi”包。
（另一方面，在你的代码中不包括空格，不会提高它的性能。）
实际上，根据您正在使用的实际数据的性质，在某些情况下，funBASE_2 可能会更快——至少这是我通过一些测试得到的结果...... .

标签： r character match grepl

【解决方案1】：

正如我在评论中提到的，添加fixed = TRUE 会带来一些性能提升，而“stringi”也可能会带来很好的提升。

这里有一些测试：

N <- as.integer(length(proteins.vec))

funOP <- function() {
  do.call(rbind, lapply(peptides.vec, function(p) grepl(p, proteins.vec)))
}

funBASE_1 <- function() {
  # Just adds "fixed = TRUE"
  do.call(rbind, lapply(peptides.vec, function(p) grepl(p, proteins.vec, fixed = TRUE)))
}

funBASE_2 <- function() {
  # Does away with the `do.call` but probably won't improve performance
  vapply(peptides.vec, function(x) grepl(x, proteins.vec, fixed = TRUE), logical(N))
}

library(stringi)
funSTRINGI <- function() {
  # Should be considerably faster
  vapply(peptides.vec, function(x) stri_detect_fixed(proteins.vec, x), logical(N))
}

library(microbenchmark)
microbenchmark(funOP(), funBASE_1(), funBASE_2(), funSTRINGI())
# Unit: milliseconds
#          expr        min         lq      mean     median         uq       max neval
#       funOP() 344.500600 348.562879 352.94847 351.585206 356.508197 371.99683   100
#   funBASE_1() 128.724523 129.763464 132.55028 132.198112 135.277821 139.65782   100
#   funBASE_2() 128.564914 129.831660 132.33836 131.607216 134.380077 140.46987   100
#  funSTRINGI()   8.629728   8.825296   9.22318   9.038496   9.444376  11.28491   100

去“字符串”！

【讨论】：

您会考虑使用多核版本吗？虽然proteins.vec 在现实中相当大
@dan，我真的没有经验，但我想它会是这样的一个很好的候选人。