grep 的优化版本以匹配向量与向量答案

【问题标题】：Optimized version of grep to match vector against vectorgrep 的优化版本以匹配向量与向量
【发布时间】：2014-02-16 12:32:27
【问题描述】：

假设我有两个字符向量 a 和 b：

set.seed(123)
categ <- c("Control", "Gr", "Or", "PMT", "P450")
genes <- paste(categ, rep(1:40, each=length(categ)), sep="_")
a0 <- paste(genes, "_", rep(1:50, each=length(genes)), "_", sep="")
b0 <- paste (a0, "1", sep="")
ite <- 200
lg <- 2000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

我想应用grep 函数来查找b 中a 的每个值的匹配项。当然可以：

sapply(a, grep, b)

但我想知道是否有更有效的方法，因为我将不得不在模拟中为更大的向量运行很多次（请注意，我也不想使用mclapply，因为我已经使用它来运行我的模拟的每次迭代）：

system.time(lapply(seq(100000), function(x) sapply(a, grep, b)))
library(parallel)
system.time(mclapply(seq(100000), function(x) sapply(a, grep, b), mc.cores=8))

【问题讨论】：

总是最多只有一个匹配项吗？如果是这样，pmatch(a, b) 听起来是个好建议。
如果合适的话，您还可以在删除b 项目的最后两个字符后使用完全匹配：match(a, substr(b, 1L, nchar(b) - 2L))。您必须尝试使用大数据才能看到更快的方法。
谢谢弗洛德尔。我的例子有点具体，但我想要一个通用的解决方案。所以首先，pmatch 肯定会帮助只有一个匹配的情况。然后，b 的最后 2 个字母没有提供信息，但情况可能并非总是如此。
这会将其减少为单个 grep: grep(paste(a, collapse = "|"), b)
@G.Grothendieck。这是在回答一个不同的问题，不是吗？

标签： r regex optimization vector

【解决方案1】：

由于您不使用正则表达式但想在较长的字符串中查找子字符串，您可以使用fixed = TRUE。它要快得多。

library(microbenchmark)
microbenchmark(lapply(a, grep, b),                 # original
                 lapply(paste0("^", a), grep, b),  # @flodel
                 lapply(a, grep, b, fixed = TRUE))

Unit: microseconds
                             expr     min       lq   median       uq     max neval
               lapply(a, grep, b) 112.633 114.2340 114.9390 116.0990 326.857   100
  lapply(paste0("^", a), grep, b) 119.949 121.7380 122.7425 123.9775 191.851   100
 lapply(a, grep, b, fixed = TRUE)  21.004  22.5885  23.8580  24.6110  33.608   100

用更长的向量（原始长度的 1000 倍）进行测试。

ar <- rep(a, 1000)
br <- rep(b, 1000)

library(microbenchmark)
microbenchmark(lapply(ar, grep, br),               # original
               lapply(paste0("^", ar), grep, br),  # @flodel
               lapply(ar, grep, br, fixed = TRUE))

Unit: seconds
                               expr       min        lq    median       uq       max neval
               lapply(ar, grep, br) 32.288139 32.564223 32.726149 32.97529 37.818299   100
  lapply(paste0("^", ar), grep, br) 24.997339 25.343401 25.531138 25.71615 28.238802   100
 lapply(ar, grep, br, fixed = TRUE)  2.461934  2.494759  2.513931  2.55375  4.194093   100

（这花了相当长的时间......）

【讨论】：

当正则表达式中没有“通配符”时，我很惊讶fixed 的速度如此之快。
好吧，对于我的具体示例，使用 flodel 建议的pmatch （在问题的评论中）到目前为止效率更高。但是，出于一般目的，您的上述解决方案非常好，并且对于很长的向量，flodel 的正式回复也值得测试。如果你同意的话，我想我会尝试在我的回答中总结所有这些。

【解决方案2】：

按照我最后的建议...

您所问的最大问题是，先验地，您需要进行length(a) * length(b) 比较。但是，您可以利用这样一个事实，即这里的匹配只会发生在字符串的开头（我从 cmets 收集的内容。）。

我建议您先将 a 和 b 向量拆分为列表，然后查看每个项目中的第一个单词（“Or”、“Gr”、“Control”、“PMT”等），然后只在相应的集合中寻找匹配项。换句话说，获取a 中以Or_ 开头的项目，并仅在b 中同样以Or_ 开头的项目中查找匹配项。

让您了解为什么这在复杂性方面是有效的。想象一下a 和b 都有长度n；有x 可能的前缀，均匀分布在a 和b 中。那么你只需要在你的情况下进行x * (n/x * n/x) 与n * n 的比较。那是x 次比较。您甚至可以想象以递归方式使用第二个单词、第三个等重复该过程。

下面是它的代码：

reduced.match <- function(a, b) {

   first.word <- function(string) sub("_.*", "", string)

   a.first <- first.word(a)
   b.first <- first.word(b)
   l.first <- unique(c(a.first, b.first))
   a.first <- factor(a.first, l.first)
   b.first <- factor(b.first, l.first)
   a.split <- split(a, a.first)
   b.split <- split(b, b.first)
   a.idx.split <- split(seq_along(a), a.first)
   b.idx.split <- split(seq_along(b), b.first)

   unsorted.matches <-
     Map(function(a, b, i) lapply(a, function(x) i[grep(x, b, fixed = TRUE)]),
         a.split, b.split, b.idx.split, USE.NAMES = FALSE)

   sorted.matches <-
     unlist(unsorted.matches, recursive = FALSE)[
       match(seq_along(a), unlist(a.idx.split))]

   return(sorted.matches)
}

# sample data
set.seed(123)
n <- 10000
words <- paste0(LETTERS, LETTERS, LETTERS)
a <- paste(sample(words[-1], n, TRUE),
           sample(words, n, TRUE), sep = "_")
b <- paste(sample(words[-2], n, TRUE),
           sample(words, n, TRUE), sep = "_")

# testing
identical(reduced.match(a, b), lapply(a, grep, b, fixed = TRUE))
# [1] TRUE

# benchmarks
system.time(reduced.match(a, b))
#    user  system elapsed 
#   0.187   0.000   0.187 
system.time(lapply(a, grep, b, fixed = TRUE))
#    user  system elapsed 
#   2.915   0.002   2.920

【讨论】：

好吧，对于我的具体示例，使用您在评论中建议的pmatch 更有效。但是，对于一般用途，Sven 的解决方案非常棒，而且对于很长的向量，您的正式回复也值得测试。如果你同意的话，我想我会尝试在我的回答中总结所有这些。
当然可以。我希望您能根据自己的数据和报告时间来测试我们的方法。
我测试了 5 种不同的方法（见下文，我无法测试 Martin 的方法，因为它不支持 a 的某些元素是 a 的其他元素的前缀）。在我的具体情况下，使用pmatch 是最有效的，但根据具体情况，使用reduce.match 可能会更好。您认为引用作为公认答案的最佳答案是什么？在我比较所有方法的地方，或者你想写一个正式的答案，说我应该使用pmatch，然后引用这个作为接受？不确定使用 Stackoverflow 的最佳习惯。
没有压力，选择接受的答案是个人和主观的事情。有些人得票最多、最有帮助的人、获得最快代码的人等等。您得到的所有答案都是正确且有针对性的，因此您可以从中挑选任何人，没有人会感到被冒犯。跨度>
谢谢，我会拿起我的，因为它总结了所有提议的解决方案的结果（如果上述方法感兴趣，人们可以在下面看到）。

【解决方案3】：

如果 a 和 b 已排序（并且 a 是唯一的）并且有人对字符串开头的精确匹配感兴趣，那么下面的 C 代码通常会相对高效（大约为 length(a) + length (b) 字符串比较？）。 R 包装器确保 C 代码和 R 用户获得适当的数据。

f3 <- local({
    library(inline)
    .amatch <- cfunction(c(a="character", b="character"),
             includes="#include <string.h>", '
         int len_a = Rf_length(a), len_b = Rf_length(b);
         SEXP ans = PROTECT(allocVector(INTSXP, len_b));
         memset(INTEGER(ans), 0, sizeof(int) * len_b);
         int cmp, i = 0, j = 0;
         while (i < len_a) {
             const char *ap = CHAR(STRING_ELT(a, i));
             while (j < len_b) {
                 cmp = strncmp(ap, CHAR(STRING_ELT(b, j)), strlen(ap));
                 if (cmp > 0) {
                     j += 1;
                 } else break;
             }
             if (j == len_b)
                 break;
             if (cmp == 0)
                 INTEGER(ans)[j++] = i + 1;
             else if (cmp < 0) i += 1;
         }
         UNPROTECT(1);
         return(ans);')

    function(a, b) {
        locale = Sys.getlocale("LC_COLLATE")
        if (locale != "C") {
            warning('temporarily trying to set LC_COLLATE to "C"')
            Sys.setlocale("LC_COLLATE", "C")
            on.exit(Sys.setlocale("LC_COLLATE", locale))
        }
        a0 <- a
        lvls <- unique(a)
        a <- sort(lvls)
        o <- order(b)
        idx <- .amatch(a, b[o])[order(o)]
        f <- factor(a[idx[idx != 0]], levels=lvls)
        split(which(idx != 0), f)[a0]
    }
})

与这个半友好的 grep 相比

f0 <- function(a, b) {
    a0 <- a
    a <- unique(a)
    names(a) <- a
    lapply(a, grep, b, fixed=TRUE)[a0]
}

允许（但不会为此付出太多代价）重复的“a”值@flodel 的数据集的时间是

> microbenchmark(f0(a, b), f3(a, b), times=5)
Unit: milliseconds
     expr       min        lq    median        uq       max neval
 f0(a, b) 431.03595 431.45211 432.59346 433.96036 434.87550     5
 f3(a, b)  15.70972  15.75976  15.93179  16.05184  16.06767     5

不幸的是，当一个元素是另一个元素的前缀时，这个简单的算法就会失败

> str(f0(c("a", "ab"), "abc"))
List of 2
 $ : chr "abc"
 $ : chr "abc"
> str(f3(c("a", "ab"), "abc"))
List of 2
 $ : chr "abc"
 $ : chr(0)

与评论相反，对于这个数据集（为了重现性，需要指定随机数种子）

set.seed(123)
categ <- c("Control", "Gr", "Or", "PMT", "P450")
genes <- paste(categ, rep(1:40, each=length(categ)), sep="_")
a0 <- paste0(genes, "_", rep(1:50, each=length(genes)), "_")
b0 <- paste0(a0, "1")
ite <- 50
lg <- 1000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

f3() 返回与grep 相同的值

> identical(unname(f3(a, b)), lapply(a, grep, b, fixed=TRUE))
[1] TRUE

算法 f0 和 f3 已修改为返回命名列表中的索引。

【讨论】：

@Martin Morgan 实际上，使用以下代码categ <- c("Control", "Gr", "Or", "PMT", "P450"); genes <- paste(categ, rep(1:40, each=length(categ)), sep="_"); a0 <- paste(genes, "_", rep(1:50, each=length(genes)), "_", sep=""); b0 <- paste (a0, "1", sep=""); ite <- 50; lg <- 1000; b <- b0[1:lg]; a <- (a0[1:lg])[sample(seq(lg), ite)];res5 <- unlist(f3(a, b))，这些方法不起作用。首先，它返回名称而不是索引，其次只有一小部分请求给出结果。有什么想法吗？
@LudoDuvaux 查看答案中的响应—— f3(a, b) 的结果与 lapply(a, grep, b, value=TRUE, fixed=TRUE) 相同，所以如果有差异最好是精确的，例如，a 的单个值和 b 的两个值，其中结果不符合预期；我修改了 f3 以返回索引而不是值。
@MartinMorgan 好的，我明白了。在我之前的示例（问题中的那个）中，没有元素是另一个元素的前缀，这就是为什么identical(unname(f3(a, b)), lapply(a, grep, b, fixed=TRUE)) 是TRUE。在我更广泛的示例中（请参阅下面的答案），有很多。所以f3 多次未能返回好的值。不确定f3 是否可以在不完全更改的情况下修复它（顺便说一句，由于我没有在C 中编程，因此我无法为该功能提供任何帮助）。
您需要指定一个随机数种子以使您的数据可重现。您的数据不包括“a”值，其中一个值（整体）是另一个值的前缀（例如，“aa”整体是“aab”的前缀并导致问题，但“aab”和“aac”即使它们共享一个共同的前缀也可以）。我最好的猜测是您正在使用一种语言环境（Sys.getlocal() 报告什么？），这会导致 R 中的排序与 C 中的比较不同；您可以在会话开始时尝试 Sys.setlocale(locale="C") 。我更新了我的答案，但听起来 pmatch 适合您的用例。

【解决方案4】：

我在自己的数据上测试了@flodel 和@Sven Hohenstein 提出的不同解决方案（请注意，@Martin Morgan 的方法暂时无法测试，因为它不支持 a 的前缀为a 的其他元素）。

重要提示：尽管所有方法在我的具体情况下给出相同的结果，但提醒他们都有自己的方式，因此可以根据数据的结构给出不同的结果

这里是一个快速的总结（结果如下所示）：

在我的测试中，length(a) 和 length(b) 分别设置为 200 或 400 和 2,000 或 10,000
b 中a 的每个值只有一个匹配项
最好的方法确实取决于问题，并且都值得针对每个具体案例进行测试
pmatch 总是表现得非常好（特别是对于小长度的向量 a 和 b，分别小于 100 和 1,000 - 未在下面显示），
sapply(a, grep, b, fixed=T) 和 reduced.match（flodel 方法）函数的性能始终优于 sapply(a, grep, b)) 和 sapply(paste0("^", a), grep, b)。

这是可重现的代码以及测试结果

# set up the data set
library(microbenchmark)
categ <- c("Control", "Gr", "Or", "PMT", "P450")
genes <- paste(categ, rep(1:40, each=length(categ)), sep="_")
a0 <- paste(genes, "_", rep(1:50, each=length(genes)), "_", sep="")
b0 <- paste (a0, "1", sep="")

# length(a)==200 & length(b)==2,000
ite <- 200
lg <- 2000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

microbenchmark(as.vector(sapply(a, grep, b)),                 # original
               as.vector(sapply(paste0("^", a), grep, b)),  # @flodel 1
               as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
               unlist(reduced.match(a, b)), # @ flodel 2
#~               f3(a, b), @Martin Morgan
               pmatch(a, b))

Unit: milliseconds
                                        expr        min         lq     median
               as.vector(sapply(a, grep, b)) 188.810585 189.256705 189.827765
  as.vector(sapply(paste0("^", a), grep, b)) 157.600510 158.113507 158.560619
 as.vector(sapply(a, grep, b, fixed = TRUE))  23.954520  24.109275  24.269991
                 unlist(reduced.match(a, b))   7.999203   8.087931   8.140260
                                pmatch(a, b)   7.459394   7.489923   7.586329
         uq        max neval
 191.412879 222.131220   100
 160.129008 186.695822   100
  25.923741  26.380578   100
   8.237207  10.063783   100
   7.637560   7.888938   100


# length(a)==400 & length(b)==2,000
ite <- 400
lg <- 2000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

microbenchmark(as.vector(sapply(a, grep, b)),                 # original
               as.vector(sapply(paste0("^", a), grep, b)),  # @flodel 1
               as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
               unlist(reduced.match(a, b)), # @ flodel 2
#~               f3(a, b), @Martin Morgan
               pmatch(a, b))

Unit: milliseconds
                                        expr       min        lq    median
               as.vector(sapply(a, grep, b)) 376.85638 379.58441 380.46107
  as.vector(sapply(paste0("^", a), grep, b)) 314.38333 316.79849 318.33426
 as.vector(sapply(a, grep, b, fixed = TRUE))  49.56848  51.54113  51.90420
                 unlist(reduced.match(a, b))  13.31185  13.44923  13.57679
                                pmatch(a, b)  15.15788  15.24773  15.36917
        uq       max neval
 383.26959 415.23281   100
 320.92588 346.66234   100
  52.02379  81.65053   100
  15.56503  16.83750   100
  15.45680  17.58592   100


# length(a)==200 & length(b)==10,000
ite <- 200
lg <- 10000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

microbenchmark(as.vector(sapply(a, grep, b)),                 # original
               as.vector(sapply(paste0("^", a), grep, b)),  # @flodel 1
               as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
               unlist(reduced.match(a, b)), # @ flodel 2
#~               f3(a, b), @Martin Morgan
               pmatch(a, b))

Unit: milliseconds
                                        expr       min        lq    median
               as.vector(sapply(a, grep, b)) 975.34831 978.55579 981.56864
  as.vector(sapply(paste0("^", a), grep, b)) 808.79299 811.64919 814.16552
 as.vector(sapply(a, grep, b, fixed = TRUE)) 119.64240 120.41718 120.73548
                 unlist(reduced.match(a, b))  34.23893  34.56048  36.23506
                                pmatch(a, b)  37.57552  37.82128  38.01727
        uq        max neval
 986.17827 1061.89808   100
 824.41931  854.26298   100
 121.20605  151.43524   100
  36.57896   43.33285   100
  38.21910   40.87238   100



# length(a)==400 & length(b)==10500
ite <- 400
lg <- 10000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

microbenchmark(as.vector(sapply(a, grep, b)),                 # original
               as.vector(sapply(paste0("^", a), grep, b)),  # @flodel 1
               as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
               unlist(reduced.match(a, b)), # @ flodel 2
#~               f3(a, b), @Martin Morgan
               pmatch(a, b))

Unit: milliseconds
                                        expr        min         lq     median
               as.vector(sapply(a, grep, b)) 1977.69564 2003.73443 2028.72239
  as.vector(sapply(paste0("^", a), grep, b)) 1637.46903 1659.96661 1677.21706
 as.vector(sapply(a, grep, b, fixed = TRUE))  236.81745  238.62842  239.67875
                 unlist(reduced.match(a, b))   57.18344   59.09308   59.48678
                                pmatch(a, b)   75.03812   75.40420   75.60641
         uq        max neval
 2076.45628 2223.94624   100
 1708.86306 1905.16534   100
  241.12830  283.23043   100
   59.76167   88.71846   100
   75.99034   90.62689   100

【讨论】：

如果人们对我的总结感到满意，我会将其打勾作为明确的答案。所以请告诉我
将 mclapply 用于更快的方法并没有帮助，您可以通过使用 lapply 来为自己展示（例如，对于 f3 100 复制，在 mclapply 下比在 lapply 下慢）。如果您想获得重复的观察结果，那么可能只使用“快速”方法使用 microbenchmark（或其他）包，因为很明显 grep 解决方案是多项式的，而其他解决方案更接近线性。请务必提供一种可重现的方式来生成您的样本 (a, b) 数据，就像 @flodel 所做的那样。
@MartinMorgan 好的，我会尽量提供可复制的样本，并在有时间的时候重做测试。然而，看看每个函数如何与 mclapply 交互可能会很有趣，因为人们可能希望使用一个非常短的脚本来使用该函数，该脚本将运行多次。没有？