使检查向量是否存在于矩阵中的函数更快答案

【问题标题】：making function that checks if vector exists in matrix faster使检查向量是否存在于矩阵中的函数更快
【发布时间】：2016-08-04 02:28:53
【问题描述】：

我有以下函数（funtest）来测试矩阵中是否存在特定向量。向量的长度总是 2，矩阵总是有两列。该函数工作正常，我只想让它更快（最好更快），因为我的矩阵可以有数百到数千行。

x = c(1,2)

set.seed(100)
m <- matrix(sample(c(1,-2,3,4), 500*2, replace=TRUE), ncol=2)

funtest(m,x)
[1] TRUE

这是目前的速度

library(microbenchmark)
microbenchmark(funtest(m, x), times=100)
Unit: milliseconds
          expr      min       lq     mean   median       uq      max
 funtest(m, x) 1.501247 1.536157 1.674668 1.567826 1.708293 2.900046
 neval
   100

这是函数

funtest = function(m, x) {
    out = any(apply(m,1,function(n,x) all(n==x),x=x))
    return(out)
}

【问题讨论】：

我本身不是 R 用户，但这看起来像是一个高度矢量化的表达式，因此您可以在没有分支的情况下进行检查。这通常更容易在处理器上并行化，但有时只是比更受域驱动的方法慢。也许最好先收集所有行索引，其中第一个元素等于给定值。然后仅检查已过滤的第二列（仅与步骤 1 中的正索引进行比较；例如布尔评估中的短路）。不过，加速应该受到 ~2 的限制。
你绝对应该先尝试一下李哲元的方法，因为它具有更大的加速潜力，并且在矢量化语言中可能感觉更自然。（但遗憾的是，这一切都归结为 R 的内部结构；与上述替代方案相比，适用的情况）。当然，数据统计也可能在基于分支的方法中发挥作用。
我在考虑用某种散列替代方法来在恒定时间内进行搜索？
@user3067923 我很确定基于散列的方法由于大常数（复杂性）而一直丢失。这是一个线性复杂度运算，所以我不会那样处理它。（为了更清楚：您需要查看所有行，这是一个下限；直接比较总是比散列更快；至少如果您对所有列进行散列 -> 可能有一个很好的概率权衡方法有很多列）

标签： r matrix vector

【解决方案1】：

怎么样

paste(x[1], x[2], sep='&') %in% paste(m[,1], m[,2], sep='&')

这应该是超级高效的！它基于匹配。一旦找到第一个匹配项，将不再进行搜索！

但是我确信这不是最快的。最佳解决方案是使用单个 while 循环在 C 代码中编写此操作。但是，潜在的加速因子应该不超过 2。

【讨论】：

【解决方案2】：

这是一种 Rcpp（特别是 Rcpp Armadillo）方法。基准在最后给出：

# Import the relevant packages (All for compiling the C++ code inline)
library(Rcpp)
library(RcppArmadillo)
library(inline)

# We need to include these namespaces in the C++ code 
includes <- '
using namespace Rcpp;
using namespace arma;
'

# This is the main C++ function 
# We cast 'm' as an Armadillo matrix 'm1' and compute the number of rows 'numRows'
# We cast 'x' as a row vector 'x1'
# We then loop through the rows of the matrix 
# As soon as we find a matching row (anyEqual = TRUE), we stop and return TRUE
# If no matching row is found, then anyEqual = FALSE and we return FALSE
# Note: Within the for loop, we do an elementwise comparison of a row of m1 to x1
# If the row is equal to x1, then the sum of the elementwise comparision should equal the number of elements of x1
src <- '
mat m1 = as<mat>(m); 
int numRows = m1.n_rows;
rowvec x1 = as<rowvec>(x);
bool anyEqual = FALSE;
for (int i = 0; i < numRows & !anyEqual; i++){
    anyEqual = (sum(m1.row(i) == x1) == x1.size());
}
return(wrap(anyEqual));
'

# Here, we compile the function above
# Do this once (in a given R session) and use it as many times as desired
rcppFn <- cxxfunction(signature(m="numeric", x="numeric"), src, plugin='RcppArmadillo', includes)

基准在这里：（编辑：我在下面也为@zheyuan-li 添加了一个非常简单的解决方案的基准；它被称为pasteFn）

# Your function is called funtest
# Rcpp function is rcppFn
# Zheyuan's solution is pasteFn
microbenchmark(funtest(m, x), rcppFn(m, x), pasteFn(m, x), times=100, unit = "ms")
Unit: milliseconds
          expr      min        lq       mean    median        uq      max neval
 funtest(m, x) 1.127903 1.1984755 1.30559130 1.2514455 1.3431040 2.641258   100
  rcppFn(m, x) 0.005420 0.0061355 0.00879676 0.0073660 0.0084130 0.030305   100
 pasteFn(m, x) 0.741269 0.7610905 0.79174042 0.7752145 0.8228895 0.894389   100

编辑：如果您想改用矩阵“x”，以下源代码应该可以工作

src <- '
mat m1 = as<mat>(m); 
int numRows = m1.n_rows;
mat x1 = as<mat>(x);
vec anyEqual = zeros<vec>(x1.n_rows);
for (int j = 0; j < x1.n_rows; j++){
for (int i = 0; i < numRows & !anyEqual(j); i++){
anyEqual(j) = (sum(m1.row(i) == x1.row(j)) == x1.n_cols);
}
}
return(wrap(anyEqual));
'

这里，我只是检查 x 的每一行，是否存在于 m 中。与原始代码非常相似，只是多了一个 for 循环。它将返回 1 或 0，具体取决于是否存在匹配项（对 RcppArmadillo 的经验不足，无法创建布尔向量）。

【讨论】：

可以接受 x 作为矩阵而不是要搜索的向量吗？说x = matrix(data = c(1,2,-3,5,4,10), ncol=2) 而不是x= c(1,2)

【解决方案3】：

base::bitwXor() 将为两个整数之间的匹配生成 0。

注意：bitwXor() 仅适用于整数

编辑：添加了与bitwXor 中的0 的比较，并添加了data.table 解决方案

library(microbenchmark)
set.seed(100)
m <- matrix(sample(c(1,-2,3,4), 500*2, replace=TRUE), ncol=2)

fun1 <- function(m,x) {any(apply(m,1,function(n,x) all(n==x),x=x))}
fun2 <- function(m,x) {paste(x[1], x[2], sep='&') %in% paste(m[,1], m[,2], sep='&')}
fun3 <- function(m,x) {any((bitwXor(m[,1], x[1]) == 0) & (bitwXor(m[,2], x[2]) == 0))}
fun4 <- function(m,x) {setDT(m)[X1 == x[1] & X2 == x[2], .N > 0]}

x <-  c(1,2)

microbenchmark(fun1(m,x),     # @user3067923
               fun2(m,x),     # @Zheyuan Li
               rcppFn(m, x),  # @jav
               fun3(m,x),
               times = 1000)

# Unit: microseconds
#         expr      min       lq       mean   median       uq      max neval
#   fun1(m, x) 1802.483 1920.007 2156.93459 1995.865 2094.820 9915.013  1000
#   fun2(m, x) 1540.716 1602.534 1674.39556 1641.256 1702.848 2832.344  1000
# rcppFn(m, x)   14.040   16.305   23.43586   21.739   29.439   95.107  1000
#   fun3(m, x)   70.650   76.992   86.36290   82.879   88.766  314.303  1000

Data.Table 解决方案：

library(data.table)
m <- data.frame(m)
microbenchmark(fun4(m,x), times = 1000)

# Unit: microseconds
#       expr     min       lq     mean median      uq      max neval
# fun4(m, x) 836.026 887.6555 985.8596 920.49 968.269 9025.546  1000

【讨论】：