置换 R 中向量的所有唯一枚举答案

【问题标题】：Permute all unique enumerations of a vector in R置换 R 中向量的所有唯一枚举
【发布时间】：2011-04-15 00:19:33
【问题描述】：

我正在尝试找到一个函数，该函数将置换向量的所有唯一排列，同时不计算同一元素类型的子集中的并置。例如：

dat <- c(1,0,3,4,1,0,0,3,0,4)

有

factorial(10)
> 3628800

可能的排列，但仅限于10!/(2!*2!*4!*2!)

factorial(10)/(factorial(2)*factorial(2)*factorial(2)*factorial(4))
> 18900

忽略同一元素类型子集中的并置时的独特排列。

我可以通过使用 unique() 和包 combinat 中的 permn() 函数来获得这个

unique( permn(dat) )

但这在计算上非常昂贵，因为它涉及枚举n!，这可能比我需要的排列多一个数量级。有没有办法在不先计算 n! 的情况下做到这一点？

【问题讨论】：

你能详细说明在相同元素类型的子集中并列是什么意思吗？也许这很明显，但我现在还没有看到。
@Chase：向量中有重复值。您可以使用较小的矢量（例如c(0,0,2)）来查看它。 permn(c(0,0,2)) 的一半排列是重复的。
我没有解决方案，但我认为换一种思考方式可能会有所帮助。如果您将原始向量分成 k 个“值组”，每个大小为 n_k，那么您真正想要做的是为每个组分配一组 n_k 个位置（其中位置 # 将是您的 1 到 10 之间的任何值）例子）。因此，您的样本向量的一个“排列”如下：零点位于位置 1、2、3、4；那些获得位置 5, 6;三号位获得 7,8 位；四人获得第 9、10 名。我希望其他人能看到我要去的地方并从这里拿走-

标签： algorithm r permutation combinatorics

【解决方案1】：

编辑：这是一个更快的答案；再次基于 Louisa Gray 和 Bryce Wagner 的想法，但由于更好地使用矩阵索引，R 代码更快。它比我原来的要快很多：

> ddd <- c(1,0,3,4,1,0,0,3,0,4)
> system.time(up1 <- uniqueperm(d))
   user  system elapsed 
  0.183   0.000   0.186 
> system.time(up2 <- uniqueperm2(d))
   user  system elapsed 
  0.037   0.000   0.038

还有代码：

uniqueperm2 <- function(d) {
  dat <- factor(d)
  N <- length(dat)
  n <- tabulate(dat)
  ng <- length(n)
  if(ng==1) return(d)
  a <- N-c(0,cumsum(n))[-(ng+1)]
  foo <- lapply(1:ng, function(i) matrix(combn(a[i],n[i]),nrow=n[i]))
  out <- matrix(NA, nrow=N, ncol=prod(sapply(foo, ncol)))
  xxx <- c(0,cumsum(sapply(foo, nrow)))
  xxx <- cbind(xxx[-length(xxx)]+1, xxx[-1])
  miss <- matrix(1:N,ncol=1)
  for(i in seq_len(length(foo)-1)) {
    l1 <- foo[[i]]
    nn <- ncol(miss)
    miss <- matrix(rep(miss, ncol(l1)), nrow=nrow(miss))
    k <- (rep(0:(ncol(miss)-1), each=nrow(l1)))*nrow(miss) + 
               l1[,rep(1:ncol(l1), each=nn)]
    out[xxx[i,1]:xxx[i,2],] <- matrix(miss[k], ncol=ncol(miss))
    miss <- matrix(miss[-k], ncol=ncol(miss))
  }
  k <- length(foo)
  out[xxx[k,1]:xxx[k,2],] <- miss
  out <- out[rank(as.numeric(dat), ties="first"),]
  foo <- cbind(as.vector(out), as.vector(col(out)))
  out[foo] <- d
  t(out)
}

返回的顺序不一样，但是排序后的结果是一样的。

up1a <- up1[do.call(order, as.data.frame(up1)),]
up2a <- up2[do.call(order, as.data.frame(up2)),]
identical(up1a, up2a)

对于我的第一次尝试，请参阅编辑历史记录。

【讨论】：

功能不错，谢谢！一件小事：d 的长度为 1 的（不太明智的）边缘情况在for(i in 2:ng) 循环中失败，因为foo 只有一个组件。
@Aaron：有没有一种简单的方法可以修复 caracal 上面提到的错误？
也刚刚意识到这与布莱斯的建议相同。通过在 R 中更加小心或在 C 中重写，很有可能组合可以更快地完成；如果有人想加快速度，请随意。一方面，我确信我在此过程中创建了比必要更多的矩阵。
我想知道 - 有没有办法使用多核或 foreach 来使用多核并加快速度？看起来 out 在 for 循环中不断被覆盖，所以，也许这是不可能的。
不是我在这里使用的算法，不，这正是你提到的原因。我确信它可以以更智能的方式重写，其中至少一部分可以使用多个内核，但我怀疑将它们组合在一起可能取决于所有结果，并且会减慢它的速度。我的直觉是，通过更深入地思考算法或用 C 重写，您可以更轻松地获得更多加速。

【解决方案2】：

以下函数（它实现了重复排列的经典公式，就像您在问题中手动执行的那样）对我来说似乎很快：

upermn <- function(x) {
    n <- length(x)
    duplicates <- as.numeric(table(x))
    factorial(n) / prod(factorial(duplicates))
}

它确实计算 n!，但不像 permn 函数，它首先生成所有排列。

查看实际效果：

> dat <- c(1,0,3,4,1,0,0,3,0,4)
> upermn(dat)
[1] 18900
> system.time(uperm(dat))
   user  system elapsed 
  0.000   0.000   0.001

更新：我刚刚意识到问题是关于生成所有唯一排列，而不仅仅是指定它们的数量 - 抱歉！

您可以改进unique(perm(...)) 部分，为少一个元素指定唯一排列，然后在它们前面添加唯一元素。好吧，我的解释可能会失败，所以让消息来源说：

uperm <- function(x) {
u <- unique(x)                    # unique values of the vector
result <- x                       # let's start the result matrix with the vector
for (i in 1:length(u)) {
    v <- x[-which(x==u[i])[1]]    # leave the first occurance of duplicated values
    result <- rbind(result, cbind(u[i], do.call(rbind, unique(permn(v)))))
}
return(result)
}

这样你可以提高一些速度。我懒得在你提供的向量上运行代码（花了很多时间），这里是一个较小向量的小比较：

> dat <- c(1,0,3,4,1,0,0)
> system.time(unique(permn(dat)))
   user  system elapsed 
  0.264   0.000   0.268 
> system.time(uperm(dat))
   user  system elapsed 
  0.147   0.000   0.150

我认为通过将此函数重写为递归，您可以获得更多收益！

更新（再次）：我试图用我有限的知识来组成一个递归函数：

uperm <- function(x) {
    u <- sort(unique(x))
    l <- length(u)
    if (l == length(x)) {
        return(do.call(rbind,permn(x)))
    }
    if (l == 1) return(x)
    result <- matrix(NA, upermn(x), length(x))
    index <- 1
    for (i in 1:l) {
        v <- x[-which(x==u[i])[1]]
        newindex <- upermn(v)
        if (table(x)[i] == 1) {
            result[index:(index+newindex-1),] <- cbind(u[i], do.call(rbind, unique(permn(v))))
            } else {
                result[index:(index+newindex-1),] <- cbind(u[i], uperm(v))
            }
        index <- index+newindex
    }
    return(result)
}

这有很大的收获：

> system.time(unique(permn(c(1,0,3,4,1,0,0,3,0))))
   user  system elapsed 
 22.808   0.103  23.241 

> system.time(uperm(c(1,0,3,4,1,0,0,3,0)))
   user  system elapsed 
  4.613   0.003   4.645

如果这对你有用，请报告！

【讨论】：

我刚刚收到最后一个错误 - 错误：评估嵌套太深：无限递归/选项（表达式=）？不过，第一个递归函数运行得很好——在时间上有了很大的改进。非常感谢您抽出宝贵时间。如果你能解决错误，那就太好了。
@Steve：最后一个uperm 函数在我的机器上运行良好，使用您提供的数据。这里计算uperm(c(1,0,3,4,1,0,0,3,0,4))需要17秒。您是否检查过该功能的最新版本？我在 25 分钟前编辑了我的答案。
@Steve：你可以在我上面的帖子中找到upermn 函数。只需在运行uperm 函数之前运行它。这用于计算和声明结果矩阵的行数，不要与rbind 混淆（这有利于性能）。

【解决方案3】：

这里没有提到的一个选项是来自multicool 包的allPerm 函数。它可以很容易地用于获得所有独特的排列：

library(multicool)
perms <- allPerm(initMC(dat))
dim(perms)
# [1] 18900    10
head(perms)
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,]    4    4    3    3    1    1    0    0    0     0
# [2,]    0    4    4    3    3    1    1    0    0     0
# [3,]    4    0    4    3    3    1    1    0    0     0
# [4,]    4    4    0    3    3    1    1    0    0     0
# [5,]    3    4    4    0    3    1    1    0    0     0
# [6,]    4    3    4    0    3    1    1    0    0     0

在基准测试中，我发现它在 dat 上比 OP 和 daroczig 的解决方案更快，但比 Aaron 的解决方案慢。

【讨论】：

在我的电脑上，microbenchmark(uniqueperm2(dat),allPerm(initMC(dat))) 告诉我allperm 快了大约 7 倍..
@ruggero 在我的 Mac 上，allPerm(initMC()) 比 uniqueperm2 慢 100 倍，但可以确定原因。能否也测试一下下面iterpc解决方案的速度？

【解决方案4】：

我实际上并不了解 R，但这是我解决问题的方法：

找出每种元素类型的数量，即

按频率排序（上面已经是）。

从最频繁的值开始，它占据了 10 个位置中的 4 个。确定 10 个可用点内 4 个值的唯一组合。 (0,1,2,3),(0,1,2,4),(0,1,2,5),(0,1,2,6) ... (0,1,2,9),(0,1,3,4),(0,1,3,5) ... (6,7,8,9)

转到第二个最频繁的值，它占据了 6 个可用点中的 2 个，并确定它是 6 个中的 2 个的唯一组合。 (0,1),(0,2),(0,3),(0,4),(0,5),(1,2),(1,3) ... (4,6), (5,6)

然后是 4 个中的 2 个： (0,1),(0,2),(0,3),(1,2),(1,3),(2,3)

其余的值，2 of 2： (0,1)

然后您需要将它们组合成每个可能的组合。这是一些伪代码（我相信有一个更有效的算法，但这应该不会太糟糕）：

lookup = (0,1,3,4)
For each of the above sets of combinations, example: input = ((0,2,4,6),(0,2),(2,3),(0,1))
newPermutation = (-1,-1,-1,-1,-1,-1,-1,-1,-1,-1)
for i = 0 to 3
  index = 0
  for j = 0 to 9
    if newPermutation(j) = -1
      if index = input(i)(j)
        newPermutation(j) = lookup(i)
        break
      else
        index = index + 1

【讨论】：

【解决方案5】：

另一个选项是iterpc 包，我相信它是现有方法中最快的。更重要的是，结果是按字典顺序排列的（这可能更可取）。

dat <- c(1, 0, 3, 4, 1, 0, 0, 3, 0, 4)
library(iterpc)
getall(iterpc(table(dat), order=TRUE))

基准测试表明iterpc 明显快于此处描述的所有其他方法

library(multicool)
library(microbenchmark)
microbenchmark(uniqueperm2(dat), 
               allPerm(initMC(dat)), 
               getall(iterpc(table(dat), order=TRUE))
              )

Unit: milliseconds
                                     expr         min         lq        mean      median
                         uniqueperm2(dat)   23.011864   25.33241   40.141907   27.143952
                     allPerm(initMC(dat)) 1713.549069 1771.83972 1814.434743 1810.331342
 getall(iterpc(table(dat), order = TRUE))    4.332674    5.18348    7.656063    5.989448
          uq        max neval
   64.147399   74.66312   100
 1855.869670 1937.48088   100
    6.705741   49.98038   100

【讨论】：

iterpc 已被弃用，请查看包arrangements。

【解决方案6】：

另一个选择是使用 Rcpp 包。不同的是它返回一个列表。

//[[Rcpp::export]]
std::vector<std::vector< int > > UniqueP(std::vector<int> v){
std::vector< std::vector<int> > out;
std::sort (v.begin(),v.end());
do {
    out.push_back(v);
} while ( std::next_permutation(v.begin(),v.end()));
return out;
}
 Unit: milliseconds
         expr       min      lq     mean    median       uq      max neval cld
 uniqueperm2(dat) 10.753426 13.5283 15.61438 13.751179 16.16061 34.03334   100   b
 UniqueP(dat)      9.090222  9.6371 10.30185  9.838324 10.20819 24.50451   100   a

【讨论】：

【解决方案7】：

由于这个问题已经过时并且继续吸引许多观点，因此这篇文章仅旨在告知R 用户该语言在执行 OP 概述的流行任务方面的当前状态。正如@RandyLai 所暗示的那样，有一些包是考虑到这个任务而开发的。它们是：arrangements 和 RcppAlgos^*。

效率

它们非常有效且非常容易用于生成multiset 的排列。

dat <- c(1, 0, 3, 4, 1, 0, 0, 3, 0, 4)
dim(RcppAlgos::permuteGeneral(sort(unique(dat)), freqs = table(dat)))
[1] 18900    10

microbenchmark(algos = RcppAlgos::permuteGeneral(sort(unique(dat)), freqs = table(dat)),
               arngmnt = arrangements::permutations(sort(unique(dat)), freq = table(dat)),
               curaccptd = uniqueperm2(dat), unit = "relative")
Unit: relative
     expr       min        lq       mean    median        uq       max neval
    algos  1.000000  1.000000  1.0000000  1.000000  1.000000 1.0000000   100
  arngmnt  1.501262  1.093072  0.8783185  1.089927  1.133112 0.3238829   100
curaccptd 19.847457 12.573657 10.2272080 11.705090 11.872955 3.9007364   100

使用RcppAlgos，我们可以利用并行处理来提高处理大型示例的效率。

hugeDat <- rep(dat, 2)[-(1:5)]
RcppAlgos::permuteCount(sort(unique(hugeDat)), freqs = table(hugeDat))
[1] 3603600

microbenchmark(algospar = RcppAlgos::permuteGeneral(sort(unique(hugeDat)),
                                                    freqs = table(hugeDat), nThreads = 4),
               arngmnt = arrangements::permutations(sort(unique(hugeDat)), freq = table(hugeDat)),
               curaccptd = uniqueperm2(hugeDat), unit = "relative", times = 10)
Unit: relative
     expr      min        lq      mean    median       uq      max neval
 algospar  1.00000  1.000000  1.000000  1.000000  1.00000  1.00000    10
  arngmnt  3.23193  3.109092  2.427836  2.598058  2.15965  1.79889    10
curaccptd 49.46989 45.910901 34.533521 39.399481 28.87192 22.95247    10

字典顺序

这些包的一个很好的好处是输出在lexicographical order：

head(RcppAlgos::permuteGeneral(sort(unique(dat)), freqs = table(dat)))
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    0    0    0    0    1    1    3    3    4     4
[2,]    0    0    0    0    1    1    3    4    3     4
[3,]    0    0    0    0    1    1    3    4    4     3
[4,]    0    0    0    0    1    1    4    3    3     4
[5,]    0    0    0    0    1    1    4    3    4     3
[6,]    0    0    0    0    1    1    4    4    3     3

tail(RcppAlgos::permuteGeneral(sort(unique(dat)), freqs = table(dat)))
         [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[18895,]    4    4    3    3    0    1    1    0    0     0
[18896,]    4    4    3    3    1    0    0    0    0     1
[18897,]    4    4    3    3    1    0    0    0    1     0
[18898,]    4    4    3    3    1    0    0    1    0     0
[18899,]    4    4    3    3    1    0    1    0    0     0
[18900,]    4    4    3    3    1    1    0    0    0     0

identical(RcppAlgos::permuteGeneral(sort(unique(dat)), freqs = table(dat)),
      arrangements::permutations(sort(unique(dat)), freq = table(dat)))
[1] TRUE

迭代器

此外，这两个包都提供了迭代器，这些迭代器允许内存高效地生成置换，一个接一个：

algosIter <- RcppAlgos::permuteIter(sort(unique(dat)), freqs = table(dat))

algosIter$nextIter()
[1] 0 0 0 0 1 1 3 3 4 4

algosIter$nextNIter(5)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    0    0    0    0    1    1    3    4    3     4
[2,]    0    0    0    0    1    1    3    4    4     3
[3,]    0    0    0    0    1    1    4    3    3     4
[4,]    0    0    0    0    1    1    4    3    4     3
[5,]    0    0    0    0    1    1    4    4    3     3

## last permutation
algosIter$back()
[1] 4 4 3 3 1 1 0 0 0 0

## use reverse iterator methods
algosIter$prevNIter(5)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    4    4    3    3    1    0    1    0    0     0
[2,]    4    4    3    3    1    0    0    1    0     0
[3,]    4    4    3    3    1    0    0    0    1     0
[4,]    4    4    3    3    1    0    0    0    0     1
[5,]    4    4    3    3    0    1    1    0    0     0

^*我是RcppAlgos的作者

【讨论】：