有效地在列表中找到唯一的向量元素答案

【问题标题】：finding unique vector elements in a list efficiently有效地在列表中找到唯一的向量元素
【发布时间】：2012-12-05 03:47:31
【问题描述】：

我有一个数字向量列表，我需要创建一个列表，其中每个向量只包含一个副本。相同的函数没有列表方法，所以我编写了一个函数来应用来检查每个向量。

F1 <- function(x){

    to_remove <- c()
    for(i in 1:length(x)){
        for(j in 1:length(x)){
            if(i!=j && identical(x[[i]], x[[j]]) to_remove <- c(to_remove,j)
        }
    }
    if(is.null(to_remove)) x else x[-c(to_remove)] 
}

问题是随着输入列表 x 的大小增加，这个函数变得非常慢，部分原因是 for 循环分配了两个大向量。我希望有一种方法可以在一分钟内运行一个长度为 150 万且向量长度为 15 的列表，但这可能是乐观的。

有谁知道将列表中的每个向量与其他所有向量进行比较的更有效方法？向量本身保证长度相等。

示例输出如下所示。

x = list(1:4, 1:4, 2:5, 3:6)
F1(x)
> list(1:4, 2:5, 3:6)

【问题讨论】：

我们是在谈论数字向量还是任何类型的向量？
Ryan -- 为了未来搜索者的利益，您能否将我的回答中的“接受”切换为@RicardoSaporta 的回答？谢谢！

标签： r list vector

【解决方案1】：

您可以散列每个向量，然后使用!duplicated() 来识别结果字符向量的唯一元素：

library(digest)  

## Some example data
x <- 1:44
y <- 2:10
z <- rnorm(10)
ll <- list(x,y,x,x,x,z,y)

ll[!duplicated(sapply(ll, digest))]
# [[1]]
#  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
# 
# [[2]]
# [1]  2  3  4  5  6  7  8  9 10
# 
# [[3]]
#  [1]  1.24573610 -0.48894189 -0.18799758 -1.30696395 -0.05052373  0.94088670
#  [7] -0.20254574 -1.08275938 -0.32937153  0.49454570

为了一目了然地了解它的工作原理，下面是哈希的样子：

sapply(ll, digest)
[1] "efe1bc7b6eca82ad78ac732d6f1507e7" "fd61b0fff79f76586ad840c9c0f497d1"
[3] "efe1bc7b6eca82ad78ac732d6f1507e7" "efe1bc7b6eca82ad78ac732d6f1507e7"
[5] "efe1bc7b6eca82ad78ac732d6f1507e7" "592e2e533582b2bbaf0bb460e558d0a5"
[7] "fd61b0fff79f76586ad840c9c0f497d1"

【讨论】：

非常感谢，我的这部分代码不再限速了。摘要包看起来很强大:)
@JoshO'Brien - 我无法在当前设置上安装软件包，但这是否比简单地执行 x[!duplicated(x)] 更快，这似乎适用于 list() 对象？
@JoshuaUlrich -- 很好。谢谢。我只需要安慰自己，知道这对于如何有效地找到列表 in 的独特元素来说仍然是一个相当不错的答案。
@thelatemail 你的意思是“厄运”，嗯？ :-)
@Josh O'Brien - 你的回答对我仍然非常有用；我最终在代码的其他地方使用了摘要包

【解决方案2】：

根据@JoshuaUlrich 和@thelatemail，ll[!duplicated(ll)] 工作得很好。
因此，unique(ll) 也应该如此我之前建议了一种使用 sapply 的方法，其想法是不检查列表中的每个元素（我删除了那个答案，因为我认为使用 unique 更有意义）

由于效率是一个目标，我们应该对这些进行基准测试。

# Let's create some sample data
xx <- lapply(rep(100,15), sample)
ll <- as.list(sample(xx, 1000, T))
ll

将其与一些指标进行比较

fun1 <- function(ll) {
  ll[c(TRUE, !sapply(2:length(ll), function(i) ll[i] %in% ll[1:(i-1)]))]
}

fun2 <- function(ll) {
  ll[!duplicated(sapply(ll, digest))]
}

fun3 <- function(ll)  {
  ll[!duplicated(ll)]
}

fun4 <- function(ll)  {
  unique(ll)
}

#Make sure all the same
all(identical(fun1(ll), fun2(ll)), identical(fun2(ll), fun3(ll)), 
    identical(fun3(ll), fun4(ll)), identical(fun4(ll), fun1(ll)))
# [1] TRUE


library(rbenchmark)

benchmark(digest=fun2(ll), duplicated=fun3(ll), unique=fun4(ll), replications=100, order="relative")[, c(1, 3:6)]

        test elapsed relative user.self sys.self
3     unique   0.048    1.000     0.049    0.000
2 duplicated   0.050    1.042     0.050    0.000
1     digest   8.427  175.563     8.415    0.038
# I took out fun1, since when ll is large, it ran extremely slow

最快的选择：

unique(ll)

【讨论】：