深入了解 Collections removeAll 方法答案

【问题标题】：Insight into Collections removeAll method深入了解 Collections removeAll 方法
【发布时间】：2016-01-18 13:52:07
【问题描述】：

我有一个大小约为 200k 的列表。我在过滤列表时遇到了一些问题。

这里是实现：

public List<> filterList(List<> listToBeFiltered){
List<> removeElementsFromList = listToBeFiltered.parallelStream()
                                    .filter(//some filtering logic)
                                    .collect(Collectors.toList());
listToBeFiltered.removeAll(removeElementsFromList);
return listToBeFiltered;
}

我遇到的代码问题是，当 removeElementsFromList 接近 listToBeFiltered 的大小时，程序将停留在 removeAll 语句。非常感谢任何见解/替代解决方案。

【问题讨论】：

标签： java collections

【解决方案1】：

问题在于x.removeAll(y)的操作是O(n×m)，其中n是集合x的大小，而 m 是集合y 的大小（即O(|x|×|y|)）。

removeAll 方法基本上只是为y 中的每个元素迭代整个列表，检查x 中的每个元素是否恰好相等，如果相等则删除它。如果你能一次性做到这一点，效率会高得多。

假设您使用的是 Java 8，有一种更有效的方法可以做到这一点：

List<Integer> xs = new ArrayList<>();
// TODO: initialize xs with a bunch of values
List<Integer> ys = new ArrayList<>();
// TODO: initialize ys with a bunch of values
Set<Integer> ysSet = new HashSet<>(ys);
List<Integer> xsPrime = xs.stream()
    .filter(x -> !ysSet.contains(x))
    .collect(Collectors.toList());

对于大小为 100k 的 xs 和大小为 66k 的 ys，使用 removeAll 大约需要 5500 毫秒，而使用上述方法只需要大约 8 毫秒。由于removeAll 的二次复杂度，我预计当您扩展到 200k 时差异会更加明显。

相比之下，上面使用的过滤器版本的复杂度将是 O(n+m)，因为构建 @987654333 需要 O(m) @ys 中的所有值，然后 O(n) 迭代 xs 的所有值以确保新的 ysSet 中不包含任何值。（这当然是假设HashSet 查找是O(1)。）

再次回顾您的问题，我意识到您已经在使用filter... 在这种情况下，我建议您只需反转您的过滤器逻辑，然后将传入列表的值重置为过滤后的值：

public List<> filterList(List<> listToBeFiltered){
    List<> filteredList = listToBeFiltered.parallelStream()
        .filter(/* some inverted filtering logic */)
        .collect(Collectors.toList());
    listToBeFiltered.clear();
    listToBeFiltered.addAll(filteredList);
    return listToBeFiltered;
}

如果您不需要改变原始列表，那么您可以直接返回filteredList。（无论如何，这将是我的首选解决方案。）

我刚刚再次运行测试，这次我添加了另一个使用循环而不是流的版本：

Set<Integer> ysSet = new HashSet<>(ys);
List<Integer> xsPrime = new ArrayList<>();
for (Integer x : xs) {
    if (!ysSet.contains(x)) {
        xsPrime.add(x);
    }
}
return xsPrime;

这个版本在大约 7 毫秒而不是 8 毫秒内完成。由于这仅比流版本快一点（特别是考虑到使用 removeAll 的原始版本慢了 3 个数量级），我会坚持使用流版本 - 特别是因为你可以利用那里的并行性（因为你是已经在使用parallelStream)。

【讨论】：