迭代后从失败的 HashSet 中删除答案

【问题标题】：Remove from a HashSet failing after iterating over it迭代后从失败的 HashSet 中删除
【发布时间】：2010-10-19 17:56:46
【问题描述】：

我正在用 java 编写一个凝聚聚类算法，但在删除操作时遇到了问题。当集群数量达到初始数量的一半时，它似乎总是失败。

在下面的示例代码中，clusters 是 Collection<Collection<Integer>>。

      while(clusters.size() > K){
           // determine smallest distance between clusters
           Collection<Integer> minclust1 = null;
           Collection<Integer> minclust2 = null;
           double mindist = Double.POSITIVE_INFINITY;

           for(Collection<Integer> cluster1 : clusters){
                for(Collection<Integer> cluster2 : clusters){
                     if( cluster1 != cluster2 && getDistance(cluster1, cluster2) < mindist){
                          minclust1 = cluster1;
                          minclust2 = cluster2;
                          mindist = getDistance(cluster1, cluster2);
                     }
                }
           }

           // merge the two clusters
           minclust1.addAll(minclust2);
           clusters.remove(minclust2);
      }

循环运行几次后，clusters.remove(minclust2) 最终返回 false，但我不明白为什么。

我首先创建了 10 个集群来测试此代码，每个集群都有一个从 1 到 10 的整数。距离是 0 到 1 之间的随机数。这是添加一些 println 语句后的输出。在集群数量之后，我打印出实际的集群，合并操作，以及 clusters.remove(minclust2) 的结果。

Clustering: 10 clusters
[[3], [1], [10], [5], [9], [7], [2], [4], [6], [8]]
[5] <- [6]
true
Clustering: 9 clusters
[[3], [1], [10], [5, 6], [9], [7], [2], [4], [8]]
[7] <- [8]
true
Clustering: 8 clusters
[[3], [1], [10], [5, 6], [9], [7, 8], [2], [4]]
[10] <- [9]
true
Clustering: 7 clusters
[[3], [1], [10, 9], [5, 6], [7, 8], [2], [4]]
[5, 6] <- [4]
true
Clustering: 6 clusters
[[3], [1], [10, 9], [5, 6, 4], [7, 8], [2]]
[3] <- [2]
true
Clustering: 5 clusters
[[3, 2], [1], [10, 9], [5, 6, 4], [7, 8]]
[10, 9] <- [5, 6, 4]
false
Clustering: 5 clusters
[[3, 2], [1], [10, 9, 5, 6, 4], [5, 6, 4], [7, 8]]
[10, 9, 5, 6, 4] <- [5, 6, 4]
false
Clustering: 5 clusters
[[3, 2], [1], [10, 9, 5, 6, 4, 5, 6, 4], [5, 6, 4], [7, 8]]
[10, 9, 5, 6, 4, 5, 6, 4] <- [5, 6, 4]
false

[10, 9, 5, 6, 4, 5, 6, 4, ...] 集合从那里无限增长。

编辑：澄清一下，我为集群中的每个集群使用HashSet<Integer>（HashSet<HashSet<Integer>>)。

【问题讨论】：

[10, 9, 5, 6, 4, 5, 6, 4, ...] 显然不是一个集合。是列表吗？
是的，好点。 HashSet 不应包含重复的对象。这里有些奇怪。

标签： java collections hash iterator hashset

【解决方案1】：

啊。当您更改已经在 Set（或 Map 键）中的值时，它不一定在正确的位置，并且哈希码将被缓存。您需要将其移除、更改然后重新插入。

【讨论】：

是的，你明白了！解决方案是创建一个新集群，添加 minclust1 和 minclust2 中的所有元素，从集群中删除 minclust1 和 minclust2，然后添加新集群。更改 HashSet 中的对象似乎是个坏主意。
优秀。不变的岩石。从技术上讲，只要您不破坏其 equals 和 hashCode，您就可以将元素更改为 HashSet，但这些应该取决于所有数据或不取决于任何数据。如果你有 HashSet 你可以毫无畏惧地改变组件。

【解决方案2】：

在显示的测试中，remove 在您第一次尝试删除包含多个整数的集合时失败。总是这样吗？

使用的 Collection 的具体类型是什么？

【讨论】：

是的，你在正确的轨道上。当我第一次尝试删除具有多个整数的集合时，它总是会失败。我为集群中的每个集群使用 HashSet (HashSet>)。
这很奇怪。如果您使用的是 HashSet，为什么您会在集合中获得多个值。
正如我上面所说的，一个 HashSet 不应该包含重复的对象。我认为这里有一个更深层次的问题。
而且显然保持秩序。

【解决方案3】：

明显的问题是clusters.remove 可能正在使用equals 来查找要删除的元素。不幸的是集合上的equals 通常比较元素是否相同，而不是是否是同一个集合（我相信 C# 在这方面做出了更好的选择）。

一个简单的解决方法是将clusters 创建为Collections.newSetFromMap(new IdentityHashMap<Collection<Integer>, Boolean>())（我认为）。

【讨论】：

关于equals的好点，但即使使用equals，为什么删除会失败？