mahout 0.7关于随机选择数据的一个bug

在K-means聚类算法里，我们首先需要在已有的数据点中选取K个点作为初始中心点。这个bug就出现在中心点的随机选取上，mahout的实现不是真的随机。

【位置】：

org.apache.mahout.clustering.kmeans.RandomSeedGenerator#buildRandom(...) , 行 88 - 110 这段。

我简化了一下，mahout的随机抽取逻辑如下：

/**

 * Sample K integers from integer interval [0, N).

 * @param N

 * @param K

 * @return

*/

int K) {

   8:     List<Integer> chosen = Lists. newArrayListWithCapacity(K);

   9:     Random random = RandomUtils. getRandom();

int n = 0; n < N; ++n) {

int currentSize = chosen.size();

if (currentSize < K) {

  13:             chosen.add(n);

if (random.nextInt(currentSize + 1) != 0) {

// evict one chosen randomly

  16:             chosen.remove(indexToRemove);

  17:             chosen.add(n);

  18:         }

  19:     }

return chosen;

  21: }