【问题标题】:scikit-learn: Comparison of the K-Means and MiniBatchKMeans clustering algorithmsscikit-learn:K-Means 和 MiniBatchKMeans 聚类算法的比较
【发布时间】:2019-12-06 19:35:55
【问题描述】:

我正在阅读Clustering 上的 scikit-learn 用户指南。他们有一个比较K-Means and MiniBatchKMeans 的例子。

我对示例中的以下代码有点困惑:

# We want to have the same colors for the same cluster from the
# MiniBatchKMeans and the KMeans algorithm. Let's pair the cluster centers per
# closest one.
k_means_cluster_centers = np.sort(k_means.cluster_centers_, axis=0)
mbk_means_cluster_centers = np.sort(mbk.cluster_centers_, axis=0)
k_means_labels = pairwise_distances_argmin(X, k_means_cluster_centers)
mbk_means_labels = pairwise_distances_argmin(X, mbk_means_cluster_centers)
order = pairwise_distances_argmin(k_means_cluster_centers,
                                  mbk_means_cluster_centers)

排序前后k-means聚类中心的值分别为:

k_means.cluster_centers_
array([[ 1.07705469, -1.06730994],
       [-1.07159013, -1.00648645],
       [ 0.96700708,  1.01837274]])

k_means_cluster_centers
array([[-1.07159013, -1.06730994],
       [ 0.96700708, -1.00648645],
       [ 1.07705469,  1.01837274]])

有三个中心,所以我想每一行都是一个中心的 xy 坐标。 我不确定他们为什么在将每个点与最近的中心配对之前使用np.sort(),因为这会扭曲中心的 x/y 坐标。也许他们只是想按 x 或 y 轴排序?

【问题讨论】:

  • 我创建了一个issue at GitHub。让我们看看会发生什么......
  • github上的文件好像改正了,但是网站还是显示不正确的版本np.sort。我偶然发现了这个线程,因为我在尝试上面链接的示例中概述的 kmeans 方法得到令人困惑的结果后想知道np.sort

标签: python scikit-learn k-means


【解决方案1】:

我认为你是对的。像本例中所做的排序混合了点​​的 xy 坐标。它在示例中起作用的事实或多或少是巧合。

我们有x-坐标[1, -1, 1]y-坐标[1, -1, -1]。排序后它们变成了[-1, 1, 1][-1, -1, 1],它们形成了我们最初的三对:

# original | sorted
# [ 1, -1] | [-1, -1]
# [-1, -1] | [ 1, -1]
# [ 1,  1] | [ 1,  1]

在下面观察使用四个集群时这是如何分解的。在这种情况下,我们有:

# original | sorted
# [-1, -1] | [-1, -1]
# [-1,  1] | [-1, -1]
# [ 1, -1] | [ 1,  1]
# [ 1,  1] | [ 1,  1]

它们是不同相同的点。

修改后的示例代码:

print(__doc__)

import time

import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import MiniBatchKMeans, KMeans
from sklearn.metrics.pairwise import pairwise_distances_argmin
from sklearn.datasets.samples_generator import make_blobs

# #############################################################################
# Generate sample data
np.random.seed(0)

batch_size = 45
centers = [[1, 1], [-1, -1], [1, -1], [-1, 1]]
n_clusters = len(centers)
X, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)

# #############################################################################
# Compute clustering with Means

k_means = KMeans(init='k-means++', n_clusters=4, n_init=10)
t0 = time.time()
k_means.fit(X)
t_batch = time.time() - t0

# #############################################################################
# Compute clustering with MiniBatchKMeans

mbk = MiniBatchKMeans(init='k-means++', n_clusters=4, batch_size=batch_size,
                      n_init=10, max_no_improvement=10, verbose=0)
t0 = time.time()
mbk.fit(X)
t_mini_batch = time.time() - t0

# #############################################################################
# Plot result

fig = plt.figure(figsize=(8, 3))
fig.subplots_adjust(left=0.02, right=0.98, bottom=0.05, top=0.9)
colors = ['#4EACC5', '#FF9C34', '#4E9A06', '#123456']

# We want to have the same colors for the same cluster from the
# MiniBatchKMeans and the KMeans algorithm. Let's pair the cluster centers per
# closest one.
k_means_cluster_centers = np.sort(k_means.cluster_centers_, axis=0)
mbk_means_cluster_centers = np.sort(mbk.cluster_centers_, axis=0)
k_means_labels = pairwise_distances_argmin(X, k_means_cluster_centers)
mbk_means_labels = pairwise_distances_argmin(X, mbk_means_cluster_centers)
order = pairwise_distances_argmin(k_means_cluster_centers,
                                  mbk_means_cluster_centers)

# KMeans
ax = fig.add_subplot(1, 3, 1)
for k, col in zip(range(n_clusters), colors):
    my_members = k_means_labels == k
    cluster_center = k_means_cluster_centers[k]
    ax.plot(X[my_members, 0], X[my_members, 1], 'w',
            markerfacecolor=col, marker='.')
    ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
            markeredgecolor='k', markersize=6)
ax.set_title('KMeans')
ax.set_xticks(())
ax.set_yticks(())
plt.text(-3.5, 1.8,  'train time: %.2fs\ninertia: %f' % (
    t_batch, k_means.inertia_))

# MiniBatchKMeans
ax = fig.add_subplot(1, 3, 2)
for k, col in zip(range(n_clusters), colors):
    my_members = mbk_means_labels == order[k]
    cluster_center = mbk_means_cluster_centers[order[k]]
    ax.plot(X[my_members, 0], X[my_members, 1], 'w',
            markerfacecolor=col, marker='.')
    ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
            markeredgecolor='k', markersize=6)
ax.set_title('MiniBatchKMeans')
ax.set_xticks(())
ax.set_yticks(())
plt.text(-3.5, 1.8, 'train time: %.2fs\ninertia: %f' %
         (t_mini_batch, mbk.inertia_))

# Initialise the different array to all False
different = (mbk_means_labels == 4)
ax = fig.add_subplot(1, 3, 3)

for k in range(n_clusters):
    different += ((k_means_labels == k) != (mbk_means_labels == order[k]))

identic = np.logical_not(different)
ax.plot(X[identic, 0], X[identic, 1], 'w',
        markerfacecolor='#bbbbbb', marker='.')
ax.plot(X[different, 0], X[different, 1], 'w',
        markerfacecolor='m', marker='.')
ax.set_title('Difference')
ax.set_xticks(())
ax.set_yticks(())

plt.show()

更合适的排序可能如下所示:

# order cluster centers by their x and y coordinates, weighted by 1 and 0.1 respectively
k_order = np.argsort(k_means.cluster_centers_[:, 0] + k_means.cluster_centers_[:, 1]*0.1)
mbk_order = np.argsort(mbk.cluster_centers_[:, 0] + mbk.cluster_centers_[:, 1]*0.1)
k_means_cluster_centers = k_means.cluster_centers_[k_order]
mbk_means_cluster_centers = mbk.cluster_centers_[mbk_order]

但是,正确的方法是首先对齐集群中心,然后施加(任意)顺序。这应该可以完成工作:

mbk_order = pairwise_distances_argmin(k_means.cluster_centers_, mbk.cluster_centers_)
k_means_cluster_centers = k_means.cluster_centers_
mbk_means_cluster_centers = mbk.cluster_centers_[mbk_order]

【讨论】:

    【解决方案2】:

    我不确定我们为什么在这里使用 np.sort()。

    答案在评论中 - 但是,它的实现方式存在错误,见下文。

    # We want to have the same colors for the same cluster from the
    # MiniBatchKMeans and the KMeans algorithm. Let's pair the cluster centers per
    # closest one.
    

    配对在示例代码的下面两行完成:

    k_means_cluster_centers = np.sort(k_means.cluster_centers_, axis=0)
    mbk_means_cluster_centers = np.sort(mbk.cluster_centers_, axis=0)
    (...)
    mbk_means_cluster_centers)
    order = pairwise_distances_argmin(k_means_cluster_centers,
                                      mbk_means_cluster_centers)
    

    在代码中,order 被有效地用作查找表,以获取mbk_means_cluster_centers 中对应于k_means_cluster_centers 的集群。

    my_members = mbk_means_labels == order[k]
    cluster_center = mbk_means_cluster_centers[order[k]]
    

    它会扭曲计算出的聚类中心的坐标。

    (根据cmets中的讨论更新)

    确实,通过使用np.sort(..., axis=0),中心坐标会混淆。正确的排序方式是使用np.lexsort,就像这样

    arr = k_means.cluster_centers_
    k_means_cluster_centers = arr[np.lexsort((arr[:, 0], arr[:, 1]))]
    
    arr = mbk.cluster_centers_
    mbk_means_cluster_center = arr[np.lexsort((arr[:, 0], arr[:, 1]))]
    

    这确实改变了示例的结果:

    使用sort(..., axis=0)

    使用np.lexsort

    【讨论】:

    • 例如,一个中心是 [1.07705469, -1.06730994]。在 np.sort() 之后,它变成了 [1.07705469, 1.01837274]。
    • 问题是np.sort 对坐标进行独立排序。 IE。它不保留 xy 坐标属于同一点的信息。
    • 确实如此。我同意这没有任何意义,并且是代码中的错误。这个commit里面好像已经介绍过了github.com/scikit-learn/scikit-learn/commit/…
    猜你喜欢
    • 2016-05-07
    • 2020-02-28
    • 2016-09-03
    • 2017-10-02
    • 2020-03-10
    • 2014-02-02
    • 1970-01-01
    • 1970-01-01
    • 2019-06-14
    相关资源
    最近更新 更多