【问题标题】：Equivalent of Matlab's cluster quality function?相当于Matlab的聚类质量函数？
【发布时间】：2011-10-02 10:24:37
【问题描述】：

MATLAB 有一个很好的 silhouette function 来帮助评估 k-means 的集群数量。 Python 的 Numpy/Scipy 是否也有等价物？

【问题讨论】：

标签： python matlab numpy cluster-analysis scipy

【解决方案1】：

我在下面展示了一个在 MATLAB 和 Python/Numpy 中实现的示例 silhouette（请记住，我在 MATLAB 中更流利）：

1) MATLAB

function s = mySilhouette(X, IDX)
    %# X  : matrix of size N-by-p, data where rows are instances
    %# IDX: vector of size N, cluster index of each instance (starting from 1)
    %# s  : vector of size N, silhouette score value of each instance

    N = size(X,1);            %# number of instances
    K = numel(unique(IDX));   %# number of clusters

    %# compute pairwise distance matrix
    D = squareform( pdist(X,'euclidean').^2 );

    %# indices belonging to each cluster
    kIndices = accumarray(IDX, 1:N, [K 1], @(x){sort(x)});

    %# compute a,b,s for each instance
    %# a(i): average distance from i to all other data within the same cluster.
    %# b(i): lowest average dist from i to the data of another single cluster
    a = zeros(N,1);
    b = zeros(N,1);
    for i=1:N
        ind = kIndices{IDX(i)}; ind = ind(ind~=i);
        a(i) = mean( D(i,ind) );
        b(i) = min( cellfun(@(ind) mean(D(i,ind)), kIndices([1:K]~=IDX(i))) );
    end
    s = (b-a) ./ max(a,b);
end

为了模拟 MATLAB 中 silhouette 函数的绘图，我们将轮廓值按簇分组，在每个簇内排序，然后水平绘制条形图。 MATLAB 添加了NaNs 以将条形与不同的集群分开，我发现简单地对条形进行颜色编码更容易：

%# sample data
load fisheriris
X = meas;
N = size(X,1);

%# cluster and compute silhouette score
K = 3;
[IDX,C] = kmeans(X, K, 'distance','sqEuclidean');
s = mySilhouette(X, IDX);

%# plot
[~,ord] = sortrows([IDX s],[1 -2]);
indices = accumarray(IDX(ord), 1:N, [K 1], @(x){sort(x)});
ytick = cellfun(@(ind) (min(ind)+max(ind))/2, indices);
ytickLabels = num2str((1:K)','%d');           %#'

h = barh(1:N, s(ord),'hist');
set(h, 'EdgeColor','none', 'CData',IDX(ord))
set(gca, 'CLim',[1 K], 'CLimMode','manual')
set(gca, 'YDir','reverse', 'YTick',ytick, 'YTickLabel',ytickLabels)
xlabel('Silhouette Value'), ylabel('Cluster')

%# compare against SILHOUETTE
figure, silhouette(X,IDX)

2) 蟒蛇

这是我在 Python 中提出的：

import numpy as np
from scipy.cluster.vq import kmeans2
from scipy.spatial.distance import pdist, squareform
from sklearn import datasets
import matplotlib.pyplot as plt
from matplotlib import cm

def silhouette(X, cIDX):
    """
    Computes the silhouette score for each instance of a clustered dataset,
    which is defined as:
        s(i) = (b(i)-a(i)) / max{a(i),b(i)}
    with:
        -1 <= s(i) <= 1

    Args:
        X    : A M-by-N array of M observations in N dimensions
        cIDX : array of len M containing cluster indices (starting from zero)

    Returns:
        s    : silhouette value of each observation
    """

    N = X.shape[0]              # number of instances
    K = len(np.unique(cIDX))    # number of clusters

    # compute pairwise distance matrix
    D = squareform(pdist(X))

    # indices belonging to each cluster
    kIndices = [np.flatnonzero(cIDX==k) for k in range(K)]

    # compute a,b,s for each instance
    a = np.zeros(N)
    b = np.zeros(N)
    for i in range(N):
        # instances in same cluster other than instance itself
        a[i] = np.mean( [D[i][ind] for ind in kIndices[cIDX[i]] if ind!=i] )
        # instances in other clusters, one cluster at a time
        b[i] = np.min( [np.mean(D[i][ind]) 
                        for k,ind in enumerate(kIndices) if cIDX[i]!=k] )
    s = (b-a)/np.maximum(a,b)

    return s

def main():
    # load Iris dataset
    data = datasets.load_iris()
    X = data['data']

    # cluster and compute silhouette score
    K = 3
    C, cIDX = kmeans2(X, K)
    s = silhouette(X, cIDX)

    # plot
    order = np.lexsort((-s,cIDX))
    indices = [np.flatnonzero(cIDX[order]==k) for k in range(K)]
    ytick = [(np.max(ind)+np.min(ind))/2 for ind in indices]
    ytickLabels = ["%d" % x for x in range(K)]
    cmap = cm.jet( np.linspace(0,1,K) ).tolist()
    clr = [cmap[i] for i in cIDX[order]]

    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.barh(range(X.shape[0]), s[order], height=1.0, 
            edgecolor='none', color=clr)
    ax.set_ylim(ax.get_ylim()[::-1])
    plt.yticks(ytick, ytickLabels)
    plt.xlabel('Silhouette Value')
    plt.ylabel('Cluster')
    plt.show()

if __name__ == '__main__':
    main()

更新：

正如其他人所指出的，scikit-learn 此后添加了自己的silhouette metric implementation。要在上面的代码中使用它，将对自定义silhouette函数的调用替换为：

from sklearn.metrics import silhouette_samples

...

#s = silhouette(X, cIDX)
s = silhouette_samples(X, cIDX)    # <-- scikit-learn function

...

其余代码仍可按原样使用以生成完全相同的图。

【讨论】：

对不起，我没有早点回过头来。非常感谢您花这么多时间在这上面。真的很感激。从我对数据的初步运行来看，结果看起来非常好！
嗨，阿姆罗。我想知道D = squareform( pdist(X,'euclidean').^2 )是什么意思我有5行3列，D给了我5行5列。它是什么公式？这个怎么运作？或者你能告诉我一些关于这个计算的源链接吗？谢谢你。 :)

【解决方案2】：

我看过，但我找不到 numpy/scipy 剪影函数，我什至在 pylab 和 matplotlib 中查看。我认为你必须自己实现它。

我可以将您指向http://orange.biolab.si/trac/browser/trunk/orange/orngClustering.py?rev=7462。它有一些实现剪影功能的功能。

希望这会有所帮助。

【讨论】：

【解决方案3】：

这有点晚了，但值得一提的是，scikits-learn 现在似乎实现了剪影功能。见their documentation page或直接查看source code。

【讨论】：