【问题标题】:Equivalent of Matlab's cluster quality function?相当于Matlab的聚类质量函数?
【发布时间】:2011-10-02 10:24:37
【问题描述】:

MATLAB 有一个很好的 silhouette function 来帮助评估 k-means 的集群数量。 Python 的 Numpy/Scipy 是否也有等价物?

【问题讨论】:

    标签: python matlab numpy cluster-analysis scipy


    【解决方案1】:

    我在下面展示了一个在 MATLAB 和 Python/Numpy 中实现的示例 silhouette(请记住,我在 MATLAB 中更流利):

    1) MATLAB

    function s = mySilhouette(X, IDX)
        %# X  : matrix of size N-by-p, data where rows are instances
        %# IDX: vector of size N, cluster index of each instance (starting from 1)
        %# s  : vector of size N, silhouette score value of each instance
    
        N = size(X,1);            %# number of instances
        K = numel(unique(IDX));   %# number of clusters
    
        %# compute pairwise distance matrix
        D = squareform( pdist(X,'euclidean').^2 );
    
        %# indices belonging to each cluster
        kIndices = accumarray(IDX, 1:N, [K 1], @(x){sort(x)});
    
        %# compute a,b,s for each instance
        %# a(i): average distance from i to all other data within the same cluster.
        %# b(i): lowest average dist from i to the data of another single cluster
        a = zeros(N,1);
        b = zeros(N,1);
        for i=1:N
            ind = kIndices{IDX(i)}; ind = ind(ind~=i);
            a(i) = mean( D(i,ind) );
            b(i) = min( cellfun(@(ind) mean(D(i,ind)), kIndices([1:K]~=IDX(i))) );
        end
        s = (b-a) ./ max(a,b);
    end
    

    为了模拟 MATLAB 中 silhouette 函数的绘图,我们将轮廓值按簇分组,在每个簇内排序,然后水平绘制条形图。 MATLAB 添加了NaNs 以将条形与不同的集群分开,我发现简单地对条形进行颜色编码更容易:

    %# sample data
    load fisheriris
    X = meas;
    N = size(X,1);
    
    %# cluster and compute silhouette score
    K = 3;
    [IDX,C] = kmeans(X, K, 'distance','sqEuclidean');
    s = mySilhouette(X, IDX);
    
    %# plot
    [~,ord] = sortrows([IDX s],[1 -2]);
    indices = accumarray(IDX(ord), 1:N, [K 1], @(x){sort(x)});
    ytick = cellfun(@(ind) (min(ind)+max(ind))/2, indices);
    ytickLabels = num2str((1:K)','%d');           %#'
    
    h = barh(1:N, s(ord),'hist');
    set(h, 'EdgeColor','none', 'CData',IDX(ord))
    set(gca, 'CLim',[1 K], 'CLimMode','manual')
    set(gca, 'YDir','reverse', 'YTick',ytick, 'YTickLabel',ytickLabels)
    xlabel('Silhouette Value'), ylabel('Cluster')
    
    %# compare against SILHOUETTE
    figure, silhouette(X,IDX)
    


    2) 蟒蛇

    这是我在 Python 中提出的:

    import numpy as np
    from scipy.cluster.vq import kmeans2
    from scipy.spatial.distance import pdist, squareform
    from sklearn import datasets
    import matplotlib.pyplot as plt
    from matplotlib import cm
    
    def silhouette(X, cIDX):
        """
        Computes the silhouette score for each instance of a clustered dataset,
        which is defined as:
            s(i) = (b(i)-a(i)) / max{a(i),b(i)}
        with:
            -1 <= s(i) <= 1
    
        Args:
            X    : A M-by-N array of M observations in N dimensions
            cIDX : array of len M containing cluster indices (starting from zero)
    
        Returns:
            s    : silhouette value of each observation
        """
    
        N = X.shape[0]              # number of instances
        K = len(np.unique(cIDX))    # number of clusters
    
        # compute pairwise distance matrix
        D = squareform(pdist(X))
    
        # indices belonging to each cluster
        kIndices = [np.flatnonzero(cIDX==k) for k in range(K)]
    
        # compute a,b,s for each instance
        a = np.zeros(N)
        b = np.zeros(N)
        for i in range(N):
            # instances in same cluster other than instance itself
            a[i] = np.mean( [D[i][ind] for ind in kIndices[cIDX[i]] if ind!=i] )
            # instances in other clusters, one cluster at a time
            b[i] = np.min( [np.mean(D[i][ind]) 
                            for k,ind in enumerate(kIndices) if cIDX[i]!=k] )
        s = (b-a)/np.maximum(a,b)
    
        return s
    
    def main():
        # load Iris dataset
        data = datasets.load_iris()
        X = data['data']
    
        # cluster and compute silhouette score
        K = 3
        C, cIDX = kmeans2(X, K)
        s = silhouette(X, cIDX)
    
        # plot
        order = np.lexsort((-s,cIDX))
        indices = [np.flatnonzero(cIDX[order]==k) for k in range(K)]
        ytick = [(np.max(ind)+np.min(ind))/2 for ind in indices]
        ytickLabels = ["%d" % x for x in range(K)]
        cmap = cm.jet( np.linspace(0,1,K) ).tolist()
        clr = [cmap[i] for i in cIDX[order]]
    
        fig = plt.figure()
        ax = fig.add_subplot(111)
        ax.barh(range(X.shape[0]), s[order], height=1.0, 
                edgecolor='none', color=clr)
        ax.set_ylim(ax.get_ylim()[::-1])
        plt.yticks(ytick, ytickLabels)
        plt.xlabel('Silhouette Value')
        plt.ylabel('Cluster')
        plt.show()
    
    if __name__ == '__main__':
        main()
    


    更新:

    正如其他人所指出的,scikit-learn 此后添加了自己的silhouette metricimplementation。要在上面的代码中使用它,将对自定义silhouette函数的调用替换为:

    from sklearn.metrics import silhouette_samples
    
    ...
    
    #s = silhouette(X, cIDX)
    s = silhouette_samples(X, cIDX)    # <-- scikit-learn function
    
    ...
    

    其余代码仍可按原样使用以生成完全相同的图。

    【讨论】:

    • 对不起,我没有早点回过头来。非常感谢您花这么多时间在这上面。真的很感激。从我对数据的初步运行来看,结果看起来非常好!
    • 嗨,阿姆罗。我想知道D = squareform( pdist(X,'euclidean').^2 )是什么意思我有5行3列,D给了我5行5列。它是什么公式?这个怎么运作?或者你能告诉我一些关于这个计算的源链接吗?谢谢你。 :)
    【解决方案2】:

    我看过,但我找不到 numpy/scipy 剪影函数,我什至在 pylab 和 matplotlib 中查看。我认为你必须自己实现它。

    我可以将您指向http://orange.biolab.si/trac/browser/trunk/orange/orngClustering.py?rev=7462。它有一些实现剪影功能的功能。

    希望这会有所帮助。

    【讨论】:

      【解决方案3】:

      这有点晚了,但值得一提的是,scikits-learn 现在似乎实现了剪影功能。见their documentation page或直接查看source code

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2020-11-18
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2010-10-09
        • 2019-07-28
        • 2014-12-15
        相关资源
        最近更新 更多