重新排序簇编号以正确对应答案

【问题标题】：reordering cluster numbers for correct correspondence重新排序簇编号以正确对应
【发布时间】：2016-11-24 08:13:29
【问题描述】：

我有一个数据集，我使用两种不同的聚类算法进行了聚类。结果大致相同，但簇数被置换了。现在为了显示颜色编码的标签，我希望相同集群的标签 ID 相同。如何获得两个标签 ID 之间的正确排列？

我可以使用蛮力来做到这一点，但也许有更好/更快的方法。我将不胜感激任何帮助或指示。如果可能的话，我正在寻找一个 python 函数。

【问题讨论】：

标签： python python-2.7 cluster-analysis permutation

【解决方案1】：

最知名的寻找最佳匹配的算法是匈牙利法。

因为不能用几句话解释，我只好给你推荐一本你喜欢的书，或者Wikipedia article "Hungarian algorithm"。

通过简单地选择对应矩阵的最大值，然后删除该行和列，您可能会得到很好的结果（即使差异确实很小，甚至是完美的）。

【讨论】：

【解决方案2】：

我有一个适合我的功能。但是当两个聚类结果非常不一致时，它可能会失败，这会导致列联矩阵中的最大值重复。如果您的集群结果大致相同，它应该可以工作。

这是我的代码：

from sklearn.metrics.cluster import contingency_matrix

def align_cluster_index(ref_cluster, map_cluster):
"""
remap cluster index according the the ref_cluster.
both inputs must be nparray and have same number of unique cluster index values.

Xin Niu Jan-15-2020
"""

ref_values = np.unique(ref_cluster)
map_values = np.unique(map_cluster)

print(ref_values)
print(map_values)

num_values = ref_values.shape[0]

if ref_values.shape[0]!=map_values.shape[0]:
    print('error: both inputs must have same number of unique cluster index values.')
    return()

switched_col = set()
while True:
    cont_mat = contingency_matrix(ref_cluster, map_cluster)
    print(cont_mat)
    # divide contingency_matrix by its row and col sums to avoid potential duplicated values:
    col_sum = np.matmul(np.ones((num_values, 1)), np.sum(cont_mat, axis = 0).reshape(1, num_values))
    row_sum = np.matmul(np.sum(cont_mat, axis = 1).reshape(num_values, 1), np.ones((1, num_values)))
    print(col_sum)
    print(row_sum)

    cont_mat = cont_mat/(col_sum+row_sum)
    print(cont_mat)

    # ignore columns that have been switched:
    cont_mat[:, list(switched_col)]=-1

    print(cont_mat)

    sort_0 = np.argsort(cont_mat, axis = 0)
    sort_1 = np.argsort(cont_mat, axis = 1)

    print('argsort contmat:')
    print(sort_0)
    print(sort_1)

    if np.array_equal(sort_1[:,-1], np.array(range(num_values))):
        break

    # switch values according to the max value in the contingency matrix:
    # get the position of max value:
    idx_max = np.unravel_index(np.argmax(cont_mat, axis=None), cont_mat.shape)
    print(cont_mat)
    print(idx_max)

    if (cont_mat[idx_max]>0) and (idx_max[0] not in switched_col):
        cluster_tmp = map_cluster.copy()
        print('switch', map_values[idx_max[1]], 'and:', ref_values[idx_max[0]])
        map_cluster[cluster_tmp==map_values[idx_max[1]]]=ref_values[idx_max[0]]
        map_cluster[cluster_tmp==map_values[idx_max[0]]]=ref_values[idx_max[1]]

        switched_col.add(idx_max[0])
        print(switched_col)

    else:
        break

print('final argsort contmat:')
print(sort_0)
print(sort_1)

print('final cont_mat:')
cont_mat = contingency_matrix(ref_cluster, map_cluster)
col_sum = np.matmul(np.ones((num_values, 1)), np.sum(cont_mat, axis = 0).reshape(1, num_values))
row_sum = np.matmul(np.sum(cont_mat, axis = 1).reshape(num_values, 1), np.ones((1, num_values)))
cont_mat = cont_mat/(col_sum+row_sum)

print(cont_mat)

return(map_cluster)

这是一些测试代码：

ref_cluster = np.array([2,2,3,1,0,0,0,1,2,1,2,2,0,3,3,3,3])
map_cluster = np.array([0,0,0,1,1,3,2,3,2,2,0,0,0,2,0,3,3])

c = align_cluster_index(ref_cluster, map_cluster)
print(ref_cluster)
print(c)

>>>[2 2 3 1 0 0 0 1 2 1 2 2 0 3 3 3 3]
>>>[2 2 2 1 1 3 0 3 0 0 2 2 2 0 2 3 3]

【讨论】：