优化邓恩指数计算？答案

【问题标题】：Optimizing Dunn Index calculation?优化邓恩指数计算？
【发布时间】：2020-06-25 05:09:58
【问题描述】：

邓恩指数是一种评估聚类的方法。值越高越好。它的计算方法是最小的簇间距离（即任何两个簇质心之间的最小距离）除以最高的簇内距离（即任何簇中任意两点之间的最大距离）。

我有一个代码 sn-p 用于计算邓恩指数：

def dunn_index(pf, cf):
    """
    pf -- all data points
    cf -- cluster centroids
    """
    numerator = inf
    for c in cf: # for each cluster
        for t in cf: # for each cluster
            if t is c: continue # if same cluster, ignore
            numerator = min(numerator, distance(t, c)) # find distance between centroids
    denominator = 0
    for c in cf: # for each cluster
        for p in pf: # for each point
            if p.get_cluster() is not c: continue # if point not in cluster, ignore
            for t in pf: # for each point
                if t.get_cluster() is not c: continue # if point not in cluster, ignore
                if t is p: continue # if same point, ignore
                denominator = max(denominator, distance(t, p))
    return numerator/denominator

问题是这非常慢：对于一个由 5000 个实例和 15 个集群组成的示例数据集，上面的函数在最坏的情况下需要执行超过 3.75 亿次距离计算。实际上它要低得多，但即使是数据已经按集群排序的最佳情况，也需要大约 2500 万次距离计算。我想节省时间，我已经尝试过直线距离与欧几里得距离，但效果不佳。

如何改进这个算法？

【问题讨论】：

你有几个维度？
@hilberts_drinking_problem 只有两个。

标签： python-3.x artificial-intelligence cluster-analysis k-means

【解决方案1】：

TLDR：重要的是，问题是在二维中设置的。对于大尺寸，这些技术可能无效。

在 2D 中，我们可以在 O(n log n) 时间计算每个簇的直径（簇内距离），其中 n 是使用凸包的簇大小。矢量化用于加速剩余操作。文末提到了两种可能的渐近改进，欢迎投稿；）

设置和伪造数据：

import numpy as np
from scipy import spatial
from matplotlib import pyplot as plt

# set up fake data
np.random.seed(0)
n_centroids = 1000
centroids = np.random.rand(n_centroids, 2)
cluster_sizes = np.random.randint(1, 1000, size=n_centroids)
# labels from 1 to n_centroids inclusive
labels = np.repeat(np.arange(n_centroids), cluster_sizes) + 1
points = np.zeros((cluster_sizes.sum(), 2))
points[:,0] = np.repeat(centroids[:,0], cluster_sizes)
points[:,1] = np.repeat(centroids[:,1], cluster_sizes)
points += 0.05 * np.random.randn(cluster_sizes.sum(), 2)

看起来有点像这样：

接下来，我们定义了一个 diameter 函数，用于计算最大的簇内距离，基于使用凸包的 this 方法。

# compute the diameter based on convex hull 
def diameter(pts):
  # need at least 3 points to construct the convex hull
  if pts.shape[0] <= 1:
    return 0
  if pts.shape[0] == 2:
    return ((pts[0] - pts[1])**2).sum()
  # two points which are fruthest apart will occur as vertices of the convex hull
  hull = spatial.ConvexHull(pts)
  candidates = pts[spatial.ConvexHull(pts).vertices]
  return spatial.distance_matrix(candidates, candidates).max()

对于 Dunn 指数计算，我假设我们已经计算了点、聚类标签和聚类质心。

如果集群的数量很大，以下基于 Pandas 的解决方案可能会表现良好：

import pandas as pd
def dunn_index_pandas(pts, labels, centroids):
  # O(k n log(n)) with k clusters and n points; better performance with more even clusters
  max_intracluster_dist = pd.DataFrame(pts).groupby(labels).agg(diameter_pandas)[0].max()
  # O(k^2) with k clusters; can be reduced to O(k log(k))
  # get pairwise distances between centroids
  cluster_dmat = spatial.distance_matrix(centroids, centroids)
  # fill diagonal with +inf: ignore zero distance to self in "min" computation
  np.fill_diagonal(cluster_dmat, np.inf)
  min_intercluster_dist = cluster_sizes.min()
  return min_intercluster_dist / max_intracluster_dist

否则，我们可以继续使用纯 numpy 解决方案。

def dunn_index(pts, labels, centroids):
  # O(k n log(n)) with k clusters and n points; better performance with more even clusters
  max_intracluster_dist = max(diameter(pts[labels==i]) for i in np.unique(labels))
  # O(k^2) with k clusters; can be reduced to O(k log(k))
  # get pairwise distances between centroids
  cluster_dmat = spatial.distance_matrix(centroids, centroids)
  # fill diagonal with +inf: ignore zero distance to self in "min" computation
  np.fill_diagonal(cluster_dmat, np.inf)
  min_intercluster_dist = cluster_sizes.min()
  return min_intercluster_dist / max_intracluster_dist

%time dunn_index(points, labels, centroids)
# returned value 2.15
# in 2.2 seconds
%time dunn_index_pandas(points, labels, centroids)
# returned 2.15
# in 885 ms

对于具有 i.i.d. ~U[1,1000] 集群大小的 1000 集群，这需要 2.2。在我的机器上秒。对于此示例（许多小集群），使用 Pandas 方法，这个数字下降到 0.8 秒。

当集群数量很大时，还有两个相关的进一步优化机会：

首先，我使用蛮力 O(k^2) 方法计算最小集群间距离，其中 k 是集群的数量。如here 所讨论的，这可以简化为O(k log(k))。
其次，max(diameter(pts[labels==i]) for i in np.unique(labels)) 需要 k 传递大小为 n 的数组。对于许多集群，这可能成为瓶颈（如本例所示）。使用 pandas 方法可以稍微缓解这种情况，但我希望这可以进一步优化。对于当前参数，大约三分之一的计算时间花费在计算簇内距离的簇间之外。

【讨论】：

这是一个非常聪明的方法，使用凸包上的点。

【解决方案2】：

这与优化算法本身无关，但我认为以下建议之一可以提高性能。

使用multiprocessing的pool of workers。
将python 代码提取到c/cpp。参考official documentation。

https://www.python.org 上还有Performance Tips。

【讨论】：