具有质心约束的 k 均值答案

【问题标题】：k-means with a centroid constraint具有质心约束的 k 均值
【发布时间】：2017-11-04 05:33:27
【问题描述】：

我正在为我的数据科学课程介绍一个数据科学项目，我们决定解决一个与加利福尼亚的海水淡化厂有关的问题：“我们应该在哪里放置 k 个工厂，以尽量减少与邮政编码的距离？”

到目前为止，我们拥有的数据是 zip、城市、县、流行、纬度、经度、水量。

问题是，我找不到任何有关如何强制质心被限制在海岸上的资源。目前我们想到的是：使用正常的 kmeans 算法，但一旦集群稳定，将质心移动到海岸（坏）使用带权重的普通 kmeans 算法，使沿海拉链具有无限权重（有人告诉我这不是一个很好的解决方案）

你们怎么看？

【问题讨论】：

IANA 数据科学家，但您能否离散化海岸线，然后选择聚类内平方和最小的海岸线？不过，重新定义更新步骤会更难。我现在要详述这一点。
我认为我最初的想法不会很好地扩展。相反，您可以重新定义更新步骤以将新平均值投影回海岸。那应该是直截了当的。首先计算新平均值，然后找到海岸上离该新平均值最近的点。该算法将继续尝试将手段从海岸拉开，而您必须继续将它们推回。我预计最终，三角洲将垂直于海岸，但这只是猜测。

标签： python algorithm k-means data-science

【解决方案1】：

K-means 不会最小化距离。

它最大限度地减少了 平方误差，这是相当不同的。差异大致是中位数和一维数据中的平均值。错误可能很大。

这是一个反例，假设我们有坐标：

k-means 选择的中心是 0,25。最佳位置是 0,0。 k-means 的距离总和 > 152，最佳位置的距离为 104。所以这里的质心几乎比最佳位置差 50%！但是质心（= 多元均值）是 k-means 使用的！

k-means 不会最小化欧几里得距离！

这是“k-means 对异常值敏感”的一种变体。

如果你试图将其限制为仅在海岸上放置“中心”，它并没有变得更好......

此外，您可能希望至少使用半正弦距离，因为在加利福尼亚，北纬 1 度！= 东纬 1 度，因为它不在赤道。

此外，您可能应该不假设每个位置都需要自己的管道，而是将它们像树一样连接起来。这大大降低了成本。

我强烈建议将此视为通用优化问题，而不是 k-means。 K-means 也是一种优化，但它可能会针对您的问题优化错误的函数...

【讨论】：

【解决方案2】：

我会通过设置可能的点来解决这个问题，这些点可能是中心，即你的海岸线。
我认为这接近 Nathaniel Saul's 第一条评论。
这样，对于每次迭代，不是选择平均值，而是通过接近集群来选择可能集合中的一个点。

我已将条件简化为只有 2 个数据列（经度和纬度），但您应该能够推断出这个概念。为简单起见，为了演示，我基于来自here 的代码。

在这个例子中，紫色的点是海岸线上的地方。如果我理解正确，最佳的海岸线位置应该是这样的：

见下面的代码：

#! /usr/bin/python3.6

# Code based on:
# https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/

import matplotlib.pyplot as plt
import numpy as np
import random

##### Simulation START #####
# Generate possible points.
def possible_points(n=20):
    y=list(np.linspace( -1, 1, n ))
    x=[-1.2]
    X=[]
    for i in list(range(1,n)):
        x.append(x[i-1]+random.uniform(-2/n,2/n) )
    for a,b in zip(x,y):
        X.append(np.array([a,b]))
    X = np.array(X)
    return X

# Generate sample
def init_board_gauss(N, k):
    n = float(N)/k
    X = []
    for i in range(k):
        c = (random.uniform(-1, 1), random.uniform(-1, 1))
        s = random.uniform(0.05,0.5)
        x = []
        while len(x) < n:
            a, b = np.array([np.random.normal(c[0], s), np.random.normal(c[1], s)])
            # Continue drawing points from the distribution in the range [-1,1]
            if abs(a) < 1 and abs(b) < 1:
                x.append([a,b])
        X.extend(x)
    X = np.array(X)[:N]
    return X
##### Simulation END #####    

# Identify points for each center.
def cluster_points(X, mu):
    clusters  = {}
    for x in X:
        bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \
                    for i in enumerate(mu)], key=lambda t:t[1])[0]
        try:
            clusters[bestmukey].append(x)
        except KeyError:
            clusters[bestmukey] = [x]
    return clusters

# Get closest possible point for each cluster.
def closest_point(cluster,possiblePoints):
    closestPoints=[]
    # Check average distance for each point.
    for possible in possiblePoints:
        distances=[]
        for point in cluster:
            distances.append(np.linalg.norm(possible-point))
            closestPoints.append(np.sum(distances)) # minimize total distance
            # closestPoints.append(np.mean(distances))
    return possiblePoints[closestPoints.index(min(closestPoints))]

# Calculate new centers.
# Here the 'coast constraint' goes.
def reevaluate_centers(clusters,possiblePoints):
    newmu = []
    keys = sorted(clusters.keys())
    for k in keys:
        newmu.append(closest_point(clusters[k],possiblePoints))
    return newmu

# Check whether centers converged.
def has_converged(mu, oldmu):
    return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu]))

# Meta function that runs the steps of the process in sequence.
def find_centers(X, K, possiblePoints):
    # Initialize to K random centers
    oldmu = random.sample(list(possiblePoints), K)
    mu = random.sample(list(possiblePoints), K)
    while not has_converged(mu, oldmu):
        oldmu = mu
        # Assign all points in X to clusters
        clusters = cluster_points(X, mu)
        # Re-evaluate centers
        mu = reevaluate_centers(clusters,possiblePoints)
    return(mu, clusters)


K=3
X = init_board_gauss(30,K)
possiblePoints=possible_points()
results=find_centers(X,K,possiblePoints)

# Show results

# Show constraints and clusters
# List point types
pointtypes1=["gx","gD","g*"]

plt.plot(
    np.matrix(possiblePoints).transpose()[0],np.matrix(possiblePoints).transpose()[1],'m.'
    )

for i in list(range(0,len(results[0]))) :
    plt.plot(
        np.matrix(results[0][i]).transpose()[0], np.matrix(results[0][i]).transpose()[1],pointtypes1[i]
        )

pointtypes=["bx","yD","c*"]
# Show all cluster points
for i in list(range(0,len(results[1]))) :
    plt.plot(
        np.matrix(results[1][i]).transpose()[0],np.matrix(results[1][i]).transpose()[1],pointtypes[i]
        )
plt.show()

经过编辑以最小化总距离。

【讨论】：