按索引分配随机质心答案

【问题标题】：Assigning random centroids by index按索引分配随机质心
【发布时间】：2020-08-10 20:04:20
【问题描述】：

我正在尝试为 2D 数据分配一个随机质心（即每个 2D 点应分配给预先从数据点中随机选择的单个质心）。

Code 1 演示了一个简单的示例，其中：

points 是一个 numpy.array 对象，包含 10 个二维数据点。
5 个随机点索引（对应于points 数组中的点索引）被选为初始簇质心。
正在使用points 数组中的 10 个点创建一个pandas.DataFrame，其标签对应于点的坐标（即x1 和x2）。

代码 1：

import numpy as np
import pandas as pd

points = np.array(
    [
        [ 4.5,  7.0],
        [ 8.8,  7.7],
        [-9.2, -7.2],
        [-1.3,  6.9],
        [-5.7, -8.3],
        [ 8.8, -2.8],
        [-3.8, -6.7],
        [ 1.3,  4.5],
        [ 9.4,  8.5],
        [ 0.4,  1.5],
    ]
)

init_centroids_idx = np.random.choice(points.shape[0], n_clusters, replace=False)
print(f'initial centroid indices: {init_centroids_idx}')

data_df = pd.DataFrame(
    {
        'x1': points[:, 0],
        'x2': points[:, 1]
    }
)
data_df

要求：

将data_df 中的每个点随机分配给points 中的单个质心，该质心由centroids 中的点索引确定，因此生成的data_df 将与Example 1 中的一样。

示例 1：

如果选择的质心指数是：

init_centroids_idx = [9, 7, 8, 5, 2]

那么来自points数组的对应数据点，代表质心是：

initial centroids (points): [[ 0.4,  1.5], [ 1.3,  4.5], [ 9.4,  8.5], [ 8.8, -2.8], [-9.2, -7.2]]

所以，最后的data_df 应该有：

另外两列，即centroid_x1和centroid_x2，将对应initial_centroids中点的坐标，并将随机分配给每个数据点。
质心本身将以其自身坐标作为质心）。

所需输出的示例：

注意：

前两个点分配给聚类质心 2，第三个点分配给第 9 个索引中的质心点等。
质心本身（即对应于索引 2、5、7、8、9 的点）正在分配给它们自己。

我的问题：

完成需求的最佳方法是什么？

提前感谢您的帮助。

【问题讨论】：

一个错误 - 已更正。谢谢。
所以一切都适用于您的解决方案。 “更优雅”是什么意思？
我的意思是有没有办法用更少的代码行来解决它？ IE。 numpy 或其他模块中是否有可以完成此任务的方法？例如，一次使用两次而不是使用choice？

标签： python python-3.x pandas numpy

【解决方案1】：

我能够通过以下方式解决这个问题：

使用numpy.random.choice() 方法生成随机向量
生成点的样本，其索引对应于随机向量中的采样索引，其长度为points数组中的点数，如Code所示。

代码：

import numpy as np
import pandas as pd

points = np.array(
    [
        [ 4.5,  7.0],
        [ 8.8,  7.7],
        [-9.2, -7.2],
        [-1.3,  6.9],
        [-5.7, -8.3],
        [ 8.8, -2.8],
        [-3.8, -6.7],
        [ 1.3,  4.5],
        [ 9.4,  8.5],
        [ 0.4,  1.5],
    ]
)

init_centroids_idx = choice(points.shape[0], n_clusters, replace=False)
print(f'initial centroid indices: {init_centroids_idx}')

centroid_idx_2_point = choice(init_centroids_idx, points.shape[0], replace=True)
centroid_idx_2_point[np.sort(centroid_idx_2_point)] = np.sort(centroid_idx_2_point)  # replacing the centroids with themselves
print(f'centroid index to point assignment vector: {centroid_idx_2_point}')

point_2_centroid = points[centroid_idx_2_point]
print(f'point-centroid assignment vector: \n{point_2_centroid}')

data_df = pd.DataFrame(
    {
        'x1': points[:, 0],
        'x2': points[:, 1],
        'centroid_x1': point_2_centroid[:, 0],
        'centroid_x2': point_2_centroid[:, 1],
    }
)
data_df

不过，我希望对我的代码提出任何形式的改进建议，即使其更加优雅/高效（例如，将两个 choice 调用替换为一种相同的方法）。

谢谢。

【讨论】：