聚类后的聚类分配问题答案

【问题标题】：Problems with cluster assignment after clustering聚类后的聚类分配问题
【发布时间】：2017-09-22 14:55:03
【问题描述】：

我在理解 k-means 聚类中的聚类分配时遇到问题。具体来说，我知道该点已分配给最近的集群（到集群中心的最短距离），但我无法重现结果。详情如下。

假设我有一个数据框df1：

set.seed(16)
df1 = data.frame(matrix(sample(1:50, replace = T), ncol=10, nrow=10000))
head(df1, n=4)

  X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 35 35 35 35 35 35 35 35 35  35
2 13 13 13 13 13 13 13 13 13  13
3 23 23 23 23 23 23 23 23 23  23
4 12 12 12 12 12 12 12 12 12  12

在该数据帧上，我想执行 k-means 聚类（带缩放）：

for_clst_km = scale(df1, center=F) #standardization with z-scores

kclust = 6 #number of clusters
Clusters <- kmeans(for_clst_km, kclust)

聚类完成后，我可以将聚类分配给原始数据框：

df1$cluster = Clusters$cluster

出于测试目的，让我们选择 3 号集群。

library(dplyr)
cluster3 = df1 %>% filter(cluster == 3)

因为我想先扩展 cluster3，所以我需要删除 cluster 列，然后执行 z 标准化：

cluster3$cluster = NULL

cluster3_1 = (cluster3-colMeans(df1))/apply(df1,2,sd)

现在，当我在 cluster3_1 中缩放值时，我可以计算到每个集群中心点的距离：

centroids = data.matrix(Clusters$centers)

dist_to_clust1 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[1,])^2)))
dist_to_clust2 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[2,])^2)))
dist_to_clust3 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[3,])^2)))
dist_to_clust4 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[4,])^2)))
dist_to_clust5 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[5,])^2)))
dist_to_clust6 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[6,])^2)))

dist_to_clust = cbind(dist_to_clust1, dist_to_clust2, dist_to_clust3, dist_to_clust4, dist_to_clust5, dist_to_clust6)

最后，在观察到每个集群的距离之后，很明显我做错了什么。例如，查看 第五行，我发现该点最接近 cluster 4（例如，这是最小值）。

head(dist_to_clust)

     dist_to_clust1 dist_to_clust2 dist_to_clust3 dist_to_clust4 dist_to_clust5 dist_to_clust6
[1,]      11.015929      11.116591      10.946547      11.173597      11.034535      10.968986
[2,]      13.136060      12.848511      12.967084      13.379930      12.840414      12.861085
[3,]      13.681588      13.314994      13.492713      13.942535      13.322293      13.360695
[4,]      10.506083      10.725233      10.467843      10.636465      10.621233      10.529714
[5,]       2.157906       5.392285       3.120574       1.168265       4.855553       4.197457
[6,]      11.015929      11.116591      10.946547      11.173597      11.034535      10.968986

我认为缩放方法存在错误。我不确定我是否真的可以用整个数据框的平均值和标准差将集群缩放 3 个点。

您能否分享您的想法，我做错了什么？非常感谢！

【问题讨论】：

一个问题是您的数据生成。您只生成 100 个不同的点，然后将它们放入 100 x 10 矩阵中，因此您得到 10 个相同的列。试试head(df1)。在您的测试数据固定后，我们可以深入挖掘您的聚类。
@G5W 超过 600k 行的数据框也会出现同样的问题。因此，我认为我的方法有问题。
也许吧，但是你没有给我们一个合理的测试用例来解决问题。
@G5W 请参阅我已编辑的问题。 10000 行也是一样... 100 个不同的点在那里是因为我在现实中遇到了这样的问题。
因为后者不完全是scale 所做的？但它向您展示了您的结果对缩放差异的脆弱性。

标签： r statistics cluster-analysis k-means clustered-index

【解决方案1】：

根据我在交叉验证时的回答：

这是因为df-colmeans(df) 没有按照你的想法做。

让我们试试代码：

a=matrix(1:9,nrow=3)

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

colMeans(a)

[1] 2 5 8

a-colMeans(a)

     [,1] [,2] [,3]
[1,]   -1    2    5
[2,]   -3    0    3
[3,]   -5   -2    1

apply(a,2,function(x) x-mean(x))

     [,1] [,2] [,3]
[1,]   -1   -1   -1
[2,]    0    0    0
[3,]    1    1    1

您会发现 a-colMeans(a) 与 apply(a,2,function(x) x-mean(x)) 做的事情不同，这正是您想要的居中。

你可以写一个apply 来为你做完整的自动缩放：

apply(a,2,function(x) (x-mean(x))/sd(x))

     [,1] [,2] [,3]
[1,]   -1   -1   -1
[2,]    0    0    0
[3,]    1    1    1

scale(a)

     [,1] [,2] [,3]
[1,]   -1   -1   -1
[2,]    0    0    0
[3,]    1    1    1
attr(,"scaled:center")
[1] 2 5 8
attr(,"scaled:scale")
[1] 1 1 1

但是这样做没有意义，因为scale 会为你做这件事。 :)

此外，尝试聚类：

set.seed(16)
nc=10
nr=10000
# Make sure you draw enough samples: There was extreme periodicity in your sampling
df1 = matrix(sample(1:50, size=nr*nc,replace = T), ncol=nc, nrow=nr)
head(df1, n=4)

for_clst_km = scale(df1) #standardization with z-scores
nclust = 4 #number of clusters
Clusters <- kmeans(for_clst_km, nclust)

# For extracting scaled values: They are already available in for_clst_km
cluster3_sc=for_clst_km[Clusters$cluster==3,]

# Simplify code by putting distance in function
distFun=function(mat,centre) apply(mat, 1, function(x) sqrt(sum((x-centre)^2)))

centroids=Clusters$centers
dists=matrix(nrow=nrow(cluster3_sc),ncol=nclust) # Allocate matrix
for(d in 1:nclust) dists[,d]=distFun(cluster3_sc,centroids[d,])  # Calculate observation distances to centroid d=1..nclust

whichMins=apply(dists,1,which.min) # Calculate the closest centroid per observation
table(whichMins) # Tabularize

> table(whichMins)
whichMins
   3 
2532

HTH 手，
卡尔

【讨论】：

【解决方案2】：

您的手写缩放代码已损坏。检查结果数据的标准差，它不是 1。

你为什么不直接使用

cluster3 = for_clst_km %>% filter(cluster == 3)

【讨论】：

这是一个有效的声明，但不幸的是我不能在我的商业案例中使用它。必须使用cluster3 = df1 %>% filter(cluster == 3)。
然后两次都使用手工缩放？