【发布时间】:2021-03-04 16:28:36
【问题描述】:
试图在大型数据框中的组上找到 K-means 聚类中心的坐标。一种选择是蛮力循环。但最好能找到一些让它工作的整洁方法。 This previous Q&A 不清楚。有任何想法吗?以下是我尝试过的一些事情,以及我通过reprex 制作的 MCVE。非常感谢,
library(magrittr)
library(dplyr)
labels <- c(rep(1,4),rep(2,4))
x <- sample.int(100, length(labels))
y <- sample.int(100, length(labels))
df <- as.data.frame(list(labels=labels,x=x,y=y))
df
#> labels x y
#> 1 1 75 32
#> 2 1 71 10
#> 3 1 41 68
#> 4 1 38 69
#> 5 2 99 95
#> 6 2 15 56
#> 7 2 73 96
#> 8 2 67 92
# Error Idea 1
df %>% group_by(labels) %>% summarize(center=kmeans(c(x,y), centers=2))
#> Error: Problem with `summarise()` input `center`.
#> x Input `center` must be a vector, not a `kmeans` object.
#> i Input `center` is `kmeans(c(x, y), centers = 2)`.
#> i The error occurred in group 1: labels = 1.
# Error Idea 2
df %>% group_by(labels) %>% summarize(x=list(x), y=list(y)) %>% select(x,y) %>% lapply(kmeans, centers=2)
#> Error in storage.mode(x) <- "double": (list) object cannot be coerced to type 'double'
# Brute force loop - works but cumbersome
ulabs <- unique(df$labels)
ctr <- vector("list", length(ulabs))
for (i in 1:length(ulabs)){
tmp <- df[df$labels==ulabs[i],]
ctr[[i]] <- (kmeans(tmp[, c('x','y')], centers=2))$centers
}
ctr
#> [[1]]
#> x y
#> 1 20.5 42
#> 2 78.0 36
#>
#> [[2]]
#> x y
#> 1 9.00000 20.00000
#> 2 65.33333 34.33333
【问题讨论】:
-
您可以采用此处列出的 tidymodels 方法tidymodels.org/learn/statistics/k-means