Kmeans 以 group_by 为中心并汇总管道答案

【问题标题】：Kmeans centers in a group_by and summarize pipeKmeans 以 group_by 为中心并汇总管道
【发布时间】：2021-03-04 16:28:36
【问题描述】：

试图在大型数据框中的组上找到 K-means 聚类中心的坐标。一种选择是蛮力循环。但最好能找到一些让它工作的整洁方法。 This previous Q&A 不清楚。有任何想法吗？以下是我尝试过的一些事情，以及我通过reprex 制作的 MCVE。非常感谢，

library(magrittr)
library(dplyr)
labels <- c(rep(1,4),rep(2,4))
x <- sample.int(100, length(labels))
y <- sample.int(100, length(labels))

df <- as.data.frame(list(labels=labels,x=x,y=y))
df
#>   labels  x  y
#> 1      1 75 32
#> 2      1 71 10
#> 3      1 41 68
#> 4      1 38 69
#> 5      2 99 95
#> 6      2 15 56
#> 7      2 73 96
#> 8      2 67 92

# Error Idea 1
df %>% group_by(labels) %>% summarize(center=kmeans(c(x,y), centers=2))
#> Error: Problem with `summarise()` input `center`.
#> x Input `center` must be a vector, not a `kmeans` object.
#> i Input `center` is `kmeans(c(x, y), centers = 2)`.
#> i The error occurred in group 1: labels = 1.

# Error Idea 2
df %>% group_by(labels) %>% summarize(x=list(x), y=list(y)) %>% select(x,y) %>% lapply(kmeans, centers=2)
#> Error in storage.mode(x) <- "double": (list) object cannot be coerced to type 'double'

# Brute force loop - works but cumbersome
ulabs <- unique(df$labels)
ctr <- vector("list", length(ulabs))
for (i in 1:length(ulabs)){
    tmp <- df[df$labels==ulabs[i],]
    ctr[[i]] <- (kmeans(tmp[, c('x','y')], centers=2))$centers
    
}

ctr
#> [[1]]
#>      x  y
#> 1 20.5 42
#> 2 78.0 36
#> 
#> [[2]]
#>          x        y
#> 1  9.00000 20.00000
#> 2 65.33333 34.33333

【问题讨论】：

您可以采用此处列出的 tidymodels 方法tidymodels.org/learn/statistics/k-means

标签： r dplyr tidyverse

【解决方案1】：

在 purrr 包的帮助下，这里是对您的想法 1 的轻微修改。结果以列表形式在Center 列中。

library(magrittr)
library(dplyr)
library(purrr)

set.seed(1)

labels <- c(rep(1,4),rep(2,4))
x <- sample.int(100, length(labels))
y <- sample.int(100, length(labels))

df <- as.data.frame(list(labels=labels,x=x,y=y))

df2 <- df %>% 
  group_by(labels) %>%
  nest() %>%
  summarize(Kmeans = map(data, ~kmeans(.x[, c("x", "y")], 
                                centers = 2))) %>%
  mutate(Center = map(Kmeans, "centers"))

df2$Center
# [[1]]
#      x  y
# 1 53.5 55
# 2 17.5 91
# 
# [[2]]
#      x  y
# 1 28.5 64
# 2 84.5 14

【讨论】：

谢谢，建议的代码似乎只处理一维的所有输入数据。例如，您的 Cluster means 结果是一个 2x1 矩阵，即每个集群都有一个坐标。将其与我的蛮力循环进行对比（我已经对其进行了编辑以显示结果）。如果重要的话，我实际上只对存储集群质心坐标而不是整个集群对象感兴趣。
@saintsfan342000 请查看我的更新。结果在Center 列中。