根据常用词对列中的值进行分组答案

【问题标题】：Group values in column based on common words根据常用词对列中的值进行分组
【发布时间】：2020-10-31 13:36:33
【问题描述】：

我有一个数据框：

ID    message
1     request body: <?xml version="2.0",<code> dwfkjn34241
2     request body: <?xml version="2.0",<code> jnwg3425
3     request body: <?xml version="2.0", <PlatCode>, <code> qwefn2
4     received an error
5     <MarkCheckMSG>
6     received an error

我想根据常用词提取列中的值组。因此，消息列中的前三行可以视为同一组，尽管它们略有不同。第四和第六作为同一组的成员。我如何使用单词和结构相似性标准将这些值分组到列消息中？有什么好的方法呢？例如，给出了示例中的数据框。因此，我对适合问题思想的方法更感兴趣，而不是基于正则表达式的解决方案，例如

【问题讨论】：

标签： r dataframe group-by cluster-computing

【解决方案1】：

也许尝试使用字符串距离度量进行 k-medoids 聚类分析？

library(cluster)
library(stringdist)

find_medoids <- function(x, k_from, method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)) {
  diss <- stringdist::stringdistmatrix(x, x, method = method, weight = weight)
  dimnames(diss) <- list(x, x)
  trials <- lapply(
    seq(from = k_from, to = length(unique(x))), 
    function(i) cluster::pam(diss, i, diss = TRUE)
  )
  sel <- which.max(vapply(trials, `[[`, numeric(1L), c("silinfo", "avg.width")))
  trials[[sel]]
}

map_cluster <- function(x, med_obj) {
  unname(med_obj$clustering[x])
}

输出

> map_cluster(df$message, find_medoids(df$message, 2, "cosine"))
[1] 1 1 1 2 3 2

对于您的真实数据，您可能需要调整一些参数，例如字符串距离法（上例使用余弦距离）。

【讨论】：

这是一个绝妙的解决方案。
哇 ekoam，很棒的方法。
我不太明白结果是什么意思
元素 1-3（即“请求正文”）属于第 1 组；元素 4 和 6（即“收到错误”）属于第 2 组；元素 5（即“”）属于第 3 组。@french_fries
我得到这个数据帧 ID 的结果 2 3 1 1 1 1 dwfkjn34241','请求体: jnwg3425','请求体: , qwefn2', '收到错误', '', '收到错误') df