聚类导致非常集中的聚类答案

【问题标题】：Clustering leads to very concentrated clusters聚类导致非常集中的聚类
【发布时间】：2018-11-05 01:21:58
【问题描述】：

要理解我的问题，您需要整个数据集：https://pastebin.com/82paf0G8

预处理：我有一个订单列表和 696 个唯一商品编号，并希望根据每对商品一起订购的频率对它们进行聚类。我为每对项目计算了同一订单内出现频率的数量。即两个项目之间的最高出现次数为 489。然后我通过以下方式“计算”相似性/相关性：频率/“所有对的最大频率”（489）。现在我有了我上传的数据集。

相似性/相关性：我不知道我的相似性方法在这种情况下是否是最好的。我还尝试了一种叫做“Jaccard 系数/指数”的东西，但得到了几乎相同的结果。

数据集：数据集包含材料编号 V1 和 V2。而N是0-1之间的两个材料数之间的相关性。

在另一个人的帮助下，我设法创建了一个距离矩阵并使用了 PAM 聚类。

为什么要进行 PAM 聚类？ 一位数据科学家建议：您有超过 95% 的配对没有信息，这使得所有这些材料距离相同，单个聚类非常分散。这个问题可以使用 PAM 算法来解决，但你仍然会有一个非常集中的群体。另一种解决方案是增加一个以外的距离的权重。

问题1：矩阵只有567x567。我认为对于聚类我需要 696x696 完整矩阵，即使其中很多都是零。但我不确定。

问题 2： 聚类效果不是很好。我得到非常集中的集群。很多项目都聚集在第一个集群中。此外，根据您验证 PAM 集群的方式，我的集群结果很差。是不是因为相似度分析？我还应该使用什么？是因为 95% 的数据都是零吗？我应该将零更改为其他内容吗？

整个代码和结果：

#Suppose X is the dataset
df <- data.table(X)
ss <- dcast(rbind(df, df[, .(V1 = V2, V2 = V1, N)]), V1~V2, value.var = "N")[, -1]
ss <- ss/max(ss, na.rm = TRUE)
ss[is.na(ss)] <- 0
diag(ss) <- 1

现在使用 PAM 聚类

dd2 <- as.dist(1 - sqrt(ss))
pam2 <- pam(dd2, 4)
summary(as.factor(pam2$clustering))

但我得到了非常集中的集群，如：

1   2   3   4 
382 100  23  62

【问题讨论】：

您的代码在 ss <- dcast(rbind(df, df[, .(V1 = V2, V2 = V1, N)]), V1~V2, value.var = "N")[, -1] 对我来说失败了。
ncol(df) # 3 ncol(df[, .(V1 = V2, V2 = V1, N)]) # 4，在 rbind 你有不同数量的列，所以它们不能放在一起。
用新的 pastebin 更新了我的帖子，你能再试一次吗？谢谢

标签： r cluster-analysis distance correlation

【解决方案1】：

我不确定你从哪里得到 696 号码。 rbind 之后，您有一个数据框，其中包含 V1 和 V2 的 567 个唯一值，然后您执行 dcast，并最终得到一个矩阵，如预期的 567 x 567。集群明智我认为您的集群没有问题。

dim(df) # [1] 7659    3

test <- rbind(df, df[, .(V1 = V2, V2 = V1, N)])
dim(test) # [1] 15318     3

length(unique(test$V1)) # 567
length(unique(test$V2)) # 567

test2 <- dcast(test, V1~V2, value.var = "N")[,-1]
dim(test2) # [1] 567 567

【讨论】：

我在你的数据集中找不到数字 696。dim(unique(df)) # 7659 3 dim(unique(df[,c("V1", "V2")])) # 7659 2 length(unique(df$V1)) # 422 length(unique(df$V2)) # 533
不，你是对的，这是因为第 696 号项目从未与任何其他项目一起出现，因此在数据集中没有提及。我不知道它是否会影响聚类，即 rbind 只是排除那些，而不是创建 696x696 矩阵。不过，您有集群的解决方案吗？我必须尽快提交我的论文，我真的很想弄清楚这一点。感谢您迄今为止的帮助！
不均匀的簇没有错，你有理由相信它们会是均匀的吗？
是的，因为我知道这些项目，并且从几个方面进行了分析。我不指望完全均匀的集群。不应该有一个集群包含超过 60% 的所有项目。我很确定问题是有这么多项目从来没有出现在同一个订单上，所以它们的相似系数为 0，我认为这会破坏聚类。
@Mayo，我已经阅读了这篇完整的帖子，但不确定是否对原始数据集进行了预处理，因为它没有在任何地方提及。假设它没有完成，我会推荐高相关过滤、缺失值处理等措施。此后，再次尝试PAM 聚类。此外，您确实了解 PAM 最适合混合数据集（即具有连续变量和分类变量的数据集。这就是 medoid 方法的原因）。那么应用 PAM 有什么特别的原因吗？因为，您的数据只有连续值？

【解决方案2】：

@Mayo，忘记数据科学家对PAM 所说的话。既然你提到这项工作是为了一篇论文。然后从学术角度来看，您目前对为什么需要 PAM 的理由没有任何价值。从本质上讲，您需要证明或证明为什么 PAM 对您的案例研究是必要的。鉴于数据集中（连续）变量的性质，V1, V2, N，我看不出为什么 PAM 在这里适用的逻辑（就像我在 cmets 中提到的，PAM 最适合混合变量 ）。继续往下看这个post关于R中的相关检测；

# Objective: Detect Highly Correlated variables, visualize them and remove them
data("mtcars")
my_data <- mtcars[, c(1,3,4,5,6,7)]
# print the first 6 rows
head(my_data, 6)
# compute correlation matrix using the cor()
res<- cor(my_data)
round(res, 2) # Unfortunately, the function cor() returns only the correlation coefficients between variables. 
# Visualize the correlation
# install.packages("corrplot")
library(corrplot)
corrplot(res, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)
# Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color shows the correlation coefficients and the corresponding colors.
# tl.col (for text label color) and tl.srt (for text label string rotation) are used to change text colors and rotations.

#Apply correlation filter at 0.80,
#install.packages("caret", dependencies = TRUE)
library(caret)
highlyCor <- colnames(my_data)[findCorrelation(res, cutoff = 0.80, verbose = TRUE)]
# show highly correlated variables
highlyCor
[1] "disp" "mpg"

removeHighCor<- findCorrelation(res, cutoff = 0.80) # returns indices of highly correlated variables
# remove highly correlated variables from the dataset
my_data<- my_data[,-removeHighCor]
[1] 32  4

希望这会有所帮助。

【讨论】：