R从对聚类[关闭]答案

【问题标题】：R Clustering from pairs [closed]R从对聚类[关闭]
【发布时间】：2015-07-19 23:03:58
【问题描述】：

有一个问题我已经好几周没能解决了。

我有一个用户和他们喜欢的电视剧的数据库。有成千上万的用户（A、B、C、D……）和数千部电视剧（1、2、3、4……）。所以结果是数百万对“user;likedseries”数据库。例如：

A;10 #user A liked series 10
A;23
A;233
A;500
B;5
B;10
B;343
C;10
C;233
C;340
...

我在R中寻求一种方法如何比较：

1) 基于他们喜欢的电视剧的相似用户集群

2）基于用户喜欢的相似电视剧聚类

你知道怎么解决吗？

谢谢

【问题讨论】：

如果您访问过亚马逊网站，您会在上面看到推荐书籍或其他内容。可以做类似的事情，k-最近邻算法可能就是其中之一。
您为我们提供的内容尚不清楚，但我的直觉是数据非常稀疏。随机森林可能会起作用，但我怀疑您的大多数集群将由更受欢迎的节目形成（这对于您的第一个请求可能很好）。对于第二个问题，比如购物篮分析，您不必对数据进行分区，而是形成可能的关系，例如 {The Simpson's and Futurama} -> {Family Guy} 可能有用。
如果电视剧的数量比用户数量少几个数量级，您可以使用遵循贝叶斯规则的生成模型。本质上是 P(cluster_i) = prod(P(cluster_i(show_j)))，并且您最初将节目随机分配给不同数量的集群。只要你有一个平滑参数（例如，一个节目的集群成员的最小概率），你可能会没事的。有更好的方法可以做到这一点，但这是最简单的方法之一。
这确实与编程无关（仅仅因为你想使用 R 并不会使它成为编程问题）。您确实需要选择适合您的数据的统计方法。有很多聚类算法，在开始编程之前你应该知道要实现哪一个。对于Cross Validated 或Data Science，这可能是一个更好的问题（但请先查看那里的主题内容）。
非常感谢所有 cmets 和答案。你是对的，这不是编程问题，像 Cross Validated 或 Data Science 这样的板会更好（我不知道）。

标签： r cluster-analysis

【解决方案1】：

这是您可以使用的生成算法示例。如果您的样本量非常大，您可能希望使用 data.table 包和/或外部数据库对其进行优化。代码被编写为相对容易阅读的初学者。

下面有 12,000 个用户和 90 个节目，以及 5 种不同类型的节目/用户。每个用户喜欢他们类别中的节目的机会是喜欢他们类别之外的节目的 7 倍。生成的数据框显示用户的估计集群、用户的集群成员概率以及特定节目与集群关联的概率（您需要对值进行标准化，因为列中的概率加起来为 1）。 This 是这里使用的算法。

library(plyr)

#creates "true" values
trueclass = sample(5,12000,replace=TRUE)
sid.sample <-function(x){ sapply(x,function(x) sample(1:90,1,prob = rep(1,90)*1+((0:89)%%5 == (x-1))*6))}
df = data.frame(user = rep(1:12000,each = 4),sid = sid.sample(rep(trueclass,each=4)))

#create empty frames
k = 5
uids = unique(as.numeric(df$user))
sids = unique(df$sid)

#initialize probabilities
uclass = uprobs = rdply(function() {x=rep(0,k);x[sample(k,1)] = 1;return(x)},
                        .n = length(uids))[,-1]
sprobs = matrix(0,nrow = length(sids),ncol = k)
scounts = sprobs*0

row.to.max <- function(x) rep(1,length(x)) * (1:length(x) == which.max(x))

#priors for each group; initially make them unbiased
priors = rep(0.2,5)

#slow method that still works
#20 iterations
for (counter in 1:40){
  print(counter)
  #smoothing
  scounts[,] = 1
  #calculate show probabilities
  for (i in 1:nrow(df)){
    scounts[df[i,2],which.max(uclass[df[i,1],])]=scounts[df[i,2],which.max(uclass[df[i,1],])]+1
  }
  sprobs = apply(scounts,2,function(x) x/sum(x))
  #to calculate user probabilities
  uprobs[,] = 0
  for (i in 1:nrow(df)){
    uprobs[df[i,1],] = uprobs[df[i,1],] + log(sprobs[df[i,2],])
  }
  #convert from log to actual, and add prior
  uprobs = t(apply(uprobs,1,function(x,priors,temperature){ x = x + log(priors);x=x-max(x);x=exp(x);x/sum(x)},priors = priors))
  uclass = t(apply(uprobs,1,row.to.max))
  priors = colSums(uclass)
  #small bit of smoothing
  priors = (priors+0.01)/sum(priors+0.01)
  print(priors)
}

final.classes = apply(uclass,1,which.max)
table(trueclass,final.classes)

【讨论】：

感谢您提供这个有趣的解决方案。

【解决方案2】：

如果您将数据转为交易，您就有了一个经典的购物篮分析场景，在推荐系统中很受欢迎：

UserA: M1 M11 M17

为此有很多算法和工具，例如 arules 包。

【讨论】：

谢谢。我使用了 Apriori 库中的 Arules，结果非常好。