【发布时间】:2010-11-17 01:02:12
【问题描述】:
注意:此问题底部的更新/解决方案
作为产品推荐引擎的一部分,我尝试从使用 k-means 聚类算法开始,根据他们的产品偏好对用户进行细分。
我的数据是以下形式的字典:
prefs = {
'user_id_1': { 1L: 3.0f, 2L: 1.0f, },
'user_id_2': { 4L: 1.0f, 8L: 1.5f, },
}
其中产品 ID 是长整数,而评级是浮点数。数据稀疏。我目前有大约 60,000 名用户,其中大多数人只评价了少数产品。每个用户的值字典使用 defaultdict(float) 来实现以简化代码。
我的k-means聚类实现如下:
def kcluster(prefs,sim_func=pearson,k=100,max_iterations=100):
from collections import defaultdict
users = prefs.keys()
centroids = [prefs[random.choice(users)] for i in range(k)]
lastmatches = None
for t in range(max_iterations):
print 'Iteration %d' % t
bestmatches = [[] for i in range(k)]
# Find which centroid is closest for each row
for j in users:
row = prefs[j]
bestmatch=(0,0)
for i in range(k):
d = simple_pearson(row,centroids[i])
if d < bestmatch[1]: bestmatch = (i,d)
bestmatches[bestmatch[0]].append(j)
# If the results are the same as last time, this is complete
if bestmatches == lastmatches: break
lastmatches=bestmatches
centroids = [defaultdict(float) for i in range(k)]
# Move the centroids to the average of their members
for i in range(k):
len_best = len(bestmatches[i])
if len_best > 0:
items = set.union(*[set(prefs[u].keys()) for u in bestmatches[i]])
for user_id in bestmatches[i]:
row = prefs[user_id]
for m in items:
if row[m] > 0.0: centroids[i][m]+=(row[m]/len_best)
return bestmatches
据我所知,该算法可以很好地处理第一部分(将每个用户分配到其最近的质心)。
问题是在做下一部分时,取每个集群中每个产品的平均评分,并使用这些平均评分作为下一轮的质心。
基本上,在它甚至设法为第一个集群 (i=0) 进行计算之前,该算法在这一行出现 MemoryError :
if row[m] > 0.0: centroids[i][m]+=(row[m]/len_best)
原来除法步骤是在一个单独的循环中,但效果并不好。
这是我得到的例外:
malloc: *** mmap(size=16777216) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
任何帮助将不胜感激。
更新:最终算法
感谢这里收到的帮助,这是我的固定算法。如果您发现任何明显的错误,请添加评论。
首先,simple_pearson 实现
def simple_pearson(v1,v2):
si = [val for val in v1 if val in v2]
n = len(si)
if n==0: return 0.0
sum1 = 0.0
sum2 = 0.0
sum1_sq = 0.0
sum2_sq = 0.0
p_sum = 0.0
for v in si:
sum1+=v1[v]
sum2+=v2[v]
sum1_sq+=pow(v1[v],2)
sum2_sq+=pow(v2[v],2)
p_sum+=(v1[v]*v2[v])
# Calculate Pearson score
num = p_sum-(sum1*sum2/n)
temp = (sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n)
if temp < 0.0:
temp = -temp
den = sqrt(temp)
if den==0: return 1.0
r = num/den
return r
将simple_pearson转为距离值的简单方法:
def distance(v1,v2):
return 1.0-simple_pearson(v1,v2)
最后,k-means 聚类实现:
def kcluster(prefs,k=21,max_iterations=50):
from collections import defaultdict
users = prefs.keys()
centroids = [prefs[u] for u in random.sample(users, k)]
lastmatches = None
for t in range(max_iterations):
print 'Iteration %d' % t
bestmatches = [[] for i in range(k)]
# Find which centroid is closest for each row
for j in users:
row = prefs[j]
bestmatch=(0,2.0)
for i in range(k):
d = distance(row,centroids[i])
if d <= bestmatch[1]: bestmatch = (i,d)
bestmatches[bestmatch[0]].append(j)
# If the results are the same as last time, this is complete
if bestmatches == lastmatches: break
lastmatches=bestmatches
centroids = [defaultdict(float) for i in range(k)]
# Move the centroids to the average of their members
for i in range(k):
len_best = len(bestmatches[i])
if len_best > 0:
for user_id in bestmatches[i]:
row = prefs[user_id]
for m in row:
centroids[i][m]+=row[m]
for key in centroids[i].keys():
centroids[i][key]/=len_best
# We may have made the centroids quite dense which significantly
# slows down subsequent iterations, so we delete values below a
# threshold to speed things up
if centroids[i][key] < 0.001:
del centroids[i][key]
return centroids, bestmatches
【问题讨论】:
-
我在 4GB 的 Windows Vista 笔记本电脑上执行此操作,使用 100k 用户时内存使用量似乎约为 100MB。所以我没有得到你描述的问题。但是,我确实得到了这个: >>> print kcluster(prefs,k=100,max_iterations=100) Iteration 0 Traceback (最近一次调用最后一次): File "
", line 1, in File "", line 38, in kcluster KeyError: 3 所以你的算法可能有问题。或缩进:如果不重新格式化,我无法从 SO 剪切和粘贴您的代码。可能搞错了。
标签: python