【发布时间】:2016-06-28 16:11:08
【问题描述】:
我在python 中运行scikit-learn 的Selecting the number of clusters 示例。该示例获取了几个具有 2 个特征的样本,并为 kmeans 聚类找到了最佳 k。
就我而言,我有具有 3 个特征的样本。他们确实是3 dimensional coordinates。所以,在代码中,我只是将输入更改为我的样本,其余部分保持不变。我的样本点数量非常大,可能超过 10,000 个点。
当我输入所有数据时,我遇到了内存错误(我有 16GB 的 RAM,并且所有这些都已满)。但是当我输入一半的数据时,它并没有给出错误。尽管 ipython notebook 显示了剪影函数的错误,但我很确定它发生在 kmeans 中并且它不执行聚类并突然跳转到这个错误。
使用相同数量的数据,我在 C++ 中进行了 kmeans 聚类,它完全没有任何问题,而且速度非常快。
有什么想法可以解决这个问题吗?
这是我得到的错误
MemoryError Traceback (most recent call last)
<ipython-input-4-ed4b060ccea1> in <module>()
41 # This gives a perspective into the density and separation of the formed
42 # clusters
---> 43 silhouette_avg = silhouette_score(X, cluster_labels)
44 print("For n_clusters =", n_clusters,
45 "The average silhouette_score is :", silhouette_avg)
/usr/lib64/python2.7/site-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_score(X, labels, metric, sample_size, random_state, **kwds)
82 else:
83 X, labels = X[indices], labels[indices]
---> 84 return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
85
86
/usr/lib64/python2.7/site-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_samples(X, labels, metric, **kwds)
141
142 """
--> 143 distances = pairwise_distances(X, metric=metric, **kwds)
144 n = labels.shape[0]
145 A = np.array([_intra_cluster_distance(distances[i], labels, i)
/usr/lib64/python2.7/site-packages/sklearn/metrics/pairwise.pyc in pairwise_distances(X, Y, metric, n_jobs, **kwds)
649 func = pairwise_distance_functions[metric]
650 if n_jobs == 1:
--> 651 return func(X, Y, **kwds)
652 else:
653 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
/usr/lib64/python2.7/site-packages/sklearn/metrics/pairwise.pyc in euclidean_distances(X, Y, Y_norm_squared, squared)
181 distances.flat[::distances.shape[0] + 1] = 0.0
182
--> 183 return distances if squared else np.sqrt(distances)
184
185
MemoryError:
【问题讨论】:
-
如何输入数据?也许它可以懒惰地生成。
-
like this mypath =/Desktop/trainingFiles/' onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))] results_trajectories = [] for i in range(6 ,len(onlyfiles)): fname = onlyfiles[i] filepath = mypath + fname f = open(filepath, 'r') t = f.read().split('\n') for line in t: if line : ll = [float(x) for x in line.split(',')] 结果_trajectories.append(ll) all_Trajectories = np.array(resulted_trajectories) print(all_Trajectories) X = all_Trajectories range_n_clusters = [4, 5, 6, 7、8、9、10]
-
然后我使用 X 作为输入
标签: python c++ scikit-learn k-means