【发布时间】:2018-02-02 12:17:23
【问题描述】:
我有一百万个句子。我正在使用 Affinity Propagation 算法将相似的句子聚集在一起。我在处理如此大的数据并出现内存错误时遇到问题。
错误:
----------------------------------- ---------------------------------------- MemoryError Traceback(最近调用 最后)在() 72#] 73 ---> 74 个簇 = get_clusters(sentences) 75 #打印(集群) 76
在 get_clusters(sentences) 18 def get_clusters(句子): 19 tf_idf_matrix = vectorizer.fit_transform(句子) ---> 20 相似度矩阵 = (tf_idf_matrix * tf_idf_matrix.T).A 21 亲和传播=亲和传播(亲和=“预计算”,阻尼=0.5) 22 affinity_propagation.fit(similarity_matrix)
~/.local/lib/python3.5/site-packages/scipy/sparse/base.py 在 getattr(自我,attr) 562 def getattr(自我,attr): 第563章 --> 564 返回 self.toarray() 第565章 第566章
~/.local/lib/python3.5/site-packages/scipy/sparse/compressed.py 在 toarray(自我,订单,出) 第962章 963 """查看
spmatrix.toarray的文档字符串。""" --> 964 返回 self.tocoo(copy=False).toarray(order=order, out=out) 965 第966章##############~/.local/lib/python3.5/site-packages/scipy/sparse/coo.py 在 toarray(自我,订单,出) 250 def toarray(自我,订单=无,出=无): 251 """查看
spmatrix.toarray的文档字符串。""" --> 252 B = self._process_toarray_args(order, out) 253 fortran = int(B.flags.f_contiguous) 254 如果不是 fortran 也不是 B.flags.c_contiguous:~/.local/lib/python3.5/site-packages/scipy/sparse/base.py 在 _process_toarray_args(self, order, out) 1037 return out 1038 else: -> 1039 return np.zeros(self.shape, dtype=self.dtype, order=order) 1040 1041 def numpy_ufunc(self, func, 方法,位置,输入,**kwargs):
内存错误:
代码:
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AffinityPropagation
import pandas as pd
from collections import Counter
punctuation_map = dict((ord(char), None) for char in string.punctuation)
stemmer = nltk.stem.snowball.SpanishStemmer()
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize)
def get_clusters(sentences):
tf_idf_matrix = vectorizer.fit_transform(sentences)
similarity_matrix = (tf_idf_matrix * tf_idf_matrix.T).A
affinity_propagation = AffinityPropagation(affinity="precomputed", damping=0.5)
affinity_propagation.fit(similarity_matrix)
labels = affinity_propagation.labels_
cluster_centers = affinity_propagation.cluster_centers_indices_
tagged_sentences = zip(sentences, labels)
clusters = {}
for sentence, cluster_id in tagged_sentences:
clusters.setdefault(sentences[cluster_centers[cluster_id]], []).append(sentence)
return clusters
#loading data file
filename = "/home/ubuntu/VA_data/first_50K.csv"
df = pd.read_csv(filename, header = None)
sentences = df.iloc[:, 0].values.tolist()
clusters = get_clusters(sentences)
#print cluster labels in descending order of number sentences present in it
for k in sorted(clusters, key=lambda k: len(clusters[k]), reverse=True):
print(k,"\n")
#Print cluster with sentences in it
for cluster in clusters:
print(cluster, ':')
count = 0
for element in clusters[cluster]:
print(' - ', element)
count+= 1
print('Cluster size: ', count)
print('% of queries within the cluster', (count/len(sentences))*100)
print('Number of clusters: ',len(cluster_centers))
我应该如何解决这个问题?请帮忙
【问题讨论】:
-
买更多的内存?
-
@Mad:我有 16GB RAM 和 64 位系统和 Ubuntu
-
您可能应该将其编辑到您的问题中。
标签: python memory memory-management cluster-analysis