【发布时间】:2015-01-09 08:25:07
【问题描述】:
我正在使用 scikit-learn 从多个文本文档构建 ngrams。我需要使用 countVectorizer 构建 document-frequency。
示例:
document1 = "john is a nice guy"
document2 = "person can be a guy"
所以,文档频率将是
{'be': 1,
'can': 1,
'guy': 2,
'is': 1,
'john': 1,
'nice': 1,
'person': 1}
这里的文档只是字符串,但是当我尝试使用 大量数据 时。它会抛出 MEMORY ERROR。
代码:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
document = [Huge amount of data around 7MB] # ['john is a guy', 'person guy']
vectorizer = CountVectorizer(ngram_range=(1, 5))
X = vectorizer.fit_transform(document).todense()
tranformer = vectorizer.transform(document).todense()
matrix_terms = np.array(vectorizer.get_feature_names())
lst_freq = map(sum,zip(*tranformer.A))
matrix_freq = np.array(lst_freq)
final_matrix = np.array([matrix_terms,matrix_freq])
错误:
Traceback (most recent call last):
File "demo1.py", line 13, in build_ngrams_matrix
X = vectorizer.fit_transform(document).todense()
File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.py", line 605, in todense
return np.asmatrix(self.toarray(order=order, out=out))
File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 901, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 269, in toarray
B = self._process_toarray_args(order, out)
File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.py", line 789, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
【问题讨论】:
-
我认为
using todense()同时生成MEMORY ERROR。但是当它不使用todense()时,它会在sparse matrix中给出输出。我不知道阅读那个稀疏矩阵。有什么帮助吗? -
如果您真的想查看稀疏矩阵,可以查看其中的一小部分(例如前 10 行),例如
X[:10,:].todense()。大多数其他操作,例如求和,对稀疏和密集矩阵的工作方式相同,因此您实际上不需要调用todense/A/toarray
标签: python memory numpy scikit-learn n-gram