无法在 python、pandas NLP 中解决这个内存错误答案

【问题标题】：sklearn TfidfVectorizer giving MemoryError无法在 python、pandas NLP 中解决这个内存错误
【发布时间】：2021-05-23 07:59:09
【问题描述】：

MemoryError：无法为形状为 (50000, 164921) 且数据类型为 float64 的数组分配 61.4 GiB

tfidf = TfidfVectorizer(analyzer=remove_stopwords)

X = tfidf.fit_transform(df['lemmatize'])
print(X.shape)


Output :  (50000, 164921)

现在，内存错误来了

df = pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names())

MemoryError: 无法为形状为 (50000, 164921) 且数据类型为 float64 的数组分配 61.4 GiB

【问题讨论】：

【解决方案1】：

您的内存不足，但有可能通过将数据类型从 float64 更改为 uint8 来完成。请试试这个，如果它再次引发同样的错误，请告诉我。

df = pd.DataFrame(np.array(X).astype(np.uint8))

【讨论】：