句子的欧几里得距离答案

【问题标题】：Euclidean distance of sentences句子的欧几里得距离
【发布时间】：2021-06-01 05:53:03
【问题描述】：

我有 1,16,098 个句子和 30,119 个特征的 scipy 稀疏矩阵。 我想计算他们每个人的欧几里得距离并打印 5 个最相似的句子。

我正在使用 CountVectorizer 方法来构建词汇表并对单词进行编码。

但我遇到了错误。请帮忙。我刚刚开始使用python实现NLP。

vectorizer = CountVectorizer(stop_words = 'english')
features = vectorizer.fit_transform(corpus)
print(vectorizer.vocabulary_)

features.shape
(116098, 30119)

print(len(vectorizer.vocabulary_))
30119

for i in range(0,116098):  
    for j in features:
        print(euclidean_distances(features[j],i))

IndexError                                Traceback (most recent call last)
<ipython-input-56-528966153c16> in <module>
      1 for i in range(0,116098):
      2     for j in features:
----> 3         print(euclidean_distances(features[j],i))```

~\Anaconda3\lib\site-packages\scipy\sparse\_index.py in __getitem__(self, key)
     33     """
     34     def __getitem__(self, key):
---> 35         row, col = self._validate_indices(key)
     36         # Dispatch to specialized methods.
     37         if isinstance(row, INT_TYPES):

~\Anaconda3\lib\site-packages\scipy\sparse\_index.py in _validate_indices(self, key)
    128     def _validate_indices(self, key):
    129         M, N = self.shape
--> 130         row, col = _unpack_index(key)
    131 
    132         if isintlike(row):

~\Anaconda3\lib\site-packages\scipy\sparse\_index.py in _unpack_index(index)
    274         # not work because spmatrix.ndim is always 2.
    275         raise IndexError(
--> 276             'Indexing with sparse matrices is not supported '
    277             'except boolean indexing where matrix and index '
    278             'are equal shapes.')

IndexError: Indexing with sparse matrices is not supported except boolean indexing where matrix and index are equal shapes.

【问题讨论】：

标签： python numpy machine-learning scipy nlp

【解决方案1】：

你可能的意思是：

print(euclidean_distances(features[j],features[i]))

wgich 应该可以正常工作。

【讨论】：

【解决方案2】：

CountVectorizer 返回稀疏矩阵。

sklearn CountVectorizer

此实现使用 scipy.sparse.csr_matrix 生成计数的稀疏表示。

您可以将其转换为 numpy 数组以与 euclidean_distance 一起使用。

features = features.toarray()

【讨论】：