【发布时间】:2018-09-16 18:44:23
【问题描述】:
我正在尝试编写一个 Python 方法,以根据它们各自的嵌入向量有效地将 n 个最接近的单词返回给给定单词。每个向量有 200 个维度,有几百万个。
这就是我目前所拥有的,它只是对目标单词和其他所有单词进行余弦相似度比较。这非常非常慢:
def n_nearest_words(word, n, word_vectors):
"""
Return a list of the n nearest words to param word, based on cosine similarity
param word_vectors: dict, keys are words and values are vectors
"""
# get_word_vector() finds the word in the word_vectors dict, using a number of
# possible capitalizations. Returns None if not found
word_vector = get_word_vector(word, word_vectors)
if word_vector:
word_vector = word_vector.reshape((1, -1))
sorted_by_sim = sorted(
word_vectors.keys(),
key=lambda other_word: cosine_similarity(word_vector, word_vectors[other_word].reshape((1, -1))),
reverse=True)
return sorted_by_sim[1:n + 1] # ignore first item, which should be target word itself
return list()
有人有更好的建议吗?
【问题讨论】:
标签: python vector similarity cosine-similarity word-embedding