哪些来自 Gensim（word2vec 模型）的训练嵌入向量应该用于 Tensorflow？非标准化的还是标准化的？答案

【问题标题】：Which trained embeddings vectors from Gensim (word2vec model) should be used for Tensorflow? Unnormalised or normalised ones?哪些来自 Gensim（word2vec 模型）的训练嵌入向量应该用于 Tensorflow？非标准化的还是标准化的？
【发布时间】：2021-04-06 07:07:40
【问题描述】：

我想在神经网络 (Tensorflow) 中使用经过 Gensim（word2vec 模型）训练的向量。为此，我可以使用两种重量。第一组是model.syn0，第二组是model.vectors_norm（在调用model.init_sims(replace=True)之后）。第二个是我们用来计算相似度的向量组。哪一个具有正确的顺序（与model.wv.index2word 和model.wv.vocab[X].index 匹配）和神经网络嵌入层的权重？

【问题讨论】：

标签： tensorflow keras gensim word2vec word-embedding

【解决方案1】：

如果您使用 Google 的GoogleNews-vectors 作为预训练模型，您可以使用model.syn0。如果您使用 Facebook 的 fastText 词嵌入，则可以直接加载二进制文件。
下面是加载两个实例的示例。

加载 GoogleNews 预训练嵌入：

model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz',binary=True,limit=500000) # To load the model first time.
model.wv.save_word2vec_format(model_path) #You can save the loaded model to binary file to load the model faster
model = gensim.models.KeyedVectors.load(model_path,mmap='r')
model.syn0norm = model.syn0
index2word_set = set(model.index2word)

model[word] gives the vector representation of the word which can be used to find similarity.

加载 fastText 预训练嵌入：

import gensim
from gensim.models import FastText
model = FastText.load_fasttext_format('cc.en.300') # to load the model for first time.
model.save("fasttext_en_bin") # Save the model to binary file to load faster.
model = gensim.models.KeyedVectors.load("fasttext_en_bin",mmap="r")
index2word_set = set(model.index2word)

model[word] gives the vector representation of the word which can be used to find similarity.

一般示例：

if word in index2word:
   feature_vec = model[word]

【讨论】：

你为什么使用model.syn0norm = model.syn0？
@Eghbal 该行基本上可以防止重新计算归一化向量以提高速度。由于必须对单位归一化向量进行类似的操作。
我认为这里仍然没有回答主要问题。我们应该使用 Gensim 的归一化权重还是只使用 syn0 在神经网络中使用这些权重？
你可以使用syn0。