如何将 word2vec 导入 TensorFlow Seq2Seq 模型？答案

【问题标题】：How to import word2vec into TensorFlow Seq2Seq model?如何将 word2vec 导入 TensorFlow Seq2Seq 模型？
【发布时间】：2016-03-17 22:15:32
【问题描述】：

我正在使用 Tensorflow 序列到序列转换模型。我想知道是否可以将自己的 word2vec 导入此模型？而不是使用教程中提到的原始“密集表示”。

从我的角度来看，TensorFlow 似乎正在为 seq2seq 模型使用 One-Hot 表示。首先，对于函数tf.nn.seq2seq.embedding_attention_seq2seq，编码器的输入是一个标记符号，例如'a' 将是 '4' 而 'dog' 将是 '15715' 等等，并且需要 num_encoder_symbols。所以我认为它让我提供了单词的位置和单词的总数，然后该函数可以在 One-Hot 表示中表示单词。我还在学习源代码，但是很难理解。

谁能给我一个关于上述问题的想法？

【问题讨论】：

标签： python tensorflow

【解决方案1】：

seq2seq embedding_* 函数确实创建了与 word2vec 非常相似的嵌入矩阵。它们是一个像这样命名的变量：

EMBEDDING_KEY = "embedding_attention_seq2seq/RNN/EmbeddingWrapper/embedding"

知道了这一点，你就可以修改这个变量了。我的意思是——以某种格式获取你的 word2vec 向量，比如一个文本文件。假设您在 model.vocab 中有您的词汇表，您可以按照下面 sn-p 所示的方式分配读取向量（它只是一个 sn-p，您必须更改它才能使其工作，但我希望它展示了这个想法）。

   vectors_variable = [v for v in tf.trainable_variables()
                        if EMBEDDING_KEY in v.name]
    if len(vectors_variable) != 1:
      print("Word vector variable not found or too many.")
      sys.exit(1)
    vectors_variable = vectors_variable[0]
    vectors = vectors_variable.eval()
    print("Setting word vectors from %s" % FLAGS.word_vector_file)
    with gfile.GFile(FLAGS.word_vector_file, mode="r") as f:
      # Lines have format: dog 0.045123 -0.61323 0.413667 ...
      for line in f:
        line_parts = line.split()
        # The first part is the word.
        word = line_parts[0]
        if word in model.vocab:
          # Remaining parts are components of the vector.
          word_vector = np.array(map(float, line_parts[1:]))
          if len(word_vector) != vec_size:
            print("Warn: Word '%s', Expecting vector size %d, found %d"
                     % (word, vec_size, len(word_vector)))
          else:
            vectors[model.vocab[word]] = word_vector
    # Assign the modified vectors to the vectors_variable in the graph.
    session.run([vectors_variable.initializer],
                {vectors_variable.initializer.inputs[1]: vectors})

【讨论】：

非常有用的建议。但我仍然想知道原始嵌入文件存储在哪里？为什么使用像with vs.variable_scope(scope or type(self).__name__): 这样的范围方法，我认为最直接的方法是从某个地方加载某个嵌入的矢量文件。再次感谢您的帮助:)
“PAD”、“EOS”和“GO”标签呢？这些都去哪儿了？

【解决方案2】：

我猜想通过 Matthew 提到的范围样式，您可以获得变量：

 with tf.variable_scope("embedding_attention_seq2seq"):
        with tf.variable_scope("RNN"):
            with tf.variable_scope("EmbeddingWrapper", reuse=True):
                  embedding = vs.get_variable("embedding", [shape], [trainable=])

另外，我想你也想将嵌入注入解码器，它的键（或范围）类似于：

“embedding_attention_seq2seq/embedding_attention_decoder/embedding”

感谢您的回答，卢卡斯！

我想知道，代码 sn-p <b>model.vocab[word]</b> 到底代表什么？只是单词在词汇表中的位置？

在这种情况下，迭代词汇表并为 w2v 模型中存在的单词注入 w2v 向量不是更快。

【讨论】：