通过获取 Tensorflow 中所有词嵌入的均值来获得句子嵌入？答案

【问题标题】：Obtaining sentence embedding by getting the mean of all its word embeddings in Tensorflow?通过获取 Tensorflow 中所有词嵌入的均值来获得句子嵌入？
【发布时间】：2018-12-31 19:58:34
【问题描述】：

这是我的代码，用于拆分类型为 tf.string 的输入张量并使用预训练的 GloVe 模型提取其每个词嵌入。但是，我收到关于 cond 实施的无根据的错误。我想知道是否有一种更简洁的方法来获取 字符串张量中所有单词的嵌入。

# Take out the words
target_words = tf.string_split([target_sentence], delimiter=" ")

# Tensorflow parallel while loop variable, condition and body
i = tf.constant(0, dtype=tf.int32)
cond = lambda self, i: tf.less(x=tf.cast(i, tf.int32), y=tf.cast(tf.shape(target_words)[0], tf.int32))
sentence_mean_embedding = tf.Variable([], trainable=False)

def body(i, sentence_mean_embedding):
    sentence_mean_embedding = tf.concat(1, tf.nn.embedding_lookup(params=tf_embedding, ids=tf.gather(target_words, i)))

    return sentence_mean_embedding

embedding_sentence = tf.reduce_mean(tf.while_loop(cond, body, [i, sentence_mean_embedding]))

【问题讨论】：

标签： python tensorflow

【解决方案1】：

index_table_from_file 和 Dataset API 有一种更简洁的方法。

首先，创建你自己的tf.Dataset（我假设我们有两个带有一些任意标签的句子）：

sentence = tf.constant(['this is first sentence', 'this is second sentence'])
labels = tf.constant([1, 0])
dataset = tf.data.Dataset.from_tensor_slices((sentence, labels))

第二，创建一个vocab.txt 文件，该文件中的每一行的编号映射到Glove 嵌入中的相同索引。例如，如果 Glove 中的第一个词汇是 vocab.txt 中的“absent”，那么第一行应该是“absent”，依此类推。为简单起见，假设我们的vocab.txt 包含以下单词：

first
is
test
this
second
sentence

然后，基于here，定义一个表，其目标是将每个单词转换为特定的id：

table = tf.contrib.lookup.index_table_from_file(vocabulary_file="vocab.txt", num_oov_buckets=1)
dataset = dataset.map(lambda x, y: (tf.string_split([x]).values, y))
dataset = dataset.map(lambda x, y: (tf.cast(table.lookup(x), tf.int32), y))

dataset = dataset.batch(1)

最后，基于this answer，通过使用nn.embedding_lookup()将每个句子转化为embedding：

glove_weights = tf.get_variable('embed', shape=embedding.shape, initializer=initializer=tf.constant_initializer(embedding), trainable=False)

iterator = dataset.make_initializable_iterator()
x, y = iterator.get_next()

embedding = tf.nn.embedding_lookup(glove_weights, x)
sentence = tf.reduce_mean(embedding, axis=1)

Eager 模式下的完整代码：

import tensorflow as tf

tf.enable_eager_execution()

sentence = tf.constant(['this is first sentence', 'this is second sentence'])
labels = tf.constant([1, 0])

dataset = tf.data.Dataset.from_tensor_slices((sentence, labels))
table = tf.contrib.lookup.index_table_from_file(vocabulary_file="vocab.txt", num_oov_buckets=1)
dataset = dataset.map(lambda x, y: (tf.string_split([x]).values, y))
dataset = dataset.map(lambda x, y: (tf.cast(table.lookup(x), tf.int32), y))

dataset = dataset.batch(1)

glove_weights = tf.get_variable('embed', shape=(10000, 300), initializer=tf.truncated_normal_initializer())

for x, y in dataset:
    embedding = tf.nn.embedding_lookup(glove_weights, x)
    sentence = tf.reduce_mean(embedding, axis=1)
    print(sentence.shape)

【讨论】：

ids：一个 int32 或 int64 类型的张量，包含要在 params 中查找的 id。
我错过了 ids 是各种 id 的（列表）张量这一事实。因此，我试图逐个遍历句子 id。通过 ids 使用 tf.string_split 传递一个向量化的句子就足够了。