index_table_from_file 和 Dataset API 有一种更简洁的方法。
首先,创建你自己的tf.Dataset(我假设我们有两个带有一些任意标签的句子):
sentence = tf.constant(['this is first sentence', 'this is second sentence'])
labels = tf.constant([1, 0])
dataset = tf.data.Dataset.from_tensor_slices((sentence, labels))
第二,创建一个vocab.txt 文件,该文件中的每一行的编号映射到Glove 嵌入中的相同索引。例如,如果 Glove 中的第一个词汇是 vocab.txt 中的“absent”,那么第一行应该是“absent”,依此类推。为简单起见,假设我们的vocab.txt 包含以下单词:
first
is
test
this
second
sentence
然后,基于here,定义一个表,其目标是将每个单词转换为特定的id:
table = tf.contrib.lookup.index_table_from_file(vocabulary_file="vocab.txt", num_oov_buckets=1)
dataset = dataset.map(lambda x, y: (tf.string_split([x]).values, y))
dataset = dataset.map(lambda x, y: (tf.cast(table.lookup(x), tf.int32), y))
dataset = dataset.batch(1)
最后,基于this answer,通过使用nn.embedding_lookup()将每个句子转化为embedding:
glove_weights = tf.get_variable('embed', shape=embedding.shape, initializer=initializer=tf.constant_initializer(embedding), trainable=False)
iterator = dataset.make_initializable_iterator()
x, y = iterator.get_next()
embedding = tf.nn.embedding_lookup(glove_weights, x)
sentence = tf.reduce_mean(embedding, axis=1)
Eager 模式下的完整代码:
import tensorflow as tf
tf.enable_eager_execution()
sentence = tf.constant(['this is first sentence', 'this is second sentence'])
labels = tf.constant([1, 0])
dataset = tf.data.Dataset.from_tensor_slices((sentence, labels))
table = tf.contrib.lookup.index_table_from_file(vocabulary_file="vocab.txt", num_oov_buckets=1)
dataset = dataset.map(lambda x, y: (tf.string_split([x]).values, y))
dataset = dataset.map(lambda x, y: (tf.cast(table.lookup(x), tf.int32), y))
dataset = dataset.batch(1)
glove_weights = tf.get_variable('embed', shape=(10000, 300), initializer=tf.truncated_normal_initializer())
for x, y in dataset:
embedding = tf.nn.embedding_lookup(glove_weights, x)
sentence = tf.reduce_mean(embedding, axis=1)
print(sentence.shape)