从 BERT 获取嵌入查找结果答案

【问题标题】：Getting embedding lookup result from BERT从 BERT 获取嵌入查找结果
【发布时间】：2020-08-17 13:28:18
【问题描述】：

在通过 BERT 传递我的令牌之前，我想对其嵌入执行一些处理（嵌入查找层的结果）。 HuggingFace BERT TensorFlow implementation 允许我们使用以下方法访问嵌入查找的输出：

import tensorflow as tf
from transformers import BertConfig, BertTokenizer, TFBertModel

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

input_ids = tf.constant(bert_tokenizer.encode("Hello, my dog is cute", add_special_tokens=True))[None, :]
attention_mask = tf.stack([tf.ones(shape=(len(sent),)) for sent in input_ids])
token_type_ids = tf.stack([tf.ones(shape=(len(sent),)) for sent in input_ids])

config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True)
bert_model = TFBertModel.from_pretrained('bert-base-uncased', config=config)

result = bert_model(inputs={'input_ids': input_ids, 
                            'attention_mask': attention_mask, 
                            'token_type_ids': token_type_ids})
inputs_embeds = result[-1][0]  # output of embedding lookup

随后，可以处理inputs_embeds，然后将其作为输入发送到同一模型，使用：

inputs_embeds = process(inputs_embeds)  # some processing on inputs_embeds done here (dimensions kept the same)
result = bert_model(inputs={'inputs_embeds': inputs_embeds, 
                            'attention_mask': attention_mask, 
                            'token_type_ids': token_type_ids})
output = result[0]

output 现在包含修改后输入的 BERT 输出。但是，这需要两次完全通过 BERT。我不想一直运行 BERT 来执行嵌入查找，我只想获得嵌入查找层的输出。 这可能吗？如果可以，怎么做？

【问题讨论】：

标签： python tensorflow nlp huggingface-transformers bert-language-model

【解决方案1】：

实际上将第一个输出result[-1][0] 视为嵌入查找的结果是不正确的。原始嵌入查找由下式给出：

embeddings = bert_model.bert.get_input_embeddings()
word_embeddings = embeddings.word_embeddings
inputs_embeds = tf.gather(word_embeddings, input_ids)

而result[-1][0] 给出嵌入查找plus 位置嵌入和令牌类型嵌入。上面的代码不需要完整的通过BERT，结果可以在馈入BERT的其余层之前进行处理。

编辑：要获得将位置和令牌类型嵌入添加到任意inputs_embeds 的结果，可以使用：

full_embeddings = embeddings(inputs=[None, None, token_type_ids, inputs_embeds])

这里，embeddings 对象的call 方法接受一个列表，该列表被馈送到_embeddings 方法中。第一个值为input_ids，第二个值为position_ids，第三个值为token_type_ids，第四个值为inputs_embeds。（详见here。）如果您在一个输入中有多个句子，您可能需要设置position_ids。

【讨论】：