变压器模型预测的意外结果答案

【问题标题】：Unexpected result from transformer model prediction变压器模型预测的意外结果
【发布时间】：2021-12-31 10:47:47
【问题描述】：

为 Masked Language Task 使用 huggingface 转换器我预计预测将返回相同的输入字符串以及掩码的标记：

from transformers import BertConfig, BertTokenizer, BertForMaskedLM

model1 = BertForMaskedLM.from_pretrained("bert-base-uncased")
tokenizer1 = BertTokenizer.from_pretrained("bert-base-uncased")

# Read the rest of this [MASK] to understand things in more detail
text = ["Read the rest of this [MASK] to understand things in more detail"]
encoding1 = tokenizer1(text, return_tensors="pt")

# forward pass
outputs1 = model1(**encoding1)
outputs1.logits.argmax(-1)

输出是：

tensor([[1012, 3191, 1996, 2717, 1997, 2023, 2338, 2000, 3305, 2477, 1999, 2062,
         1012, 1012]])

但是当我解码输出时，我没有找到最后一个输入标记 detail：

tokenizer1.convert_ids_to_tokens([1012, 3191, 1996, 2717, 1997, 2023, 2338, 2000, 3305, 2477, 1999, 2062, 1012, 1012])


['.',
 'read',
 'the',
 'rest',
 'of',
 'this',
 'book',
 'to',
 'understand',
 'things',
 'in',
 'more',
 '.',
 '.']

也许我使用的模型不正确？还有其他原因吗？

【问题讨论】：

标签： python pandas huggingface-transformers

【解决方案1】：

我将这简单地归因于神经网络的错误。虽然 BERT 非常擅长预测大多数样本，但仍可能发生个别标记被错误预测的情况。
此外，需要注意的是，BERT 无法添加/删除标记，即输出必须始终与输入的长度相同（就 BPE 单元而言）。

有趣的是，我尝试在您的例句末尾添加一个句点，即

# Adding period
text = ["Read the rest of this [MASK] to understand things in more detail."]

现在输出“更正确”了：

'. read the rest of this book to understand things in more detail..'

因此，BERT 似乎知道句子 always 必须以句号结尾，这取代了 detail 的正确预测。请注意，[CLS] 和 [SEP] 标记（在开头和结尾）也会转换为句点，但这可能是由于类似的解释。

【讨论】：