使用转换器和 pytorch 微调因果语言模型答案

【问题标题】：fine tune causal language model using transformers and pytorch使用转换器和 pytorch 微调因果语言模型
【发布时间】：2020-12-15 18:00:58
【问题描述】：

我有一些关于使用转换器和 PyTorch 微调因果语言模型的问题。

我的主要目标是微调 XLNet。然而，我发现网上大部分帖子都是针对文本分类的，比如post。我想知道，有没有办法在不使用变形金刚 GitHub 中的run_language_model.py 的情况下微调模型？

这是我尝试微调 XLNet 的一段代码：

model = XLNetLMHeadModel.from_pretrained("xlnet-base-cased")
tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased", do_lower_case=True)
LOSS = torch.nn.CrossEntrypoLoss()
batch_texts = ["this is sentence 1", "i have another sentence like this", "the final sentence"]
encodings = tokenizer.encode_plus(batch_texts, add_special_tokens=True,
                                  return_tensors=True, return_attention_mask=True)
outputs = model(encodings["input_ids"], encodings["attention_mask"])
loss = LOSS(outputs[0], target_ids)
loss.backward()
# ignoring the rest of codes...

我在最后两行卡住了。起初，当使用这个 LM 模型时，我似乎没有像监督学习通常那样的labels；其次，作为最小化损失的语言模型（这里是交叉熵），我需要一个target_ids来计算input_ids的损失和困惑。

以下是我的后续问题：

在模型拟合过程中我应该如何处理labels？
我应该设置类似target_ids=encodings["input_ids"].copy() 的东西来计算交叉熵损失和困惑度吗？
如果没有，应该如何设置这个target_ids？
来自变形金刚documentation 的困惑页面，我应该如何调整其方法以适应非固定长度的输入文本？
我从文档中看到另一个post 说它需要填充文本以进行因果语言建模。但是，从 3) 中的链接来看，填充文本没有这样的标志。我应该关注哪一个？

任何建议和意见将不胜感激！

【问题讨论】：

请先定义你想要达到的目标。当您没有目标时，微调是一个没有意义的术语，因为您在某个方向上微调模型。换句话说，当你给它一个特定的输入时，你期望从一个调整模型得到什么样的输出。
@cronoik 我的目标是微调模型以尽量减少输入文本的困惑
这是什么意思？您是要分类还是要总结您的文本。请给我一个例子（将此直接添加到您的问题中）。
@cronoik 我试图微调因果语言模型。从技术上讲，这个任务的目标是来自标记器的input_ids，而不是像 0 或 1 这样的二进制标签。
如果您正在寻求我的帮助，如果您能简单地回答我的问题，那就太好了。我没有问你任何技术问题，但我要求你澄清总体目标。我可以自己阅读您的代码。因此，请暂时想象一下您的模型按预期工作，示例输入的模型输出是什么（请直接将示例输入和预期输出添加到您的问题中）。

标签： python-3.x pytorch huggingface-transformers language-model

【解决方案1】：

当使用语言模型头对模型进行微调时，标签本身就是下一个标记（您可以预测下一个单词）。 Huggingface 的库通过将大部分过程的复杂性隐藏在他们的方法中，使很多事情变得非常容易，当你想做一些标准的事情时，这非常好。但是如果你想做一些特别的事情，或者如果你想学习和了解细节，我建议直接在pytorch中实现训练循环；编写低级代码是最好的学习方式。

对于这种情况，这里有一个草稿开始；训练循环远未完成，但无论如何它必须适应每个具体情况，所以我希望这几行可以帮助开始......

model = GPT2LMHeadModel.from_pretrained('distilgpt2')
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')
# our input:
s = tokenizer.encode('In winter, the weather is',return_tensors='pt')
# we want to fine-tune to force a fake output as follows:
ss = tokenizer.encode('warm and hot',return_tensors='pt')
# forward pass:
outputs = model(s)
# check that the outout logits are given for every input token:
print(outputs.logits.size())
# we're gonna train on the token that follows the last input one
# so we extract just the last logit:
lasty = outputs.logits[0,-1].view(1,-1)
# prepare backprop:
lossfct = torch.nn.CrossEntropyLoss()
optimizer = transformers.AdamW(model.parameters(), lr=5e-5)
# just take the first next token (you should repeat this for the next ones)
labels = ss[0][0].view(1)
loss = lossfct(lasty,labels)
loss.backward()
optimizer.step()
optimizer.zero_grad()

# finetunening done: you may check the answer is already different:
y = model.generate(s)
sy = tokenizer.decode(y[0])
print(sy)

【讨论】：