【问题标题】:BERT for next sentence predictionBERT 用于下一句预测
【发布时间】:2021-06-30 10:40:05
【问题描述】:

我正在尝试使用我自己的数据集微调 Bert 模型以进行下一句预测,但它不起作用。 谁能告诉我我的数据集的结构应该是什么?如何使用拥抱脸训练器()进行微调?

def train(bert_model,bert_tokenizer,path,eval_path=None):
    out_dir = "/content/drive/My Drive/next_sentence/"

    training_args = TrainingArguments(output_dir=out_dir,
                                      overwrite_output_dir=True,
                                      num_train_epochs=1,
                                      per_device_train_batch_size=30,
                                      save_steps=100,
                                      save_total_limit=5,
                                      )

    data_collator = DataCollatorForLanguageModeling(tokenizer=bert_tokenizer)
    
      
    trainer = Trainer(
      model=bert_model,
      args=training_args,
      data_collator=data_collator,
      train_dataset="c:/data.txt",
      tokenizer=BertTokenizer)
    
    trainer.train()
    trainer.save_model(out_dir)

import transformers

from torch.nn.functional import softmax

from transformers import BertTokenizer, BertTokenizerFast, BertForNextSentencePrediction,TextDatasetForNextSentencePrediction
import torch

from transformers import Trainer, TrainingArguments
from transformers.data.data_collator import DataCollatorForLanguageModeling

def main():
  bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
  bert_model = BertForNextSentencePrediction.from_pretrained("bert-base-cased")
  train_data_set_path = "c:/data.txt"
  train(bert_model,BertTokenizer,train_data_set_path)
  #prepare_data_set(bert_tokenizer)
main()

【问题讨论】:

标签: prediction next bert-language-model sentence


【解决方案1】:

根据huggingface source code,输入数据集的结构需要是:

输入文件格式:

    # (1) One sentence per line. These should ideally be actual sentences, not
    # entire paragraphs or arbitrary spans of text. (Because we use the
    # sentence boundaries for the "next sentence prediction" task).
    # (2) Blank lines between documents. Document boundaries are needed so
    # that the "next sentence prediction" task doesn't span between documents.
    #
    # Example:
    # I am very happy.
    # Here is the second sentence.
    #
    # A new document.

【讨论】:

    【解决方案2】:

    您应该创建TextDatasetForNextSentencePrediction 并将其传递给训练器,而不是传递数据集路径。

    所以你应该在你的训练函数中创建TextDatasetForNextSentencePrediction 数据集,如下所示。

    from transformers import TextDatasetForNextSentencePrediction
    
    def train(bert_model, bert_tokenizer, path, eval_path=None):
        out_dir = "/content/drive/My Drive/next_sentence/"
    
        training_args = TrainingArguments(output_dir=out_dir,
                                          overwrite_output_dir=True,
                                          num_train_epochs=1,
                                          per_device_train_batch_size=30,
                                          save_steps=100,
                                          save_total_limit=5,
                                          )
    
        data_collator = DataCollatorForLanguageModeling(tokenizer=bert_tokenizer)
        
    
        train_dataset = TextDatasetForNextSentencePrediction(
            tokenizer = bert_tokenizer,
            file_path = path,
            block_size = 256
        )
          
        trainer = Trainer(
          model=bert_model,
          args=training_args,
          data_collator=data_collator,
          train_dataset=train_dataset,
          tokenizer=BertTokenizer)
        
        trainer.train()
        trainer.save_model(out_dir)
    

    您还应该传递bert_tokenizer 而不是BertTokenizer。训练器和数据集需要预先训练的分词器。

    所以你的主要功能应该是这样的:

    
    def main():
      bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
      bert_model = BertForNextSentencePrediction.from_pretrained("bert-base-cased")
      train_data_set_path = "c:/data.txt"
      train(bert_model, bert_tokenizer, train_data_set_path)
    
    main()
    
    

    【讨论】:

      猜你喜欢
      • 2019-08-02
      • 2020-07-10
      • 1970-01-01
      • 1970-01-01
      • 2020-08-08
      • 2021-08-12
      • 2021-02-16
      • 2020-09-21
      • 2021-06-23
      相关资源
      最近更新 更多