在 Colab TPU 上运行 Pytorch 堆叠模型答案

【问题标题】：Run Pytorch stacked model on Colab TPU在 Colab TPU 上运行 Pytorch 堆叠模型
【发布时间】：2021-03-14 23:41:15
【问题描述】：

我正在尝试在 Colab 多核 TPU 上运行我的模型，但我真的不知道该怎么做。我尝试了this tutorial notebook，但出现了一些错误，我无法修复它，但我认为等待可能更简单。

关于我的模型：

class BERTModel(nn.Module):
    def __init__(self,...):
        super().__init__()
        if ...:
            self.bert_model = XLMRobertaModel.from_pretrained(...)   # huggingface XLM-R
        elif ...:
            self.bert_model = others_model.from_pretrained(...)   # huggingface XLM-R
        
        ... # some other model's parameters
        
    def forward(self,...):
        bert_input = ...
        output = self.bert_model(bert_input)
        
        ... # some function that process on output
        
    def other_function(self,...):
        # just doing some process on output. like concat layers's embedding and return ...
        
class MAINModel(nn.Module):
    def __init__(self,...):
        super().__init__()
        
        print('Using model 1')
        self.bert_model_1 = BERTModel(...)
        
        print('Using model 2')
        self.bert_model_2 = BERTModel(...)
        
        self.linear = nn.Linear(...)
        
    def forward(self,...):
        bert_input = ...
        bert_output = self.bert_model(bert_input)
        linear_output = self.linear(bert_output)
   
        return linear_output

您能告诉我如何在 Colab TPU 上运行类似于我的模型的模型吗？我使用 Colab PRO 来确保 Ram 内存不是大问题。非常感谢你。

【问题讨论】：

分享收到的错误消息总是很有帮助的。请将完整的堆栈跟踪添加到您的问题中。

标签： pytorch google-colaboratory huggingface-transformers tpu google-cloud-tpu

【解决方案1】：

我会在这里处理示例：https://github.com/pytorch/xla/tree/master/contrib/colab

也许从一个更简单的模型开始，比如：https://github.com/pytorch/xla/blob/master/contrib/colab/mnist-training.ipynb

在您分享的伪代码中，没有引用 torch_xla 库，这是在 TPU 上使用 PyTorch 所必需的。我建议从我共享的那个目录中的一个正在工作的 Colab 笔记本开始，然后用你自己的模型交换模型的一部分。如果您想在 TPU 上运行该模型，那么您需要为使用本机 PyTorch 在 GPU 上运行的模型修改整体训练代码中的几个（通常像 3-4 个）位置。有关某些更改的说明，请参阅here。另一个重大变化是使用 ParallelLoader 包装默认数据加载器，如我共享的示例 MNIST colab 中所示

如果您在其中一个 Colab 中看到任何特定错误，请随时打开问题：https://github.com/pytorch/xla/issues

【讨论】：