【发布时间】:2017-07-24 14:44:36
【问题描述】:
在单个 GPU 上训练时,我的训练适用于小批量(默认)。
if USE_CUDA:
encoderchar = encoderchar.cuda()
encoder = encoder.cuda()
decoder = decoder.cuda()
但是,当我使用所有可用的 GPU 进行训练时,我得到了一个错误。
if USE_CUDA:
encoderchar = torch.nn.DataParallel(encoderchar, device_ids=[0, 1, 2, 3, 4, 5, 6, 7])
encoder = torch.nn.DataParallel(encoder, device_ids=[0, 1, 2, 3, 4, 5, 6, 7])
decoder = torch.nn.DataParallel(decoder, device_ids=[0, 1, 2, 3, 4, 5, 6, 7])
encoderchar = encoderchar.cuda()
encoder = encoder.cuda()
decoder = decoder.cuda()
在转发过程中出现以下错误。
RuntimeError Traceback (most recent call last)
<ipython-input-10-227f3e86847c> in <module>()
18 loss, ar1, ar2 = train(data_input_batch_index, data_input_batch_length, data_target_batch_index, data_target_batch_length,
19 encoderchar, encoder, decoder, encoderchar_optimizer, encoder_optimizer, decoder_optimizer,
---> 20 criterion, batch_size)
21
22 # Keep track of loss
<ipython-input-8-21861d792653> in train(input_batch, input_batch_length, target_batch, target_batch_length, encoderchar, encoder, decoder, encoderchar_optimizer, encoder_optimizer, decoder_optimizer, criterion, batch_size)
21 #reshaped_input_length = Variable(torch.LongTensor(reshaped_input_length)).cuda()
22 hidden_all, output = encoderchar(w, reshaped_input_length)
---> 23 encoder_input[ix] = output.transpose(0,1).contiguous().view(batch_size, -1)
24
25 temporary_target_batch_length = [15] * batch_size
/home/ubuntu/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/autograd/variable.py in __setitem__(self, key, value)
78 else:
79 if isinstance(value, Variable):
---> 80 return SetItem(key)(self, value)
81 else:
82 return SetItem(key, value)(self)
/home/ubuntu/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py in forward(self, i, value)
37 else: # value is Tensor
38 self.value_size = value.size()
---> 39 i._set_index(self.index, value)
40 return i
41
RuntimeError: sizes do not match at /py/conda-bld/pytorch_1493681908901/work/torch/lib/THC/THCTensorCopy.cu:31
一个cuda long tensor和一个list是传递给encoderchar前馈的参数类型。
hidden_all, output = encoderchar(w, reshaped_input_length)
encoder_input[ix] = output.transpose(0,1).contiguous().view(batch_size, -1)
nvidia-smi 在抛出错误后显示以下内容。
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage
| 0 18320 C python 453MiB |
| 1 18320 C python 266MiB |
| 2 18320 C python 266MiB |
| 3 18320 C python 266MiB |
| 4 18320 C python 266MiB |
| 5 18320 C python 266MiB |
| 6 18320 C python 266MiB |
| 7 18320 C python 262MiB |
+-----------------------------------------------------------------------------+
这里有什么问题?
【问题讨论】:
-
hidden_all、output和encoder_input的大小/尺寸是多少?还有batch_size的内容是什么? -
以下是尺寸 hidden_all - torch.Size([15, 128, 500]) output - torch.Size([1, 128, 500]) encoder_input - torch.Size([15, 128 , 500]) **这段代码在单个 GPU 环境中运行良好。 **
-
batch_size 为 128