无法在 pytorch 中训练大数据答案

【问题标题】：Fail to train large data in pytorch无法在 pytorch 中训练大数据
【发布时间】：2018-09-22 03:55:07
【问题描述】：

我尝试在 pytorch 中构建两个全连接层，以将 [x1,x2,...,xn] 等功能嵌入到多个目标 [y1,y2,y3,y4,y5] 中。我在下面发布我的代码：

class FullConnect(nn.Module):
    def __init__(self):
        super(FullConnect, self).__init__()        
        self.fc = nn.Sequential(
            nn.Linear(195, 100),
            nn.Linear(100, 5)
        )
    def forward(self, x):
        out = self.fc(x)
        return out


class LossFunc(nn.Module):
    def __init__(self):
        super(LossFunc, self).__init__() 
    def forward(self,x,y):
        loss=torch.div(torch.sum(torch.pow(torch.log(torch.div(x+1,y+1)),2)),5)
        return loss

small_data=np.random.randn(100, 200)
small_data[small_data<0]=0
model = FullConnect()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.3)
criterion = LossFunc()
for epoch in range(5):
    acc=0
    for i in range(small_data.shape[0]):
        x = Variable(torch.FloatTensor(small_data[i][5:]))
        y = Variable(torch.FloatTensor(small_data[i][:5]))
        output=model(x)
        loss=criterion(output,y)
        optimizer.zero_grad()
        loss.backward()  
        optimizer.step()
        acc+=loss
    print("epoch:",epoch)
    print("Loss:",acc)

当我将小型训练集输入其中时，此代码运行良好，返回：

epoch: 0
Loss: Variable containing:
 15.7719
[torch.FloatTensor of size 1]

epoch: 1
Loss: Variable containing:
 12.0258
[torch.FloatTensor of size 1]

epoch: 2
Loss: Variable containing:
 9.9758
[torch.FloatTensor of size 1]

epoch: 3
Loss: Variable containing:
 8.5442
[torch.FloatTensor of size 1]

epoch: 4
Loss: Variable containing:
 7.4562
[torch.FloatTensor of size 1]

但是当我用一个大型训练集替换 small_data 时：

large_data=np.random.randn(60000, 200)
large_data[large_data<0]=0

Jupyter notebook 给我返回了一个错误The kernel appears to have died. It will restart automatically. 我想这个错误与输入的大小有关。

我的 cuda9.1 是可用的，但是 cudnn 在torch 中是不可接受的。现在我正在寻找改进我的代码并使这个训练过程有效的方法。我很感激任何可以帮助我的建议。

【问题讨论】：

标签： python-3.x neural-network deep-learning torch pytorch

【解决方案1】：

问题可能是您没有将输入数据定义为volatile（请参阅this 文档）。我建议您将定义 x 和 y 的行更改为：

x = Variable(torch.FloatTensor(small_data[i][5:]), volatile=True)
y = Variable(torch.FloatTensor(small_data[i][:5]), volatile=True)

这将使前向计算更节省内存。

改进代码的一种方法是使用stochastic gradient descent，而不是一次提供一个示例，或者更简单地说，将您的数据集分成批次并将它们提供给模型而是。

在对每个批次进行推理（前馈）之后，您应该在损失变量上调用 backward()，在优化器上调用 step()。

看看 this 示例，他们使用 pytorch 内置的 DataLoader 类为您执行批处理。

您可以在训练循环中看到我所描述的内容：

for batch_idx, (data, target) in enumerate(train_loader):
    if args.cuda:
        data, target = data.cuda(), target.cuda()
    data, target = Variable(data), Variable(target)
    optimizer.zero_grad()
    output = model(data)
    loss = F.nll_loss(output, target)
    loss.backward()
    optimizer.step()

【讨论】：