pytorch 的动态计算图中的权重更新如何工作？答案

【问题标题】：How Weight update in Dynamic Computation Graph of pytorch works?pytorch 的动态计算图中的权重更新如何工作？
【发布时间】：2019-06-12 12:32:08
【问题描述】：

当权重被分片（=重复使用多次）时，权重更新如何在动态计算图的 Pytorch 代码中工作

https://pytorch.org/tutorials/beginner/examples_nn/dynamic_net.html#sphx-glr-beginner-examples-nn-dynamic-net-py

import random
import torch

class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
    """
    In the constructor we construct three nn.Linear instances that we will use
    in the forward pass.
    """
    super(DynamicNet, self).__init__()
    self.input_linear = torch.nn.Linear(D_in, H)
    self.middle_linear = torch.nn.Linear(H, H)
    self.output_linear = torch.nn.Linear(H, D_out)

def forward(self, x):
    """
    For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
    and reuse the middle_linear Module that many times to compute hidden layer
    representations.

    Since each forward pass builds a dynamic computation graph, we can use normal
    Python control-flow operators like loops or conditional statements when
    defining the forward pass of the model.

    Here we also see that it is perfectly safe to reuse the same Module many
    times when defining a computational graph. This is a big improvement from Lua
    Torch, where each Module could be used only once.
    """
    h_relu = self.input_linear(x).clamp(min=0)
    for _ in range(random.randint(0, 3)):
        h_relu = self.middle_linear(h_relu).clamp(min=0)
    y_pred = self.output_linear(h_relu)
    return y_pred

我想知道middle_linear 的重量在每一步中被多次使用时会发生什么

【问题讨论】：

标签： deep-learning pytorch computation-graph

【解决方案1】：

当您调用backward（作为张量上的函数或方法）时，带有requires_grad == True 的操作数的梯度是相对于您调用backward 的张量计算的。这些梯度累积在这些操作数的.grad 属性中。如果相同的操作数 A 在表达式中出现多次，您可以在概念上将它们视为单独的实体 A1、A2... 用于反向传播算法，并在最后将它们的梯度相加，以便 A.grad = A1.grad + A2.grad + ...。

现在，严格来说，你的问题的答案

我想知道 middle_linear weight 每次后退会发生什么

是：什么都没有。 backward 不改变权重，只计算梯度。要更改权重，您必须执行优化步骤，可能使用torch.optim 中的优化器之一。然后根据它们的.grad 属性更新权重，因此如果您的操作数被多次使用，它将根据每次使用时的梯度总和进行相应更新。

换句话说，如果您的矩阵元素x 在第一次应用时具有正梯度，而在第二次使用时具有负梯度，则可能是净效应将抵消并保持原样（或仅更改少量）。如果两个应用程序都要求x 更高，那么它会比只使用一次时提高更多，等等。

【讨论】：

谢谢！ A.grad = A1.grad + A2.grad + ... 是我想知道的部分。感谢您的帮助