【问题标题】：How to initialize weights in PyTorch?如何在 PyTorch 中初始化权重？
【发布时间】：2018-09-01 04:51:21
【问题描述】：

如何在 PyTorch 中初始化网络中的权重和偏差（例如，使用 He 或 Xavier 初始化）？

【问题讨论】：

PyTorch 经常初始化权重automatically。
如果我问这个问题很好，但我会收到 100 票反对，因为我没有提供足够的研究工作和代码

标签： python machine-learning deep-learning neural-network pytorch

【解决方案1】：

单层

要初始化单个层的权重，请使用torch.nn.init 中的函数。例如：

conv1 = torch.nn.Conv2d(...)
torch.nn.init.xavier_uniform(conv1.weight)

或者，您可以通过写入conv1.weight.data（即torch.Tensor）来修改参数。示例：

conv1.weight.data.fill_(0.01)

这同样适用于偏见：

conv1.bias.data.fill_(0.01)

`nn.Sequential` 或自定义`nn.Module`

将初始化函数传递给torch.nn.Module.apply。它将递归地初始化整个nn.Module中的权重。

apply(fn): 递归地将fn 应用于每个子模块（由.children() 返回）以及self。典型用途包括初始化模型的参数（另请参阅 torch-nn-init）。

例子：

def init_weights(m):
    if isinstance(m, nn.Linear):
        torch.nn.init.xavier_uniform(m.weight)
        m.bias.data.fill_(0.01)

net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
net.apply(init_weights)

【讨论】：

我在很多模块的源码中发现了一个reset_parameters方法。我应该重写权重初始化的方法吗？
如果我想使用具有一些均值和标准的正态分布怎么办？
如果我不指定默认初始化是什么？

【解决方案2】：

我们使用相同的神经网络 (NN) 架构比较不同的权重初始化模式。

全零或一

如果您遵循Occam's razor 的原则，您可能会认为将所有权重设置为 0 或 1 是最好的解决方案。事实并非如此。

在每个权重相同的情况下，每一层的所有神经元都产生相同的输出。这使得很难决定调整哪些权重。

    # initialize two NN's with 0 and 1 constant weights
    model_0 = Net(constant_weight=0)
    model_1 = Net(constant_weight=1)

2 个 epoch 后：

Validation Accuracy
9.625% -- All Zeros
10.050% -- All Ones
Training Loss
2.304  -- All Zeros
1552.281  -- All Ones

统一初始化

uniform distribution 从一组数字中选择任何数字的概率相同。

让我们看看神经网络使用统一权重初始化的训练效果如何，其中low=0.0 和high=1.0。

下面，我们将看到另一种方法（除了在 Net 类代码中）来初始化网络的权重。要在模型定义之外定义权重，我们可以：

定义一个按网络层类型分配权重的函数，然后

使用model.apply(fn) 将这些权重应用于初始化模型，这会将函数应用于每个模型层。

    # takes in a module and applies the specified weight initialization
    def weights_init_uniform(m):
        classname = m.__class__.__name__
        # for every Linear layer in a model..
        if classname.find('Linear') != -1:
            # apply a uniform distribution to the weights and a bias=0
            m.weight.data.uniform_(0.0, 1.0)
            m.bias.data.fill_(0)

    model_uniform = Net()
    model_uniform.apply(weights_init_uniform)

2 个 epoch 后：

Validation Accuracy
36.667% -- Uniform Weights
Training Loss
3.208  -- Uniform Weights

设置权重的一般规则

在神经网络中设置权重的一般规则是将它们设置为接近零而不会太小。

好的做法是在 [-y, y] 范围内开始您的权重，其中y=1/sqrt(n)
（n 是给定神经元的输入数）。

    # takes in a module and applies the specified weight initialization
    def weights_init_uniform_rule(m):
        classname = m.__class__.__name__
        # for every Linear layer in a model..
        if classname.find('Linear') != -1:
            # get the number of the inputs
            n = m.in_features
            y = 1.0/np.sqrt(n)
            m.weight.data.uniform_(-y, y)
            m.bias.data.fill_(0)

    # create a new model with these weights
    model_rule = Net()
    model_rule.apply(weights_init_uniform_rule)

下面我们比较了 NN 的性能，使用均匀分布 [-0.5,0.5) 初始化的权重与使用一般规则

初始化的权重

2 个 epoch 后：

Validation Accuracy
75.817% -- Centered Weights [-0.5, 0.5)
85.208% -- General Rule [-y, y)
Training Loss
0.705  -- Centered Weights [-0.5, 0.5)
0.469  -- General Rule [-y, y)

初始化权重的正态分布

正态分布的均值应为 0，标准差应为 y=1/sqrt(n)，其中 n 是 NN 的输入数

    ## takes in a module and applies the specified weight initialization
    def weights_init_normal(m):
        '''Takes in a module and initializes all linear layers with weight
           values taken from a normal distribution.'''

        classname = m.__class__.__name__
        # for every Linear layer in a model
        if classname.find('Linear') != -1:
            y = m.in_features
        # m.weight.data shoud be taken from a normal distribution
            m.weight.data.normal_(0.0,1/np.sqrt(y))
        # m.bias.data should be 0
            m.bias.data.fill_(0)

下面我们展示了两个 NN 的性能，一个使用 uniform-distribution 初始化，另一个使用 normal-distribution

2 个 epoch 后：

Validation Accuracy
85.775% -- Uniform Rule [-y, y)
84.717% -- Normal Distribution
Training Loss
0.329  -- Uniform Rule [-y, y)
0.443  -- Normal Distribution

【讨论】：

您优化的任务是什么？全零解决方案如何实现零损失？
@ashunigion 我认为你歪曲了奥卡姆所说的：“实体不应在没有必要的情况下成倍增加”。他没有说你应该选择最简单的方法。如果是这样，那么您一开始就不应该使用神经网络。

【解决方案3】：

要初始化图层，您通常不需要做任何事情。 PyTorch 会为你做这件事。如果你仔细想想，这很有意义。当 PyTorch 可以按照最新趋势进行初始化时，我们为什么要初始化图层。

检查例如Linear layer。

在__init__ 方法中，它将调用Kaiming He init 函数。

    def reset_parameters(self):
        init.kaiming_uniform_(self.weight, a=math.sqrt(3))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            init.uniform_(self.bias, -bound, bound)

其他图层类型也是如此。对于conv2d，例如检查here。

注意：正确初始化的好处是训练速度更快。如果您的问题需要特殊初始化，您可以在之后进行。

【讨论】：

不过，默认初始化并不总是能提供最好的结果。我最近在 Pytorch 中实现了 VGG16 架构并在 CIFAR-10 数据集上对其进行了训练，我发现只需切换到 xavier_uniform 初始化权重（偏差初始化为 0），而不是使用默认初始化，我的验证RMSprop 30 个 epoch 后的准确率从 82% 提高到 86%。在使用 Pytorch 的内置 VGG16 模型（未预训练）时，我也获得了 86% 的验证准确率，所以我认为我正确地实现了它。（我使用了 0.00001 的学习率。）
这是因为他们没有在 VGG16 中使用 Batch Norms。确实，正确的初始化很重要，并且对于某些体系结构您要注意。例如，如果您使用 (nn.conv2d(), ReLU() 序列)，您将初始化为 relu 您的 conv 层设计的 Kaiming He 初始化。 PyTorch 无法预测 conv2d 之后的激活函数。如果您评估特征值，这是有道理的，但通常如果您使用 Batch Norms，您不必做太多事情，它们会为您标准化输出。如果您打算赢得 SotaBench 比赛，这很重要。

【解决方案4】：

import torch.nn as nn        

# a simple network
rand_net = nn.Sequential(nn.Linear(in_features, h_size),
                         nn.BatchNorm1d(h_size),
                         nn.ReLU(),
                         nn.Linear(h_size, h_size),
                         nn.BatchNorm1d(h_size),
                         nn.ReLU(),
                         nn.Linear(h_size, 1),
                         nn.ReLU())

# initialization function, first checks the module type,
# then applies the desired changes to the weights
def init_normal(m):
    if type(m) == nn.Linear:
        nn.init.uniform_(m.weight)

# use the modules apply function to recursively apply the initialization
rand_net.apply(init_normal)

【讨论】：

【解决方案5】：

如果您想要一些额外的灵活性，您也可以手动设置权重。

假设你有所有的输入：

import torch
import torch.nn as nn

input = torch.ones((8, 8))
print(input)

tensor([[1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

并且你想制作一个没有偏差的密集层（这样我们就可以可视化）：

d = nn.Linear(8, 8, bias=False)

将所有权重设置为 0.5（或其他任何值）：

d.weight.data = torch.full((8, 8), 0.5)
print(d.weight.data)

权重：

Out[14]: 
tensor([[0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000]])

您的所有权重现在都是 0.5。传递数据：

d(input)

Out[13]: 
tensor([[4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.]], grad_fn=<MmBackward>)

请记住，每个神经元接收 8 个输入，所有这些输入的权重均为 0.5，值为 1（并且没有偏差），因此每个神经元的总和为 4。

【讨论】：

【解决方案6】：

抱歉这么晚了，希望我的回答对你有帮助。

使用normal distribution 初始化权重：

torch.nn.init.normal_(tensor, mean=0, std=1)

或者使用constant distribution写：

torch.nn.init.constant_(tensor, value)

或者使用uniform distribution：

torch.nn.init.uniform_(tensor, a=0, b=1) # a: lower_bound, b: upper_bound

你可以查看其他初始化张量的方法here

【讨论】：

【解决方案7】：

迭代参数

如果你不能使用apply，例如模型没有直接实现Sequential：

所有人都一样

# see UNet at https://github.com/milesial/Pytorch-UNet/tree/master/unet


def init_all(model, init_func, *params, **kwargs):
    for p in model.parameters():
        init_func(p, *params, **kwargs)

model = UNet(3, 10)
init_all(model, torch.nn.init.normal_, mean=0., std=1) 
# or
init_all(model, torch.nn.init.constant_, 1.)

取决于形状

def init_all(model, init_funcs):
    for p in model.parameters():
        init_func = init_funcs.get(len(p.shape), init_funcs["default"])
        init_func(p)

model = UNet(3, 10)
init_funcs = {
    1: lambda x: torch.nn.init.normal_(x, mean=0., std=1.), # can be bias
    2: lambda x: torch.nn.init.xavier_normal_(x, gain=1.), # can be weight
    3: lambda x: torch.nn.init.xavier_uniform_(x, gain=1.), # can be conv1D filter
    4: lambda x: torch.nn.init.xavier_uniform_(x, gain=1.), # can be conv2D filter
    "default": lambda x: torch.nn.init.constant(x, 1.), # everything else
}

init_all(model, init_funcs)

您可以尝试使用torch.nn.init.constant_(x, len(x.shape)) 来检查它们是否已正确初始化：

init_funcs = {
    "default": lambda x: torch.nn.init.constant_(x, len(x.shape))
}

【讨论】：

【解决方案8】：

因为我目前的声望还不够，所以我不能在下面添加评论

prosti 在 2019 年 6 月 26 日 13:16 发布的答案。

    def reset_parameters(self):
        init.kaiming_uniform_(self.weight, a=math.sqrt(3))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            init.uniform_(self.bias, -bound, bound)

但我想指出，实际上我们知道 Kaiming He 的论文中的一些假设，Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification，是不合适，尽管看起来故意设计的初始化方法在实践中很受欢迎。

例如，在反向传播案例的小节中，他们假设 $w_l$ 和 $\delta y_l$ 是相互独立的。但是众所周知，以 score map $\delta y^L_i$ 为例，如果我们使用一个典型的交叉熵损失函数目标。

所以我认为 He's Initialization 运作良好的真正根本原因仍有待解开。因为每个人都见证了它在促进深度学习培训方面的力量。

【讨论】：

【解决方案9】：

如果您看到弃用警告 (@Fábio Perez)...

def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform_(m.weight)
        m.bias.data.fill_(0.01)

net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
net.apply(init_weights)

【讨论】：

您可以在Fábio Perez's answer 那里发表评论以保持答案干净。

【解决方案10】：

这是更好的方法，只需传递你的整个模型

import torch.nn as nn
def initialize_weights(model):
    # Initializes weights according to the DCGAN paper
    for m in model.modules():
        if isinstance(m, (nn.Conv2d, nn.ConvTranspose2d, nn.BatchNorm2d)):
            nn.init.normal_(m.weight.data, 0.0, 0.02)
        # if you also want for linear layers ,add one more elif condition

【讨论】：

单层

nn.Sequential 或自定义nn.Module

我们使用相同的神经网络 (NN) 架构比较不同的权重初始化模式。

全零或一

统一初始化

设置权重的一般规则

初始化权重的正态分布

迭代参数

所有人都一样

取决于形状

`nn.Sequential` 或自定义`nn.Module`