为什么 tf.contrib.layers.instance_norm 层包含 StopGradient 操作？答案

【问题标题】：Why tf.contrib.layers.instance_norm layer contain StopGradient operation?为什么 tf.contrib.layers.instance_norm 层包含 StopGradient 操作？
【发布时间】：2021-02-22 20:32:10
【问题描述】：

为什么tf.contrib.layers.instance_norm层包含StopGradient操作？即为什么需要它？

似乎即使在更简单的层tf.nn.moments 中也有StopGradient（可以是tf.contrib.layers.instance_norm 的构建块）。

x_m, x_v = tf.nn.moments(x, [1, 2], keep_dims=True)

我还在tf.nn.moments源代码中找到了关于StopGradient的注释：

# The dynamic range of fp16 is too limited to support the collection of
# sufficient statistics. As a workaround we simply perform the operations
# on 32-bit floats before converting the mean and variance back to fp16
y = math_ops.cast(x, dtypes.float32) if x.dtype == dtypes.float16 else x
# Compute true mean while keeping the dims for proper broadcasting.
mean = math_ops.reduce_mean(y, axes, keepdims=True, name="mean")
# sample variance, not unbiased variance
# Note: stop_gradient does not change the gradient that gets
#       backpropagated to the mean from the variance calculation,
#       because that gradient is zero
variance = math_ops.reduce_mean(
    math_ops.squared_difference(y, array_ops.stop_gradient(mean)),
    axes,
    keepdims=True,
    name="variance")

所以这是一种优化，因为梯度总是为零？

【问题讨论】：

标签： tensorflow deep-learning batch-normalization

【解决方案1】：

尝试回答。

这个设计告诉我们，最小化第二时刻我们不希望在第一时刻传播梯度。是否有意义？如果我们尝试最小化E[x^2]-E[x]^2，我们将最小化E[x^2]，同时最大化E[x]^2。第一项将减少每个元素的绝对值（将它们拖到中心）。第二项将通过梯度增加所有值，这对最小化方差没有任何作用，但可能会对其他梯度路径产生负面影响。

因此，我们不会通过第一个时刻传播第二个时刻的梯度，因为这个梯度不会影响第二个时刻，至少在使用普通 SGD 时是这样。

【讨论】：