在 TensorFlow 中批量访问单个渐变的最佳方法是什么？答案

【问题标题】：What's the best way to access single gradients in a batch in TensorFlow?在 TensorFlow 中批量访问单个渐变的最佳方法是什么？
【发布时间】：2020-06-08 11:33:58
【问题描述】：

我目前正在分析梯度如何在使用 Tensorflow 2.x 训练 CNN 的过程中发展。我想要做的是将批次中的每个梯度与整个批次的梯度进行比较。目前我对每个训练步骤都使用这个简单的代码 sn-p：

[...]
loss_object = tf.keras.losses.SparseCategoricalCrossentropy()
[...]

# One training step
# x_train is a batch of input data, y_train the corresponding labels
def train_step(model, optimizer, x_train, y_train):

    # Process batch
    with tf.GradientTape() as tape:
        batch_predictions = model(x_train, training=True)
        batch_loss = loss_object(y_train, batch_predictions)
    batch_grads = tape.gradient(batch_loss, model.trainable_variables)
    # Do something with gradient of whole batch
    # ...

    # Process each data point in the current batch
    for index in range(len(x_train)):
        with tf.GradientTape() as single_tape:
            single_prediction = model(x_train[index:index+1], training=True)
            single_loss = loss_object(y_train[index:index+1], single_prediction)
        single_grad = single_tape.gradient(single_loss, model.trainable_variables)
        # Do something with gradient of single data input
        # ...

    # Use batch gradient to update network weights
    optimizer.apply_gradients(zip(batch_grads, model.trainable_variables))

    train_loss(batch_loss)
    train_accuracy(y_train, batch_predictions)

我的主要问题是单手计算每个梯度时计算时间会爆炸，尽管在计算批次梯度时这些计算应该已经由 Tensorflow 完成。原因是GradientTape 和compute_gradients 总是返回一个梯度，无论给出单个还是多个数据点。所以这个计算必须针对每个数据点进行。

我知道我可以通过使用为每个数据点计算的所有单个梯度来计算批次的梯度来更新网络，但这在节省计算时间方面只起次要作用。

有没有更有效的方法来计算单个梯度？

【问题讨论】：

标签： python tensorflow machine-learning tensorflow2.0

【解决方案1】：

您可以使用梯度带的jacobian 方法获得雅可比矩阵，这将为您提供每个单独损失值的梯度：

import tensorflow as tf

# Make a random linear problem
tf.random.set_seed(0)
# Random input batch of ten four-vector examples
x = tf.random.uniform((10, 4))
# Random weights
w = tf.random.uniform((4, 2))
# Random batch label
y = tf.random.uniform((10, 2))
with tf.GradientTape() as tape:
    tape.watch(w)
    # Prediction
    p = x @ w
    # Loss
    loss = tf.losses.mean_squared_error(y, p)
# Compute Jacobian
j = tape.jacobian(loss, w)
# The Jacobian gives you the gradient for each loss value
print(j.shape)
# (10, 4, 2)
# Gradient of the loss wrt the weights for the first example
tf.print(j[0])
# [[0.145728424 0.0756840706]
#  [0.103099883 0.0535449386]
#  [0.267220169 0.138780832]
#  [0.280130595 0.145485848]]

【讨论】：

如果我理解正确的话，我可以用tape.jacobian(batch_loss, model.trainable_variables) 替换我的tape.gradient(batch_loss, model.trainable_variables) 电话，对吧？如果我这样做，（对于批量大小为 250 和 8 层的 CNN），我仍然会得到大小为 8 的j（不是我期望的 250），这意味着我得到的是每层的梯度，而不是每个输入日期的梯度。你知道我做错了什么吗？
@ItsMarvolo 好吧，这取决于您的其余代码，即loss_object。我发布的内容假设您的 batch_loss 是一个一维张量，批次中每个示例都有一个损失值。但是如果loss_object 已经将批次丢失聚合为单个值，或者输出其他内容，那么您需要进行其他更改...
好的，谢谢您的提示。我将当前的loss_object 添加到我的问题中。您对将损失汇总为单个值的看法是正确的。如果我将reduction=tf.keras.losses.Reduction.NONE 添加到loss_object 我的optimizer.apply_gradients(..) 呼叫现在由于附加维度而中断。我会继续努力，看看它是否能解决我的问题。
好的，只是为了澄清一下：tensorflow 文档指出“对于几乎所有情况”，AUTO 缩减默认为SUM_OVER_BATCH_SIZE。在我最初的帖子中应该是这种情况，没有指定减少。所以..SparseCategoricalCrossentropy() + tape.gradient(...) 的结果应该等于..SparseCategoricalCrossentropy(..NONE) + tape.jacobian(...) + tf.reduce_mean(..) 的新结果。也许这可以帮助与我在同一点上挣扎的人。
@ItsMarvolo 感谢您提供的附加信息，听起来不错。忘了提一下，如果你需要输入的损失梯度，你可以使用batch_jacobian，这样会更快（假设每个y[i]只依赖于x[i]）。但我认为它不适用于你的情况。