Tensorflow.js 中的内存泄漏：如何清理未使用的张量？答案

【问题标题】：Memory leak in Tensorflow.js: How to clean up unused tensors?Tensorflow.js 中的内存泄漏：如何清理未使用的张量？
【发布时间】：2019-10-24 02:13:52
【问题描述】：

我正在编写一个脚本，它有时会泄漏张量。这可能在多种情况下发生，例如当我训练神经网络时，但训练崩溃了。在这种情况下，训练会中断，并且不会正确处理张量。这会导致内存泄漏，我试图通过处理未使用的张量来清理它。

示例

在下面的 sn-p 中，我正在训练两个（非常简单的）模型。第一次运行将起作用并且不会导致张量泄漏（训练前的张量数 = 训练后的张量数）。第二次，我使用无效的reshape 层在训练期间强制崩溃。因此，会引发错误，并且数据集中的张量（我猜？）将无法正确处理。该代码是显示张量如何泄漏的示例。

async function train(shouldCrash) {
  console.log(`Training, shouldCrash=${shouldCrash}`);
  const dataset = tf.data.zip({ // setup data
    xs: tf.data.array([[1],[1]]),
    ys: tf.data.array([1]),
  }).batch(1);

  const model = tf.sequential({ // setup model
    layers: [
      tf.layers.dense({units: 1, inputShape: [1]}),
      tf.layers.reshape({targetShape: [(shouldCrash ? 2 : 1)]}), // use invalid shape when crashing
    ],
  });
  model.compile({ optimizer: 'sgd', loss: 'meanSquaredError' });
  console.log('  Tensors before:', tf.memory().numTensors);
  try {
    const history = await model.fitDataset(dataset, { epochs: 1 });
  } catch (err) {
    console.log(`    Error: ${err.message}`);
  }
  console.log('  Tensors after:', tf.memory().numTensors);
}

(async () => {
  await train(false); // normal training
  await train(true); // training with error
})();

&lt;script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@1.1.2/dist/tf.min.js"&gt;&lt;/script&gt;

问题

有tf.tidy，它在某些情况下可以帮助我处理未使用的张量，但它只能用于同步函数调用。所以调用await model.fitDataset(...)时不能使用。

有没有办法处理任何未使用的张量？或者，有没有办法处理页面上所有现有的张量（无需重新加载）？

【问题讨论】：

标签： javascript node.js machine-learning memory-leaks tensorflow.js

【解决方案1】：

清除异步代码中任何未使用的张量的方法是将创建它们的代码封装在 startScope() 和 endScope() 调用之间。

tf.engine().startScope()
// do your thing
tf.engine().endScope()

【讨论】：

我看ts.tidy() 方法就知道了。 TFjs 也在他们的测试中使用它github.com/tensorflow/tfjs-core/blob/master/tfjs-core/src/…
是的，这很棒！它解决了我遇到的内存问题。谢谢！
非常感谢！，这确实是奇迹！它清除了开始和结束范围之间的所有张量:)
这个 API 有文档吗？
TensorFlow.js 的维护者建议使用.dispose() 而不是上面的方法：github.com/tensorflow/tfjs/issues/4685。如果您有多个 Promise 一起运行，或者如果您在 Promise 中创建 Promise（这在 async/await 代码中很常见），上述方法可能会出现问题：这种情况可能导致来自不同 Promise 的交错微任务，这可能导致以下执行：startScope、startScope、endScope、endScope。第一个 endScope 可以处理将在另一个 Promise 中使用的张量。

【解决方案2】：

根据文档，提供给tf.tidy 的函数“不得返回 Promise”。在内部，tf backend 在拟合模型时处理所有使用的张量。这就是为什么tf.fit 不应该放在tf.tidy 中的原因。要处理模型崩溃，可以在模型上调用tf.dispose。

确实，当前似乎存在内存泄漏，但是在模型定义期间发生模型崩溃是一个糟糕的实现。这不应该在适当的情况下发生，因为可以测试给定的参数是否与层的输入相匹配。例如，可以避免在构建模型之前将形状从 2 变为 1，以防止内存泄漏。

async function train(shouldCrash) {
  console.log(`Training, shouldCrash=${shouldCrash}`);
  const dataset = tf.data.zip({ // setup data
    xs: tf.data.array([[1],[1]]),
    ys: tf.data.array([1]),
  }).batch(1);

  const model = tf.sequential({ // setup model
    layers: [
      tf.layers.dense({units: 1, inputShape: [1]}),
      tf.layers.reshape({targetShape: [(shouldCrash ? 2 : 1)]}), // use invalid shape when crashing
    ],
  });
  model.compile({ optimizer: 'sgd', loss: 'meanSquaredError' });
  console.log('  Tensors before:', tf.memory().numTensors);
  try {
    const history = await model.fitDataset(dataset, { epochs: 1 });
  } catch (err) {
    console.log(`    Error: ${err.message}`);
  }
  
  console.log('  Tensors after:', tf.memory().numTensors);
  return model
}

(async () => {
  const m1 = await train(false); // normal training
   tf.dispose(m1)
  const m2 = await train(true); // training with error
  
  tf.dispose(m2)
  tf.disposeVariables() 
  console.log('Tensors afters:', tf.memory().numTensors);
   
})();

&lt;script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@1.1.2/dist/tf.min.js"&gt;&lt;/script&gt;

【讨论】：

我知道我展示的代码是“糟糕的实现”，但正如我所说，它只是用来演示内存泄漏。 tf.disposeVariables 函数看起来非常有用，我什至不知道我可以将模型传递给 tf.dispose。谢谢！ :)