TensorFlow Google Cloud ML 训练中的内存泄漏答案

【问题标题】：Memory Leak in TensorFlow Google Cloud ML TrainingTensorFlow Google Cloud ML 训练中的内存泄漏
【发布时间】：2017-11-08 19:36:30
【问题描述】：

我一直在尝试 Google Cloud ML 上的 TensorFlow 教程脚本。特别是我使用了https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10 的 cifar10 CNN 教程脚本。

当我在 Google Cloud ML 中运行此训练脚本时，每小时会出现大约 0.5% 的内存泄漏。

除了将脚本打包成所需的 GCP 格式（如https://cloud.google.com/ml-engine/docs/how-tos/packaging-trainer 中所述）并将数据位置设置为包含 .bin 数据文件的存储桶之外，我没有对脚本进行任何更改。

如果我在本地运行，即不在 Google Cloud 中，并使用 TCMALLOC，通过设置 LD_PRELOAD="/usr/lib/libtcmalloc.so"，内存泄漏得到解决。但是，我在 Google Cloud ML 中没有这个选项。

什么可能导致泄漏，我可以做些什么来解决这个问题？为什么其他用户没有注意到同样的问题？虽然泄漏很小，但当我针对自己的数据运行几天时，它足以导致我的训练课程耗尽内存并失败。无论我使用多少 GPU，都会发生泄漏。

我使用的 gcloud 命令是：

gcloud ml-engine jobs submit training cifar10_job --job-dir gs://tfoutput/joboutput --package-path trainer --module-name=trainer.cifar10_multi_gpu_train --region europe-west1 --staging-bucket gs://tfoutput --scale-tier CUSTOM --config config.yml --runtime-version 1.0 -- --num_gpus=4

配置文件（config.yml）是：

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m_gpu

任何帮助表示赞赏，谢谢。

【问题讨论】：

能分享一下本地运行python -c "from google.protobuf.internal import api_implementation; print(api_implementation._default_implementation_type)"的输出吗？是'cpp'吗？
@rhaertel80 是的，它是'cpp'
匹配 CloudML 引擎中的输出。我们将继续调查。
另外，我们建议使用github.com/tensorflow/models/pull/1538，它具有巨大的性能优势，可能足以让您在我们调查期间完成培训
感谢@rhaertel80，这似乎在内存使用和性能方面都好得多。

标签： memory-leaks tensorflow google-cloud-ml google-cloud-ml-engine

【解决方案1】：

我们推荐使用这个版本的代码：

github.com/tensorflow/models/pull/1538

具有性能优势（运行时间越短，您就越不容易出现 OOM）。

当然，这可能不是永久修复，但根据我们的测试，TensorFlow 1.2 似乎可以解决这个问题。 TensorFlow 1.2 即将在 CloudML Engine 上可用。如果您仍然有问题，请告诉我们。

【讨论】：