重复调用 DNNClassifier.predict 后 CPU 上的 Tensorflow ResourceExhaustedError OOM答案

【问题标题】：Tensorflow ResourceExhaustedError OOM on CPU after calling DNNClassifier.predict repeatedly重复调用 DNNClassifier.predict 后 CPU 上的 Tensorflow ResourceExhaustedError OOM
【发布时间】：2018-04-08 17:32:35
【问题描述】：

我正在使用在 CPU 上运行的 Tensorflow DNNClassifier。我已经完成了培训，现在正在反复拨打estimator.predict，在拨打了几千次电话后，我得到了以下信息。我很困惑，因为我认为进行预测本身不会增加内存（我看到其他一些人提出了类似的错误，但他们使用的是 GPU 并在训练期间看到了错误）。

....
File "C:\Users\Zvi\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 1654, in __init__
self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[973771,128] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
 [[Node: save/AssignVariableOp = AssignVariableOp[dtype=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](dnn/input_from_feature_columns/input_layer/product_hub_module_embedding/module/embeddings/part_0, save/Identity_7)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

【问题讨论】：

标签： python tensorflow deep-learning out-of-memory tensorflow-estimator

【解决方案1】：

我很困惑，因为我认为进行预测本身不会增加记忆。

这实际上是不正确的，所以这可能是 OOM 的原因。 Estimator.predict() 在每次调用时从头开始重建图形，并从磁盘加载权重以进行推理。有关详细信息，请参阅 this question 和 this issue on GitHub。是的，图，张量是其他对象在调用后变得可用于 GC，但这并不意味着它们都被立即收集。

当这个方法被调用一千次时，整个应用程序的稳定性取决于之前分配的内存可以多快被回收。但是python GC can be postponed 很久了。而且即使 GC 定期收集垃圾，你仍然可能面临碎片整理的问题。

这意味着您应该尝试使用更多输入数据来预测更少的调用 Estimator.predict()，或者从估算器 API 迁移到 keras、tf slim 或纯 tensorflow 实现。

【讨论】：