Tensorflow 对象检测 API：对于带有自定义数据的 ssd + mobilenetv2，训练卡在 step=0答案

【问题标题】：Tensorflow Object Detection API: Training gets stuck at step=0 for ssd + mobilenetv2 with custom dataTensorflow 对象检测 API：对于带有自定义数据的 ssd + mobilenetv2，训练卡在 step=0
【发布时间】：2020-07-31 12:48:57
【问题描述】：

我想使用 ssd + mobilenetv2 模型和我自己的图像进行迁移学习。我只有一堂课。图像是从 OpenImageDataSet 下载的。我使用了 tensorflow 的对象检测 API。但是训练停留在 step = 0。

我验证了 TFRecord 已正确创建，因为我可以使用相同的数据通过对象检测 API 训练 fast_rcnn。我使用 repos 中的配置文件创建了自己的配置文件：ssd_mobilenet_v2_oid_v4.config。

我还尝试使用相应的配置文件从 ssd_mobilenet_v2_coco_2018_03_29.tar.gz 开始。行为是一样的——它也停留在同一个地方。

####################
CONSOLE LOG:
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
I0416 16:30:39.198738 19792 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0416 16:30:39.632495 19792 session_manager.py:502] Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into D:\work\cv\others\my-tf2-od-transfer-ssd-mobilenet-v2\model.ckpt.
I0416 16:30:48.724722 19792 basic_session_run_hooks.py:606] Saving checkpoints for 0 into D:\work\cv\others\my-tf2-od-transfer-ssd-mobilenet-v2\model.ckpt.
2020-04-16 16:30:59.919297: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-16 16:31:00.964680: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows
Relying on driver to perform ptx compilation. This message will be only logged once.
2020-04-16 16:31:00.986098: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
INFO:tensorflow:loss = 12.512502, step = 0
I0416 16:31:02.740392 19792 basic_session_run_hooks.py:262] loss = 12.512502, step = 0 [STUCK HERE]

【问题讨论】：

标签： tensorflow training-data object-detection-api mobilenet tensorflow-ssd

【解决方案1】：

我发现 TF 1.15 GPU 版本 + 我的设置的组合会导致问题：“在 Windows 上不支持调用 ptxas”。将其降级到 TF 1.14 GPU 或使用 TF 1.15 CPU 即可解决此问题。这是 Tensorflow 上一个常见且开放的问题：HERE

【讨论】：

【解决方案2】：

你确定它卡住了吗？你有什么错误吗？在训练过程中，TF OD API 将日志写入模型目录下的一个事件文件（可以使用 tensorboard 打开）。查看你的模型目录，看看是否有一个事件文件写在那里，看看它的时间戳，看看它是否正在更新。

【讨论】：

感谢 @Tamir Tapuhi 将我指向 tensorboard 和事件文件。我确认该事件没有得到更新。查看 tensorboard 中的图表，只有 step=0 处的点。还有其他建议吗？谢谢！
听起来很奇怪。 1. 在你决定它被卡住之前，你给了这个过程多少时间？ 2. 你们的批次大小是多少？ 3.尝试运行htop查看内存和cpu消耗\
我发现TF 1.15 GPU版本+我的设置的组合导致了这个问题。将其降级到 TF 1.14 解决了这个问题。这是 Tensorflow 上一个常见且开放的问题：github.com/tensorflow/models/issues/7640 非常感谢您的帮助！