使用 GCP 训练的模型进行推理？答案

【问题标题】：Using model trained in GCP for inferencing?使用 GCP 训练的模型进行推理？
【发布时间】：2020-05-20 21:15:13
【问题描述】：

我是这个话题的新手，所以请多多包涵。

我一直在按照本教程训练我自己的分割模型：ShapeMask on GCP 训练过程成功完成，我得到以下输出：

现在，我正在尝试在 google 提供的 colab notebook 中使用它：Colab

但是我无法向它提供经过训练的模型。我需要在该笔记本中保存一个模型，但是我几乎没有运气将我的输出转换为保存的模型。我在 VM 和 TPU 上使用 TF 版本 1.15.2。

在我缺少的训练和推理之间有几个步骤。但我不知道它们是什么。任何帮助都非常感谢。谢谢！

到目前为止，我已经尝试使用this 将我的文件转换为保存的模型。并通过this阅读但不明白如何使用它。

【问题讨论】：

您能否更具体地说明您丢失/出现错误/出现意外行为的特定步骤？您卡在流程的哪一部分？
当然，在我指定的那个 colab 笔记本中，如果您转到单元格 6“加载预训练模型”，本教程使用“tf.saved_model.loader.load”函数加载模型。我想在那里提供训练有素的模型输出，但我不知道如何。
@AlbertAlbesa 请在下面查看我的答案。希望得到任何帮助或进一步澄清，因为我不是这方面的专家！谢谢！

标签： tensorflow google-cloud-platform image-segmentation tpu gcp-ai-platform-training

【解决方案1】：

所以我能够从检查点保存模型。在 colab 笔记本上使用以下 sn-p。我不得不在 colab 笔记本中启用 TPU（运行时 > 更改运行时类型 > TPU）可能是因为我在 TPU 上进行了尝试（否则会引发错误）。

import os
import tensorflow.compat.v1 as tf
from google.protobuf import text_format
from tensorflow import keras

trained_checkpoint_prefix ='<GC storage bucket path>/model.ckpt-1000'
export_dir = '<GC storage bucket path>'
tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']

graph = tf.Graph()
with tf.Session(target=tpu_address,graph=graph) as sess:
    # Reste from checkpoint
    loader = tf.train.import_meta_graph(trained_checkpoint_prefix + '.meta', clear_devices=True)
    loader.restore(sess, trained_checkpoint_prefix)
    # Export checkpoint to SavedModel
    builder = tf.compat.v1.saved_model.builder.SavedModelBuilder(export_dir)
    builder.add_meta_graph_and_variables(sess, [tf.saved_model.TRAINING, tf.saved_model.SERVING], strip_default_attrs=True)
    builder.save()

现在我这样说是因为这个保存的模型插入到 Colab 教程 noteook 中不起作用。它在单元格 6 中成功读取了模型，但推理部分出现错误。就在这里：

num_detections, detection_boxes, detection_classes, detection_scores, detection_masks, detection_outer_boxes, image_info = session.run(
['NumDetections:0', 'DetectionBoxes:0', 'DetectionClasses:0', 'DetectionScores:0', 'DetectionMasks:0', 'DetectionOuterBoxes:0', 'ImageInfo:0'],
feed_dict={'Placeholder:0': np_image_string})

该过程以以下错误结束：

KeyError: "The name 'Placeholder:0' refers to a Tensor which does not exist. The operation, 'Placeholder', does not exist in the graph."

它也找不到所有其他变量名。我不确定是什么原因造成的，一旦我这样做就会更新答案！

EDIT1：

我使用以下readme 解决了这个问题。

首先我使用了 TF 2.2 和 TPU repo 的主分支，而不是 shapemask 分支。然后按照原始教程中的确切步骤进行培训。并使用以下命令导出保存的模型：

python ~/tpu/models/official/detection/export_saved_model.py \
--export_dir="${EXPORT_DIR?}" \
--checkpoint_path="${CHECKPOINT_PATH?}" \
--params_override="${PARAMS_OVERRIDE?}" \
--batch_size=${BATCH_SIZE?} \
--input_type="${INPUT_TYPE?}" \
--input_name="${INPUT_NAME?}" \

这里的 params override flag 应该传递给训练期间创建的 params.yaml 文件。批量大小设置为 1 以一次处理一张图像。更多细节可以在自述文件中找到。

注意：我必须注释掉以下行才能执行：

import segmentation from serving

它导出了模型，并能够在 colab 笔记本中加载和使用它，只需对笔记本进行一些细微的调整。

【讨论】：

sess.graph.get_operations() 的输出是什么？
@AlbertAlbesa 我不确定我是否可以分享这些数据。但我现在确实修复了这个错误。我更新了答案。感谢您的帮助！