【问题标题】:tensorflow serving returning NaN when predicttensorflow在预测时服务返回NaN
【发布时间】:2020-06-18 01:15:33
【问题描述】:

我已经训练了一个 GAN 模型并通过以下函数保存了生成器:

    tf.keras.models.save_model(
        generator,
        filepath=os.path.join(MODEL_PATH, 'model_saver'),
        overwrite=True,
        include_optimizer=False,
        save_format=None,
        options=None
    )

在python中通过tf.keras.models.load_model加载模型时预测成功。但是在 tensorflow 模型服务器中为模型提供服务时,模型返回 NaN 值。 我通过以下方式为模型提供服务:

zhaocc:~/products/tensorflow_server$ sudo docker run -t --rm -p 8502:8501     -v "/tmp/pix2pix/sketch_photo/model_saver:/models/photo2sketch"     -e MODEL_NAME=photo2sketch     tensorflow/serving &
[3] 30089
zhaocc:~/products/tensorflow_server$ 2020-06-17 12:57:31.745339: I tensorflow_serving/model_servers/server.cc:86] Building single TensorFlow model file config:  model_name: photo2sketch model_base_path: /models/photo2sketch
2020-06-17 12:57:31.745448: I tensorflow_serving/model_servers/server_core.cc:464] Adding/updating models.
2020-06-17 12:57:31.745459: I tensorflow_serving/model_servers/server_core.cc:575]  (Re-)adding model: photo2sketch
2020-06-17 12:57:31.846162: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: photo2sketch version: 1}
2020-06-17 12:57:31.846213: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: photo2sketch version: 1}
2020-06-17 12:57:31.846233: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: photo2sketch version: 1}
2020-06-17 12:57:31.846282: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /models/photo2sketch/1
2020-06-17 12:57:31.874158: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2020-06-17 12:57:31.874182: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:295] Reading SavedModel debug info (if present) from: /models/photo2sketch/1
2020-06-17 12:57:31.874315: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-17 12:57:31.952982: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:234] Restoring SavedModel bundle.
2020-06-17 12:57:32.172641: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:183] Running initialization op on SavedModel bundle at path: /models/photo2sketch/1
2020-06-17 12:57:32.248514: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:364] SavedModel load for tags { serve }; Status: success: OK. Took 402236 microseconds.
2020-06-17 12:57:32.256576: I tensorflow_serving/servables/tensorflow/saved_model_warmup.cc:105] No warmup data file found at /models/photo2sketch/1/assets.extra/tf_serving_warmup_requests
2020-06-17 12:57:32.265064: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: photo2sketch version: 1}
2020-06-17 12:57:32.267113: I tensorflow_serving/model_servers/server.cc:355] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2020-06-17 12:57:32.269289: I tensorflow_serving/model_servers/server.cc:375] Exporting HTTP/REST API at:localhost:8501 ...
[evhttp_server.cc : 238] NET_LOG: Entering the event loop ...

当我通过 REST 请求进行预测时,它会返回具有正确形状的 NaN:

[[[[nan nan nan]
   [nan nan nan]
   [nan nan nan]
   ...
   [nan nan nan]
   [nan nan nan]
   [nan nan nan]]

有人知道为什么吗?我该如何调试它?非常感谢!

【问题讨论】:

  • 我尝试提供另一个模型(LSTM 进行分类)。它运作良好。 Emm.. 那么也许生成器模型有错误?

标签: tensorflow tensorflow-serving


【解决方案1】:

我的 Pix2Pix 生成器也遇到了同样的问题。问题出在训练参数上。正如这里What does `training=True` mean when calling a TensorFlow Keras model? 所解释的,此参数会影响网络的结果。一种可能的解决方案是在保存网络之前删除所有丢失(和其他受影响的部分)。这个解决方案对我不起作用(可能错过了一些东西)。因此,作为一种临时解决方法,我在模型中添加了 2 个签名

@tf.function(input_signature=[tf.TensorSpec([None, 256,256,3], dtype=tf.float32)])
def model_predict1(input_batch):
  return {'outputs': generator(input_batch, training=True)}

@tf.function(input_signature=[tf.TensorSpec([None, 256,256,3], dtype=tf.float32)])
def model_predict2(input_batch):
  return {'outputs': generator(input_batch, training=False)}
...
generator.save(base_path + "kerassave",signatures={'predict1': model_predict1, 'predict2': model_predict2})

predict2 仍然总是返回 nans。但是 predict1 有效。

【讨论】:

  • 非常感谢您的回答!我会试试的!
  • training=True 如何影响模型预测?据我了解,在训练期间,辍学会随机删除节点/分支,但不会在预测期间。这是否意味着,在推理过程中,设置training=True 在传递相同图像进行预测时,我们可能会得到不同的结果?干杯!
猜你喜欢
  • 2019-12-31
  • 2017-10-13
  • 1970-01-01
  • 2020-04-07
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-09-11
  • 1970-01-01
相关资源
最近更新 更多