如何使用 tensorflow 的 Estimator API 控制何时计算评估与训练？答案

【问题标题】：How to control when to compute evaluation vs training using the Estimator API of tensorflow?如何使用 tensorflow 的 Estimator API 控制何时计算评估与训练？
【发布时间】：2018-09-12 04:26:22
【问题描述】：

如this question中所述：

tensorflow 文档没有提供任何关于如何在评估集上对模型执行定期评估的示例

接受的答案建议使用 Experiment（根据 this README 已弃用）。

我在网上找到的所有内容都指向使用train_and_evaluate 方法。但是，我仍然看不到如何在两个过程（训练和评估）之间切换。我尝试了以下方法：

estimator = tf.estimator.Estimator(
    model_fn=model_fn,
    params=hparams,
    model_dir=model_dir,
    config = tf.estimator.RunConfig(
        save_checkpoints_steps = 2000,
        save_summary_steps = 100,
        keep_checkpoint_max=5
    )
)

train_input_fn = lambda: input_fn(
    train_file, #a .tfrecords file
    train=True,
    batch_size=70,
    num_epochs=100
)

eval_input_fn = lambda: input_fn(
    val_file, # another .tfrecords file
    train=False,
    batch_size=70,
    num_epochs=1
)
train_spec = tf.estimator.TrainSpec(
    train_input_fn,
    max_steps=125
)    

eval_spec = tf.estimator.EvalSpec(
    eval_input_fn,
    steps=30,
    name='validation',
    start_delay_secs=150,
    throttle_secs=200
)

tf.logging.info("start experiment...")
tf.estimator.train_and_evaluate(
    estimator,
    train_spec,
    eval_spec
)

这是我认为我的代码应该做的事情：

使用 70 的批大小训练模型 100 个 epoch；每 2000 个批次保存检查点；每 100 个批次保存摘要；最多保留5个检查点；在训练集上进行 150 批后，使用 30 批验证数据计算验证误差

但是，我得到以下日志：

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into /output/model.ckpt.
INFO:tensorflow:loss = 39.55082, step = 1
INFO:tensorflow:global_step/sec: 178.622
INFO:tensorflow:loss = 1.0455043, step = 101 (0.560 sec)
INFO:tensorflow:Saving checkpoints for 150 into /output/model.ckpt.
INFO:tensorflow:Loss for final step: 0.8327793.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-04-02-22:49:15
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /projects/MNIST-GCP/output/model.ckpt-150
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [3/30]
INFO:tensorflow:Evaluation [6/30]
INFO:tensorflow:Evaluation [9/30]
INFO:tensorflow:Evaluation [12/30]
INFO:tensorflow:Evaluation [15/30]
INFO:tensorflow:Evaluation [18/30]
INFO:tensorflow:Evaluation [21/30]
INFO:tensorflow:Evaluation [24/30]
INFO:tensorflow:Evaluation [27/30]
INFO:tensorflow:Evaluation [30/30]
INFO:tensorflow:Finished evaluation at 2018-04-02-22:49:15
INFO:tensorflow:Saving dict for global step 150: accuracy = 0.8552381, global_step =150, loss = 0.95031387

从日志看来，训练似乎在第一个评估步骤后停止。我从文档中遗漏了什么？你能解释一下我应该如何实现我认为我的代码正在做的事情吗？

附加信息我正在使用 MNIST 数据集运行所有内容，该数据集在训练集中有 50,000 张图像，因此（我认为）模型应该运行 *num_epochs*50,000/batch_size ≃ 7,000 步*

衷心感谢您的帮助！

编辑：运行实验后，我意识到 max_steps 控制整个训练过程的步数，而不仅仅是计算测试集指标之前的步数。阅读 tf.estimator.Estimator.train，我看到它有一个 steps 参数，它以增量方式工作并受 max_steps 限制；但是，tf.estimator.TrainSpec 没有 steps 参数，这意味着我无法控制在验证集上计算指标之前要采取的步骤数。

【问题讨论】：

标签： python tensorflow

【解决方案1】：

事实上，每 200 秒或当您的训练结束时，估算器将从训练阶段切换到评估阶段。

但是，我们可以在您的代码中看到，您能够在评估开始前达到 125 步，这意味着您的训练完成了。 max_steps 是在停止之前重复训练的次数，与 epoch 的数量有任何联系（因为它没有在 tf.estimator.train_and_evaluate 中使用）。在您的训练期间，您的评估指标将出现在每个节流_秒（=200 这里）。

关于您可以在模型中添加的指标：

predict = tf.nn.softmax(logits, name="softmax_tensor")
classes = tf.cast(tf.argmax(predict, 1), tf.uint8)

def conv_model_eval_metrics(classes, labels, mode):
    if mode == tf.estimator.ModeKeys.TRAIN or mode == tf.estimator.ModeKeys.EVAL:
        return {
            'accuracy': tf.metrics.accuracy(classes, labels),
            'precision': tf.metrics.precision(classes, labels),
            'recall': tf.metrics.recall(classes, labels),
        }
    else:
        return None

eval_metrics = conv_model_eval_metrics(classes, labels, mode)
with tf.variable_scope("performance_metrics"):
    #Accuracy is the most intuitive performance measure and it is simply a
        #ratio of correctly predicted observation to the total observations.
    tf.summary.scalar('accuracy', eval_metrics['accuracy'][1])

    #How many selected items are relevant
    #Precision is the ratio of correctly predicted positive observations to
        #the total predicted positive observations.
    tf.summary.scalar('precision', eval_metrics['precision'][1])

    #How many relevant items are selected
    #Recall is the ratio of correctly predicted positive observations to
        #the all observations in actual class
    tf.summary.scalar('recall', eval_metrics['recall'][1])

在训练和评估期间，在 tensorboard 上跟踪精度、召回率和准确率非常有效。

PS：对不起，这是我的第一个答案，这就是为什么读起来很恶心^^

【讨论】：

感谢您的回答！尽管它很有用，但它并不能回答问题。我将发布我认为是我运行的一些实验的答案

【解决方案2】：

可以通过 input_fn() 中的一组 tf.data.Dataset.repeat(num_epochs) 来控制重复次数。训练函数会一直运行到消耗完 epoch 数，然后再运行评估函数，然后训练函数会再次运行直到 epoch 数，以此类推；最后，train_and_eval 方法将在达到 TrainSpec 中定义的 max_steps 时停止。

这是我通过一些实验得出的结论，欢迎指正。

【讨论】：

【解决方案3】：

据我了解，评估是使用最新检查点的重生模型进行的。在您的情况下，您直到 2000 步才保存检查点。您还指出max_steps=125，它将优先于您提供给模型的数据集。

因此，即使您指定批量大小为 70 和 100 个 epoch，您的模型在 125 步时停止训练，这远低于 2000 步的检查点限制，这反过来又限制了评估，因为评估取决于检查点模型.

请注意，默认情况下，每次保存检查点都会进行评估，假设您没有设置 throttle_secs 限制。

【讨论】：