为什么 tf.keras model.fit() 初始化需要这么长时间？如何优化？答案

【问题标题】：Why does tf.keras model.fit() initialize take so long? How can it be optimized?为什么 tf.keras model.fit() 初始化需要这么长时间？如何优化？
【发布时间】：2019-04-20 04:59:48
【问题描述】：

使用 tensorflow.keras（支持 GPU 的 2.0-alpha0）我在新编译的模型和以前保存和重新加载的模型上使用 tf.keras.model.fit() 的初始化时间都非常长。

我相信这是在 tf.data.Datasets() 已经加载和预处理之后，所以我不明白什么需要这么长时间并且 TF/Keras 没有输出：

2019-04-19 23:29:18.109067: tensorflow/core/common_runtime/gpu/gpu_device.cc:1149] Created TensorFlow device
Resizing images and creating data sets with num_parallel_calls=8
Loading existing model to continue training.
Starting model.fit()
Epoch 1/100
2019-04-19 23:32:22.934394: tensorflow/core/kernels/data/shuffle_dataset_op.cc:150] Shuffle buffer filled.
2019-04-19 23:38:52.374924: tensorflow/core/common_runtime/bfc_allocator.cc:230] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.62GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

3 分钟加载模型并填充随机缓冲区，6 分钟......什么？而这个神秘的作品又该如何优化呢？（5ghz 8700K、32 GB RAM、NVME SSD、1080ti 11G DDR5 - 任务管理器显示 100% 单线程 CPU 使用率、中等磁盘访问、缓慢将 RAM 使用率扩展到最大 ~ 28GB、此期间 GPU 使用率为零）。

有没有什么方法可以以更有效的方式序列化或存储模型，以便它们可以定期启动和停止而无需 10 分钟的开销？

TF/Keras 是否在此期间以某种方式延迟加载数据集并对其进行预处理？

【问题讨论】：

标签： python-3.x tensorflow tf.keras

【解决方案1】：

对于 tf.data.Datasets() 使用多个工作器似乎是个问题。从日志消息中可以看出，您正在使用 8 个并行进程，这可以解释为什么您的 CPU/RAM 使用率如此之高。所以这不是模型的问题。

据我所知，第一次使用 Datasets 应该会比较慢，但是在数据被缓存后会变得更快。

如果 model.fit() 调用的启动速度仍然非常缓慢，您可以将进程数调整为 4 或 2。这可能会影响您的训练时间，因为您的 SSD 可能会因为必须加载数据而变慢。

【讨论】：