【问题标题】:Caffe: resuming from a trained snapshot fails with an errorCaffe:从经过训练的快照恢复失败并出现错误
【发布时间】:2016-07-17 07:08:27
【问题描述】:

我用一些配置训练了我的网络,然后保存了它的快照。
现在我正在尝试从最后一个快照恢复,但失败并显示以下错误消息:

I0328 13:44:30.756110 24238 net.cpp:283] Network initialization done.
I0328 13:44:30.756206 24238 solver.cpp:60] Solver scaffolding done.
I0328 13:44:30.757062 24238 caffe.cpp:209] Resuming from /media/hossein/tmpstore/caffe_new/examples/cifar10/cifar10_full_relu_bn_iter_60000.caffemodel.h5
HDF5-DIAG: Error detected in HDF5 (1.8.15-patch1) thread 0:
  #000: H5D.c line 358 in H5Dopen2(): not found
    major: Dataset
    minor: Object not found
  #001: H5Gloc.c line 430 in H5G_loc_find(): can't find object
    major: Symbol table
    minor: Object not found
  #002: H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #003: H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #004: H5Gloc.c line 385 in H5G_loc_find_cb(): object 'iter' doesn't exist
    major: Symbol table
    minor: Object not found
F0328 13:44:30.786376 24238 hdf5.cpp:153] Check failed: status >= 0 (-1 vs. 0) Failed to load int dataset with name iter
*** Check failure stack trace: ***
    @     0x7f2d6e635daa  (unknown)
    @     0x7f2d6e635ce4  (unknown)
    @     0x7f2d6e6356e6  (unknown)
    @     0x7f2d6e638687  (unknown)
    @     0x7f2d6ed74acd  caffe::hdf5_load_int()
    @     0x7f2d6ed678d0  caffe::SGDSolver<>::RestoreSolverStateFromHDF5()
    @     0x7f2d6ed4bf19  caffe::Solver<>::Restore()
    @           0x408038  train()
    @           0x405a0c  main
    @     0x7f2d6d943ec5  (unknown)
    @           0x406141  (unknown)
    @              (nil)  (unknown)
Aborted (core dumped)

这就是我试图恢复它的方式:

#!/usr/bin/env sh

TOOLS=./build/tools

$TOOLS/caffe train \
    --solver=examples/cifar10/cifar10_full_solver_bn_lr2.prototxt \
    --snapshot=/media/hossein/tmpstore/caffe_new/examples/cifar10/cifar10_full_relu_bn_iter_60000.caffemodel.h5

然后我放弃了,我尝试使用BINARYPROTO 而不是HDF5,但我得到了这个错误:

I0328 16:35:34.721277 27243 net.cpp:283] Network initialization done.
I0328 16:35:34.721369 27243 solver.cpp:60] Solver scaffolding done.
I0328 16:35:34.722338 27243 caffe.cpp:209] Resuming from /media/hossein/tmpstore/caffe_new/examples/cifar10_full_relu_bn_iter_60000.caffemodel
F0328 16:35:39.143900 27243 sgd_solver.cpp:316] Check failed: state.history_size() == history_.size() (0 vs. 28) Incorrect length of history blobs.

*** Check failure stack trace: ***

    @     0x7fd1c2cbbdaa  (unknown)
    @     0x7fd1c2cbbce4  (unknown)
    @     0x7fd1c2cbb6e6  (unknown)
    @     0x7fd1c2cbe687  (unknown)
    @     0x7fd1c33ef097  caffe::SGDSolver<>::RestoreSolverStateFromBinaryProto()
    @     0x7fd1c33d1ed3  caffe::Solver<>::Restore()

    @           0x408038  train()
    @           0x405a0c
 main
    @     0x7fd1c1fc9ec5  (unknown)

    @           0x406141  (unknown)
    @              (nil)  (unknown)
Aborted (core dumped)

当我用不同的模型尝试不同的时间时,历史部分会发生变化(例如 58 vs 28、32 vs 28 和这样,总体错误是相同的,但数字不同!)

我该怎么办?这让我发疯了!

【问题讨论】:

  • 关于 hdf5 格式:我遇到了同样的问题。我回到了 binaryproto。我想在导出/导入 hdf5 权重方面还有一些工作要做
  • 关于 binaryproto 格式的第二部分呢?我想不通:-/
  • 抱歉,没有遇到这个……
  • @Shai:谢谢哥们;)
  • 我发现的原因之一是 caffe 存在 Adam 和 AdaDelta 类型的错误。每当我使用 AdaDelta 求解器时都会遇到此错误。

标签: snapshot caffe pre-trained-model resuming-training


【解决方案1】:

作为 --snapshot 参数的值,您必须传递 .solverstate.h5 文件,而不是 .caffemodel.h5 文件。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2016-03-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-07-09
    • 1970-01-01
    相关资源
    最近更新 更多