【问题标题】:Training Job Running on Google Cloud Platform but Not Consuming Any CPU在 Google Cloud Platform 上运行但不消耗任何 CPU 的训练作业
【发布时间】:2020-09-24 06:17:22
【问题描述】:

我在 Google Cloud Platform 上的 AI 平台上的训练作业似乎正在运行,但没有消耗任何 CPU。该程序不会终止,但在作业首次开始运行时确实会出现一些错误。它们如下所示

INFO    2020-06-05 04:33:38 +0000       master-replica-0                Create CheckpointSaverHook.
ERROR   2020-06-05 04:33:38 +0000       master-replica-0                I0605 04:33:38.890919 139686838036224 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
INFO    2020-06-05 04:33:41 +0000       worker-replica-0                Graph was finalized.
ERROR   2020-06-05 04:33:41 +0000       worker-replica-0                I0605 04:33:41.006648 140712303798016 monitored_session.py:240] Graph was finalized.
INFO    2020-06-05 04:33:41 +0000       worker-replica-4                Graph was finalized.
ERROR   2020-06-05 04:33:41 +0000       worker-replica-4                I0605 04:33:41.482944 139947128342272 monitored_session.py:240] Graph was finalized.
INFO    2020-06-05 04:33:41 +0000       worker-replica-2                Graph was finalized.
ERROR   2020-06-05 04:33:41 +0000       worker-replica-2                I0605 04:33:41.927765 140284058486528 monitored_session.py:240] Graph was finalized.
INFO    2020-06-05 04:33:41 +0000       master-replica-0                Graph was finalized.
ERROR   2020-06-05 04:33:41 +0000       master-replica-0                I0605 04:33:41.995326 139686838036224 monitored_session.py:240] Graph was finalized.
INFO    2020-06-05 04:33:42 +0000       master-replica-0                Restoring parameters from gs://lasertagger_v1/output/models/wikisplit_experiment_name_2/model.ckpt-0
ERROR   2020-06-05 04:33:42 +0000       master-replica-0                I0605 04:33:42.216852 139686838036224 saver.py:1284] Restoring parameters from gs://lasertagger_v1/output/models/wikisplit_experiment_name_2/model.ckpt-0
INFO    2020-06-05 04:33:43 +0000       worker-replica-3                Done calling model_fn.
ERROR   2020-06-05 04:33:43 +0000       worker-replica-3                I0605 04:33:43.411592 140653000845056 estimator.py:1150] Done calling model_fn.
INFO    2020-06-05 04:33:43 +0000       worker-replica-3                Create CheckpointSaverHook.
ERROR   2020-06-05 04:33:43 +0000       worker-replica-3                I0605 04:33:43.413079 140653000845056 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
INFO    2020-06-05 04:33:44 +0000       worker-replica-1                Done calling model_fn.
ERROR   2020-06-05 04:33:44 +0000       worker-replica-1                I0605 04:33:44.139685 140410730743552 estimator.py:1150] Done calling model_fn.
INFO    2020-06-05 04:33:44 +0000       worker-replica-1                Create CheckpointSaverHook.
ERROR   2020-06-05 04:33:44 +0000       worker-replica-1                I0605 04:33:44.141169 140410730743552 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
INFO    2020-06-05 04:33:47 +0000       worker-replica-1                Graph was finalized.
ERROR   2020-06-05 04:33:47 +0000       worker-replica-1                I0605 04:33:47.280014 140410730743552 monitored_session.py:240] Graph was finalized.
INFO    2020-06-05 04:33:47 +0000       worker-replica-3                Graph was finalized.
ERROR   2020-06-05 04:33:47 +0000       worker-replica-3                I0605 04:33:47.335122 140653000845056 monitored_session.py:240] Graph was finalized.

每条 INFO 消息后面都有一条 ERROR 消息,我很困惑这个培训工作发生了什么。谢谢!

下面是一些更详细的错误信息:

2020-06-05 13:12:50.583 EDT
worker-replica-4
I0605 17:12:50.583258 140104498276096 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
{
 insertId: "o5flw8f1urq2q"  
 jsonPayload: {
  created: 1591377170.5835383   
  levelname: "ERROR"   
  lineno: 328   
  message: "I0605 17:12:50.583258 140104498276096 basic_session_run_hooks.py:541] Create CheckpointSaverHook."   
  pathname: "/runcloudml.py"   
 }
 labels: {
  compute.googleapis.com/resource_id: "2069730006064940177"   
  compute.googleapis.com/resource_name: "gke-cml-0605-170056-7fb-n1-highmem-96-9990517e-rvlx"   
  compute.googleapis.com/zone: "us-east1-c"   
  ml.googleapis.com/job_id/log_area: "root"   
  ml.googleapis.com/trial_id: ""   
 }
 logName: "projects/smart-content-summary/logs/worker-replica-4"  
 receiveTimestamp: "2020-06-05T17:13:00.962017815Z"  
 resource: {
  labels: {…}   
  type: "ml_job"   
 }
 severity: "ERROR"  
 timestamp: "2020-06-05T17:12:50.583538292Z"  
}

【问题讨论】:

    标签: google-cloud-platform google-cloud-ai


    【解决方案1】:

    我高度怀疑该问题是在保存模型期间发生的。问题是由

    引起的
    1. 内存溢出
    2. 磁盘溢出。

    您能否展示它们的一些监控指标,或者考虑一下:

    1. 增加机器内存
    2. 增加根分区大小?

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2017-03-09
      • 2018-12-24
      • 2021-09-15
      • 2019-11-09
      • 2020-04-08
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多