【问题标题】:Amazon sagemaker training job using prebuild docker image使用预构建 docker 映像的 Amazon sagemaker 培训作业
【发布时间】:2020-04-27 12:38:23
【问题描述】:

您好,我是 AWS Sagemaker 的新手,我正在尝试在 sagemaker 上部署自定义时间序列模型,因此使用 sagemaker 终端构建 docker 映像,但是当我尝试创建训练作业时,它显示了一些错误。我我在过去四天里苦苦挣扎,请任何人都可以帮助我。 这是我的代码:

lstm = sage.estimator.Estimator(image,
                       role, 1, 'ml.m4.xlarge',
                       output_path='s3://' + s3Bucket,
                       sagemaker_session=sess)

lstm.fit(upload_data)

这是我的错误,我将 ecr 完全访问权限的策略附加到 sagemaker Iam 角色并且帐户位于同一区域。

ClientErrorTraceback (most recent call last)
<ipython-input-48-1d7f3ff70f18> in <module>()
      4                        sagemaker_session=sess)
      5 
----> 6 lstm.fit(upload_data)

/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc in fit(self, inputs, wait, logs, job_name, experiment_config)
    472         self._prepare_for_training(job_name=job_name)
    473 
--> 474         self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
    475         self.jobs.append(self.latest_training_job)
    476         if wait:

/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc in start_new(cls, estimator, inputs, experiment_config)
   1036             train_args["enable_sagemaker_metrics"] = estimator.enable_sagemaker_metrics
   1037 
-> 1038         estimator.sagemaker_session.train(**train_args)
   1039 
   1040         return cls(estimator.sagemaker_session, estimator._current_job_name)

/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/session.pyc in train(self, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags, metric_definitions, enable_network_isolation, image, algorithm_arn, encrypt_inter_container_traffic, train_use_spot_instances, checkpoint_s3_uri, checkpoint_local_path, experiment_config, debugger_rule_configs, debugger_hook_config, tensorboard_output_config, enable_sagemaker_metrics)
    588         LOGGER.info("Creating training-job with name: %s", job_name)
    589         LOGGER.debug("train request: %s", json.dumps(train_request, indent=4))
--> 590         self.sagemaker_client.create_training_job(**train_request)
    591 
    592     def process(

/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/botocore/client.pyc in _api_call(self, *args, **kwargs)
    314                     "%s() only accepts keyword arguments." % py_operation_name)
    315             # The "self" in this scope is referring to the BaseClient.
--> 316             return self._make_api_call(operation_name, kwargs)
    317 
    318         _api_call.__name__ = str(py_operation_name)

/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/botocore/client.pyc in _make_api_call(self, operation_name, api_params)
    624             error_code = parsed_response.get("Error", {}).get("Code")
    625             error_class = self.exceptions.from_code(error_code)
--> 626             raise error_class(parsed_response, operation_name)
    627         else:
    628             return parsed_response

ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: Cannot find repository: sagemaker-model in registry ID: 534860077983 Please check if your ECR repository exists and role arn:aws:iam::534860077983:role/service-role/AmazonSageMaker-ExecutionRole-20190508T215284 has proper pull permissions for SageMaker: ecr:BatchCheckLayerAvailability, ecr:BatchGetImage, ecr:GetDownloadUrlForLayer

【问题讨论】:

    标签: amazon-web-services amazon-sagemaker


    【解决方案1】:

    TL;DR:您似乎没有为 SageMaker 估算器提供正确的 ECR 图像存储库。也许存储库不存在?

    还要确保存储库的权限配置为允许主体sagemaker.amazonaws.com 执行ecr:BatchCheckLayerAvailability, ecr:BatchGetImage, ecr:GetDownloadUrlForLayer

    【讨论】:

      猜你喜欢
      • 2019-11-13
      • 2019-06-17
      • 1970-01-01
      • 2019-09-23
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多