【问题标题】:SageMaker in local Jupyter notebook: cannot use AWS hosted XGBoost container ("KeyError: 'S3DistributionType'" and "Failed to run: ['docker-compose'")本地 Jupyter 笔记本中的 SageMaker:无法使用 AWS 托管的 XGBoost 容器(“KeyError:'S3DistributionType'”和“运行失败:['docker-compose'”)
【发布时间】:2020-08-14 00:54:28
【问题描述】:

在本地 Jupyter 笔记本中运行 SageMaker(使用 VS Code)没有问题,但尝试使用 AWS 托管的容器训练 XGBoost 模型会导致错误(容器名称:246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3)。

Jupyter 笔记本

import sagemaker

session = sagemaker.LocalSession()

# Load and prepare the training and validation data
...

# Upload the training and validation data to S3
test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)

region = session.boto_region_name
instance_type = 'ml.m4.xlarge'
container = sagemaker.image_uris.retrieve('xgboost', region, '1.0-1', 'py3', instance_type=instance_type)

role = 'arn:aws:iam::<USER ID #>:role/service-role/AmazonSageMaker-ExecutionRole-<ROLE ID #>'

xgb_estimator = sagemaker.estimator.Estimator(
    container, role, train_instance_count=1, train_instance_type=instance_type,
    output_path=f's3://{session.default_bucket()}/{prefix}/output', sagemaker_session=session)

xgb_estimator.set_hyperparameters(max_depth=5, eta=0.2, gamma=4, min_child_weight=6,
                                  subsample=0.8, objective='reg:squarederror', early_stopping_rounds=10,
                                  num_round=200)

s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=val_location, content_type='csv')

xgb_estimator.fit({'train': s3_input_train, 'validation': s3_input_validation})

Docker 容器密钥错误

algo-1-tfcvc_1  | ERROR:sagemaker-containers:Reporting training FAILURE
algo-1-tfcvc_1  | ERROR:sagemaker-containers:framework error: 
algo-1-tfcvc_1  | Traceback (most recent call last):
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_trainer.py", line 84, in train
algo-1-tfcvc_1  |     entrypoint()
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 94, in main
algo-1-tfcvc_1  |     train(framework.training_env())
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 90, in train
algo-1-tfcvc_1  |     run_algorithm_mode()
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 68, in run_algorithm_mode
algo-1-tfcvc_1  |     checkpoint_config=checkpoint_config
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 115, in sagemaker_train
algo-1-tfcvc_1  |     validated_data_config = channels.validate(data_config)
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/channel_validation.py", line 106, in validate
algo-1-tfcvc_1  |     channel_obj.validate(value)
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/channel_validation.py", line 52, in validate
algo-1-tfcvc_1  |     if (value[CONTENT_TYPE], value[TRAINING_INPUT_MODE], value[S3_DIST_TYPE]) not in self.supported:
algo-1-tfcvc_1  | KeyError: 'S3DistributionType'

本地 PC 运行时错误

RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmp71tx0fop/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

如果 Jupyter 笔记本使用 Amazon 云 SageMaker 环境(而不是在本地 PC 上)运行,则不会出现错误。请注意,在云笔记本上运行时,会话初始化为:

session = sagemaker.Session()

LocalSession() 与托管 docker 容器的工作方式似乎存在问题。

【问题讨论】:

    标签: python docker jupyter-notebook xgboost amazon-sagemaker


    【解决方案1】:

    在本地 Jupyter 笔记本中运行 SageMaker 时,它希望 Docker 容器也可以在本地计算机上运行。

    确保 SageMaker(在本地笔记本中运行)使用 AWS 托管的 docker 容器的关键是在初始化 Estimator 时省略 LocalSession 对象。

    错误

    xgb_estimator = sagemaker.estimator.Estimator(
        container, role, train_instance_count=1, train_instance_type=instance_type,
        output_path=f's3://{session.default_bucket()}/{prefix}/output', sagemaker_session=session)
    

    正确

    xgb_estimator = sagemaker.estimator.Estimator(
        container, role, train_instance_count=1, train_instance_type=instance_type,
        output_path=f's3://{session.default_bucket()}/{prefix}/output')
    

      

    其他信息

    SageMaker Python SDK 源代码提供以下有用提示:

    文件:sagemaker/local/local_session.py

    class LocalSagemakerClient(object):
        """A SageMakerClient that implements the API calls locally.
    
        Used for doing local training and hosting local endpoints. It still needs access to
        a boto client to interact with S3 but it won't perform any SageMaker call.
        ...
    

    文件:sagemaker/estimator.py

    class EstimatorBase(with_metaclass(ABCMeta, object)):
        """Handle end-to-end Amazon SageMaker training and deployment tasks.
    
        For introduction to model training and deployment, see
        http://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html
    
        Subclasses must define a way to determine what image to use for training,
        what hyperparameters to use, and how to create an appropriate predictor instance.
        """
    
        def __init__(self, role, train_instance_count, train_instance_type,
                     train_volume_size=30, train_max_run=24 * 60 * 60, input_mode='File',
                     output_path=None, output_kms_key=None, base_job_name=None, sagemaker_session=None, tags=None):
            """Initialize an ``EstimatorBase`` instance.
    
            Args:
                role (str): An AWS IAM role (either name or full ARN). ...
                
            ...
    
                sagemaker_session (sagemaker.session.Session): Session object which manages interactions with
                    Amazon SageMaker APIs and any other AWS services needed. If not specified, the estimator creates one
                    using the default AWS configuration chain.
            """
    

    【讨论】:

      猜你喜欢
      • 2023-03-10
      • 1970-01-01
      • 2020-06-12
      • 2021-08-21
      • 2020-11-18
      • 1970-01-01
      • 2019-08-22
      • 2020-07-08
      • 1970-01-01
      相关资源
      最近更新 更多