【发布时间】:2020-08-14 00:54:28
【问题描述】:
在本地 Jupyter 笔记本中运行 SageMaker(使用 VS Code)没有问题,但尝试使用 AWS 托管的容器训练 XGBoost 模型会导致错误(容器名称:246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3)。
Jupyter 笔记本
import sagemaker
session = sagemaker.LocalSession()
# Load and prepare the training and validation data
...
# Upload the training and validation data to S3
test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)
region = session.boto_region_name
instance_type = 'ml.m4.xlarge'
container = sagemaker.image_uris.retrieve('xgboost', region, '1.0-1', 'py3', instance_type=instance_type)
role = 'arn:aws:iam::<USER ID #>:role/service-role/AmazonSageMaker-ExecutionRole-<ROLE ID #>'
xgb_estimator = sagemaker.estimator.Estimator(
container, role, train_instance_count=1, train_instance_type=instance_type,
output_path=f's3://{session.default_bucket()}/{prefix}/output', sagemaker_session=session)
xgb_estimator.set_hyperparameters(max_depth=5, eta=0.2, gamma=4, min_child_weight=6,
subsample=0.8, objective='reg:squarederror', early_stopping_rounds=10,
num_round=200)
s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=val_location, content_type='csv')
xgb_estimator.fit({'train': s3_input_train, 'validation': s3_input_validation})
Docker 容器密钥错误
algo-1-tfcvc_1 | ERROR:sagemaker-containers:Reporting training FAILURE
algo-1-tfcvc_1 | ERROR:sagemaker-containers:framework error:
algo-1-tfcvc_1 | Traceback (most recent call last):
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_trainer.py", line 84, in train
algo-1-tfcvc_1 | entrypoint()
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 94, in main
algo-1-tfcvc_1 | train(framework.training_env())
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 90, in train
algo-1-tfcvc_1 | run_algorithm_mode()
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 68, in run_algorithm_mode
algo-1-tfcvc_1 | checkpoint_config=checkpoint_config
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 115, in sagemaker_train
algo-1-tfcvc_1 | validated_data_config = channels.validate(data_config)
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/channel_validation.py", line 106, in validate
algo-1-tfcvc_1 | channel_obj.validate(value)
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/channel_validation.py", line 52, in validate
algo-1-tfcvc_1 | if (value[CONTENT_TYPE], value[TRAINING_INPUT_MODE], value[S3_DIST_TYPE]) not in self.supported:
algo-1-tfcvc_1 | KeyError: 'S3DistributionType'
本地 PC 运行时错误
RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmp71tx0fop/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1
如果 Jupyter 笔记本使用 Amazon 云 SageMaker 环境(而不是在本地 PC 上)运行,则不会出现错误。请注意,在云笔记本上运行时,会话初始化为:
session = sagemaker.Session()
LocalSession() 与托管 docker 容器的工作方式似乎存在问题。
【问题讨论】:
标签: python docker jupyter-notebook xgboost amazon-sagemaker