【发布时间】:2021-05-15 11:04:30
【问题描述】:
您好,感谢您的阅读。简单地说,我想在我使用 SageMaker Experiments 制作的 XGBoost 模型上执行 Batch Transform。我在存储在 S3 中的 csv 数据上训练了我的模型,为我的模型部署了一个端点,成功地用单个 csv 行到达了所述端点并得到了预期的推论。
(I followed this tutorial to the letter before starting to work on Batch Transformation)
现在我正在尝试使用根据上述教程创建的模型运行批量转换,但我遇到了错误(跳到底部查看我的错误日志)。在直接解决错误之前,我想显示我的批量转换代码。
(从 SageMaker SDK v2.24.4 导入)
import sagemaker
import boto3
from sagemaker import get_execution_role
from sagemaker.model import Model
region = boto3.Session().region_name
role = get_execution_role()
image = sagemaker.image_uris.retrieve('xgboost', region, '1.2-1')
model_location = '{mys3info}/output/model.tar.gz'
model = Model(image_uri=image,
model_data=model_location,
role=role,
)
transformer = model.transformer(instance_count=1,
instance_type='ml.m5.xlarge',
strategy='MultiRecord',
assemble_with='Line',
output_path='myOutputPath',
accept='text/csv',
max_concurrent_transforms=1,
max_payload=20)
transformer.transform(data='s3://test-s3-prefix/short_test_data.csv',
content_type='text/csv',
split_type='Line',
join_source='Input'
)
transformer.wait()
short_test_data.csv
33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown
47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown
33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown
35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown
57,blue-collar,married,primary,no,52,yes,no,unknown,5,may,38,1,-1,0,unknown
32,blue-collar,single,primary,no,23,yes,yes,unknown,5,may,160,1,-1,0,unknown
53,technician,married,secondary,no,-3,no,no,unknown,5,may,1666,1,-1,0,unknown
29,management,single,tertiary,no,0,yes,no,unknown,5,may,363,1,-1,0,unknown
32,management,married,tertiary,no,0,yes,no,unknown,5,may,179,1,-1,0,unknown
38,management,single,tertiary,no,424,yes,no,unknown,5,may,104,1,-1,0,unknown
我在命令行中使用我的原始数据集通过运行创建了上述 csv 测试数据:
head original_training_data.csv > short_test_data.csv
然后我手动将其上传到我的 S3 存储桶。
日志
[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=20, BatchStrategy=MULTI_RECORD
[sagemaker logs]: */short_test_data.csv: ClientError: 415
[sagemaker logs]: */short_test_data.csv:
[sagemaker logs]: */short_test_data.csv: Message:
[sagemaker logs]: */short_test_data.csv: Loading csv data failed with Exception, please ensure data is in csv format:
[sagemaker logs]: */short_test_data.csv: <class 'ValueError'>
[sagemaker logs]: */short_test_data.csv: could not convert string to float: 'entrepreneur'
我了解 one-hot 编码的概念以及将字符串转换为数字以供 XGBoost 等算法使用的其他方法。我的问题是,我可以轻松地将完全相同格式的数据输入到已部署的端点中,并在不进行该级别编码的情况下获取结果。不过,我显然遗漏了一些东西,因此非常感谢您的帮助!
【问题讨论】:
标签: python amazon-web-services machine-learning xgboost amazon-sagemaker