使用 Sagemaker Pytorch Estimator 未将Requirements.txt 打包在model.tar.gz 中答案

【问题标题】：Requirements.txt is not packed in the model.tar.gz using Sagemaker Pytorch Estimator使用 Sagemaker Pytorch Estimator 未将Requirements.txt 打包在model.tar.gz 中
【发布时间】：2022-01-04 09:16:38
【问题描述】：

我正在使用 SageMaker Pipeline 工作流程来训练模型并进行注册。然后稍后我将从注册的模型创建一个端点。

我需要在我的 inference.py 文件中安装一些 python 包，例如 gensim。我将 requirements.txt 文件放在与 train.py 和 inference.py 相同的文件夹中。

问题是 requirements.txt 没有被打包在 model.tar.gz 中。 这就是为什么虽然训练和创建端点工作正常，但是当我检查日志时部署的端点我看到以下错误：

ModuleNotFoundError: No module named 'gensim'

这是我用于训练和注册模型的脚本的一部分。

from sagemaker.pytorch.estimator import PyTorch
from sagemaker.workflow.step_collections import RegisterModel
from sagemaker.workflow.steps import (
    ProcessingStep,
    TrainingStep,
)

    train_estimator = PyTorch(entry_point= 'train.py',
                                source_dir= BASE_DIR,
                                instance_type= "ml.m5.2xlarge",
                                instance_count=1,
                                role=role,
                                framework_version='1.8.0',
                                py_version='py3',
                                )
    step_train = TrainingStep(
        name="TrainStep",
        estimator=train_estimator,
        inputs={
                "train": sagemaker.TrainingInput(
                            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                            "train_data"
                            ].S3Output.S3Uri,
                            content_type= 'text/csv',
                        ),
        },
    )
    step_register = RegisterModel(
        name="RegisterStep",
        estimator= train_estimator,
        model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
        content_types=["application/json"],
        response_types=["application/json"],
        inference_instances=["ml.t2.medium", "ml.m5.2xlarge"],
        transform_instances=["ml.m5.large"],
        model_package_group_name=model_package_group_name,
        approval_status=model_approval_status,
        source_dir = BASE_DIR,
        entry_point= os.path.join(BASE_DIR, "inference.py"),
        depends_on = [step_train]
    )

这是我的文件结构：

-abalone
  - __init__.py
  - train.py
  - inference.py
  - requirements.py
  - preprocess.py
  - evaluate.py
  - pipeline.py

BASE_DIR 指鲍鱼文件夹。

在model.tar.gz中我看到了：

- model.pth
- model.pth.wv.vectors_ngrams.npy
- code
  - __pycache__
  - train.py
  - _repack_model.py
  - inference.py
  - preprocess.py
  - evaluate.py
  - __init__.py
  - pipeline.py

您可以看到它包含除了 requirements.txt 文件之外的所有内容。

在圣人documents 中说：

“只要 framework_version 设置为 1.2 或更高版本，PyTorch 和 PyTorchModel 类就会重新打包 model.tar.gz 以包含推理脚本（和相关文件）。”

但是你可以看到虽然我的framework_version高于1.2，但是model.tar.gz中仍然没有打包requirements.txt文件。

有人可以帮我解决这个问题吗？

【问题讨论】：

这个链接可能有用github.com/aws/sagemaker-python-sdk/issues/2759
我无法重新创建此问题@user_5。我使用 PyTorch 估计器和 RegisterModel 步骤创建了一个带有 TrainingStep 的管道。在我的 BASE_DIR 中，我添加了一个 requirements.txt 文件。模型 tarball 包含 BASE_DIR 中的所有文件，包括 requirements.txt。
需要注意的是，管道，而不是 PyTorch 容器，正在通过训练工作进行重新打包。您应该会看到一个名称中带有“RegisterStep”的培训作业，该作业运行 _repack_model.py。我会检查该作业的 CloudWatch 日志，看看是否有任何内容弹出。这只是推测，但培训容器可能会锁定 requirements.txt 文件，这会阻止它被复制到输出目录。
@Payton，非常感谢您的努力。你能把你使用的 Pytorch、Python 和 Sagemaker 的配置或版本发给我吗？
@Payton，在注册日志中我没有发现任何有用的信息。它只说依赖项是空的： /miniconda3/bin/python _repack_model.py --dependencies --inference_script inference.py --model_archive model.tar.gz --source_dir /root/.pyenv/versions/3.8.10/ lib/python3.8/site-packages/...

标签： pytorch amazon-sagemaker endpoint inference requirements.txt

【解决方案1】：

解决方法是在inference.py 中安装所需的包

import os 
os.execute("pip install package1 package2 ...")

要解决此问题，我建议使用 train_estimator.deploy 部署估算器 train_estimator，这将创建模型、端点配置和端点。然后，检查 CloudWatch 日志，看看它是否仍然无法打包 requirements.txt。

requirements.txt 可能被用作附加到由estimator.deploy 创建的模型的环境变量，并且由于您使用RegisterModel，它会忽略该参数。

【讨论】：

谢谢，我已经使用 os.system(..) 将包安装在 inference.py 中，但我想解决打包 require.txt 的问题