如何通过文件更改触发 azure ml 管道？答案

【问题标题】：How to Trigger azure ml pipeline with file change?如何通过文件更改触发 azure ml 管道？
【发布时间】：2022-11-16 13:54:46
【问题描述】：

我是 azure ml 的新手，我想在向数据集添加一些新数据时触发训练管道：

这是训练代码，一切正常：

prep_train_step = PythonScriptStep(
    name=PREPROCESS_TRAIN_PIPELINE_STEP_NAME,
    script_name=PREPROCESS_TRAIN_PIPELINE_SCRIPT_NAME, 
    compute_target=train_compute_instance, 
    source_directory=PREPROCESS_TRAIN_PIPELINE_SCRIPT_SOURCE_DIR,
    runconfig=train_run_config,
    allow_reuse=False,
    arguments=['--classifier-type', "xgBoost", "--train", train_dataset.as_mount(), "--test", test_dataset.as_mount()]
    )

print("Classification model preprocessing and training step created")

pipeline = Pipeline(workspace=ws, steps=[prep_train_step], )
print ("Pipeline is built")

# Submit the pipeline to be run once
experiment_name = PREPROCESS_TRAIN_EXPERIMENT_NAME
pipeline_run1 = Experiment(ws, experiment_name).submit(pipeline)
pipeline_run1.wait_for_completion(show_output=True)

现在我们来看看我从文档中得到的时间表：

published_pipeline = pipeline.publish(name='training_pipeline',
                                      description='Model training pipeline mock',
                                      version='1.0')

检查已发布管道的其余端点：

rest_endpoint = published_pipeline.endpoint
print(rest_endpoint)

到目前为止一切都很好，我们得到了它的 url。

现在到最后一部分，我必须安排管道：

from azureml.pipeline.core import Schedule

reactive_schedule = Schedule.create(ws, name='MyReactiveScheduleTraining',
                                    description='trains based on input file change.',
                                    pipeline_id=published_pipeline.id,
                                    experiment_name='retraining_Pipeline_data_changes',
                                    datastore=blob_storage,
                                    path_on_datastore='./toy_data/train1')

当我上传任何东西到./toy_data/train1时，管道没有被触发，我不知道为什么？！

即使我尝试更改path_on_datastore，并更改上传数据的目标，仍然没有！！！

有什么有用的想法吗？！

【问题讨论】：

标签： azure-pipelines azure-machine-learning-service azure-ml-pipelines

【解决方案1】：

场景如下：[文件] => [数据存储] -> 触发器（带有输入数据参数的 AML 管道）-> [输出文件]。有关如何触发管道的更多详细信息，请参阅 Schedule 类文档 (https://learn.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.schedule(class)?view=azure-ml-py)：Time intervalAdded 或 modified blob。

import azureml.core
from azureml.core import Workspace
from azureml.pipeline.core import Pipeline, PublishedPipeline
from azureml.pipeline.core.schedule import ScheduleRecurrence, Schedule
from azureml.core.experiment import Experiment

ws = Workspace.from_config()

pipeline_id = ""  # Retrieve from GetPublishedPipelines script
experiment_name = ""
recurrence = ScheduleRecurrence(
    frequency="Day", interval=1, time_of_day="08:00"
)  # time_of_day is UTC
recurring_schedule = Schedule.create(
    ws,
    name=experiment_name + "_RecurringJob",
    description="Based on time",
    pipeline_id=pipeline_id,
    experiment_name=experiment_name,
    recurrence=recurrence,
)

【讨论】：