【问题标题】:AWS step function does not add next step to EMR cluster when current step fails当前步骤失败时,AWS 步骤功能不会将下一步添加到 EMR 集群
【发布时间】:2021-03-10 12:22:54
【问题描述】:

我已经从 AWS step 函数设置了一个状态机,它将创建一个 EMR 集群,添加一些 emr 步骤,然后终止集群。只要所有步骤都运行完成且没有任何错误,这就可以正常工作。如果一个步骤失败,尽管添加了一个 catch 以继续下一步,但这不会发生。每当一个步骤失败时,该步骤被标记为已捕获(在图中以橙色表示),但下一步被标记为已取消。

如果有帮助,这是我的步骤函数定义:

{
  "StartAt": "MyEMR-SMFlowContainer-beta",
  "States": {
    "MyEMR-SMFlowContainer-beta": {
      "Type": "Parallel",
      "End": true,
      "Branches": [
        {
          "StartAt": "CreateClusterStep-feature-generation-cluster-beta",
          "States": {
            "CreateClusterStep-feature-generation-cluster-beta": {
              "Next": "Step-SuccessfulJobOne",
              "Type": "Task",
              "ResultPath": "$.Cluster.1.CreateClusterTask",
              "Resource": "arn:aws:states:::elasticmapreduce:createCluster.sync",
              "Parameters": {
                "Instances": {
                  "Ec2SubnetIds": [
                    "subnet-*******345fd38423"
                  ],
                  "InstanceCount": 2,
                  "KeepJobFlowAliveWhenNoSteps": true,
                  "MasterInstanceType": "m4.xlarge",
                  "SlaveInstanceType": "m4.xlarge"
                },
                "JobFlowRole": "MyEMR-emrInstance-beta-EMRInstanceRole",
                "Name": "emr-step-fail-handle-test-cluster",
                "ServiceRole": "MyEMR-emr-beta-EMRRole",
                "Applications": [
                  {
                    "Name": "Spark"
                  },
                  {
                    "Name": "Hadoop"
                  }
                ],
                "AutoScalingRole": "MyEMR-beta-FeatureG-CreateClusterStepfeature-NJB2UG1J1EWB",
                "Configurations": [
                  {
                    "Classification": "spark-env",
                    "Configurations": [
                      {
                        "Classification": "export",
                        "Properties": {
                          "PYSPARK_PYTHON": "/usr/bin/python3"
                        }
                      }
                    ]
                  }
                ],
                "LogUri": "s3://MyEMR-beta-feature-createclusterstepfeature-1jpp1wp3dfn04/emr/logs/",
                "ReleaseLabel": "emr-5.32.0",
                "VisibleToAllUsers": true
              }
            },
            "Step-SuccessfulJobOne": {
              "Next": "Step-AlwaysFailingJob",
              "Catch": [
                {
                  "ErrorEquals": [
                    "States.ALL"
                  ],
                  "Next": "Step-AlwaysFailingJob"
                }
              ],
              "Type": "Task",
              "TimeoutSeconds": 7200,
              "ResultPath": "$.ClusterStep.SuccessfulJobOne.AddSparkTask",
              "Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
              "Parameters": {
                "ClusterId.$": "$.Cluster.1.CreateClusterTask.ClusterId",
                "Step": {
                  "Name": "SuccessfulJobOne",
                  "ActionOnFailure": "CONTINUE",
                  "HadoopJarStep": {
                    "Jar": "command-runner.jar",
                    "Args": [
                      "spark-submit",
                      "--deploy-mode",
                      "client",
                      "--master",
                      "yarn",
                      "--conf",
                      "spark.logConf=true",
                      "--class",
                      "com.test.sample.core.EMRJobRunner",
                      "s3://my-****-bucket/jars/77/my-****-bucketBundleJar-1.0.jar",
                      "--JOB_NUMBER",
                      "1",
                      "--JOB_KEY",
                      "SuccessfulJobOne"
                    ]
                  }
                }
              }
            },
            "Step-AlwaysFailingJob": {
              "Next": "Step-SuccessfulJobTwo",
              "Catch": [
                {
                  "ErrorEquals": [
                    "States.ALL"
                  ],
                  "Next": "Step-SuccessfulJobTwo"
                }
              ],
              "Type": "Task",
              "TimeoutSeconds": 7200,
              "ResultPath": "$.ClusterStep.AlwaysFailingJob.AddSparkTask",
              "Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
              "Parameters": {
                "ClusterId.$": "$.Cluster.1.CreateClusterTask.ClusterId",
                "Step": {
                  "Name": "AlwaysFailingJob",
                  "ActionOnFailure": "CONTINUE",
                  "HadoopJarStep": {
                    "Jar": "command-runner.jar",
                    "Args": [
                      "spark-submit",
                      "--deploy-mode",
                      "client",
                      "--master",
                      "yarn",
                      "--conf",
                      "spark.logConf=true",
                      "--class",
                      "com.test.sample.core.EMRJobRunner",
                      "s3://my-****-bucket/jars/77/my-****-bucketBundleJar-1.0.jar",
                      "--JOB_NUMBER",
                      "2",
                      "--JOB_KEY",
                      "AlwaysFailingJob"
                    ]
                  }
                }
              }
            },
            "Step-SuccessfulJobTwo": {
              "Next": "TerminateClusterStep-feature-generation-cluster-beta",
              "Catch": [
                {
                  "ErrorEquals": [
                    "States.ALL"
                  ],
                  "Next": "TerminateClusterStep-feature-generation-cluster-beta"
                }
              ],
              "Type": "Task",
              "TimeoutSeconds": 7200,
              "ResultPath": "$.ClusterStep.SuccessfulJobTwo.AddSparkTask",
              "Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
              "Parameters": {
                "ClusterId.$": "$.Cluster.1.CreateClusterTask.ClusterId",
                "Step": {
                  "Name": "DeviceJob",
                  "ActionOnFailure": "CONTINUE",
                  "HadoopJarStep": {
                    "Jar": "command-runner.jar",
                    "Args": [
                      "spark-submit",
                      "--deploy-mode",
                      "client",
                      "--master",
                      "yarn",
                      "--conf",
                      "spark.logConf=true",
                      "--class",
                      "com.test.sample.core.EMRJobRunner",
                      "s3://my-****-bucket/jars/77/my-****-bucketBundleJar-1.0.jar",
                      "--JOB_NUMBER",
                      "3",
                      "--JOB_KEY",
                      "SuccessfulJobTwo"
                    ]
                  }
                }
              }
            },
            "TerminateClusterStep-feature-generation-cluster-beta": {
              "End": true,
              "Type": "Task",
              "ResultPath": null,
              "Resource": "arn:aws:states:::elasticmapreduce:terminateCluster.sync",
              "Parameters": {
                "ClusterId.$": "$.Cluster.1.CreateClusterTask.ClusterId"
              }
            }
          }
        }
      ]
    }
  },
  "TimeoutSeconds": 43200
}

有人可以建议我如何在步骤中发现失败并忽略它添加下一步。 提前致谢。

【问题讨论】:

  • 我注意到您将所有状态包装在一个并行状态中。有什么理由吗?或者也许你有更多的分支,但为了简单起见,你在这里发布了一个简化版本?您是否在没有这种并行状态的情况下进行了测试?
  • 我还有其他分支,为了简单起见,我没有在这里粘贴。我之前尝试删除它,但没有帮助。我已经发布了问题的原因作为答案。我已经解决了这个问题,

标签: amazon-web-services amazon-emr aws-step-functions


【解决方案1】:

问题是因为我没有在 catch 属性中指定 resultPath。这导致 resultPath 被 catch 块覆盖,因为 resultPath 的默认值为 $。下一步无法获取集群信息,因为该信息已被覆盖并因此被取消。

      "Catch": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "Next": "Step-SuccessfulJobTwo"
        }
      ],

一旦我更新了 catch 以获得正确的结果路径,它就会按预期工作。

      "Catch": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "Next": "Step-SuccessfulJobTwo",
          "ResultPath": "$.ClusterStep.SuccessfulJobOne.AddSparkTask.Error",
        }
      ],

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2018-08-15
    • 2019-09-06
    • 2021-11-16
    • 1970-01-01
    • 1970-01-01
    • 2017-05-03
    • 2018-05-05
    相关资源
    最近更新 更多