【问题标题】:AWS data pipeline unable to create through serverless yaml templateAWS 数据管道无法通过无服务器 yaml 模板创建
【发布时间】:2020-09-29 18:41:52
【问题描述】:

我正在为 dynamo db 导出到 s3 创建数据管道。模板 为无服务器 yaml 提供的不适用于“PAY_PER_REQUEST”计费 模式

使用 aws 控制台创建了一个,它运行良好,导出了它的定义,试图 在无服务器中使用相同的定义创建,但它给了我 跟随错误

ServerlessError: An error occurred: UrlReportDataPipeline - Pipeline Definition failed to validate because of following Errors: [{ObjectId = 'TableBackupActivity', errors = [Object references invalid id: 's3://dynamodb-dpl-#{myDDBRegion}/emr-ddb-storage-handler/4.11.0/emr-dynamodb-tools-4.11.0-SNAPSHOT-jar-with-dependencies.jar,org.apache.hadoop.dynamodb.tools.DynamoDBExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}']}] and Warnings: [].

谁能帮我解决这个问题。使用控制台创建的管道与表备份活动中的步骤值相同。

管道模板粘贴在下面

UrlReportDataPipeline:
      Type: AWS::DataPipeline::Pipeline
      Properties: 
        Name: ***pipeline name****
        Activate: true
        ParameterObjects: 
          - Id: "myDDBReadThroughputRatio"
            Attributes: 
              - Key: "description"
                StringValue: "DynamoDB read throughput ratio"
              - Key: "type"
                StringValue: "Double"
              - Key: "default"
                StringValue: "0.9"
          - Id: "myOutputS3Loc"
            Attributes: 
              - Key: "description"
                StringValue: "S3 output bucket"
              - Key: "type"
                StringValue: "AWS::S3::ObjectKey"
              - Key: "default"
                StringValue: 
                  !Join [ "", [ "s3://", Ref: "UrlReportBucket" ] ]
          - Id: "myDDBTableName"
            Attributes: 
              - Key: "description"
                StringValue: "DynamoDB Table Name"
              - Key: "type"
                StringValue: "String"
          - Id: "myDDBRegion"
            Attributes:
              - Key: "description"
                StringValue: "DynamoDB region"
        ParameterValues: 
          - Id: "myDDBTableName"
            StringValue: 
              Ref: "UrlReport"
          - Id: "myDDBRegion"
            StringValue: "eu-west-1"
        PipelineObjects: 
          - Id: "S3BackupLocation"
            Name: "Copy data to this S3 location"
            Fields: 
              - Key: "type"
                StringValue: "S3DataNode"
              - Key: "dataFormat"
                RefValue: "DDBExportFormat"
              - Key: "directoryPath"
                StringValue: "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}"
          - Id: "DDBSourceTable"
            Name: "DDBSourceTable"
            Fields: 
              - Key: "tableName"
                StringValue: "#{myDDBTableName}"
              - Key: "type"
                StringValue: "DynamoDBDataNode"
              - Key: "dataFormat"
                RefValue: "DDBExportFormat"
              - Key: "readThroughputPercent"
                StringValue: "#{myDDBReadThroughputRatio}"
          - Id: "DDBExportFormat"
            Name: "DDBExportFormat"
            Fields: 
              - Key: "type"
                StringValue: "DynamoDBExportDataFormat"
          - Id: "TableBackupActivity"
            Name: "TableBackupActivity"
            Fields: 
              - Key: "resizeClusterBeforeRunning"
                StringValue: "true"
              - Key: "type"
                StringValue: "EmrActivity"
              - Key: "input"
                RefValue: "DDBSourceTable"
              - Key: "runsOn"
                RefValue: "EmrClusterForBackup"
              - Key: "output"
                RefValue: "S3BackupLocation"
              - Key: "step"
                RefValue: "s3://dynamodb-dpl-#{myDDBRegion}/emr-ddb-storage-handler/4.11.0/emr-dynamodb-tools-4.11.0-SNAPSHOT-jar-with-dependencies.jar,org.apache.hadoop.dynamodb.tools.DynamoDBExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}"
          - Id: "DefaultSchedule"
            Name: "Every 1 day"
            Fields: 
              - Key: "occurrences"
                StringValue: "1"
              - Key: "startDateTime"
                StringValue: "2020-09-17T1:00:00"
              - Key: "type"
                StringValue: "Schedule"
              - Key: "period"
                StringValue: "1 Day"
          - Id: "Default"
            Name: "Default"
            Fields: 
              - Key: "type"
                StringValue: "Default"
              - Key: "scheduleType"
                StringValue: "cron"
              - Key: "failureAndRerunMode"
                StringValue: "CASCADE"
              - Key: "role"
                StringValue: "DatapipelineDefaultRole"
              - Key: "resourceRole"
                StringValue: "DatapipelineDefaultResourceRole"
              - Key: "schedule"
                RefValue: "DefaultSchedule"
          - Id: "EmrClusterForBackup"
            Name: "EmrClusterForBackup"
            Fields: 
              - Key: "terminateAfter"
                StringValue: "2 Hours"
              - Key: "masterInstanceType"
                StringValue: "m3.xlarge"
              - Key: "coreInstanceType"
                StringValue: "m3.xlarge"
              - Key: "coreInstanceCount"
                StringValue: "1"
              - Key: "type"
                StringValue: "EmrCluster"
              - Key: "releaseLabel"
                StringValue: "emr-5.23.0"
              - Key: "region"
                StringValue: "#{myDDBRegion}"

【问题讨论】:

  • 当你的代码中有 #{myOutputS3Loc​​} 是引用环境变量还是什么?对于无服务器,我不得不使用 $ 来代替 #.您能否尝试以您想要的方式进行硬编码,而不是使用这种格式来消除任何问题
  • #{} 将在运行时被数据管道替换,我也尝试对这些值进行硬编码,但没有成功
  • 看起来对于“step”,您为 refValue 配置了多个值,这是正确的吗?

标签: amazon-web-services serverless amazon-data-pipeline aws-data-pipeline


【解决方案1】:

各位,我通过 AWS 支持团队解决了这个问题。截至今天,以下是 yaml 代码,根据请求按需付费创建数据管道 dynamodb 表

如果需要,您也可以将其转换为 json

    UrlReportBucket:
      Type: AWS::S3::Bucket
      Properties:
        BucketName: ***bucketname***

    UrlReportDataPipeline:
      Type: AWS::DataPipeline::Pipeline
      Properties: 
        Name: ***pipelinename***
        Activate: true
        ParameterObjects: 
          - Id: "myDDBReadThroughputRatio"
            Attributes: 
              - Key: "description"
                StringValue: "DynamoDB read throughput ratio"
              - Key: "type"
                StringValue: "Double"
              - Key: "default"
                StringValue: "0.9"
          - Id: "myOutputS3Loc"
            Attributes: 
              - Key: "description"
                StringValue: "S3 output bucket"
              - Key: "type"
                StringValue: "AWS::S3::ObjectKey"
              - Key: "default"
                StringValue: 
                  !Join [ "", [ "s3://", Ref: "UrlReportBucket" ] ]
          - Id: "myDDBTableName"
            Attributes: 
              - Key: "description"
                StringValue: "DynamoDB Table Name"
              - Key: "type"
                StringValue: "String"
          - Id: "myDDBRegion"
            Attributes:
              - Key: "description"
                StringValue: "DynamoDB region"
        ParameterValues: 
          - Id: "myDDBTableName"
            StringValue: 
              Ref: "UrlReport"
          - Id: "myDDBRegion"
            StringValue: "eu-west-1"
        PipelineObjects: 
          - Id: "S3BackupLocation"
            Name: "Copy data to this S3 location"
            Fields: 
              - Key: "type"
                StringValue: "S3DataNode"
              - Key: "dataFormat"
                RefValue: "DDBExportFormat"
              - Key: "directoryPath"
                StringValue: "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}"
          - Id: "DDBSourceTable"
            Name: "DDBSourceTable"
            Fields: 
              - Key: "tableName"
                StringValue: "#{myDDBTableName}"
              - Key: "type"
                StringValue: "DynamoDBDataNode"
              - Key: "dataFormat"
                RefValue: "DDBExportFormat"
              - Key: "readThroughputPercent"
                StringValue: "#{myDDBReadThroughputRatio}"
          - Id: "DDBExportFormat"
            Name: "DDBExportFormat"
            Fields: 
              - Key: "type"
                StringValue: "DynamoDBExportDataFormat"
          - Id: "TableBackupActivity"
            Name: "TableBackupActivity"
            Fields: 
              - Key: "resizeClusterBeforeRunning"
                StringValue: "true"
              - Key: "type"
                StringValue: "EmrActivity"
              - Key: "input"
                RefValue: "DDBSourceTable"
              - Key: "runsOn"
                RefValue: "EmrClusterForBackup"
              - Key: "output"
                RefValue: "S3BackupLocation"
              - Key: "step"
                StringValue: "s3://dynamodb-dpl-#{myDDBRegion}/emr-ddb-storage-handler/4.11.0/emr-dynamodb-tools-4.11.0-SNAPSHOT-jar-with-dependencies.jar,org.apache.hadoop.dynamodb.tools.DynamoDBExport,#{output.directoryPath},#{myDDBTableName},#{myDDBReadThroughputRatio}"
          - Id: "DefaultSchedule"
            Name: "Every 1 day"
            Fields: 
              - Key: "occurrences"
                StringValue: "1"
              - Key: "startDateTime"
                StringValue: "2020-09-23T1:00:00"
              - Key: "type"
                StringValue: "Schedule"
              - Key: "period"
                StringValue: "1 Day"
          - Id: "Default"
            Name: "Default"
            Fields: 
              - Key: "type"
                StringValue: "Default"
              - Key: "scheduleType"
                StringValue: "cron"
              - Key: "failureAndRerunMode"
                StringValue: "CASCADE"
              - Key: "role"
                StringValue: "DatapipelineDefaultRole"
              - Key: "resourceRole"
                StringValue: "DatapipelineDefaultResourceRole"
              - Key: "schedule"
                RefValue: "DefaultSchedule"
          - Id: "EmrClusterForBackup"
            Name: "EmrClusterForBackup"
            Fields: 
              - Key: "terminateAfter"
                StringValue: "2 Hours"
              - Key: "masterInstanceType"
                StringValue: "m3.xlarge"
              - Key: "coreInstanceType"
                StringValue: "m3.xlarge"
              - Key: "coreInstanceCount"
                StringValue: "1"
              - Key: "type"
                StringValue: "EmrCluster"
              - Key: "releaseLabel"
                StringValue: "emr-5.23.0"
              - Key: "region"
                StringValue: "#{myDDBRegion}"

【讨论】:

    【解决方案2】:

    Step 有一个指向多个资源的 refValue,并且看起来它们被指定为一个字符串。根据无服务器文档,refValue 是

    您指定为同一管道定义中另一个对象的标识符的字段值。

    如果您查看使用 S3BackupLocation 的位置,它会在 PipelineObjects 下创建,然后使用其 Id 进行引用。

    对于步骤,你有 refValue 使用一个字符串作为它的值,然后该字符串有逗号,所以看起来它指定了多个对象。

    我不确定这意味着什么,但如果您想使用 refValue 在模板中的其他位置创建它并在此处使用它的 ID?

    也可以在这里尝试使用字符串值而不是参考值

    【讨论】:

    • refValues 是管道运行时标识符。假设如果我提供 #{output.directoryPath} 的值,它每次都会创建一个新的 s3 存储桶,emr 活动就会运行。
    • 是的,但是你有一个 refValue 步骤,你应该有一个字符串值,就像我在我的回答中提到的那样。我看到你在你的回答中做了这个改变。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2020-10-25
    • 2021-01-22
    • 2017-11-01
    • 1970-01-01
    • 2023-01-29
    • 2020-06-12
    • 2020-12-19
    相关资源
    最近更新 更多